Real-World Use Cases ==================== This guide demonstrates spark-bestfit in production scenarios. Each use case includes a complete Jupyter notebook with working code you can adapt to your needs. Monte Carlo Risk Simulation --------------------------- **Business Context:** Financial risk managers need to estimate potential losses across portfolios of correlated assets. Monte Carlo simulation generates thousands of scenarios to calculate Value-at-Risk (VaR) and other risk metrics. **spark-bestfit Features Used:** - ``GaussianCopula`` for modeling asset correlations - Multi-column fitting for portfolio assets - Distributed sampling for scenario generation - ``lazy_metrics=True`` for performance **Notebooks:** - **Spark:** `examples/spark/usecase_monte_carlo.ipynb `_ - **Local:** `examples/local/usecase_monte_carlo.ipynb `_ - **Ray:** `examples/ray/usecase_monte_carlo.ipynb `_ .. code-block:: python from spark_bestfit import DistributionFitter, GaussianCopula # Fit distributions to historical returns fitter = DistributionFitter(spark) results = fitter.fit(returns_df, columns=["AAPL", "GOOGL", "MSFT"], lazy_metrics=True) # Model correlations from fit results copula = GaussianCopula.fit(results, returns_df) # Generate correlated scenarios scenarios = copula.sample(n_samples=10000, seed=42) # Calculate portfolio VaR portfolio_returns = scenarios.select( (F.col("AAPL") * 0.4 + F.col("GOOGL") * 0.35 + F.col("MSFT") * 0.25).alias("portfolio") ) ML Synthetic Data Generation ---------------------------- **Business Context:** Machine learning teams need synthetic data for model training, testing, and privacy-preserving data sharing. Fitting distributions to real data enables generating statistically similar synthetic datasets. **spark-bestfit Features Used:** - Multi-column fitting for feature columns - ``DiscreteDistributionFitter`` for categorical/count features - Serialization for saving/loading fitted models - Distributed sampling at scale **Notebooks:** - **Spark:** `examples/spark/usecase_synthetic_data.ipynb `_ - **Local:** `examples/local/usecase_synthetic_data.ipynb `_ - **Ray:** `examples/ray/usecase_synthetic_data.ipynb `_ .. code-block:: python from spark_bestfit import DistributionFitter, DiscreteDistributionFitter # Fit continuous features cont_fitter = DistributionFitter(spark) cont_results = cont_fitter.fit(df, columns=["age", "income", "score"]) # Fit discrete features disc_fitter = DiscreteDistributionFitter(spark) disc_results = disc_fitter.fit(df, columns=["num_purchases", "category_id"]) # Save for reproducibility cont_results.save("models/continuous_fits.json") # Generate synthetic data synthetic_df = sample_from_results(cont_results, n=100000) A/B Test Analysis ----------------- **Business Context:** Product teams run experiments to measure the impact of changes. Distribution fitting helps model conversion rates, revenue per user, and other metrics with proper uncertainty quantification. **spark-bestfit Features Used:** - Bounded fitting for proportions (0-1 range) - Bootstrap confidence intervals - ``lazy_metrics=True`` for quick model selection **Notebooks:** - **Spark:** `examples/spark/usecase_ab_testing.ipynb `_ - **Local:** `examples/local/usecase_ab_testing.ipynb `_ - **Ray:** `examples/ray/usecase_ab_testing.ipynb `_ .. code-block:: python from spark_bestfit import DistributionFitter fitter = DistributionFitter(spark) # Fit bounded distributions to conversion rates # Beta naturally fits [0, 1] bounded data results = fitter.fit( experiment_df, column="conversion_rate", bounded=True, lower_bound=0.0, upper_bound=1.0, lazy_metrics=True ) # Get best fit (typically Beta for proportions) best = results.best(n=1, metric='aic')[0] samples = best.sample(size=10000) # For bootstrap CI Insurance Claims Modeling ------------------------- **Business Context:** Actuaries model claim frequency and severity to set premiums and reserves. Heavy-tailed distributions (Pareto, lognormal) are essential for capturing extreme loss events. **spark-bestfit Features Used:** - Heavy-tail distributions (Pareto, lognormal, Weibull) - ``DiscreteDistributionFitter`` for claim counts - Bounded fitting for capped policies - Q-Q plots for tail behavior validation **Notebooks:** - **Spark:** `examples/spark/usecase_insurance.ipynb `_ - **Local:** `examples/local/usecase_insurance.ipynb `_ - **Ray:** `examples/ray/usecase_insurance.ipynb `_ .. code-block:: python from spark_bestfit import DistributionFitter, DiscreteDistributionFitter # Fit claim severity (includes heavy-tailed: pareto, lognorm, burr, etc.) severity_fitter = DistributionFitter(spark) severity_results = severity_fitter.fit( claims_df, column="claim_amount", lazy_metrics=True ) # Fit claim frequency (discrete: poisson, nbinom, etc.) freq_fitter = DiscreteDistributionFitter(spark) freq_results = freq_fitter.fit(claims_df, column="num_claims") # Get best fits and visualize best_severity = severity_results.best(n=1, metric='aic')[0] severity_fitter.plot(best_severity, claims_df, "claim_amount") Risk Model Validation --------------------- **Business Context:** Financial regulators require statistical validation of risk models. Unlike model *selection* (AIC), model *validation* asks: "Does the data actually come from this distribution?" The Kolmogorov-Smirnov (KS) test provides formal hypothesis testing. **spark-bestfit Features Used:** - ``lazy_metrics=False`` to compute KS and Anderson-Darling statistics - ``metric='ks_statistic'`` for goodness-of-fit based selection - ``metric='ad_statistic'`` for tail-sensitive validation - ``fit.pvalue`` for hypothesis test interpretation **Notebooks:** - **Spark:** `examples/spark/usecase_model_validation.ipynb `_ - **Local:** `examples/local/usecase_model_validation.ipynb `_ - **Ray:** `examples/ray/usecase_model_validation.ipynb `_ .. code-block:: python from spark_bestfit import DistributionFitter fitter = DistributionFitter(spark) # Fit with full metrics (not lazy) for validation results = fitter.fit( returns_df, column="daily_return", lazy_metrics=False # Compute KS and AD statistics ) # Select by goodness-of-fit (not prediction accuracy) best = results.best(n=1, metric='ks_statistic')[0] # Interpret hypothesis test if best.pvalue > 0.05: print("Model PASSES validation (cannot reject H0)") else: print("Model FAILS validation (reject H0)") Discrete Event Simulation ------------------------- **Business Context:** Operations teams need to answer "what-if" questions about staffing, capacity, and process changes. Rather than experimenting with real operations (expensive and risky), you can fit distributions to historical data and simulate scenarios. **spark-bestfit Features Used:** - ``DistributionFitter`` for inter-arrival and service times - ``DiscreteDistributionFitter`` for hourly/daily volumes - ``lazy_metrics=False`` to validate distributional assumptions - ``get_scipy_dist()`` to sample from fitted distributions in simulations **Notebooks:** - **Spark:** `examples/spark/usecase_simulation.ipynb `_ - **Local:** `examples/local/usecase_simulation.ipynb `_ - **Ray:** `examples/ray/usecase_simulation.ipynb `_ .. code-block:: python from spark_bestfit import DistributionFitter fitter = DistributionFitter(spark) # Fit distributions to operational data arrival_results = fitter.fit(df, column='inter_arrival_seconds', lazy_metrics=False) service_results = fitter.fit(df, column='service_time_seconds', lazy_metrics=False) # Get best fits for simulation arrival_dist = arrival_results.best(n=1, metric='aic')[0].get_scipy_dist() service_dist = service_results.best(n=1, metric='aic')[0].get_scipy_dist() # Simulate queue with fitted distributions inter_arrivals = arrival_dist.rvs(size=1000) service_times = service_dist.rvs(size=1000) # Run what-if scenarios (add agents, change volume, etc.) Data Drift Detection -------------------- **Business Context:** Production ML models degrade when underlying data distributions shift. Drift detection monitors feature distributions over time and alerts when significant changes occur, enabling proactive model retraining and data quality monitoring. **spark-bestfit Features Used:** - ``lazy_metrics=False`` for KS statistics and p-values - ``DistributionFitter`` for baseline and monitoring period fitting - ``scipy.stats.ks_2samp`` for direct sample comparison - Multi-column fitting for multi-feature monitoring **Notebook:** `examples/spark/usecase_drift_detection.ipynb `_ .. code-block:: python from spark_bestfit import DistributionFitter from scipy.stats import ks_2samp fitter = DistributionFitter(spark) # Establish baseline from historical data baseline_results = fitter.fit( baseline_df, column='feature', lazy_metrics=False # Need KS statistics ) baseline_samples = baseline_df.toPandas()['feature'].values # Monitor new data for drift new_samples = new_df.toPandas()['feature'].values ks_stat, p_value = ks_2samp(baseline_samples, new_samples) if p_value < 0.05: print(f"DRIFT DETECTED: KS={ks_stat:.4f}, p={p_value:.4e}") # Trigger retraining, alert, or investigation Capital Budgeting Monte Carlo (Ray) ----------------------------------- **Business Context:** Finance teams evaluate capital investments (new plants, equipment, acquisitions) under uncertainty. Traditional DCF analysis uses single "best estimate" values, but Monte Carlo simulation provides probability distributions for NPV, IRR, and payback period—enabling risk-informed decision making. **spark-bestfit Features Used:** - ``FitterConfigBuilder`` for reusable configuration - ``RayBackend`` for distributed computation - ``GaussianCopula`` for correlated economic parameters - Multi-column fitting for uncertain inputs (growth rates, costs, discount rates) **Notebook:** `examples/ray/usecase_capital_budgeting.ipynb `_ .. code-block:: python from spark_bestfit import ( DistributionFitter, FitterConfigBuilder, GaussianCopula, RayBackend ) # Create reusable configuration config = (FitterConfigBuilder() .with_bins(50) .with_sampling(enabled=False) .with_lazy_metrics(False) .build()) # Fit distributions to uncertain parameters backend = RayBackend() fitter = DistributionFitter(backend=backend) results = fitter.fit( historical_data, columns=['revenue_growth', 'cost_ratio', 'discount_rate'], config=config ) # Capture parameter correlations copula = GaussianCopula.fit(results, historical_data, backend=backend) # Generate 10,000 correlated scenarios scenarios = copula.sample(n=10000, random_state=42) # Calculate NPV, IRR, Payback for each scenario # Analyze P(NPV > 0), VaR, sensitivity rankings Which Use Case Fits Your Needs? ------------------------------- .. list-table:: :header-rows: 1 :widths: 25 25 50 * - Use Case - Key Features - Best For * - Monte Carlo - Copula, sampling - Risk management, finance, simulations * - Synthetic Data - Multi-column, serialization - ML training, privacy, testing * - A/B Testing - Bounded, bootstrap CI - Product experiments, marketing * - Insurance - Heavy-tail, discrete - Actuarial, loss modeling * - Model Validation - KS/AD tests, p-values - Regulatory compliance, backtesting * - Discrete Event Simulation - get_scipy_dist(), sampling - Operations, staffing, capacity planning * - Data Drift Detection - KS tests, multi-feature monitoring - ML monitoring, data quality, model retraining * - Capital Budgeting (Ray) - FitterConfigBuilder, RayBackend, NPV/IRR - Investment decisions, project evaluation See Also -------- - :doc:`quickstart` - Basic usage and installation - :doc:`/features/config` - FitterConfig builder pattern - :doc:`/features/copula` - Detailed copula documentation - :doc:`/features/sampling` - Distributed sampling guide - :doc:`/features/bounded` - Bounded distribution fitting