Real-World Use Cases¶
This guide demonstrates spark-bestfit in production scenarios. Each use case includes a complete Jupyter notebook with working code you can adapt to your needs.
Monte Carlo Risk Simulation¶
Business Context: Financial risk managers need to estimate potential losses across portfolios of correlated assets. Monte Carlo simulation generates thousands of scenarios to calculate Value-at-Risk (VaR) and other risk metrics.
spark-bestfit Features Used:
GaussianCopulafor modeling asset correlationsMulti-column fitting for portfolio assets
Distributed sampling for scenario generation
lazy_metrics=Truefor performance
Notebooks:
from spark_bestfit import DistributionFitter, GaussianCopula
# Fit distributions to historical returns
fitter = DistributionFitter(spark)
results = fitter.fit(returns_df, columns=["AAPL", "GOOGL", "MSFT"], lazy_metrics=True)
# Model correlations from fit results
copula = GaussianCopula.fit(results, returns_df)
# Generate correlated scenarios
scenarios = copula.sample(n_samples=10000, seed=42)
# Calculate portfolio VaR
portfolio_returns = scenarios.select(
(F.col("AAPL") * 0.4 + F.col("GOOGL") * 0.35 + F.col("MSFT") * 0.25).alias("portfolio")
)
ML Synthetic Data Generation¶
Business Context: Machine learning teams need synthetic data for model training, testing, and privacy-preserving data sharing. Fitting distributions to real data enables generating statistically similar synthetic datasets.
spark-bestfit Features Used:
Multi-column fitting for feature columns
DiscreteDistributionFitterfor categorical/count featuresSerialization for saving/loading fitted models
Distributed sampling at scale
Notebooks:
from spark_bestfit import DistributionFitter, DiscreteDistributionFitter
# Fit continuous features
cont_fitter = DistributionFitter(spark)
cont_results = cont_fitter.fit(df, columns=["age", "income", "score"])
# Fit discrete features
disc_fitter = DiscreteDistributionFitter(spark)
disc_results = disc_fitter.fit(df, columns=["num_purchases", "category_id"])
# Save for reproducibility
cont_results.save("models/continuous_fits.json")
# Generate synthetic data
synthetic_df = sample_from_results(cont_results, n=100000)
A/B Test Analysis¶
Business Context: Product teams run experiments to measure the impact of changes. Distribution fitting helps model conversion rates, revenue per user, and other metrics with proper uncertainty quantification.
spark-bestfit Features Used:
Bounded fitting for proportions (0-1 range)
Bootstrap confidence intervals
lazy_metrics=Truefor quick model selection
Notebooks:
from spark_bestfit import DistributionFitter
fitter = DistributionFitter(spark)
# Fit bounded distributions to conversion rates
# Beta naturally fits [0, 1] bounded data
results = fitter.fit(
experiment_df,
column="conversion_rate",
bounded=True,
lower_bound=0.0,
upper_bound=1.0,
lazy_metrics=True
)
# Get best fit (typically Beta for proportions)
best = results.best(n=1, metric='aic')[0]
samples = best.sample(size=10000) # For bootstrap CI
Insurance Claims Modeling¶
Business Context: Actuaries model claim frequency and severity to set premiums and reserves. Heavy-tailed distributions (Pareto, lognormal) are essential for capturing extreme loss events.
spark-bestfit Features Used:
Heavy-tail distributions (Pareto, lognormal, Weibull)
DiscreteDistributionFitterfor claim countsBounded fitting for capped policies
Q-Q plots for tail behavior validation
Notebooks:
from spark_bestfit import DistributionFitter, DiscreteDistributionFitter
# Fit claim severity (includes heavy-tailed: pareto, lognorm, burr, etc.)
severity_fitter = DistributionFitter(spark)
severity_results = severity_fitter.fit(
claims_df,
column="claim_amount",
lazy_metrics=True
)
# Fit claim frequency (discrete: poisson, nbinom, etc.)
freq_fitter = DiscreteDistributionFitter(spark)
freq_results = freq_fitter.fit(claims_df, column="num_claims")
# Get best fits and visualize
best_severity = severity_results.best(n=1, metric='aic')[0]
severity_fitter.plot(best_severity, claims_df, "claim_amount")
Risk Model Validation¶
Business Context: Financial regulators require statistical validation of risk models. Unlike model selection (AIC), model validation asks: “Does the data actually come from this distribution?” The Kolmogorov-Smirnov (KS) test provides formal hypothesis testing.
spark-bestfit Features Used:
lazy_metrics=Falseto compute KS and Anderson-Darling statisticsmetric='ks_statistic'for goodness-of-fit based selectionmetric='ad_statistic'for tail-sensitive validationfit.pvaluefor hypothesis test interpretation
Notebooks:
from spark_bestfit import DistributionFitter
fitter = DistributionFitter(spark)
# Fit with full metrics (not lazy) for validation
results = fitter.fit(
returns_df,
column="daily_return",
lazy_metrics=False # Compute KS and AD statistics
)
# Select by goodness-of-fit (not prediction accuracy)
best = results.best(n=1, metric='ks_statistic')[0]
# Interpret hypothesis test
if best.pvalue > 0.05:
print("Model PASSES validation (cannot reject H0)")
else:
print("Model FAILS validation (reject H0)")
Discrete Event Simulation¶
Business Context: Operations teams need to answer “what-if” questions about staffing, capacity, and process changes. Rather than experimenting with real operations (expensive and risky), you can fit distributions to historical data and simulate scenarios.
spark-bestfit Features Used:
DistributionFitterfor inter-arrival and service timesDiscreteDistributionFitterfor hourly/daily volumeslazy_metrics=Falseto validate distributional assumptionsget_scipy_dist()to sample from fitted distributions in simulations
Notebooks:
from spark_bestfit import DistributionFitter
fitter = DistributionFitter(spark)
# Fit distributions to operational data
arrival_results = fitter.fit(df, column='inter_arrival_seconds', lazy_metrics=False)
service_results = fitter.fit(df, column='service_time_seconds', lazy_metrics=False)
# Get best fits for simulation
arrival_dist = arrival_results.best(n=1, metric='aic')[0].get_scipy_dist()
service_dist = service_results.best(n=1, metric='aic')[0].get_scipy_dist()
# Simulate queue with fitted distributions
inter_arrivals = arrival_dist.rvs(size=1000)
service_times = service_dist.rvs(size=1000)
# Run what-if scenarios (add agents, change volume, etc.)
Data Drift Detection¶
Business Context: Production ML models degrade when underlying data distributions shift. Drift detection monitors feature distributions over time and alerts when significant changes occur, enabling proactive model retraining and data quality monitoring.
spark-bestfit Features Used:
lazy_metrics=Falsefor KS statistics and p-valuesDistributionFitterfor baseline and monitoring period fittingscipy.stats.ks_2sampfor direct sample comparisonMulti-column fitting for multi-feature monitoring
Notebook: examples/spark/usecase_drift_detection.ipynb
from spark_bestfit import DistributionFitter
from scipy.stats import ks_2samp
fitter = DistributionFitter(spark)
# Establish baseline from historical data
baseline_results = fitter.fit(
baseline_df,
column='feature',
lazy_metrics=False # Need KS statistics
)
baseline_samples = baseline_df.toPandas()['feature'].values
# Monitor new data for drift
new_samples = new_df.toPandas()['feature'].values
ks_stat, p_value = ks_2samp(baseline_samples, new_samples)
if p_value < 0.05:
print(f"DRIFT DETECTED: KS={ks_stat:.4f}, p={p_value:.4e}")
# Trigger retraining, alert, or investigation
Capital Budgeting Monte Carlo (Ray)¶
Business Context: Finance teams evaluate capital investments (new plants, equipment, acquisitions) under uncertainty. Traditional DCF analysis uses single “best estimate” values, but Monte Carlo simulation provides probability distributions for NPV, IRR, and payback period—enabling risk-informed decision making.
spark-bestfit Features Used:
FitterConfigBuilderfor reusable configurationRayBackendfor distributed computationGaussianCopulafor correlated economic parametersMulti-column fitting for uncertain inputs (growth rates, costs, discount rates)
Notebook: examples/ray/usecase_capital_budgeting.ipynb
from spark_bestfit import (
DistributionFitter, FitterConfigBuilder,
GaussianCopula, RayBackend
)
# Create reusable configuration
config = (FitterConfigBuilder()
.with_bins(50)
.with_sampling(enabled=False)
.with_lazy_metrics(False)
.build())
# Fit distributions to uncertain parameters
backend = RayBackend()
fitter = DistributionFitter(backend=backend)
results = fitter.fit(
historical_data,
columns=['revenue_growth', 'cost_ratio', 'discount_rate'],
config=config
)
# Capture parameter correlations
copula = GaussianCopula.fit(results, historical_data, backend=backend)
# Generate 10,000 correlated scenarios
scenarios = copula.sample(n=10000, random_state=42)
# Calculate NPV, IRR, Payback for each scenario
# Analyze P(NPV > 0), VaR, sensitivity rankings
Which Use Case Fits Your Needs?¶
Use Case |
Key Features |
Best For |
|---|---|---|
Monte Carlo |
Copula, sampling |
Risk management, finance, simulations |
Synthetic Data |
Multi-column, serialization |
ML training, privacy, testing |
A/B Testing |
Bounded, bootstrap CI |
Product experiments, marketing |
Insurance |
Heavy-tail, discrete |
Actuarial, loss modeling |
Model Validation |
KS/AD tests, p-values |
Regulatory compliance, backtesting |
Discrete Event Simulation |
get_scipy_dist(), sampling |
Operations, staffing, capacity planning |
Data Drift Detection |
KS tests, multi-feature monitoring |
ML monitoring, data quality, model retraining |
Capital Budgeting (Ray) |
FitterConfigBuilder, RayBackend, NPV/IRR |
Investment decisions, project evaluation |
See Also¶
Quick Start - Basic usage and installation
FitterConfig Builder - FitterConfig builder pattern
Gaussian Copula - Detailed copula documentation
Distributed Sampling - Distributed sampling guide
Bounded Distribution Fitting - Bounded distribution fitting