Real-World Use Cases

This guide demonstrates spark-bestfit in production scenarios. Each use case includes a complete Jupyter notebook with working code you can adapt to your needs.

Monte Carlo Risk Simulation

Business Context: Financial risk managers need to estimate potential losses across portfolios of correlated assets. Monte Carlo simulation generates thousands of scenarios to calculate Value-at-Risk (VaR) and other risk metrics.

spark-bestfit Features Used:

  • GaussianCopula for modeling asset correlations

  • Multi-column fitting for portfolio assets

  • Distributed sampling for scenario generation

  • lazy_metrics=True for performance

Notebooks:

from spark_bestfit import DistributionFitter, GaussianCopula

# Fit distributions to historical returns
fitter = DistributionFitter(spark)
results = fitter.fit(returns_df, columns=["AAPL", "GOOGL", "MSFT"], lazy_metrics=True)

# Model correlations from fit results
copula = GaussianCopula.fit(results, returns_df)

# Generate correlated scenarios
scenarios = copula.sample(n_samples=10000, seed=42)

# Calculate portfolio VaR
portfolio_returns = scenarios.select(
    (F.col("AAPL") * 0.4 + F.col("GOOGL") * 0.35 + F.col("MSFT") * 0.25).alias("portfolio")
)

ML Synthetic Data Generation

Business Context: Machine learning teams need synthetic data for model training, testing, and privacy-preserving data sharing. Fitting distributions to real data enables generating statistically similar synthetic datasets.

spark-bestfit Features Used:

  • Multi-column fitting for feature columns

  • DiscreteDistributionFitter for categorical/count features

  • Serialization for saving/loading fitted models

  • Distributed sampling at scale

Notebooks:

from spark_bestfit import DistributionFitter, DiscreteDistributionFitter

# Fit continuous features
cont_fitter = DistributionFitter(spark)
cont_results = cont_fitter.fit(df, columns=["age", "income", "score"])

# Fit discrete features
disc_fitter = DiscreteDistributionFitter(spark)
disc_results = disc_fitter.fit(df, columns=["num_purchases", "category_id"])

# Save for reproducibility
cont_results.save("models/continuous_fits.json")

# Generate synthetic data
synthetic_df = sample_from_results(cont_results, n=100000)

A/B Test Analysis

Business Context: Product teams run experiments to measure the impact of changes. Distribution fitting helps model conversion rates, revenue per user, and other metrics with proper uncertainty quantification.

spark-bestfit Features Used:

  • Bounded fitting for proportions (0-1 range)

  • Bootstrap confidence intervals

  • lazy_metrics=True for quick model selection

Notebooks:

from spark_bestfit import DistributionFitter

fitter = DistributionFitter(spark)

# Fit bounded distributions to conversion rates
# Beta naturally fits [0, 1] bounded data
results = fitter.fit(
    experiment_df,
    column="conversion_rate",
    bounded=True,
    lower_bound=0.0,
    upper_bound=1.0,
    lazy_metrics=True
)

# Get best fit (typically Beta for proportions)
best = results.best(n=1, metric='aic')[0]
samples = best.sample(size=10000)  # For bootstrap CI

Insurance Claims Modeling

Business Context: Actuaries model claim frequency and severity to set premiums and reserves. Heavy-tailed distributions (Pareto, lognormal) are essential for capturing extreme loss events.

spark-bestfit Features Used:

  • Heavy-tail distributions (Pareto, lognormal, Weibull)

  • DiscreteDistributionFitter for claim counts

  • Bounded fitting for capped policies

  • Q-Q plots for tail behavior validation

Notebooks:

from spark_bestfit import DistributionFitter, DiscreteDistributionFitter

# Fit claim severity (includes heavy-tailed: pareto, lognorm, burr, etc.)
severity_fitter = DistributionFitter(spark)
severity_results = severity_fitter.fit(
    claims_df,
    column="claim_amount",
    lazy_metrics=True
)

# Fit claim frequency (discrete: poisson, nbinom, etc.)
freq_fitter = DiscreteDistributionFitter(spark)
freq_results = freq_fitter.fit(claims_df, column="num_claims")

# Get best fits and visualize
best_severity = severity_results.best(n=1, metric='aic')[0]
severity_fitter.plot(best_severity, claims_df, "claim_amount")

Risk Model Validation

Business Context: Financial regulators require statistical validation of risk models. Unlike model selection (AIC), model validation asks: “Does the data actually come from this distribution?” The Kolmogorov-Smirnov (KS) test provides formal hypothesis testing.

spark-bestfit Features Used:

  • lazy_metrics=False to compute KS and Anderson-Darling statistics

  • metric='ks_statistic' for goodness-of-fit based selection

  • metric='ad_statistic' for tail-sensitive validation

  • fit.pvalue for hypothesis test interpretation

Notebooks:

from spark_bestfit import DistributionFitter

fitter = DistributionFitter(spark)

# Fit with full metrics (not lazy) for validation
results = fitter.fit(
    returns_df,
    column="daily_return",
    lazy_metrics=False  # Compute KS and AD statistics
)

# Select by goodness-of-fit (not prediction accuracy)
best = results.best(n=1, metric='ks_statistic')[0]

# Interpret hypothesis test
if best.pvalue > 0.05:
    print("Model PASSES validation (cannot reject H0)")
else:
    print("Model FAILS validation (reject H0)")

Discrete Event Simulation

Business Context: Operations teams need to answer “what-if” questions about staffing, capacity, and process changes. Rather than experimenting with real operations (expensive and risky), you can fit distributions to historical data and simulate scenarios.

spark-bestfit Features Used:

  • DistributionFitter for inter-arrival and service times

  • DiscreteDistributionFitter for hourly/daily volumes

  • lazy_metrics=False to validate distributional assumptions

  • get_scipy_dist() to sample from fitted distributions in simulations

Notebooks:

from spark_bestfit import DistributionFitter

fitter = DistributionFitter(spark)

# Fit distributions to operational data
arrival_results = fitter.fit(df, column='inter_arrival_seconds', lazy_metrics=False)
service_results = fitter.fit(df, column='service_time_seconds', lazy_metrics=False)

# Get best fits for simulation
arrival_dist = arrival_results.best(n=1, metric='aic')[0].get_scipy_dist()
service_dist = service_results.best(n=1, metric='aic')[0].get_scipy_dist()

# Simulate queue with fitted distributions
inter_arrivals = arrival_dist.rvs(size=1000)
service_times = service_dist.rvs(size=1000)

# Run what-if scenarios (add agents, change volume, etc.)

Data Drift Detection

Business Context: Production ML models degrade when underlying data distributions shift. Drift detection monitors feature distributions over time and alerts when significant changes occur, enabling proactive model retraining and data quality monitoring.

spark-bestfit Features Used:

  • lazy_metrics=False for KS statistics and p-values

  • DistributionFitter for baseline and monitoring period fitting

  • scipy.stats.ks_2samp for direct sample comparison

  • Multi-column fitting for multi-feature monitoring

Notebook: examples/spark/usecase_drift_detection.ipynb

from spark_bestfit import DistributionFitter
from scipy.stats import ks_2samp

fitter = DistributionFitter(spark)

# Establish baseline from historical data
baseline_results = fitter.fit(
    baseline_df,
    column='feature',
    lazy_metrics=False  # Need KS statistics
)
baseline_samples = baseline_df.toPandas()['feature'].values

# Monitor new data for drift
new_samples = new_df.toPandas()['feature'].values
ks_stat, p_value = ks_2samp(baseline_samples, new_samples)

if p_value < 0.05:
    print(f"DRIFT DETECTED: KS={ks_stat:.4f}, p={p_value:.4e}")
    # Trigger retraining, alert, or investigation

Capital Budgeting Monte Carlo (Ray)

Business Context: Finance teams evaluate capital investments (new plants, equipment, acquisitions) under uncertainty. Traditional DCF analysis uses single “best estimate” values, but Monte Carlo simulation provides probability distributions for NPV, IRR, and payback period—enabling risk-informed decision making.

spark-bestfit Features Used:

  • FitterConfigBuilder for reusable configuration

  • RayBackend for distributed computation

  • GaussianCopula for correlated economic parameters

  • Multi-column fitting for uncertain inputs (growth rates, costs, discount rates)

Notebook: examples/ray/usecase_capital_budgeting.ipynb

from spark_bestfit import (
    DistributionFitter, FitterConfigBuilder,
    GaussianCopula, RayBackend
)

# Create reusable configuration
config = (FitterConfigBuilder()
    .with_bins(50)
    .with_sampling(enabled=False)
    .with_lazy_metrics(False)
    .build())

# Fit distributions to uncertain parameters
backend = RayBackend()
fitter = DistributionFitter(backend=backend)
results = fitter.fit(
    historical_data,
    columns=['revenue_growth', 'cost_ratio', 'discount_rate'],
    config=config
)

# Capture parameter correlations
copula = GaussianCopula.fit(results, historical_data, backend=backend)

# Generate 10,000 correlated scenarios
scenarios = copula.sample(n=10000, random_state=42)

# Calculate NPV, IRR, Payback for each scenario
# Analyze P(NPV > 0), VaR, sensitivity rankings

Which Use Case Fits Your Needs?

Use Case

Key Features

Best For

Monte Carlo

Copula, sampling

Risk management, finance, simulations

Synthetic Data

Multi-column, serialization

ML training, privacy, testing

A/B Testing

Bounded, bootstrap CI

Product experiments, marketing

Insurance

Heavy-tail, discrete

Actuarial, loss modeling

Model Validation

KS/AD tests, p-values

Regulatory compliance, backtesting

Discrete Event Simulation

get_scipy_dist(), sampling

Operations, staffing, capacity planning

Data Drift Detection

KS tests, multi-feature monitoring

ML monitoring, data quality, model retraining

Capital Budgeting (Ray)

FitterConfigBuilder, RayBackend, NPV/IRR

Investment decisions, project evaluation

See Also