Gaussian Copula¶
The GaussianCopula class enables correlated multi-column sampling at scale.
Unlike standard copula libraries that require loading data into memory, spark-bestfit
computes correlation via Spark ML and generates samples across the cluster.
Why Use a Copula?¶
When you fit distributions to multiple columns independently, the correlation structure between columns is lost:
# Independent fitting loses correlation
results = fitter.fit(df, columns=["price", "quantity", "revenue"])
# Sampling each column independently - correlation is LOST
price_samples = results.for_column("price").best(n=1)[0].sample(1000)
quantity_samples = results.for_column("quantity").best(n=1)[0].sample(1000)
# These are uncorrelated! Not realistic.
A Gaussian copula preserves both:
Marginal distributions: Each column follows its fitted distribution
Correlation structure: Columns maintain their original relationships
When to Use spark-bestfit Copula¶
spark-bestfit is not faster than statsmodels for small data. The value is scale:
Scenario |
statsmodels |
spark-bestfit |
|---|---|---|
Data < 10M rows |
Faster (use this) |
Slower (Spark overhead) |
Data > 100M rows |
Crashes (OOM) |
Works (distributed) |
Data already in Spark |
Requires |
Native (no conversion) |
100M+ samples needed |
May OOM |
sample_distributed() scales |
Basic Usage¶
Fit a copula from multi-column fit results:
from spark_bestfit import DistributionFitter, GaussianCopula
from spark_bestfit.backends import BackendFactory
# Fit multiple columns
fitter = DistributionFitter(spark)
results = fitter.fit(df, columns=["price", "quantity", "revenue"])
# Fit copula - correlation computed via Spark ML (scales to billions)
copula = GaussianCopula.fit(results, df)
# Generate correlated samples locally
samples = copula.sample(n=10000) # Dict[str, np.ndarray]
# Or distributed via any backend
backend = BackendFactory.create("spark", spark_session=spark)
samples_df = copula.sample_distributed(n=100_000_000, backend=backend)
The df parameter is required to compute the correlation matrix. The copula
uses Spearman rank correlation, which is robust to non-linear relationships.
Local vs Distributed Sampling¶
Method |
Use Case |
Output |
|---|---|---|
|
Small to medium samples (< 10M) |
Dict[str, np.ndarray] |
|
Large samples (> 10M) |
DataFrame (Spark/pandas) |
For small samples, sample() is faster due to distributed overhead:
# Local sampling - fast for small n
samples = copula.sample(n=10000, random_state=42)
df = pd.DataFrame(samples)
# Distributed sampling - efficient for large n
backend = BackendFactory.create("spark", spark_session=spark)
samples_df = copula.sample_distributed(n=100_000_000, backend=backend, random_seed=42)
Backend Options¶
Use any backend for distributed copula sampling:
from spark_bestfit.backends import BackendFactory
# Spark
backend = BackendFactory.create("spark", spark_session=spark)
samples_df = copula.sample_distributed(n=100_000_000, backend=backend)
# Ray
backend = BackendFactory.create("ray")
samples_df = copula.sample_distributed(n=100_000_000, backend=backend)
# Local (for testing)
backend = BackendFactory.create("local", max_workers=4)
samples_df = copula.sample_distributed(n=100_000, backend=backend)
Fast Uniform Sampling¶
Both sample() and sample_distributed() support a return_uniform=True parameter
that skips the marginal distribution transforms, returning uniform [0,1] samples instead.
This matches the behavior of statsmodels and is significantly faster:
# Fast path - returns uniform samples without marginal transforms
uniform_samples = copula.sample(n=10_000_000, return_uniform=True)
# Full transform - slower but returns samples in fitted marginal distributions
marginal_samples = copula.sample(n=10_000_000)
When to use ``return_uniform=True``:
You only need the correlation structure, not the exact marginal distributions
You’re doing correlation analysis or downstream transforms
Performance is critical
Performance Benchmarks¶
Sampling performance comparison (3-column copula, local mode):
N Samples |
statsmodels |
return_uniform |
with transform |
|---|---|---|---|
1,000,000 |
73 ms |
56 ms |
1,547 ms |
10,000,000 |
725 ms |
555 ms |
15,485 ms |
50,000,000 |
3,706 ms |
2,820 ms |
77,820 ms |
Key findings:
return_uniform=Trueis ~24% faster than statsmodels (same output format)Full marginal transforms add ~28x overhead due to scipy’s PPF using iterative root-finding
Use
return_uniform=Truewhen you don’t need the exact marginal distributions
Fast PPF Optimization¶
Added in version 2.7.0.
For common distributions, spark-bestfit bypasses scipy’s generic PPF machinery (which uses iterative root-finding) by calling scipy.special functions directly. This optimization is applied automatically during copula sampling.
Supported distributions with fast PPF:
norm- Normal/Gaussianexpon- Exponentialuniform- Uniformlognorm- Log-normalweibull_min- Weibull (minimum)gamma- Gammabeta- Beta
For these distributions, marginal transforms are ~10-20x faster than the generic scipy path. Other distributions automatically fall back to scipy.stats.
Usage:
No code changes required - the optimization is applied automatically in both sample()
and sample_distributed():
# Fast PPF is used automatically for supported distributions
samples = copula.sample(n=1_000_000) # Uses fast_ppf for norm, gamma, etc.
Direct access (advanced):
If you need to use the fast PPF implementation directly:
from spark_bestfit.fast_ppf import fast_ppf, has_fast_ppf
import numpy as np
# Check if a distribution has fast PPF support
has_fast_ppf("gamma") # True
has_fast_ppf("pareto") # False
# Compute PPF directly
q = np.array([0.1, 0.5, 0.9])
values = fast_ppf("gamma", (2.0, 0.0, 1.0), q) # shape=2, loc=0, scale=1
Serialization¶
Save and load copulas for later use:
# Save to JSON (recommended)
copula.save("copula.json")
# Or pickle for faster serialization
copula.save("copula.pkl")
# Load later
loaded = GaussianCopula.load("copula.json")
samples = loaded.sample(n=1000)
The JSON format includes metadata for debugging:
{
"schema_version": "1.0",
"spark_bestfit_version": "2.6.0",
"created_at": "2026-01-04T20:00:00Z",
"type": "gaussian_copula",
"column_names": ["price", "quantity", "revenue"],
"correlation_matrix": [[1.0, 0.8, 0.9], ...],
"marginals": {
"price": {"distribution": "gamma", "parameters": [2.0, 0.0, 5.0]},
...
}
}
How It Works¶
The Gaussian copula sampling process:
Fit phase: Compute Spearman correlation matrix via Spark ML (no
.toPandas())Sample phase:
Generate multivariate normal samples with the correlation matrix
Transform each normal sample -> uniform via phi (standard normal CDF)
Transform each uniform -> target marginal via PPF (inverse CDF)
This ensures that:
Each column follows its fitted marginal distribution
Columns maintain the correlation structure from the original data
Removed: sample_spark()¶
Changed in version 3.0.1: The sample_spark() method was removed in v3.0.1 (deprecated since v2.2.0).
Use sample_distributed() with an explicit backend instead.
If you are migrating from code that used sample_spark():
# Removed (was deprecated since v2.2.0)
# samples_df = copula.sample_spark(n=100_000_000)
# Use instead
from spark_bestfit.backends import BackendFactory
backend = BackendFactory.create("spark", spark_session=spark)
samples_df = copula.sample_distributed(n=100_000_000, backend=backend)
API Reference¶
See spark_bestfit.copula.GaussianCopula for full API documentation.