ADR-0007: Copula Implementation¶
- Status:
Accepted
- Date:
2025-12-30 (v1.3.0)
Context¶
spark-bestfit fits marginal distributions independently for each column. However, real-world data often has correlations between columns:
Financial data: asset returns are correlated
Sensor data: temperature and humidity co-vary
Biological data: gene expressions have dependencies
Users need to generate synthetic samples that preserve both:
Marginal distributions (individual column distributions)
Dependency structure (correlations between columns)
Copulas provide a mathematical framework to separate marginal distributions from dependency structure, enabling this.
Decision¶
We implemented Gaussian Copula in copula.py:
class GaussianCopula:
def __init__(
self,
marginals: Dict[str, FitResult],
correlation_matrix: Optional[np.ndarray] = None,
):
self.marginals = marginals
self.correlation = correlation_matrix
def fit(self, df: Any, columns: List[str]) -> "GaussianCopula":
# Compute Spearman correlation matrix
# Store with marginal distributions
def sample(self, n: int) -> pd.DataFrame:
# 1. Generate correlated normal samples via Cholesky
# 2. Transform to uniform via normal CDF
# 3. Apply inverse CDF of each marginal
Algorithm:
Fitting: Compute Spearman rank correlation (robust to non-normality)
Sampling: - Generate
Z ~ N(0, Sigma)using Cholesky decomposition:L @ standard_normal- Transform to uniform:U = Phi(Z)where Phi is standard normal CDF - Apply marginal inverse CDF:X_i = F_i^{-1}(U_i)
Optimizations (v2.7.0, v2.8.0):
# Cached Cholesky decomposition
self._cholesky = np.linalg.cholesky(self.correlation)
# Fast PPF using scipy.special.ndtri instead of norm.ppf
from scipy.special import ndtri
uniform = ndtri(standard_normal) # 10x faster than norm.cdf
Distributed sampling (v1.3.0):
def sample_distributed(
self,
n: int,
backend: ExecutionBackend,
num_partitions: Optional[int] = None,
) -> Any:
# Each partition generates subset of samples
# Returns backend-native DataFrame (Spark/Ray/pandas)
Consequences¶
Positive:
Preserves both marginal distributions and correlations
Gaussian copula is computationally efficient
Distributed sampling scales to billions of rows
Backend-agnostic via ExecutionBackend protocol
Negative:
Gaussian copula assumes elliptical dependency; tail dependencies (common in finance) are not captured
Correlation matrix must be positive semi-definite
Memory scales as O(columns^2) for correlation matrix
Neutral:
Spearman correlation chosen over Pearson for robustness to non-linearity
Future work could add t-copula for tail dependencies
References¶
PR #50: Gaussian Copula (v1.3.0)
Commit 8ea8f20: fast_ppf optimization (v2.7.0)
PR #125: Cholesky+ndtr optimization (v2.8.0)
Nelsen, R. B. (2006). “An Introduction to Copulas”
Related: ADR-0001: Multi-Backend Architecture (distributed sampling)