ADR-0007: Copula Implementation ================================ :Status: Accepted :Date: 2025-12-30 (v1.3.0) Context ------- spark-bestfit fits marginal distributions independently for each column. However, real-world data often has correlations between columns: - Financial data: asset returns are correlated - Sensor data: temperature and humidity co-vary - Biological data: gene expressions have dependencies Users need to generate synthetic samples that preserve both: 1. Marginal distributions (individual column distributions) 2. Dependency structure (correlations between columns) Copulas provide a mathematical framework to separate marginal distributions from dependency structure, enabling this. Decision -------- We implemented Gaussian Copula in ``copula.py``:: class GaussianCopula: def __init__( self, marginals: Dict[str, FitResult], correlation_matrix: Optional[np.ndarray] = None, ): self.marginals = marginals self.correlation = correlation_matrix def fit(self, df: Any, columns: List[str]) -> "GaussianCopula": # Compute Spearman correlation matrix # Store with marginal distributions def sample(self, n: int) -> pd.DataFrame: # 1. Generate correlated normal samples via Cholesky # 2. Transform to uniform via normal CDF # 3. Apply inverse CDF of each marginal **Algorithm:** 1. **Fitting**: Compute Spearman rank correlation (robust to non-normality) 2. **Sampling**: - Generate ``Z ~ N(0, Sigma)`` using Cholesky decomposition: ``L @ standard_normal`` - Transform to uniform: ``U = Phi(Z)`` where Phi is standard normal CDF - Apply marginal inverse CDF: ``X_i = F_i^{-1}(U_i)`` **Optimizations** (v2.7.0, v2.8.0):: # Cached Cholesky decomposition self._cholesky = np.linalg.cholesky(self.correlation) # Fast PPF using scipy.special.ndtri instead of norm.ppf from scipy.special import ndtri uniform = ndtri(standard_normal) # 10x faster than norm.cdf **Distributed sampling** (v1.3.0):: def sample_distributed( self, n: int, backend: ExecutionBackend, num_partitions: Optional[int] = None, ) -> Any: # Each partition generates subset of samples # Returns backend-native DataFrame (Spark/Ray/pandas) Consequences ------------ **Positive:** - Preserves both marginal distributions and correlations - Gaussian copula is computationally efficient - Distributed sampling scales to billions of rows - Backend-agnostic via ExecutionBackend protocol **Negative:** - Gaussian copula assumes elliptical dependency; tail dependencies (common in finance) are not captured - Correlation matrix must be positive semi-definite - Memory scales as O(columns^2) for correlation matrix **Neutral:** - Spearman correlation chosen over Pearson for robustness to non-linearity - Future work could add t-copula for tail dependencies References ---------- - `PR #50 `_: Gaussian Copula (v1.3.0) - `Commit 8ea8f20 `_: fast_ppf optimization (v2.7.0) - `PR #125 `_: Cholesky+ndtr optimization (v2.8.0) - Nelsen, R. B. (2006). "An Introduction to Copulas" - Related: :doc:`ADR-0001-multi-backend-architecture` (distributed sampling)