ADR-0007: Copula Implementation

Status:

Accepted

Date:

2025-12-30 (v1.3.0)

Context

spark-bestfit fits marginal distributions independently for each column. However, real-world data often has correlations between columns:

  • Financial data: asset returns are correlated

  • Sensor data: temperature and humidity co-vary

  • Biological data: gene expressions have dependencies

Users need to generate synthetic samples that preserve both:

  1. Marginal distributions (individual column distributions)

  2. Dependency structure (correlations between columns)

Copulas provide a mathematical framework to separate marginal distributions from dependency structure, enabling this.

Decision

We implemented Gaussian Copula in copula.py:

class GaussianCopula:
    def __init__(
        self,
        marginals: Dict[str, FitResult],
        correlation_matrix: Optional[np.ndarray] = None,
    ):
        self.marginals = marginals
        self.correlation = correlation_matrix

    def fit(self, df: Any, columns: List[str]) -> "GaussianCopula":
        # Compute Spearman correlation matrix
        # Store with marginal distributions

    def sample(self, n: int) -> pd.DataFrame:
        # 1. Generate correlated normal samples via Cholesky
        # 2. Transform to uniform via normal CDF
        # 3. Apply inverse CDF of each marginal

Algorithm:

  1. Fitting: Compute Spearman rank correlation (robust to non-normality)

  2. Sampling: - Generate Z ~ N(0, Sigma) using Cholesky decomposition: L @ standard_normal - Transform to uniform: U = Phi(Z) where Phi is standard normal CDF - Apply marginal inverse CDF: X_i = F_i^{-1}(U_i)

Optimizations (v2.7.0, v2.8.0):

# Cached Cholesky decomposition
self._cholesky = np.linalg.cholesky(self.correlation)

# Fast PPF using scipy.special.ndtri instead of norm.ppf
from scipy.special import ndtri
uniform = ndtri(standard_normal)  # 10x faster than norm.cdf

Distributed sampling (v1.3.0):

def sample_distributed(
    self,
    n: int,
    backend: ExecutionBackend,
    num_partitions: Optional[int] = None,
) -> Any:
    # Each partition generates subset of samples
    # Returns backend-native DataFrame (Spark/Ray/pandas)

Consequences

Positive:

  • Preserves both marginal distributions and correlations

  • Gaussian copula is computationally efficient

  • Distributed sampling scales to billions of rows

  • Backend-agnostic via ExecutionBackend protocol

Negative:

  • Gaussian copula assumes elliptical dependency; tail dependencies (common in finance) are not captured

  • Correlation matrix must be positive semi-definite

  • Memory scales as O(columns^2) for correlation matrix

Neutral:

  • Spearman correlation chosen over Pearson for robustness to non-linearity

  • Future work could add t-copula for tail dependencies

References