Multivariate Normal¶
Added in version 2.9.0.
The MultivariateNormalFitter class enables direct multivariate normal distribution
fitting as an alternative to copula-based correlation modeling. Use this when the joint
distribution is assumed to be multivariate normal and you need interpretable joint
parameters (mean vector, covariance matrix).
When to Use Multivariate Normal¶
This is an alternative to copulas, not a replacement:
Scenario |
GaussianCopula |
MultivariateNormalFitter |
|---|---|---|
Arbitrary marginals |
Recommended |
Not suitable |
Joint MVN assumption |
Works (via correlations) |
Preferred (direct fit) |
Interpretable parameters |
Marginals + correlation |
Mean + covariance |
Statistical testing (MVN) |
Not applicable |
Preferred |
Many columns (20+) |
Scales better |
May be unstable |
Use copulas when you want flexible marginal distributions. Use multivariate normal when your data is truly jointly normal and you need the mean vector and covariance matrix.
Basic Usage¶
Fit a multivariate normal distribution to multi-column data:
from spark_bestfit import MultivariateNormalFitter, LocalBackend
import pandas as pd
# Sample data
df = pd.DataFrame({
"x": np.random.normal(10, 2, 1000),
"y": np.random.normal(20, 3, 1000),
"z": np.random.normal(30, 4, 1000),
})
# Fit multivariate normal
fitter = MultivariateNormalFitter(backend=LocalBackend())
result = fitter.fit(df, columns=["x", "y", "z"])
# Access fitted parameters
print(result.mean) # [10.02, 19.97, 30.05] - mean vector
print(result.cov) # 3x3 covariance matrix
# Generate correlated samples
samples = result.sample(n=10000)
With Spark¶
The fitter works with Spark DataFrames:
from spark_bestfit import MultivariateNormalFitter
from spark_bestfit.backends import BackendFactory
# Fit from Spark DataFrame
fitter = MultivariateNormalFitter() # Auto-detects backend
result = fitter.fit(spark_df, columns=["price", "quantity", "revenue"])
# Distributed sampling
backend = BackendFactory.create("spark", spark_session=spark)
samples_df = result.sample_distributed(n=100_000_000, backend=backend)
Local vs Distributed Sampling¶
Method |
Use Case |
Output |
|---|---|---|
|
Small to medium samples (< 10M) |
Dict[str, np.ndarray] |
|
Large samples (> 10M) |
DataFrame (Spark/pandas) |
# Local sampling - fast for small n
samples = result.sample(n=10000, random_state=42)
df = pd.DataFrame(samples)
# Distributed sampling - efficient for large n
backend = BackendFactory.create("spark", spark_session=spark)
samples_df = result.sample_distributed(n=100_000_000, backend=backend, random_seed=42)
Result Methods¶
The MultivariateNormalResult provides several useful methods:
PDF and Log-PDF:
# Evaluate probability density
point = result.mean
density = result.pdf(point) # PDF at mean (maximum)
# Log-PDF (more numerically stable)
log_density = result.logpdf(point)
# Batch evaluation
points = np.array([[10, 20, 30], [11, 21, 31]])
densities = result.pdf(points)
Mahalanobis Distance:
The Mahalanobis distance accounts for the covariance structure and is useful for outlier detection:
# Distance from distribution center
distances = result.mahalanobis(test_data)
# Outlier detection (~99.7% threshold for MVN)
outliers = distances > 3.0
Correlation Matrix:
# Get correlation matrix from covariance
corr = result.correlation_matrix()
# Diagonal is 1.0, off-diagonal are correlation coefficients
Bias Parameter¶
The bias parameter controls the covariance estimator:
# Unbiased estimate (default) - ddof=1
result_unbiased = fitter.fit(df, columns=cols, bias=False)
# Biased (MLE) estimate - ddof=0
result_biased = fitter.fit(df, columns=cols, bias=True)
For large samples, the difference is negligible. Use bias=True for maximum
likelihood estimation consistency.
Numerical Stability¶
The fitter warns if the covariance matrix has a high condition number (> 1e10), which indicates near-collinear columns:
# Near-collinear data
df["z_copy"] = df["z"] + np.random.normal(0, 0.001, len(df))
# This will emit a warning
result = fitter.fit(df, columns=["x", "y", "z", "z_copy"])
# Warning: Covariance matrix has high condition number (1.23e+12)...
Consider removing highly correlated columns if you see this warning.
Serialization¶
Save and load results for later use:
# Save to JSON (recommended)
result.save("mvn_model.json")
# Or pickle for faster serialization
result.save("mvn_model.pkl")
# Load later
loaded = MultivariateNormalResult.load("mvn_model.json")
samples = loaded.sample(n=1000)
The JSON format includes metadata:
{
"schema_version": "1.0",
"spark_bestfit_version": "2.9.0",
"created_at": "2026-01-10T22:00:00Z",
"type": "multivariate_normal",
"column_names": ["x", "y", "z"],
"mean": [10.0, 20.0, 30.0],
"cov": [[4.0, 2.0, 1.0], [2.0, 9.0, 3.0], [1.0, 3.0, 16.0]],
"n_samples": 5000
}
Comparison with Copula¶
Feature |
GaussianCopula |
MultivariateNormalFitter |
|---|---|---|
Marginal distributions |
Any (100+ supported) |
Gaussian only |
Correlation type |
Spearman (rank) |
Pearson (linear) |
Parameters |
Per-column + correlation |
Joint mean + covariance |
Sampling process |
Transform through marginals |
Direct MVN sampling |
PDF evaluation |
Not available |
Available |
Outlier detection |
Not built-in |
Mahalanobis distance |
API Reference¶
See spark_bestfit.multivariate.MultivariateNormalFitter and
spark_bestfit.multivariate.MultivariateNormalResult for full API documentation.