Maximum Spacing Estimation¶
Added in version 2.5.0.
spark-bestfit supports Maximum Spacing Estimation (MSE) as an alternative to Maximum Likelihood Estimation (MLE) for parameter fitting. MSE is particularly robust for heavy-tailed distributions where MLE may fail or produce poor estimates.
What is Maximum Spacing Estimation?¶
MSE estimates distribution parameters by maximizing the geometric mean of spacings between consecutive order statistics of the CDF-transformed data.
For data points x₁ ≤ x₂ ≤ … ≤ xₙ and CDF F with parameters θ:
Transform data: uᵢ = F(xᵢ; θ) where uᵢ ∈ [0,1]
Compute spacings: Dᵢ = u₍ᵢ₎ - u₍ᵢ₋₁₎ (with boundary values 0 and 1)
Maximize: S(θ) = (1/(n+1)) Σᵢ log(Dᵢ)
Key advantages over MLE:
Always well-defined when the CDF exists (MLE can be unbounded)
More robust to outliers and extreme values
Better convergence for heavy-tailed distributions (Pareto, Cauchy, etc.)
Consistent and asymptotically efficient
When to Use MSE¶
Method |
Best For |
Limitations |
|---|---|---|
|
Most distributions, large samples |
Can fail for heavy tails, unbounded likelihood |
|
Heavy-tailed distributions, outliers |
Slightly slower than MLE |
|
Unknown data characteristics |
Adds detection overhead |
Use MSE when:
Fitting heavy-tailed distributions (Pareto, Cauchy, Levy, etc.)
Data has extreme outliers
MLE fails to converge or produces unreasonable estimates
You want more robust parameter estimates
API: estimation_method Parameter¶
The estimation_method parameter accepts three values:
"mle"(default): Maximum Likelihood Estimation viascipy.stats.fit()"mse": Maximum Spacing Estimation"auto": Automatically select MSE for heavy-tailed data, MLE otherwise
Direct parameter usage:
from spark_bestfit import DistributionFitter, LocalBackend
import pandas as pd
import numpy as np
# Generate heavy-tailed data
np.random.seed(42)
data = np.random.pareto(1.5, 1000) + 1
df = pd.DataFrame({"value": data})
fitter = DistributionFitter(backend=LocalBackend())
# Use MSE for heavy-tailed data
results = fitter.fit(df, column="value", estimation_method="mse")
# Auto-detect and select appropriate method
results = fitter.fit(df, column="value", estimation_method="auto")
Via FitterConfig:
from spark_bestfit import FitterConfigBuilder
# Build config with MSE
config = (FitterConfigBuilder()
.with_estimation_method("mse")
.with_bins(100)
.build())
results = fitter.fit(df, column="value", config=config)
Examples¶
Example 1: Fitting Pareto Distribution
Pareto distributions are notoriously difficult for MLE when the shape parameter is small. MSE handles this robustly:
from scipy import stats
import numpy as np
import pandas as pd
from spark_bestfit import DistributionFitter, LocalBackend
# Generate Pareto data with shape=1.5
np.random.seed(42)
data = stats.pareto.rvs(b=1.5, size=1000, random_state=42) + 1
df = pd.DataFrame({"value": data})
fitter = DistributionFitter(backend=LocalBackend())
# MSE provides more stable estimates
results = fitter.fit(
df,
column="value",
estimation_method="mse",
max_distributions=10
)
best = results.best(n=1)[0]
print(f"Best fit: {best.distribution}")
print(f"Parameters: {best.params}")
Example 2: Auto Mode for Unknown Data
When you don’t know if your data is heavy-tailed, use "auto":
# Auto mode detects heavy tails and switches to MSE
results = fitter.fit(
df,
column="value",
estimation_method="auto"
)
# No heavy-tail warning when auto selects MSE
Example 3: Cauchy Distribution
Cauchy has undefined mean and variance, making MLE unstable. MSE works well:
# Generate Cauchy data
data = stats.cauchy.rvs(loc=5.0, scale=2.0, size=500, random_state=42)
df = pd.DataFrame({"value": data})
# MSE gives stable parameter estimates
results = fitter.fit(
df,
column="value",
estimation_method="mse",
max_distributions=5
)
Low-Level API¶
For direct access to MSE fitting:
from spark_bestfit.fitting import fit_mse
from scipy import stats
import numpy as np
# Generate data
np.random.seed(42)
data = np.random.normal(10.0, 2.0, 1000)
# Fit using MSE
params = fit_mse(stats.norm, data)
print(f"Parameters: loc={params[0]:.2f}, scale={params[1]:.2f}")
# With initial parameter guess (for faster convergence)
params = fit_mse(stats.norm, data, initial_params=(9.0, 1.5))
Integration with Heavy-Tail Detection¶
MSE integrates seamlessly with spark-bestfit’s heavy-tail detection:
When
estimation_method="auto", heavy-tail detection runs automaticallyIf heavy tails are detected, MSE is used instead of MLE
When explicitly using
estimation_method="mse", the heavy-tail warning is suppressed (since you’re already using the recommended approach)
import warnings
# With auto: warning if heavy-tailed but shows we're using MSE
results = fitter.fit(df, "value", estimation_method="auto")
# With explicit mse: no warning (you know what you're doing)
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
results = fitter.fit(df, "value", estimation_method="mse")
heavy_tail_warnings = [x for x in w if "heavy-tail" in str(x.message)]
assert len(heavy_tail_warnings) == 0 # No warning
Performance Considerations¶
MSE is slightly slower than MLE because it requires optimization over the spacing objective function. Typical overhead:
Small datasets (<1000 points): ~2x slower than MLE
Large datasets (>10000 points): ~1.5x slower than MLE
For performance-critical applications with known non-heavy-tailed data,
stick with the default estimation_method="mle".
References¶
Ranneby, B. (1984). “The Maximum Spacing Method. An Estimation Method Related to the Maximum Likelihood Method.” Scandinavian Journal of Statistics, 11(2), 93-112.