Pre-filtering¶
spark-bestfit supports smart pre-filtering that skips distributions mathematically incompatible with your data. This eliminates unnecessary fitting attempts based on data characteristics like skewness and kurtosis.
Why Pre-filter?¶
Distribution fitting is expensive. Each scipy dist.fit() call involves numerical
optimization that takes 50-500ms depending on the distribution. With ~90 distributions (default),
this adds up to significant time - but many distributions have intrinsic shape properties
that make them poor fits for your data.
Example: If your data is clearly left-skewed (skewness < -1), distributions like
expon, gamma, chi2, lognorm (which are intrinsically right-skewed)
cannot possibly fit well regardless of how scipy shifts them via loc. Pre-filtering
skips these before the expensive fitting step.
Filtering Layers¶
Pre-filtering uses a layered approach based on intrinsic shape properties (not location/scale):
Layer |
Reliability |
Description |
|---|---|---|
Skewness sign |
~95% |
Skip positive-skew-only distributions for left-skewed data |
Kurtosis |
~80% |
Skip low-kurtosis distributions for heavy-tailed data (aggressive mode) |
Note
We do NOT filter by support bounds (dist.a/dist.b) because scipy’s
loc parameter can shift any distribution to cover any data range.
For example, expon(loc=-100) has support [-100, inf) and can fit
negative data. Skewness and kurtosis are intrinsic shape properties that
cannot be changed by loc/scale.
Using Pre-filtering¶
Using FitterConfig (v2.2+, recommended):
from spark_bestfit import DistributionFitter, FitterConfigBuilder
fitter = DistributionFitter(spark)
# Safe mode (recommended) - skewness filtering
config = FitterConfigBuilder().with_prefilter().build()
results = fitter.fit(df, "value", config=config)
# Aggressive mode - adds kurtosis filtering
config = FitterConfigBuilder().with_prefilter(mode="aggressive").build()
results = fitter.fit(df, "value", config=config)
Using parameter directly:
# Safe mode
results = fitter.fit(df, "value", prefilter=True)
# Aggressive mode
results = fitter.fit(df, "value", prefilter="aggressive")
# Disabled (default)
results = fitter.fit(df, "value", prefilter=False)
Performance Impact¶
Pre-filtering effectiveness depends on your data’s shape characteristics:
Data Characteristic |
Distributions Filtered |
Example |
|---|---|---|
Symmetric (skew ~ 0) |
0% |
No shape-based filtering applies |
Strongly left-skewed (skew < -1) |
20-30% |
Positive-skew-only distributions skipped |
Strongly right-skewed (skew > 1) |
0% |
Right-skewed data fits most distributions |
Heavy-tailed (aggressive, kurtosis > 10) |
Additional 5-10% |
Low-kurtosis distributions like |
Typical savings: 20-50% fewer distributions to fit for skewed data, translating to proportional time savings during the fitting phase.
Fallback Behavior¶
If pre-filtering removes all candidate distributions (which can happen with unusual data), spark-bestfit automatically falls back to fitting all distributions and logs a warning:
WARNING: Pre-filter removed all 90 distributions; falling back to fitting all distributions
This ensures you always get results, even if the pre-filter was too aggressive.
When to Use Pre-filtering¶
Use prefilter=True when:
Your data is clearly skewed (skewness < -1 or > 1)
You want faster fitting without sacrificing accuracy
You’re fitting many distributions and want to skip shape-incompatible ones
Use prefilter=”aggressive” when:
Your data is heavy-tailed (high kurtosis) and you want to skip light-tailed distributions
You’re comfortable with ~80% reliability on the kurtosis filter
Speed is more important than fitting every possible distribution
Use prefilter=False when:
Your data is approximately symmetric (skewness ~ 0)
You want to fit all distributions regardless of theoretical compatibility
You need complete control over which distributions are attempted
Combining with Lazy Metrics¶
Pre-filtering and lazy metrics are complementary optimizations:
Using FitterConfig (v2.2+, recommended):
from spark_bestfit import FitterConfigBuilder
# Maximum performance: fewer distributions + deferred KS/AD
config = (FitterConfigBuilder()
.with_prefilter() # Skip incompatible distributions
.with_lazy_metrics() # Defer KS/AD computation
.build())
results = fitter.fit(df, "value", config=config)
# Fast model selection
best = results.best(n=1, metric="aic")[0]
Using parameters directly:
results = fitter.fit(
df, "value",
prefilter=True,
lazy_metrics=True,
)
Combined benefit: Pre-filtering reduces the number of distributions to fit, and lazy metrics defers expensive KS/AD computation. Together, they can reduce total fitting time by 50-80% for typical workflows.