Pre-filtering ============= spark-bestfit supports **smart pre-filtering** that skips distributions mathematically incompatible with your data. This eliminates unnecessary fitting attempts based on data characteristics like skewness and kurtosis. Why Pre-filter? --------------- Distribution fitting is expensive. Each scipy ``dist.fit()`` call involves numerical optimization that takes 50-500ms depending on the distribution. With ~90 distributions (default), this adds up to significant time - but many distributions have intrinsic shape properties that make them poor fits for your data. **Example:** If your data is clearly left-skewed (skewness < -1), distributions like ``expon``, ``gamma``, ``chi2``, ``lognorm`` (which are intrinsically right-skewed) cannot possibly fit well regardless of how scipy shifts them via ``loc``. Pre-filtering skips these before the expensive fitting step. Filtering Layers ---------------- Pre-filtering uses a layered approach based on **intrinsic shape properties** (not location/scale): .. list-table:: :header-rows: 1 :widths: 20 20 60 * - Layer - Reliability - Description * - Skewness sign - ~95% - Skip positive-skew-only distributions for left-skewed data * - Kurtosis - ~80% - Skip low-kurtosis distributions for heavy-tailed data (aggressive mode) .. note:: We do NOT filter by support bounds (``dist.a``/``dist.b``) because scipy's ``loc`` parameter can shift any distribution to cover any data range. For example, ``expon(loc=-100)`` has support ``[-100, inf)`` and can fit negative data. Skewness and kurtosis are intrinsic shape properties that cannot be changed by ``loc``/``scale``. Using Pre-filtering ------------------- **Using FitterConfig (v2.2+, recommended):** .. code-block:: python from spark_bestfit import DistributionFitter, FitterConfigBuilder fitter = DistributionFitter(spark) # Safe mode (recommended) - skewness filtering config = FitterConfigBuilder().with_prefilter().build() results = fitter.fit(df, "value", config=config) # Aggressive mode - adds kurtosis filtering config = FitterConfigBuilder().with_prefilter(mode="aggressive").build() results = fitter.fit(df, "value", config=config) **Using parameter directly:** .. code-block:: python # Safe mode results = fitter.fit(df, "value", prefilter=True) # Aggressive mode results = fitter.fit(df, "value", prefilter="aggressive") # Disabled (default) results = fitter.fit(df, "value", prefilter=False) Performance Impact ------------------ Pre-filtering effectiveness depends on your data's **shape characteristics**: .. list-table:: :header-rows: 1 :widths: 35 25 40 * - Data Characteristic - Distributions Filtered - Example * - Symmetric (skew ~ 0) - 0% - No shape-based filtering applies * - Strongly left-skewed (skew < -1) - 20-30% - Positive-skew-only distributions skipped * - Strongly right-skewed (skew > 1) - 0% - Right-skewed data fits most distributions * - Heavy-tailed (aggressive, kurtosis > 10) - Additional 5-10% - Low-kurtosis distributions like ``uniform`` skipped **Typical savings:** 20-50% fewer distributions to fit for skewed data, translating to proportional time savings during the fitting phase. Fallback Behavior ----------------- If pre-filtering removes all candidate distributions (which can happen with unusual data), spark-bestfit automatically falls back to fitting all distributions and logs a warning: .. code-block:: text WARNING: Pre-filter removed all 90 distributions; falling back to fitting all distributions This ensures you always get results, even if the pre-filter was too aggressive. When to Use Pre-filtering ------------------------- **Use prefilter=True when:** - Your data is clearly skewed (skewness < -1 or > 1) - You want faster fitting without sacrificing accuracy - You're fitting many distributions and want to skip shape-incompatible ones **Use prefilter="aggressive" when:** - Your data is heavy-tailed (high kurtosis) and you want to skip light-tailed distributions - You're comfortable with ~80% reliability on the kurtosis filter - Speed is more important than fitting every possible distribution **Use prefilter=False when:** - Your data is approximately symmetric (skewness ~ 0) - You want to fit all distributions regardless of theoretical compatibility - You need complete control over which distributions are attempted Combining with Lazy Metrics --------------------------- Pre-filtering and lazy metrics are complementary optimizations: **Using FitterConfig (v2.2+, recommended):** .. code-block:: python from spark_bestfit import FitterConfigBuilder # Maximum performance: fewer distributions + deferred KS/AD config = (FitterConfigBuilder() .with_prefilter() # Skip incompatible distributions .with_lazy_metrics() # Defer KS/AD computation .build()) results = fitter.fit(df, "value", config=config) # Fast model selection best = results.best(n=1, metric="aic")[0] **Using parameters directly:** .. code-block:: python results = fitter.fit( df, "value", prefilter=True, lazy_metrics=True, ) **Combined benefit:** Pre-filtering reduces the number of distributions to fit, and lazy metrics defers expensive KS/AD computation. Together, they can reduce total fitting time by 50-80% for typical workflows.