Heavy-Tail Detection¶
spark-bestfit automatically detects heavy-tailed data characteristics and warns you when standard distributions may provide poor fits. This helps identify data that may need special handling.
What Are Heavy-Tailed Distributions?¶
Heavy-tailed distributions have slower tail decay than normal or exponential distributions. They exhibit:
High kurtosis: More extreme values than a normal distribution
Extreme outliers: Maximum values far beyond the 99th percentile
Potentially undefined moments: Some (like Cauchy) have undefined variance
Distribution |
Tail Behavior |
Use Case |
|---|---|---|
|
Infinite variance |
Ratios of normals, resonance phenomena |
|
Power-law decay |
Income distribution, file sizes, network traffic |
|
Heavy for df < 5 |
Financial returns, robust regression |
|
Extreme heavy tail |
Anomalous diffusion |
|
Flexible heavy tail |
Reliability analysis |
Automatic Detection¶
When fitting distributions, spark-bestfit checks two indicators:
Excess kurtosis > 6: Normal distribution has excess kurtosis = 0; t-distribution with 5 df has ~6; Cauchy is undefined (very high)
Extreme ratio > 3: The ratio of max value to 99th percentile
If either indicator triggers, a UserWarning is emitted:
from spark_bestfit import DistributionFitter, LocalBackend
import numpy as np
import pandas as pd
import warnings
# Generate heavy-tailed data
np.random.seed(42)
data = np.random.standard_cauchy(1000)
df = pd.DataFrame({"value": data})
fitter = DistributionFitter(backend=LocalBackend())
# Warning is emitted automatically
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
results = fitter.fit(df, column="value", max_distributions=5)
if w:
print(f"Warning: {w[0].message}")
# UserWarning: Column 'value' exhibits heavy-tail characteristics
# (high kurtosis (299.7 > 6.0), extreme values (max/p99 = 17.2)).
# Consider: (1) heavy-tail distributions like pareto, cauchy, t;
# (2) data transformation (log, sqrt); (3) checking for outliers.
Direct API Usage¶
You can also use the detection function directly for diagnostic purposes:
from spark_bestfit.fitting import detect_heavy_tail, HEAVY_TAIL_DISTRIBUTIONS
# Detect heavy-tail characteristics
result = detect_heavy_tail(data)
print(result)
# {
# 'is_heavy_tailed': True,
# 'kurtosis': 299.7,
# 'extreme_ratio': 17.2,
# 'indicators': ['high kurtosis (299.7 > 6.0)', 'extreme values (max/p99 = 17.2)']
# }
# Custom threshold
result = detect_heavy_tail(data, kurtosis_threshold=10.0)
# List of known heavy-tail distributions
print(HEAVY_TAIL_DISTRIBUTIONS)
# frozenset({'cauchy', 'pareto', 't', 'levy', 'burr', 'burr12', 'fisk',
# 'levy_l', 'levy_stable', 'lomax', 'powerlaw', 'invgauss',
# 'genhyperbolic', 'johnsonsu'})
Data Statistics¶
The fit results now include kurtosis and skewness in the data statistics:
# After fitting
best = results.best(n=1)[0]
# Access via internal DataFrame
print(results._df[['data_kurtosis', 'data_skewness']].iloc[0])
# Or compute directly
from spark_bestfit.fitting import compute_data_stats
stats = compute_data_stats(data)
print(f"Kurtosis: {stats['data_kurtosis']:.2f}")
print(f"Skewness: {stats['data_skewness']:.2f}")
Handling Heavy-Tailed Data¶
When you see the heavy-tail warning, consider these approaches:
1. Use Heavy-Tail Distributions
Limit fitting to heavy-tail distributions:
from spark_bestfit.fitting import HEAVY_TAIL_DISTRIBUTIONS
# Only fit heavy-tail distributions
heavy_tail_list = list(HEAVY_TAIL_DISTRIBUTIONS)
results = fitter.fit(df, "value", max_distributions=len(heavy_tail_list))
# Or exclude non-heavy-tail distributions from default set
fitter = DistributionFitter(
backend=LocalBackend(),
excluded_distributions=("norm", "expon", "gamma", "beta")
)
2. Transform Data
Apply transformations to reduce tail heaviness:
import numpy as np
# Log transform (for positive data)
df["log_value"] = np.log(df["value"] + 1)
# Square root transform
df["sqrt_value"] = np.sqrt(np.abs(df["value"]))
# Winsorize (clip extremes)
lower, upper = np.percentile(df["value"], [1, 99])
df["winsorized"] = df["value"].clip(lower, upper)
3. Check for Outliers
Investigate whether extreme values are errors:
# Identify extreme values
threshold = np.percentile(data, 99.9)
outliers = data[data > threshold]
print(f"Extreme values: {len(outliers)}")
# Consider removing if they're data errors
clean_data = data[data <= threshold]
Suppressing Warnings¶
If you’re aware of the heavy-tail nature and want to suppress warnings:
import warnings
with warnings.filterwarnings("ignore", message=".*heavy-tail.*"):
results = fitter.fit(df, column="value")
# Or globally
warnings.filterwarnings("ignore", message=".*heavy-tail.*")
When Detection Doesn’t Apply¶
The heavy-tail detection is a heuristic. It may:
False positive: Flag data with a few outliers that isn’t truly heavy-tailed
False negative: Miss heavy-tailed data with small samples or clipped values
Use it as a diagnostic aid, not a definitive classification.