Heavy-Tail Detection
====================

spark-bestfit automatically detects **heavy-tailed data characteristics** and
warns you when standard distributions may provide poor fits. This helps identify
data that may need special handling.

What Are Heavy-Tailed Distributions?
------------------------------------

Heavy-tailed distributions have **slower tail decay** than normal or exponential
distributions. They exhibit:

- **High kurtosis**: More extreme values than a normal distribution
- **Extreme outliers**: Maximum values far beyond the 99th percentile
- **Potentially undefined moments**: Some (like Cauchy) have undefined variance

.. list-table:: Common Heavy-Tailed Distributions
   :header-rows: 1
   :widths: 25 30 45

   * - Distribution
     - Tail Behavior
     - Use Case
   * - ``cauchy``
     - Infinite variance
     - Ratios of normals, resonance phenomena
   * - ``pareto``
     - Power-law decay
     - Income distribution, file sizes, network traffic
   * - ``t`` (low df)
     - Heavy for df < 5
     - Financial returns, robust regression
   * - ``levy``
     - Extreme heavy tail
     - Anomalous diffusion
   * - ``burr``
     - Flexible heavy tail
     - Reliability analysis

Automatic Detection
-------------------

When fitting distributions, spark-bestfit checks two indicators:

1. **Excess kurtosis > 6**: Normal distribution has excess kurtosis = 0;
   t-distribution with 5 df has ~6; Cauchy is undefined (very high)

2. **Extreme ratio > 3**: The ratio of max value to 99th percentile

If either indicator triggers, a ``UserWarning`` is emitted:

.. code-block:: python

   from spark_bestfit import DistributionFitter, LocalBackend
   import numpy as np
   import pandas as pd
   import warnings

   # Generate heavy-tailed data
   np.random.seed(42)
   data = np.random.standard_cauchy(1000)
   df = pd.DataFrame({"value": data})

   fitter = DistributionFitter(backend=LocalBackend())

   # Warning is emitted automatically
   with warnings.catch_warnings(record=True) as w:
       warnings.simplefilter("always")
       results = fitter.fit(df, column="value", max_distributions=5)

       if w:
           print(f"Warning: {w[0].message}")
           # UserWarning: Column 'value' exhibits heavy-tail characteristics
           # (high kurtosis (299.7 > 6.0), extreme values (max/p99 = 17.2)).
           # Consider: (1) heavy-tail distributions like pareto, cauchy, t;
           # (2) data transformation (log, sqrt); (3) checking for outliers.

Direct API Usage
----------------

You can also use the detection function directly for diagnostic purposes:

.. code-block:: python

   from spark_bestfit.fitting import detect_heavy_tail, HEAVY_TAIL_DISTRIBUTIONS

   # Detect heavy-tail characteristics
   result = detect_heavy_tail(data)
   print(result)
   # {
   #     'is_heavy_tailed': True,
   #     'kurtosis': 299.7,
   #     'extreme_ratio': 17.2,
   #     'indicators': ['high kurtosis (299.7 > 6.0)', 'extreme values (max/p99 = 17.2)']
   # }

   # Custom threshold
   result = detect_heavy_tail(data, kurtosis_threshold=10.0)

   # List of known heavy-tail distributions
   print(HEAVY_TAIL_DISTRIBUTIONS)
   # frozenset({'cauchy', 'pareto', 't', 'levy', 'burr', 'burr12', 'fisk',
   #            'levy_l', 'levy_stable', 'lomax', 'powerlaw', 'invgauss',
   #            'genhyperbolic', 'johnsonsu'})

Data Statistics
---------------

The fit results now include kurtosis and skewness in the data statistics:

.. code-block:: python

   # After fitting
   best = results.best(n=1)[0]

   # Access via internal DataFrame
   print(results._df[['data_kurtosis', 'data_skewness']].iloc[0])

   # Or compute directly
   from spark_bestfit.fitting import compute_data_stats

   stats = compute_data_stats(data)
   print(f"Kurtosis: {stats['data_kurtosis']:.2f}")
   print(f"Skewness: {stats['data_skewness']:.2f}")

Handling Heavy-Tailed Data
--------------------------

When you see the heavy-tail warning, consider these approaches:

**1. Use Heavy-Tail Distributions**

Limit fitting to heavy-tail distributions:

.. code-block:: python

   from spark_bestfit.fitting import HEAVY_TAIL_DISTRIBUTIONS

   # Only fit heavy-tail distributions
   heavy_tail_list = list(HEAVY_TAIL_DISTRIBUTIONS)
   results = fitter.fit(df, "value", max_distributions=len(heavy_tail_list))

   # Or exclude non-heavy-tail distributions from default set
   fitter = DistributionFitter(
       backend=LocalBackend(),
       excluded_distributions=("norm", "expon", "gamma", "beta")
   )

**2. Transform Data**

Apply transformations to reduce tail heaviness:

.. code-block:: python

   import numpy as np

   # Log transform (for positive data)
   df["log_value"] = np.log(df["value"] + 1)

   # Square root transform
   df["sqrt_value"] = np.sqrt(np.abs(df["value"]))

   # Winsorize (clip extremes)
   lower, upper = np.percentile(df["value"], [1, 99])
   df["winsorized"] = df["value"].clip(lower, upper)

**3. Check for Outliers**

Investigate whether extreme values are errors:

.. code-block:: python

   # Identify extreme values
   threshold = np.percentile(data, 99.9)
   outliers = data[data > threshold]
   print(f"Extreme values: {len(outliers)}")

   # Consider removing if they're data errors
   clean_data = data[data <= threshold]

Suppressing Warnings
--------------------

If you're aware of the heavy-tail nature and want to suppress warnings:

.. code-block:: python

   import warnings

   with warnings.filterwarnings("ignore", message=".*heavy-tail.*"):
       results = fitter.fit(df, column="value")

   # Or globally
   warnings.filterwarnings("ignore", message=".*heavy-tail.*")

When Detection Doesn't Apply
----------------------------

The heavy-tail detection is a heuristic. It may:

- **False positive**: Flag data with a few outliers that isn't truly heavy-tailed
- **False negative**: Miss heavy-tailed data with small samples or clipped values

Use it as a diagnostic aid, not a definitive classification.