Maximum Spacing Estimation
==========================

.. versionadded:: 2.5.0

spark-bestfit supports **Maximum Spacing Estimation (MSE)** as an alternative
to Maximum Likelihood Estimation (MLE) for parameter fitting. MSE is particularly
robust for heavy-tailed distributions where MLE may fail or produce poor estimates.

What is Maximum Spacing Estimation?
-----------------------------------

MSE estimates distribution parameters by maximizing the **geometric mean of spacings**
between consecutive order statistics of the CDF-transformed data.

For data points x₁ ≤ x₂ ≤ ... ≤ xₙ and CDF F with parameters θ:

1. Transform data: uᵢ = F(xᵢ; θ) where uᵢ ∈ [0,1]
2. Compute spacings: Dᵢ = u₍ᵢ₎ - u₍ᵢ₋₁₎ (with boundary values 0 and 1)
3. Maximize: S(θ) = (1/(n+1)) Σᵢ log(Dᵢ)

**Key advantages over MLE:**

- Always well-defined when the CDF exists (MLE can be unbounded)
- More robust to outliers and extreme values
- Better convergence for heavy-tailed distributions (Pareto, Cauchy, etc.)
- Consistent and asymptotically efficient

When to Use MSE
---------------

.. list-table:: Estimation Method Comparison
   :header-rows: 1
   :widths: 20 40 40

   * - Method
     - Best For
     - Limitations
   * - ``mle``
     - Most distributions, large samples
     - Can fail for heavy tails, unbounded likelihood
   * - ``mse``
     - Heavy-tailed distributions, outliers
     - Slightly slower than MLE
   * - ``auto``
     - Unknown data characteristics
     - Adds detection overhead

**Use MSE when:**

- Fitting heavy-tailed distributions (Pareto, Cauchy, Levy, etc.)
- Data has extreme outliers
- MLE fails to converge or produces unreasonable estimates
- You want more robust parameter estimates

API: estimation_method Parameter
--------------------------------

The ``estimation_method`` parameter accepts three values:

- ``"mle"`` (default): Maximum Likelihood Estimation via ``scipy.stats.fit()``
- ``"mse"``: Maximum Spacing Estimation
- ``"auto"``: Automatically select MSE for heavy-tailed data, MLE otherwise

**Direct parameter usage:**

.. code-block:: python

   from spark_bestfit import DistributionFitter, LocalBackend
   import pandas as pd
   import numpy as np

   # Generate heavy-tailed data
   np.random.seed(42)
   data = np.random.pareto(1.5, 1000) + 1
   df = pd.DataFrame({"value": data})

   fitter = DistributionFitter(backend=LocalBackend())

   # Use MSE for heavy-tailed data
   results = fitter.fit(df, column="value", estimation_method="mse")

   # Auto-detect and select appropriate method
   results = fitter.fit(df, column="value", estimation_method="auto")

**Via FitterConfig:**

.. code-block:: python

   from spark_bestfit import FitterConfigBuilder

   # Build config with MSE
   config = (FitterConfigBuilder()
       .with_estimation_method("mse")
       .with_bins(100)
       .build())

   results = fitter.fit(df, column="value", config=config)

Examples
--------

**Example 1: Fitting Pareto Distribution**

Pareto distributions are notoriously difficult for MLE when the shape parameter
is small. MSE handles this robustly:

.. code-block:: python

   from scipy import stats
   import numpy as np
   import pandas as pd
   from spark_bestfit import DistributionFitter, LocalBackend

   # Generate Pareto data with shape=1.5
   np.random.seed(42)
   data = stats.pareto.rvs(b=1.5, size=1000, random_state=42) + 1
   df = pd.DataFrame({"value": data})

   fitter = DistributionFitter(backend=LocalBackend())

   # MSE provides more stable estimates
   results = fitter.fit(
       df,
       column="value",
       estimation_method="mse",
       max_distributions=10
   )

   best = results.best(n=1)[0]
   print(f"Best fit: {best.distribution}")
   print(f"Parameters: {best.params}")

**Example 2: Auto Mode for Unknown Data**

When you don't know if your data is heavy-tailed, use ``"auto"``:

.. code-block:: python

   # Auto mode detects heavy tails and switches to MSE
   results = fitter.fit(
       df,
       column="value",
       estimation_method="auto"
   )

   # No heavy-tail warning when auto selects MSE

**Example 3: Cauchy Distribution**

Cauchy has undefined mean and variance, making MLE unstable. MSE works well:

.. code-block:: python

   # Generate Cauchy data
   data = stats.cauchy.rvs(loc=5.0, scale=2.0, size=500, random_state=42)
   df = pd.DataFrame({"value": data})

   # MSE gives stable parameter estimates
   results = fitter.fit(
       df,
       column="value",
       estimation_method="mse",
       max_distributions=5
   )

Low-Level API
-------------

For direct access to MSE fitting:

.. code-block:: python

   from spark_bestfit.fitting import fit_mse
   from scipy import stats
   import numpy as np

   # Generate data
   np.random.seed(42)
   data = np.random.normal(10.0, 2.0, 1000)

   # Fit using MSE
   params = fit_mse(stats.norm, data)
   print(f"Parameters: loc={params[0]:.2f}, scale={params[1]:.2f}")

   # With initial parameter guess (for faster convergence)
   params = fit_mse(stats.norm, data, initial_params=(9.0, 1.5))

Integration with Heavy-Tail Detection
-------------------------------------

MSE integrates seamlessly with spark-bestfit's heavy-tail detection:

- When ``estimation_method="auto"``, heavy-tail detection runs automatically
- If heavy tails are detected, MSE is used instead of MLE
- When explicitly using ``estimation_method="mse"``, the heavy-tail warning is suppressed
  (since you're already using the recommended approach)

.. code-block:: python

   import warnings

   # With auto: warning if heavy-tailed but shows we're using MSE
   results = fitter.fit(df, "value", estimation_method="auto")

   # With explicit mse: no warning (you know what you're doing)
   with warnings.catch_warnings(record=True) as w:
       warnings.simplefilter("always")
       results = fitter.fit(df, "value", estimation_method="mse")
       heavy_tail_warnings = [x for x in w if "heavy-tail" in str(x.message)]
       assert len(heavy_tail_warnings) == 0  # No warning

Performance Considerations
--------------------------

MSE is slightly slower than MLE because it requires optimization over the
spacing objective function. Typical overhead:

- **Small datasets (<1000 points)**: ~2x slower than MLE
- **Large datasets (>10000 points)**: ~1.5x slower than MLE

For performance-critical applications with known non-heavy-tailed data,
stick with the default ``estimation_method="mle"``.

References
----------

- Ranneby, B. (1984). "The Maximum Spacing Method. An Estimation Method
  Related to the Maximum Likelihood Method." *Scandinavian Journal of
  Statistics*, 11(2), 93-112.