Lazy Metrics
============

spark-bestfit supports **lazy metric evaluation** with true on-demand computation.
KS/AD metrics are computed only when you actually need them, providing significant
performance improvements for model selection workflows.

Metric Computation Cost
-----------------------

Not all goodness-of-fit metrics have the same computational cost:

.. list-table::
   :header-rows: 1
   :widths: 25 25 50

   * - Metric
     - Cost
     - Notes
   * - SSE
     - Fast (~ms)
     - PDF evaluation at histogram bins
   * - AIC / BIC
     - Fast (~ms)
     - Log-likelihood sum
   * - KS-statistic
     - Moderate (~100ms)
     - O(n log n) sort + CDF computation
   * - AD-statistic
     - Slow (~200-500ms)
     - O(n log n) sort + 2n log operations

With ~90 distributions (default), computing KS/AD for all can add **20-50 seconds** to the
total fitting time. With lazy metrics, you only pay this cost for the distributions
you actually access.

Using Lazy Metrics
------------------

Enable lazy metrics to skip initial KS/AD computation during fitting:

**Using FitterConfig (v2.2+, recommended):**

.. code-block:: python

   from spark_bestfit import DistributionFitter, FitterConfigBuilder

   fitter = DistributionFitter(spark)

   # Build config with lazy metrics
   config = FitterConfigBuilder().with_lazy_metrics().build()

   # Fast fitting: skip KS/AD computation initially
   results = fitter.fit(df, "value", config=config)

   # Check if results are lazy
   print(results.is_lazy)  # True

   # Get best by AIC - fast, no KS/AD needed
   best_aic = results.best(n=1, metric="aic")[0]
   print(best_aic.ks_statistic)  # None (not computed yet)

   # Get best by KS - triggers ON-DEMAND computation!
   best_ks = results.best(n=1, metric="ks_statistic")[0]
   print(best_ks.ks_statistic)  # 0.0234 (computed value!)

**Using parameter directly:**

.. code-block:: python

   # Alternatively, pass lazy_metrics parameter directly
   results = fitter.fit(df, "value", lazy_metrics=True)

**Key insight**: When you call ``best(metric="ks_statistic")`` with lazy results,
spark-bestfit automatically:

1. Gets top N*3 candidates sorted by AIC (fast, already computed)
2. Computes KS/AD only for those candidates (not all ~90 distributions)
3. Re-sorts by actual KS and returns top N

This means you get correct results while computing metrics for only ~5% of distributions.

Materializing All Metrics
-------------------------

If you need all metrics computed (e.g., before unpersisting the source DataFrame),
use the ``materialize()`` method:

.. code-block:: python

   # Fit with lazy metrics
   results = fitter.fit(df, "value", lazy_metrics=True)

   # Fast model selection
   best_aic = results.best(n=1, metric="aic")[0]

   # Before unpersisting, materialize all metrics
   materialized = results.materialize()
   print(materialized.is_lazy)  # False

   # Now safe to unpersist source data
   df.unpersist()

   # Access any metric on materialized results
   best_ks = materialized.best(n=1, metric="ks_statistic")[0]
   print(best_ks.ks_statistic)  # Computed value

.. warning::

   If you try to compute lazy metrics after the source DataFrame has been
   unpersisted, you'll get a ``RuntimeError``. Always call ``materialize()``
   before unpersisting if you need KS/AD metrics later.

When to Use Lazy Metrics
------------------------

**Use lazy_metrics=True when:**

- You're doing model selection using AIC/BIC (recommended for most cases)
- You're iterating quickly and want faster feedback
- You only need KS/AD for a few top candidates
- You're fitting many distributions and want faster iteration

**Use lazy_metrics=False (default) when:**

- You need KS/AD statistics for all distributions upfront
- You want to filter results by KS thresholds (``filter(ks_threshold=0.1)``)
- You need p-values for statistical significance testing on all fits
- You plan to serialize results and need complete data

Filter Behavior
---------------

Note that ``filter(ks_threshold=...)`` cannot trigger lazy computation because
it needs to evaluate all rows. If you use filtering with lazy results, a warning
is emitted:

.. code-block:: python

   # This will warn - can't lazily compute for filter
   filtered = results.filter(ks_threshold=0.1)

   # Instead, materialize first, then filter
   materialized = results.materialize()
   filtered = materialized.filter(ks_threshold=0.1)

Why Lazy Metrics Matters
------------------------

.. image:: /_static/lazy_metrics.png
   :alt: Lazy metrics performance comparison
   :width: 100%

The value of lazy metrics isn't measured in wall-clock speedup for a single fit - it's
about **skipping work you don't need** across your entire workflow.

**The core insight:** When fitting ~90 distributions (default), you typically only examine
the top 3-5 results. With eager evaluation, you compute KS/AD statistics for all
90 distributions. With lazy evaluation, you compute them for **only the ones you
actually access**.

.. list-table:: What Gets Computed
   :header-rows: 1
   :widths: 40 30 30

   * - Workflow
     - Eager Mode
     - Lazy Mode
   * - ``best(n=1, metric="aic")``
     - 90 KS/AD computations
     - **0** KS/AD computations
   * - ``best(n=1, metric="ks_statistic")``
     - 90 KS/AD computations
     - **~5** KS/AD computations
   * - ``materialize()`` then filter
     - 90 KS/AD computations
     - 90 KS/AD computations

Scaling Characteristics
-----------------------

**Why lazy metrics scales well for production workloads:**

1. **Fixed sample size**: KS/AD computation uses a fixed 10K sample regardless of
   data size. The savings are constant whether you have 100K rows or 1 billion rows.

2. **Multiplicative savings**: When fitting multiple columns or running repeated
   experiments, the savings multiply:

   .. code-block:: text

      10 columns x 5 iterations x 85 skipped distributions
      = 4,250 KS/AD computations avoided

3. **Interactive workflows**: During exploratory analysis, you iterate quickly
   using AIC/BIC for model selection. Lazy metrics gives you instant feedback
   without waiting for KS/AD computation you won't use until final validation.

4. **Surgical on-demand computation**: When you request ``best(metric="ks_statistic")``,
   we get top candidates by AIC first (already computed), then compute KS/AD for
   only those ~5 candidates - not all 90 distributions.

**Production recommendation**: Use ``lazy_metrics=True`` as the default for
exploratory analysis and model selection. Only use ``lazy_metrics=False`` when you
need KS/AD statistics for all distributions upfront (e.g., for comprehensive reports
or filtering by KS threshold).