Gaussian Copula
===============

The ``GaussianCopula`` class enables correlated multi-column sampling at scale.
Unlike standard copula libraries that require loading data into memory, spark-bestfit
computes correlation via Spark ML and generates samples across the cluster.

Why Use a Copula?
-----------------

When you fit distributions to multiple columns independently, the correlation
structure between columns is lost:

.. code-block:: python

    # Independent fitting loses correlation
    results = fitter.fit(df, columns=["price", "quantity", "revenue"])

    # Sampling each column independently - correlation is LOST
    price_samples = results.for_column("price").best(n=1)[0].sample(1000)
    quantity_samples = results.for_column("quantity").best(n=1)[0].sample(1000)
    # These are uncorrelated! Not realistic.

A Gaussian copula preserves both:

- **Marginal distributions**: Each column follows its fitted distribution
- **Correlation structure**: Columns maintain their original relationships

When to Use spark-bestfit Copula
---------------------------------

spark-bestfit is **not faster** than statsmodels for small data. The value is **scale**:

.. list-table::
   :header-rows: 1

   * - Scenario
     - statsmodels
     - spark-bestfit
   * - Data < 10M rows
     - Faster (use this)
     - Slower (Spark overhead)
   * - Data > 100M rows
     - Crashes (OOM)
     - **Works** (distributed)
   * - Data already in Spark
     - Requires ``.toPandas()``
     - **Native** (no conversion)
   * - 100M+ samples needed
     - May OOM
     - **sample_distributed()** scales

Basic Usage
-----------

Fit a copula from multi-column fit results:

.. code-block:: python

    from spark_bestfit import DistributionFitter, GaussianCopula
    from spark_bestfit.backends import BackendFactory

    # Fit multiple columns
    fitter = DistributionFitter(spark)
    results = fitter.fit(df, columns=["price", "quantity", "revenue"])

    # Fit copula - correlation computed via Spark ML (scales to billions)
    copula = GaussianCopula.fit(results, df)

    # Generate correlated samples locally
    samples = copula.sample(n=10000)  # Dict[str, np.ndarray]

    # Or distributed via any backend
    backend = BackendFactory.create("spark", spark_session=spark)
    samples_df = copula.sample_distributed(n=100_000_000, backend=backend)

The ``df`` parameter is required to compute the correlation matrix. The copula
uses Spearman rank correlation, which is robust to non-linear relationships.

Local vs Distributed Sampling
-----------------------------

.. list-table::
   :header-rows: 1

   * - Method
     - Use Case
     - Output
   * - ``sample(n=N)``
     - Small to medium samples (< 10M)
     - Dict[str, np.ndarray]
   * - ``sample_distributed(n=N, backend=...)``
     - Large samples (> 10M)
     - DataFrame (Spark/pandas)

For small samples, ``sample()`` is faster due to distributed overhead:

.. code-block:: python

    # Local sampling - fast for small n
    samples = copula.sample(n=10000, random_state=42)
    df = pd.DataFrame(samples)

    # Distributed sampling - efficient for large n
    backend = BackendFactory.create("spark", spark_session=spark)
    samples_df = copula.sample_distributed(n=100_000_000, backend=backend, random_seed=42)

Backend Options
---------------

Use any backend for distributed copula sampling:

.. code-block:: python

    from spark_bestfit.backends import BackendFactory

    # Spark
    backend = BackendFactory.create("spark", spark_session=spark)
    samples_df = copula.sample_distributed(n=100_000_000, backend=backend)

    # Ray
    backend = BackendFactory.create("ray")
    samples_df = copula.sample_distributed(n=100_000_000, backend=backend)

    # Local (for testing)
    backend = BackendFactory.create("local", max_workers=4)
    samples_df = copula.sample_distributed(n=100_000, backend=backend)

Fast Uniform Sampling
---------------------

Both ``sample()`` and ``sample_distributed()`` support a ``return_uniform=True`` parameter
that skips the marginal distribution transforms, returning uniform [0,1] samples instead.
This matches the behavior of statsmodels and is significantly faster:

.. code-block:: python

    # Fast path - returns uniform samples without marginal transforms
    uniform_samples = copula.sample(n=10_000_000, return_uniform=True)

    # Full transform - slower but returns samples in fitted marginal distributions
    marginal_samples = copula.sample(n=10_000_000)

**When to use ``return_uniform=True``:**

- You only need the correlation structure, not the exact marginal distributions
- You're doing correlation analysis or downstream transforms
- Performance is critical

Performance Benchmarks
----------------------

Sampling performance comparison (3-column copula, local mode):

.. list-table::
   :header-rows: 1

   * - N Samples
     - statsmodels
     - return_uniform
     - with transform
   * - 1,000,000
     - 73 ms
     - **56 ms**
     - 1,547 ms
   * - 10,000,000
     - 725 ms
     - **555 ms**
     - 15,485 ms
   * - 50,000,000
     - 3,706 ms
     - **2,820 ms**
     - 77,820 ms

**Key findings:**

- ``return_uniform=True`` is ~24% faster than statsmodels (same output format)
- Full marginal transforms add ~28x overhead due to scipy's PPF using iterative root-finding
- Use ``return_uniform=True`` when you don't need the exact marginal distributions

Fast PPF Optimization
---------------------

.. versionadded:: 2.7.0

For common distributions, spark-bestfit bypasses scipy's generic PPF machinery (which uses
iterative root-finding) by calling scipy.special functions directly. This optimization is
applied automatically during copula sampling.

**Supported distributions with fast PPF:**

- ``norm`` - Normal/Gaussian
- ``expon`` - Exponential
- ``uniform`` - Uniform
- ``lognorm`` - Log-normal
- ``weibull_min`` - Weibull (minimum)
- ``gamma`` - Gamma
- ``beta`` - Beta

For these distributions, marginal transforms are **~10-20x faster** than the generic scipy path.
Other distributions automatically fall back to scipy.stats.

**Usage:**

No code changes required - the optimization is applied automatically in both ``sample()``
and ``sample_distributed()``:

.. code-block:: python

    # Fast PPF is used automatically for supported distributions
    samples = copula.sample(n=1_000_000)  # Uses fast_ppf for norm, gamma, etc.

**Direct access (advanced):**

If you need to use the fast PPF implementation directly:

.. code-block:: python

    from spark_bestfit.fast_ppf import fast_ppf, has_fast_ppf
    import numpy as np

    # Check if a distribution has fast PPF support
    has_fast_ppf("gamma")  # True
    has_fast_ppf("pareto")  # False

    # Compute PPF directly
    q = np.array([0.1, 0.5, 0.9])
    values = fast_ppf("gamma", (2.0, 0.0, 1.0), q)  # shape=2, loc=0, scale=1

Serialization
-------------

Save and load copulas for later use:

.. code-block:: python

    # Save to JSON (recommended)
    copula.save("copula.json")

    # Or pickle for faster serialization
    copula.save("copula.pkl")

    # Load later
    loaded = GaussianCopula.load("copula.json")
    samples = loaded.sample(n=1000)

The JSON format includes metadata for debugging:

.. code-block:: javascript

    {
      "schema_version": "1.0",
      "spark_bestfit_version": "2.6.0",
      "created_at": "2026-01-04T20:00:00Z",
      "type": "gaussian_copula",
      "column_names": ["price", "quantity", "revenue"],
      "correlation_matrix": [[1.0, 0.8, 0.9], ...],
      "marginals": {
        "price": {"distribution": "gamma", "parameters": [2.0, 0.0, 5.0]},
        ...
      }
    }

How It Works
------------

The Gaussian copula sampling process:

1. **Fit phase**: Compute Spearman correlation matrix via Spark ML (no ``.toPandas()``)
2. **Sample phase**:

   a. Generate multivariate normal samples with the correlation matrix
   b. Transform each normal sample -> uniform via phi (standard normal CDF)
   c. Transform each uniform -> target marginal via PPF (inverse CDF)

This ensures that:

- Each column follows its fitted marginal distribution
- Columns maintain the correlation structure from the original data

Removed: sample_spark()
-----------------------

.. versionchanged:: 3.0.1
   The ``sample_spark()`` method was removed in v3.0.1 (deprecated since v2.2.0).
   Use ``sample_distributed()`` with an explicit backend instead.

If you are migrating from code that used ``sample_spark()``:

.. code-block:: python

    # Removed (was deprecated since v2.2.0)
    # samples_df = copula.sample_spark(n=100_000_000)

    # Use instead
    from spark_bestfit.backends import BackendFactory

    backend = BackendFactory.create("spark", spark_session=spark)
    samples_df = copula.sample_distributed(n=100_000_000, backend=backend)

API Reference
-------------

See :class:`spark_bestfit.copula.GaussianCopula` for full API documentation.