Architecture
============

This page provides a visual overview of spark-bestfit's architecture,
showing component relationships and data flow during distribution fitting.

Component Architecture
----------------------

spark-bestfit uses a layered architecture with pluggable backends:

.. code-block:: text

    ┌─────────────────────────────────────────────────────────────────────┐
    │                         User Application                            │
    │                    (Your Python/Spark/Ray code)                     │
    └─────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
    ┌─────────────────────────────────────────────────────────────────────┐
    │                         Public API Layer                            │
    ├─────────────────────────────────────────────────────────────────────┤
    │  DistributionFitter    │  DiscreteDistributionFitter                │
    │  GaussianCopula        │  GaussianMixtureFitter                     │
    │  MultivariateNormalFitter                                           │
    └─────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
    ┌─────────────────────────────────────────────────────────────────────┐
    │                       Configuration Layer                           │
    ├─────────────────────────────────────────────────────────────────────┤
    │  FitterConfig          │  FitterConfigBuilder                       │
    │  DistributionRegistry  │  DiscreteDistributionRegistry              │
    └─────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
    ┌─────────────────────────────────────────────────────────────────────┐
    │                   ExecutionBackend Protocol                         │
    │              (Abstract interface - PEP 544 Protocol)                │
    ├─────────────────────────────────────────────────────────────────────┤
    │  SparkBackend          │  RayBackend          │  LocalBackend       │
    │  (Apache Spark)        │  (Ray clusters)      │  (ThreadPool)       │
    └─────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
    ┌─────────────────────────────────────────────────────────────────────┐
    │                      Core Fitting Engine                            │
    ├─────────────────────────────────────────────────────────────────────┤
    │  estimation.py         │  fitting.py          │  metrics.py         │
    │  (MLE/MSE estimation)  │  (scipy.stats fit)   │  (KS, AD, AIC, BIC) │
    └─────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
    ┌─────────────────────────────────────────────────────────────────────┐
    │                       Results & Output                              │
    ├─────────────────────────────────────────────────────────────────────┤
    │  FitResults            │  DistributionFitResult                     │
    │  (Collection)          │  (Individual result with metrics)          │
    └─────────────────────────────────────────────────────────────────────┘

Layer Descriptions
------------------

**Public API Layer**
    User-facing classes for distribution fitting. ``DistributionFitter`` handles
    continuous distributions (~90 scipy.stats distributions), while
    ``DiscreteDistributionFitter`` handles count data (16 discrete distributions).
    Specialized fitters exist for copulas, mixtures, and multivariate normals.

**Configuration Layer**
    Controls fitting behavior. ``FitterConfig`` specifies bins, thresholds,
    excluded distributions, and sampling modes. Distribution registries manage
    the mapping between distribution names and scipy.stats implementations.

**ExecutionBackend Protocol**
    Abstraction layer enabling pluggable backends. Uses Python's structural
    subtyping (PEP 544) - any class implementing the required methods works.
    The ``BackendFactory`` provides convenient creation and auto-detection.

**Core Fitting Engine**
    Pure Python fitting logic that runs on workers. Parameter estimation
    (MLE or MSE), goodness-of-fit metrics (KS, AD, AIC, BIC, SSE), and
    histogram-based fitting algorithms. This code is serialized and sent
    to distributed workers.

**Results & Output**
    Structured results with rich querying. ``FitResults`` holds all fitted
    distributions with methods like ``best(n=5)`` and ``filter()``.
    Individual ``DistributionFitResult`` objects contain parameters,
    metrics, and methods for confidence intervals and sampling.

Data Flow: Distribution Fitting
-------------------------------

The following diagram shows the data flow during a ``fitter.fit()`` call:

.. code-block:: text

    ┌──────────────────────────────────────────────────────────────────────┐
    │                           INPUT DATA                                 │
    │              (Spark DataFrame / Ray Dataset / pandas)                │
    └──────────────────────────────────────────────────────────────────────┘
                                      │
                     ┌────────────────┴────────────────┐
                     ▼                                 ▼
         ┌──────────────────────┐         ┌──────────────────────┐
         │  Distributed Stats   │         │   Distributed        │
         │  (min, max, count)   │         │   Histogram          │
         │       ~O(N)          │         │   Computation        │
         └──────────────────────┘         │       ~O(N)          │
                     │                    └──────────────────────┘
                     │                                 │
                     └────────────────┬────────────────┘
                                      ▼
    ┌──────────────────────────────────────────────────────────────────────┐
    │                     DRIVER COLLECTS                                  │
    │              histogram (~8KB) + sample (~80KB)                       │
    │                    Raw data stays distributed                        │
    └──────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
    ┌──────────────────────────────────────────────────────────────────────┐
    │                   BROADCAST TO WORKERS                               │
    │              histogram + sample cached on each worker                │
    └──────────────────────────────────────────────────────────────────────┘
                                      │
            ┌─────────────────────────┼─────────────────────────┐
            ▼                         ▼                         ▼
    ┌───────────────┐         ┌───────────────┐         ┌───────────────┐
    │   Worker 1    │         │   Worker 2    │         │   Worker N    │
    │  fit: norm    │         │  fit: gamma   │         │  fit: beta    │
    │  fit: expon   │         │  fit: weibull │         │  fit: lognorm │
    │     ...       │         │     ...       │         │     ...       │
    └───────────────┘         └───────────────┘         └───────────────┘
            │                         │                         │
            └─────────────────────────┼─────────────────────────┘
                                      ▼
    ┌──────────────────────────────────────────────────────────────────────┐
    │                    COLLECT RESULTS                                   │
    │          ~90 DistributionFitResult objects (~50KB total)             │
    └──────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
    ┌──────────────────────────────────────────────────────────────────────┐
    │                      FitResults                                      │
    │              .best(n=5)  .filter()  .to_pandas()                     │
    └──────────────────────────────────────────────────────────────────────┘

Key Design Decisions
--------------------

**Histogram-based fitting**
    Rather than sending raw data to workers, spark-bestfit computes a histogram
    once and broadcasts it. This provides sub-linear scaling with data size -
    fitting 1M rows takes nearly the same time as fitting 100K rows.

**Embarrassingly parallel distribution fitting**
    Each distribution is fitted independently, making the workload perfectly
    parallelizable. Distributions are interleaved to spread slow ones
    (like ``burr``, ``t``) across partitions.

**Protocol-based backend abstraction**
    The ``ExecutionBackend`` protocol uses structural subtyping (duck typing),
    so backends don't need to inherit from a base class. This keeps backend
    implementations simple and testable.

**Broadcast variables for efficiency**
    Histogram and sample data are broadcast once to all workers, avoiding
    repeated serialization overhead for each fitting task.

See Also
--------

- :doc:`backends` - Detailed backend comparison and usage
- :doc:`performance` - Scaling characteristics and tuning recommendations
- :doc:`api` - Complete API reference