ADR-0002: Distribution Registry Pattern

Status:

Accepted

Date:

2025-12-24 (v0.4.0 discrete), 2026-01-09 (v2.4.0 custom)

Context

spark-bestfit fits data against scipy.stats distributions. However, not all ~100 scipy continuous distributions are practical to fit:

  1. Performance: Some distributions (levy_stable, studentized_range) take seconds per fit, making parallel fitting impractical

  2. Numerical stability: Some distributions (wald, geninvgauss) can hang or produce invalid results with certain data

  3. Discrete differences: Discrete distributions lack scipy’s fit() method and require custom parameter estimation logic

  4. Extensibility: Users may want to fit custom distributions not in scipy

We needed a centralized way to manage which distributions are available, excluded, and how they’re configured.

Decision

We created two registry classes in distributions.py:

DistributionRegistry (continuous distributions):

class DistributionRegistry:
    DEFAULT_EXCLUSIONS = {
        "levy_stable",    # Extremely slow
        "studentized_range",  # Very slow
        "geninvgauss",    # Can hang
        # ... 19 total exclusions
    }

    SLOW_DISTRIBUTIONS = {
        "powerlognorm",   # ~160ms
        "t",              # ~144ms
        # ... used for partition weighting
    }

    def get_distributions(
        self,
        support_at_zero: bool = False,
        additional_exclusions: Optional[List[str]] = None,
    ) -> List[str]: ...

    def register_distribution(
        self,
        name: str,
        distribution: rv_continuous,
    ) -> None: ...

DiscreteDistributionRegistry (discrete distributions):

class DiscreteDistributionRegistry:
    def __init__(self):
        self._param_configs = self._build_param_configs()

    def get_param_config(self, dist_name: str) -> Dict[str, Any]:
        # Returns: initial estimates, bounds, param_names
        # Needed because discrete dists lack fit()

Key design choices:

  1. Default exclusions: Curated list of problematic distributions, not a blanket ban. Users can override with remove_exclusion().

  2. Support filtering: support_at_zero=True filters to non-negative distributions (where dist.a >= 0), useful for positive-only data.

  3. Slow distribution tracking: Used for partition weighting to balance load across workers (slower dists get fewer per partition).

  4. Custom distribution registration (v2.4.0): Users can register rv_continuous subclasses with validation of required methods.

Consequences

Positive:

  • Sensible defaults: Users get fast, stable fitting out of the box

  • Flexibility: Power users can include excluded distributions or add custom ones

  • Documentation: Exclusion reasons are documented in code comments

  • Partition balancing: Slow distributions don’t create stragglers

Negative:

  • Maintenance burden: New scipy versions may add distributions requiring evaluation for exclusion

  • Custom distributions require scipy rv_continuous interface knowledge

Neutral:

  • Default exclusions are based on empirical timing measurements (documented in code comments with approximate durations)

References

  • PR #22: Discrete distribution fitting (v0.4.0)

  • PR #76: Distribution-aware partitioning (v1.7.0)

  • PR #102: Custom distribution support (v2.4.0)