ADR-0002: Distribution Registry Pattern¶
- Status:
Accepted
- Date:
2025-12-24 (v0.4.0 discrete), 2026-01-09 (v2.4.0 custom)
Context¶
spark-bestfit fits data against scipy.stats distributions. However, not all ~100 scipy continuous distributions are practical to fit:
Performance: Some distributions (
levy_stable,studentized_range) take seconds per fit, making parallel fitting impracticalNumerical stability: Some distributions (
wald,geninvgauss) can hang or produce invalid results with certain dataDiscrete differences: Discrete distributions lack scipy’s
fit()method and require custom parameter estimation logicExtensibility: Users may want to fit custom distributions not in scipy
We needed a centralized way to manage which distributions are available, excluded, and how they’re configured.
Decision¶
We created two registry classes in distributions.py:
DistributionRegistry (continuous distributions):
class DistributionRegistry:
DEFAULT_EXCLUSIONS = {
"levy_stable", # Extremely slow
"studentized_range", # Very slow
"geninvgauss", # Can hang
# ... 19 total exclusions
}
SLOW_DISTRIBUTIONS = {
"powerlognorm", # ~160ms
"t", # ~144ms
# ... used for partition weighting
}
def get_distributions(
self,
support_at_zero: bool = False,
additional_exclusions: Optional[List[str]] = None,
) -> List[str]: ...
def register_distribution(
self,
name: str,
distribution: rv_continuous,
) -> None: ...
DiscreteDistributionRegistry (discrete distributions):
class DiscreteDistributionRegistry:
def __init__(self):
self._param_configs = self._build_param_configs()
def get_param_config(self, dist_name: str) -> Dict[str, Any]:
# Returns: initial estimates, bounds, param_names
# Needed because discrete dists lack fit()
Key design choices:
Default exclusions: Curated list of problematic distributions, not a blanket ban. Users can override with
remove_exclusion().Support filtering:
support_at_zero=Truefilters to non-negative distributions (wheredist.a >= 0), useful for positive-only data.Slow distribution tracking: Used for partition weighting to balance load across workers (slower dists get fewer per partition).
Custom distribution registration (v2.4.0): Users can register
rv_continuoussubclasses with validation of required methods.
Consequences¶
Positive:
Sensible defaults: Users get fast, stable fitting out of the box
Flexibility: Power users can include excluded distributions or add custom ones
Documentation: Exclusion reasons are documented in code comments
Partition balancing: Slow distributions don’t create stragglers
Negative:
Maintenance burden: New scipy versions may add distributions requiring evaluation for exclusion
Custom distributions require scipy rv_continuous interface knowledge
Neutral:
Default exclusions are based on empirical timing measurements (documented in code comments with approximate durations)