spark-bestfit¶

Modern distribution fitting library with pluggable backends (Spark, Ray, Local).

Automatically fit ~90 scipy.stats continuous distributions and 16 discrete distributions to your data using parallel processing. Supports Apache Spark for production clusters, Ray for ML workflows, or local execution for development.

Supported Versions:

Python 3.11 - 3.13
Apache Spark 3.5.x and 4.x
Ray 2.x (optional)
See Quick Start for the full compatibility matrix

Scope & Limitations¶

spark-bestfit is designed for batch processing of statistical distribution fitting.

What it does well:

Fit ~90 continuous and 16 discrete scipy.stats distributions in parallel
Multi-column fitting: fit multiple columns efficiently in a single operation
Provide robust goodness-of-fit metrics (KS, A-D, AIC, BIC, SSE)
Generate publication-ready visualizations (histograms, Q-Q plots, P-P plots)
Compute bootstrap confidence intervals for parameters
Scale to 100M+ rows with Spark or Ray backends

Known limitations:

No real-time/streaming support (batch processing only)
Parameters and metrics use 32-bit floats (~7 significant digits) for Spark serialization efficiency. Very small values (e.g., p-values < 1e-7) may lose precision.

Features

Reference

spark-bestfit¶

Scope & Limitations¶

Indices and tables¶