spark-bestfit

Modern distribution fitting library with pluggable backends (Spark, Ray, Local).

Automatically fit ~90 scipy.stats continuous distributions and 16 discrete distributions to your data using parallel processing. Supports Apache Spark for production clusters, Ray for ML workflows, or local execution for development.

Supported Versions:

  • Python 3.11 - 3.13

  • Apache Spark 3.5.x and 4.x

  • Ray 2.x (optional)

  • See Quick Start for the full compatibility matrix

Scope & Limitations

spark-bestfit is designed for batch processing of statistical distribution fitting.

What it does well:

  • Fit ~90 continuous and 16 discrete scipy.stats distributions in parallel

  • Multi-column fitting: fit multiple columns efficiently in a single operation

  • Provide robust goodness-of-fit metrics (KS, A-D, AIC, BIC, SSE)

  • Generate publication-ready visualizations (histograms, Q-Q plots, P-P plots)

  • Compute bootstrap confidence intervals for parameters

  • Scale to 100M+ rows with Spark or Ray backends

Known limitations:

  • No real-time/streaming support (batch processing only)

  • Parameters and metrics use 32-bit floats (~7 significant digits) for Spark serialization efficiency. Very small values (e.g., p-values < 1e-7) may lose precision.

Features

Indices and tables