Diagnostics Plots

Added in version 2.6.0.

After fitting a distribution, you can assess the quality of the fit using diagnostic plots. spark-bestfit provides a comprehensive diagnostics() method that creates a 2x2 panel of diagnostic visualizations.

Quick Start

Generate a complete diagnostic panel from a fitted distribution:

from spark_bestfit import DistributionFitter

# Fit distribution
fitter = DistributionFitter(spark)
results = fitter.fit(df, column="value")
best = results.best(n=1)[0]

# Get the sample data as numpy array
data = df.select("value").toPandas()["value"].values

# Generate diagnostic plots
fig, axes = best.diagnostics(data, title="Distribution Fit Diagnostics")

The result is a 2x2 matplotlib figure with four diagnostic plots:

+-------------------+-------------------+
|      Q-Q Plot     |      P-P Plot     |
+-------------------+-------------------+
| Residual Histogram|   CDF Comparison  |
+-------------------+-------------------+

Diagnostic Plot Types

Q-Q Plot (Quantile-Quantile)

The Q-Q plot compares sample quantiles against theoretical quantiles from the fitted distribution. Points falling along the diagonal reference line indicate a good fit.

  • Use for: Detecting deviations in the tails of the distribution

  • Good fit: Points closely follow the y=x line

  • Heavy tails: Points curve away from the line at extremes

  • Light tails: Points curve toward the line at extremes

P-P Plot (Probability-Probability)

The P-P plot compares the empirical cumulative distribution function (CDF) against the theoretical CDF. It is particularly sensitive to deviations in the center of the distribution.

  • Use for: Assessing fit quality in the center of the distribution

  • Good fit: Points closely follow the y=x line

  • Bounded axes: Always [0, 1] for probabilities

Residual Histogram

Shows the distribution of residuals (observed density - expected density). A good fit should have residuals centered around zero.

  • Use for: Identifying systematic bias in the fit

  • Good fit: Histogram centered at zero, symmetric

  • Metrics shown: Mean and standard deviation of residuals

CDF Comparison

Overlays the empirical step-function CDF on top of the smooth theoretical CDF. Visual alignment indicates goodness of fit.

  • Use for: Direct visual comparison of distributions

  • Good fit: Step function closely follows smooth curve

  • Shows: KS statistic and p-value when available

API: diagnostics() Method

The diagnostics() method is available on DistributionFitResult objects:

result.diagnostics(
    data,                      # Sample data (numpy array)
    y_hist=None,               # Optional: pre-computed histogram density
    x_hist=None,               # Optional: pre-computed histogram bin centers
    bins=50,                   # Number of histogram bins
    title="",                  # Overall figure title
    figsize=(14, 12),          # Figure size (width, height)
    dpi=100,                   # Dots per inch for saved figures
    title_fontsize=16,         # Main title font size
    subplot_title_fontsize=12, # Subplot title font size
    label_fontsize=10,         # Axis label font size
    grid_alpha=0.3,            # Grid transparency
    save_path=None,            # Optional path to save figure
    save_format="png",         # Save format (png, pdf, svg)
)

Returns a tuple of (figure, axes) where axes is a 2x2 numpy array of matplotlib Axes objects.

Example Usage

Basic Diagnostics

from spark_bestfit import DistributionFitter
import matplotlib.pyplot as plt

# Fit and get best distribution
fitter = DistributionFitter(spark)
results = fitter.fit(df, "value")
best = results.best(n=1)[0]

# Get data for plotting
data = df.select("value").toPandas()["value"].values

# Generate diagnostics
fig, axes = best.diagnostics(data)
plt.show()

With Pre-computed Histogram

import numpy as np

# Pre-compute histogram (useful when reusing across multiple plots)
y_hist, x_edges = np.histogram(data, bins=50, density=True)
x_hist = (x_edges[:-1] + x_edges[1:]) / 2

# Use pre-computed histogram
fig, axes = best.diagnostics(
    data,
    y_hist=y_hist,
    x_hist=x_hist,
    title="Fit Quality Assessment"
)

Saving to File

# Save as PNG
fig, axes = best.diagnostics(
    data,
    title="Model Diagnostics",
    save_path="diagnostics.png",
    dpi=300
)

# Save as PDF for publications
fig, axes = best.diagnostics(
    data,
    save_path="diagnostics.pdf",
    save_format="pdf"
)

Comparing Multiple Fits

# Get top 3 distributions
top_3 = results.best(n=3)

# Create diagnostics for each
for i, result in enumerate(top_3):
    fig, axes = result.diagnostics(
        data,
        title=f"Rank {i+1}: {result.distribution}",
        save_path=f"diagnostics_{i+1}.png"
    )
    plt.close(fig)

Individual Plot Functions

For more control, individual plotting functions are available:

from spark_bestfit.plotting import (
    plot_qq,
    plot_pp,
    plot_residual_histogram,
    plot_cdf_comparison,
)

# Q-Q plot only
fig, ax = plot_qq(result, data, title="Q-Q Plot")

# P-P plot only
fig, ax = plot_pp(result, data, title="P-P Plot")

# Residual histogram only
fig, ax = plot_residual_histogram(result, y_hist, x_hist)

# CDF comparison only
fig, ax = plot_cdf_comparison(result, data)

Each function accepts extensive customization parameters for colors, fonts, markers, and line styles. See the API reference for full details.

Interpreting Results

Good Fit Indicators

  • Q-Q/P-P plots: Points closely follow the diagonal line

  • Residual histogram: Centered at zero, symmetric, small standard deviation

  • CDF comparison: Empirical CDF closely tracks theoretical CDF

  • KS p-value: > 0.05 (though this is only a rough guideline)

Poor Fit Indicators

  • Q-Q plot: Systematic curvature, especially in tails

  • P-P plot: S-shaped deviation from diagonal

  • Residual histogram: Mean far from zero, skewed distribution

  • CDF comparison: Visible gaps between empirical and theoretical CDFs

API Reference

See spark_bestfit.results.DistributionFitResult.diagnostics() for full API documentation.

See also: