Diagnostics Plots ================= .. versionadded:: 2.6.0 After fitting a distribution, you can assess the quality of the fit using diagnostic plots. spark-bestfit provides a comprehensive ``diagnostics()`` method that creates a 2x2 panel of diagnostic visualizations. Quick Start ----------- Generate a complete diagnostic panel from a fitted distribution: .. code-block:: python from spark_bestfit import DistributionFitter # Fit distribution fitter = DistributionFitter(spark) results = fitter.fit(df, column="value") best = results.best(n=1)[0] # Get the sample data as numpy array data = df.select("value").toPandas()["value"].values # Generate diagnostic plots fig, axes = best.diagnostics(data, title="Distribution Fit Diagnostics") The result is a 2x2 matplotlib figure with four diagnostic plots: .. code-block:: text +-------------------+-------------------+ | Q-Q Plot | P-P Plot | +-------------------+-------------------+ | Residual Histogram| CDF Comparison | +-------------------+-------------------+ Diagnostic Plot Types --------------------- Q-Q Plot (Quantile-Quantile) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The Q-Q plot compares sample quantiles against theoretical quantiles from the fitted distribution. Points falling along the diagonal reference line indicate a good fit. - **Use for**: Detecting deviations in the tails of the distribution - **Good fit**: Points closely follow the y=x line - **Heavy tails**: Points curve away from the line at extremes - **Light tails**: Points curve toward the line at extremes P-P Plot (Probability-Probability) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The P-P plot compares the empirical cumulative distribution function (CDF) against the theoretical CDF. It is particularly sensitive to deviations in the center of the distribution. - **Use for**: Assessing fit quality in the center of the distribution - **Good fit**: Points closely follow the y=x line - **Bounded axes**: Always [0, 1] for probabilities Residual Histogram ~~~~~~~~~~~~~~~~~~ Shows the distribution of residuals (observed density - expected density). A good fit should have residuals centered around zero. - **Use for**: Identifying systematic bias in the fit - **Good fit**: Histogram centered at zero, symmetric - **Metrics shown**: Mean and standard deviation of residuals CDF Comparison ~~~~~~~~~~~~~~ Overlays the empirical step-function CDF on top of the smooth theoretical CDF. Visual alignment indicates goodness of fit. - **Use for**: Direct visual comparison of distributions - **Good fit**: Step function closely follows smooth curve - **Shows**: KS statistic and p-value when available API: diagnostics() Method ------------------------- The ``diagnostics()`` method is available on ``DistributionFitResult`` objects: .. code-block:: python result.diagnostics( data, # Sample data (numpy array) y_hist=None, # Optional: pre-computed histogram density x_hist=None, # Optional: pre-computed histogram bin centers bins=50, # Number of histogram bins title="", # Overall figure title figsize=(14, 12), # Figure size (width, height) dpi=100, # Dots per inch for saved figures title_fontsize=16, # Main title font size subplot_title_fontsize=12, # Subplot title font size label_fontsize=10, # Axis label font size grid_alpha=0.3, # Grid transparency save_path=None, # Optional path to save figure save_format="png", # Save format (png, pdf, svg) ) Returns a tuple of ``(figure, axes)`` where ``axes`` is a 2x2 numpy array of matplotlib Axes objects. Example Usage ------------- Basic Diagnostics ~~~~~~~~~~~~~~~~~ .. code-block:: python from spark_bestfit import DistributionFitter import matplotlib.pyplot as plt # Fit and get best distribution fitter = DistributionFitter(spark) results = fitter.fit(df, "value") best = results.best(n=1)[0] # Get data for plotting data = df.select("value").toPandas()["value"].values # Generate diagnostics fig, axes = best.diagnostics(data) plt.show() With Pre-computed Histogram ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import numpy as np # Pre-compute histogram (useful when reusing across multiple plots) y_hist, x_edges = np.histogram(data, bins=50, density=True) x_hist = (x_edges[:-1] + x_edges[1:]) / 2 # Use pre-computed histogram fig, axes = best.diagnostics( data, y_hist=y_hist, x_hist=x_hist, title="Fit Quality Assessment" ) Saving to File ~~~~~~~~~~~~~~ .. code-block:: python # Save as PNG fig, axes = best.diagnostics( data, title="Model Diagnostics", save_path="diagnostics.png", dpi=300 ) # Save as PDF for publications fig, axes = best.diagnostics( data, save_path="diagnostics.pdf", save_format="pdf" ) Comparing Multiple Fits ~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Get top 3 distributions top_3 = results.best(n=3) # Create diagnostics for each for i, result in enumerate(top_3): fig, axes = result.diagnostics( data, title=f"Rank {i+1}: {result.distribution}", save_path=f"diagnostics_{i+1}.png" ) plt.close(fig) Individual Plot Functions ------------------------- For more control, individual plotting functions are available: .. code-block:: python from spark_bestfit.plotting import ( plot_qq, plot_pp, plot_residual_histogram, plot_cdf_comparison, ) # Q-Q plot only fig, ax = plot_qq(result, data, title="Q-Q Plot") # P-P plot only fig, ax = plot_pp(result, data, title="P-P Plot") # Residual histogram only fig, ax = plot_residual_histogram(result, y_hist, x_hist) # CDF comparison only fig, ax = plot_cdf_comparison(result, data) Each function accepts extensive customization parameters for colors, fonts, markers, and line styles. See the API reference for full details. Interpreting Results -------------------- Good Fit Indicators ~~~~~~~~~~~~~~~~~~~ - Q-Q/P-P plots: Points closely follow the diagonal line - Residual histogram: Centered at zero, symmetric, small standard deviation - CDF comparison: Empirical CDF closely tracks theoretical CDF - KS p-value: > 0.05 (though this is only a rough guideline) Poor Fit Indicators ~~~~~~~~~~~~~~~~~~~ - Q-Q plot: Systematic curvature, especially in tails - P-P plot: S-shaped deviation from diagonal - Residual histogram: Mean far from zero, skewed distribution - CDF comparison: Visible gaps between empirical and theoretical CDFs API Reference ------------- See :meth:`spark_bestfit.results.DistributionFitResult.diagnostics` for full API documentation. See also: - :func:`spark_bestfit.plotting.plot_qq` - :func:`spark_bestfit.plotting.plot_pp` - :func:`spark_bestfit.plotting.plot_residual_histogram` - :func:`spark_bestfit.plotting.plot_cdf_comparison` - :func:`spark_bestfit.plotting.plot_diagnostics`