FAQ & Troubleshooting¶
Frequently asked questions and troubleshooting tips for spark-bestfit.
Installation Issues¶
Q: ModuleNotFoundError: No module named ‘pyspark’
PySpark is an optional dependency. Install it with:
pip install spark-bestfit[spark]
Or use the local backend which doesn’t require Spark:
from spark_bestfit.backends import BackendFactory
backend = BackendFactory.create("local", max_workers=4)
Q: ImportError with Ray
Ray is also optional. Install it with:
pip install spark-bestfit[ray]
Fitting Issues¶
Q: Why do I get a “heavy-tail characteristics” warning?
This warning appears when your data has high kurtosis, suggesting heavy-tailed distributions (like Pareto, Cauchy, or Student’s t) may fit better than standard distributions.
Solutions:
Use Maximum Spacing Estimation (MSE) for robust fitting:
results = fitter.fit(df, column="value", estimation_method="mse")
Filter to heavy-tail specific distributions:
results = fitter.fit( df, column="value", included_distributions=["pareto", "cauchy", "t", "levy", "burr"] )
Transform your data (log, sqrt) to reduce tail effects
See Heavy-Tail Detection for detailed guidance.
Q: My fits have poor p-values (< 0.05)
Low p-values indicate the data may not follow the fitted distribution well. Consider:
Check data quality: Remove outliers or invalid values
Try more distributions: Use
included_distributions=Noneto test all ~90 distributionsUse bounded fitting: If your data has natural bounds (e.g., positive values):
results = fitter.fit(df, column="value", lower_bound=0)
Check sample size: Very large samples may reject good fits due to statistical power
Q: Fitting is slow
Several strategies to improve performance:
Use prefiltering to skip unlikely distributions:
results = fitter.fit(df, column="value", prefilter=True)
Reduce distribution count:
# Only fit common distributions common = ["norm", "gamma", "lognorm", "expon", "weibull_min"] results = fitter.fit(df, column="value", included_distributions=common)
Use appropriate backend for your data size:
< 1M rows: Local backend
1M-100M rows: Ray backend
> 100M rows: Spark backend
Skip expensive metrics with lazy evaluation:
from spark_bestfit import FitterConfig config = FitterConfig().skip_ks_test(True).skip_ad_test(True) fitter = DistributionFitter(spark, config=config)
See Performance & Scaling for benchmarks and tuning advice.
Sampling Issues¶
Q: Copula sampling is slow
The bottleneck is usually the marginal distribution transforms (PPF/inverse CDF).
Use return_uniform=True if you only need correlation structure:
# 20x faster - returns uniform [0,1] samples samples = copula.sample(n=1_000_000, return_uniform=True)
Use common distributions that have fast PPF implementations: norm, expon, uniform, lognorm, weibull_min, gamma, beta
Use distributed sampling for large sample counts:
backend = BackendFactory.create("spark", spark_session=spark) samples_df = copula.sample_distributed(n=100_000_000, backend=backend)
Q: Samples don’t match my original data distribution
Verify your fit quality before sampling:
# Check goodness-of-fit metrics
best = results.best(n=1)[0]
print(f"K-S statistic: {best.ks_statistic}")
print(f"p-value: {best.pvalue}")
# Visual inspection
best.diagnostics() # Shows Q-Q, P-P, histogram, and CDF plots
Memory Issues¶
Q: OutOfMemoryError when fitting large data
Don’t collect to driver - Use distributed operations:
# Good - stays distributed results = fitter.fit(df, column="value") # Bad - collects all data to driver pandas_df = df.toPandas()
Use lazy metrics to defer computation:
config = FitterConfig().skip_ks_test(True).skip_ad_test(True)
Increase Spark driver memory:
spark = SparkSession.builder \ .config("spark.driver.memory", "8g") \ .getOrCreate()
Q: Memory issues with copula sampling
For very large sample counts, use distributed sampling instead of local:
# May OOM for n > 10M
samples = copula.sample(n=100_000_000)
# Better - distributed across cluster
backend = BackendFactory.create("spark", spark_session=spark)
samples_df = copula.sample_distributed(n=100_000_000, backend=backend)
Serialization Issues¶
Q: SerializationError when loading a saved model
Common causes:
Missing required fields - Ensure JSON has
distributionandparametersUnknown distribution - The distribution name must exist in scipy.stats
Version mismatch - Check
spark_bestfit_versionin the JSON file
Q: Can I load models saved with an older version?
Yes, spark-bestfit maintains backward compatibility. The schema_version field tracks
the serialization format. Models saved with v1.x should load in v2.x.
Backend Issues¶
Q: How do I switch backends?
Use the BackendFactory for backend-agnostic code:
from spark_bestfit.backends import BackendFactory
# Development/testing
backend = BackendFactory.create("local", max_workers=4)
# Production with Spark
backend = BackendFactory.create("spark", spark_session=spark)
# ML workflows with Ray
backend = BackendFactory.create("ray")
Q: Spark jobs hang or fail silently
Check Spark UI (usually http://localhost:4040) for job status
Increase executor memory for large data:
spark = SparkSession.builder \ .config("spark.executor.memory", "4g") \ .getOrCreate()
Ensure Spark version compatibility (3.5.x or 4.x)
Getting Help¶
If your issue isn’t covered here:
Check the API Reference documentation for method signatures and parameters
Review Migrating to v2.0 for breaking changes between versions
Open an issue at https://github.com/dwsmith1983/spark-bestfit/issues