FAQ & Troubleshooting¶

Frequently asked questions and troubleshooting tips for spark-bestfit.

Installation Issues¶

Q: ModuleNotFoundError: No module named ‘pyspark’

PySpark is an optional dependency. Install it with:

pip install spark-bestfit[spark]

Or use the local backend which doesn’t require Spark:

from spark_bestfit.backends import BackendFactory

backend = BackendFactory.create("local", max_workers=4)

Q: ImportError with Ray

Ray is also optional. Install it with:

pip install spark-bestfit[ray]

Fitting Issues¶

Q: Why do I get a “heavy-tail characteristics” warning?

This warning appears when your data has high kurtosis, suggesting heavy-tailed distributions (like Pareto, Cauchy, or Student’s t) may fit better than standard distributions.

Solutions:

Use Maximum Spacing Estimation (MSE) for robust fitting:

results = fitter.fit(df, column="value", estimation_method="mse")

Filter to heavy-tail specific distributions:

results = fitter.fit(
    df,
    column="value",
    included_distributions=["pareto", "cauchy", "t", "levy", "burr"]
)

Transform your data (log, sqrt) to reduce tail effects

See Heavy-Tail Detection for detailed guidance.

Q: My fits have poor p-values (< 0.05)

Low p-values indicate the data may not follow the fitted distribution well. Consider:

Check data quality: Remove outliers or invalid values
Try more distributions: Use included_distributions=None to test all ~90 distributions
Use bounded fitting: If your data has natural bounds (e.g., positive values):
```
results = fitter.fit(df, column="value", lower_bound=0)
```
Check sample size: Very large samples may reject good fits due to statistical power

Q: Fitting is slow

Several strategies to improve performance:

Use prefiltering to skip unlikely distributions:

results = fitter.fit(df, column="value", prefilter=True)

Reduce distribution count:

# Only fit common distributions
common = ["norm", "gamma", "lognorm", "expon", "weibull_min"]
results = fitter.fit(df, column="value", included_distributions=common)

Use appropriate backend for your data size:
- < 1M rows: Local backend
- 1M-100M rows: Ray backend
- > 100M rows: Spark backend

Skip expensive metrics with lazy evaluation:

from spark_bestfit import FitterConfig

config = FitterConfig().skip_ks_test(True).skip_ad_test(True)
fitter = DistributionFitter(spark, config=config)

See Performance & Scaling for benchmarks and tuning advice.

Sampling Issues¶

Q: Copula sampling is slow

The bottleneck is usually the marginal distribution transforms (PPF/inverse CDF).

Use return_uniform=True if you only need correlation structure:

# 20x faster - returns uniform [0,1] samples
samples = copula.sample(n=1_000_000, return_uniform=True)

Use common distributions that have fast PPF implementations: norm, expon, uniform, lognorm, weibull_min, gamma, beta

Use distributed sampling for large sample counts:

backend = BackendFactory.create("spark", spark_session=spark)
samples_df = copula.sample_distributed(n=100_000_000, backend=backend)

Q: Samples don’t match my original data distribution

Verify your fit quality before sampling:

# Check goodness-of-fit metrics
best = results.best(n=1)[0]
print(f"K-S statistic: {best.ks_statistic}")
print(f"p-value: {best.pvalue}")

# Visual inspection
best.diagnostics()  # Shows Q-Q, P-P, histogram, and CDF plots

Memory Issues¶

Q: OutOfMemoryError when fitting large data

Don’t collect to driver - Use distributed operations:

# Good - stays distributed
results = fitter.fit(df, column="value")

# Bad - collects all data to driver
pandas_df = df.toPandas()

Use lazy metrics to defer computation:

config = FitterConfig().skip_ks_test(True).skip_ad_test(True)

Increase Spark driver memory:

spark = SparkSession.builder \
    .config("spark.driver.memory", "8g") \
    .getOrCreate()

Q: Memory issues with copula sampling

For very large sample counts, use distributed sampling instead of local:

# May OOM for n > 10M
samples = copula.sample(n=100_000_000)

# Better - distributed across cluster
backend = BackendFactory.create("spark", spark_session=spark)
samples_df = copula.sample_distributed(n=100_000_000, backend=backend)

Serialization Issues¶

Q: SerializationError when loading a saved model

Common causes:

Missing required fields - Ensure JSON has distribution and parameters
Unknown distribution - The distribution name must exist in scipy.stats
Version mismatch - Check spark_bestfit_version in the JSON file

Q: Can I load models saved with an older version?

Yes, spark-bestfit maintains backward compatibility. The schema_version field tracks the serialization format. Models saved with v1.x should load in v2.x.

Backend Issues¶

Q: How do I switch backends?

Use the BackendFactory for backend-agnostic code:

from spark_bestfit.backends import BackendFactory

# Development/testing
backend = BackendFactory.create("local", max_workers=4)

# Production with Spark
backend = BackendFactory.create("spark", spark_session=spark)

# ML workflows with Ray
backend = BackendFactory.create("ray")

Q: Spark jobs hang or fail silently

Check Spark UI (usually http://localhost:4040) for job status

Increase executor memory for large data:

spark = SparkSession.builder \
    .config("spark.executor.memory", "4g") \
    .getOrCreate()

Ensure Spark version compatibility (3.5.x or 4.x)

Getting Help¶

If your issue isn’t covered here:

Check the API Reference documentation for method signatures and parameters
Review Migrating to v2.0 for breaking changes between versions
Open an issue at https://github.com/dwsmith1983/spark-bestfit/issues