FAQ & Troubleshooting

Frequently asked questions and troubleshooting tips for spark-bestfit.

Installation Issues

Q: ModuleNotFoundError: No module named ‘pyspark’

PySpark is an optional dependency. Install it with:

pip install spark-bestfit[spark]

Or use the local backend which doesn’t require Spark:

from spark_bestfit.backends import BackendFactory

backend = BackendFactory.create("local", max_workers=4)

Q: ImportError with Ray

Ray is also optional. Install it with:

pip install spark-bestfit[ray]

Fitting Issues

Q: Why do I get a “heavy-tail characteristics” warning?

This warning appears when your data has high kurtosis, suggesting heavy-tailed distributions (like Pareto, Cauchy, or Student’s t) may fit better than standard distributions.

Solutions:

  1. Use Maximum Spacing Estimation (MSE) for robust fitting:

    results = fitter.fit(df, column="value", estimation_method="mse")
    
  2. Filter to heavy-tail specific distributions:

    results = fitter.fit(
        df,
        column="value",
        included_distributions=["pareto", "cauchy", "t", "levy", "burr"]
    )
    
  3. Transform your data (log, sqrt) to reduce tail effects

See Heavy-Tail Detection for detailed guidance.

Q: My fits have poor p-values (< 0.05)

Low p-values indicate the data may not follow the fitted distribution well. Consider:

  1. Check data quality: Remove outliers or invalid values

  2. Try more distributions: Use included_distributions=None to test all ~90 distributions

  3. Use bounded fitting: If your data has natural bounds (e.g., positive values):

    results = fitter.fit(df, column="value", lower_bound=0)
    
  4. Check sample size: Very large samples may reject good fits due to statistical power

Q: Fitting is slow

Several strategies to improve performance:

  1. Use prefiltering to skip unlikely distributions:

    results = fitter.fit(df, column="value", prefilter=True)
    
  2. Reduce distribution count:

    # Only fit common distributions
    common = ["norm", "gamma", "lognorm", "expon", "weibull_min"]
    results = fitter.fit(df, column="value", included_distributions=common)
    
  3. Use appropriate backend for your data size:

    • < 1M rows: Local backend

    • 1M-100M rows: Ray backend

    • > 100M rows: Spark backend

  4. Skip expensive metrics with lazy evaluation:

    from spark_bestfit import FitterConfig
    
    config = FitterConfig().skip_ks_test(True).skip_ad_test(True)
    fitter = DistributionFitter(spark, config=config)
    

See Performance & Scaling for benchmarks and tuning advice.

Sampling Issues

Q: Copula sampling is slow

The bottleneck is usually the marginal distribution transforms (PPF/inverse CDF).

  1. Use return_uniform=True if you only need correlation structure:

    # 20x faster - returns uniform [0,1] samples
    samples = copula.sample(n=1_000_000, return_uniform=True)
    
  2. Use common distributions that have fast PPF implementations: norm, expon, uniform, lognorm, weibull_min, gamma, beta

  3. Use distributed sampling for large sample counts:

    backend = BackendFactory.create("spark", spark_session=spark)
    samples_df = copula.sample_distributed(n=100_000_000, backend=backend)
    

Q: Samples don’t match my original data distribution

Verify your fit quality before sampling:

# Check goodness-of-fit metrics
best = results.best(n=1)[0]
print(f"K-S statistic: {best.ks_statistic}")
print(f"p-value: {best.pvalue}")

# Visual inspection
best.diagnostics()  # Shows Q-Q, P-P, histogram, and CDF plots

Memory Issues

Q: OutOfMemoryError when fitting large data

  1. Don’t collect to driver - Use distributed operations:

    # Good - stays distributed
    results = fitter.fit(df, column="value")
    
    # Bad - collects all data to driver
    pandas_df = df.toPandas()
    
  2. Use lazy metrics to defer computation:

    config = FitterConfig().skip_ks_test(True).skip_ad_test(True)
    
  3. Increase Spark driver memory:

    spark = SparkSession.builder \
        .config("spark.driver.memory", "8g") \
        .getOrCreate()
    

Q: Memory issues with copula sampling

For very large sample counts, use distributed sampling instead of local:

# May OOM for n > 10M
samples = copula.sample(n=100_000_000)

# Better - distributed across cluster
backend = BackendFactory.create("spark", spark_session=spark)
samples_df = copula.sample_distributed(n=100_000_000, backend=backend)

Serialization Issues

Q: SerializationError when loading a saved model

Common causes:

  1. Missing required fields - Ensure JSON has distribution and parameters

  2. Unknown distribution - The distribution name must exist in scipy.stats

  3. Version mismatch - Check spark_bestfit_version in the JSON file

Q: Can I load models saved with an older version?

Yes, spark-bestfit maintains backward compatibility. The schema_version field tracks the serialization format. Models saved with v1.x should load in v2.x.

Backend Issues

Q: How do I switch backends?

Use the BackendFactory for backend-agnostic code:

from spark_bestfit.backends import BackendFactory

# Development/testing
backend = BackendFactory.create("local", max_workers=4)

# Production with Spark
backend = BackendFactory.create("spark", spark_session=spark)

# ML workflows with Ray
backend = BackendFactory.create("ray")

Q: Spark jobs hang or fail silently

  1. Check Spark UI (usually http://localhost:4040) for job status

  2. Increase executor memory for large data:

    spark = SparkSession.builder \
        .config("spark.executor.memory", "4g") \
        .getOrCreate()
    
  3. Ensure Spark version compatibility (3.5.x or 4.x)

Getting Help

If your issue isn’t covered here:

  1. Check the API Reference documentation for method signatures and parameters

  2. Review Migrating to v2.0 for breaking changes between versions

  3. Open an issue at https://github.com/dwsmith1983/spark-bestfit/issues