FAQ & Troubleshooting ===================== Frequently asked questions and troubleshooting tips for spark-bestfit. Installation Issues ------------------- **Q: ModuleNotFoundError: No module named 'pyspark'** PySpark is an optional dependency. Install it with: .. code-block:: bash pip install spark-bestfit[spark] Or use the local backend which doesn't require Spark: .. code-block:: python from spark_bestfit.backends import BackendFactory backend = BackendFactory.create("local", max_workers=4) **Q: ImportError with Ray** Ray is also optional. Install it with: .. code-block:: bash pip install spark-bestfit[ray] Fitting Issues -------------- **Q: Why do I get a "heavy-tail characteristics" warning?** This warning appears when your data has high kurtosis, suggesting heavy-tailed distributions (like Pareto, Cauchy, or Student's t) may fit better than standard distributions. Solutions: 1. Use Maximum Spacing Estimation (MSE) for robust fitting: .. code-block:: python results = fitter.fit(df, column="value", estimation_method="mse") 2. Filter to heavy-tail specific distributions: .. code-block:: python results = fitter.fit( df, column="value", included_distributions=["pareto", "cauchy", "t", "levy", "burr"] ) 3. Transform your data (log, sqrt) to reduce tail effects See :doc:`features/heavy-tail` for detailed guidance. **Q: My fits have poor p-values (< 0.05)** Low p-values indicate the data may not follow the fitted distribution well. Consider: 1. **Check data quality**: Remove outliers or invalid values 2. **Try more distributions**: Use ``included_distributions=None`` to test all ~90 distributions 3. **Use bounded fitting**: If your data has natural bounds (e.g., positive values): .. code-block:: python results = fitter.fit(df, column="value", lower_bound=0) 4. **Check sample size**: Very large samples may reject good fits due to statistical power **Q: Fitting is slow** Several strategies to improve performance: 1. **Use prefiltering** to skip unlikely distributions: .. code-block:: python results = fitter.fit(df, column="value", prefilter=True) 2. **Reduce distribution count**: .. code-block:: python # Only fit common distributions common = ["norm", "gamma", "lognorm", "expon", "weibull_min"] results = fitter.fit(df, column="value", included_distributions=common) 3. **Use appropriate backend** for your data size: - < 1M rows: Local backend - 1M-100M rows: Ray backend - > 100M rows: Spark backend 4. **Skip expensive metrics** with lazy evaluation: .. code-block:: python from spark_bestfit import FitterConfig config = FitterConfig().skip_ks_test(True).skip_ad_test(True) fitter = DistributionFitter(spark, config=config) See :doc:`performance` for benchmarks and tuning advice. Sampling Issues --------------- **Q: Copula sampling is slow** The bottleneck is usually the marginal distribution transforms (PPF/inverse CDF). 1. **Use return_uniform=True** if you only need correlation structure: .. code-block:: python # 20x faster - returns uniform [0,1] samples samples = copula.sample(n=1_000_000, return_uniform=True) 2. **Use common distributions** that have fast PPF implementations: norm, expon, uniform, lognorm, weibull_min, gamma, beta 3. **Use distributed sampling** for large sample counts: .. code-block:: python backend = BackendFactory.create("spark", spark_session=spark) samples_df = copula.sample_distributed(n=100_000_000, backend=backend) **Q: Samples don't match my original data distribution** Verify your fit quality before sampling: .. code-block:: python # Check goodness-of-fit metrics best = results.best(n=1)[0] print(f"K-S statistic: {best.ks_statistic}") print(f"p-value: {best.pvalue}") # Visual inspection best.diagnostics() # Shows Q-Q, P-P, histogram, and CDF plots Memory Issues ------------- **Q: OutOfMemoryError when fitting large data** 1. **Don't collect to driver** - Use distributed operations: .. code-block:: python # Good - stays distributed results = fitter.fit(df, column="value") # Bad - collects all data to driver pandas_df = df.toPandas() 2. **Use lazy metrics** to defer computation: .. code-block:: python config = FitterConfig().skip_ks_test(True).skip_ad_test(True) 3. **Increase Spark driver memory**: .. code-block:: python spark = SparkSession.builder \ .config("spark.driver.memory", "8g") \ .getOrCreate() **Q: Memory issues with copula sampling** For very large sample counts, use distributed sampling instead of local: .. code-block:: python # May OOM for n > 10M samples = copula.sample(n=100_000_000) # Better - distributed across cluster backend = BackendFactory.create("spark", spark_session=spark) samples_df = copula.sample_distributed(n=100_000_000, backend=backend) Serialization Issues -------------------- **Q: SerializationError when loading a saved model** Common causes: 1. **Missing required fields** - Ensure JSON has ``distribution`` and ``parameters`` 2. **Unknown distribution** - The distribution name must exist in scipy.stats 3. **Version mismatch** - Check ``spark_bestfit_version`` in the JSON file **Q: Can I load models saved with an older version?** Yes, spark-bestfit maintains backward compatibility. The ``schema_version`` field tracks the serialization format. Models saved with v1.x should load in v2.x. Backend Issues -------------- **Q: How do I switch backends?** Use the ``BackendFactory`` for backend-agnostic code: .. code-block:: python from spark_bestfit.backends import BackendFactory # Development/testing backend = BackendFactory.create("local", max_workers=4) # Production with Spark backend = BackendFactory.create("spark", spark_session=spark) # ML workflows with Ray backend = BackendFactory.create("ray") **Q: Spark jobs hang or fail silently** 1. Check Spark UI (usually http://localhost:4040) for job status 2. Increase executor memory for large data: .. code-block:: python spark = SparkSession.builder \ .config("spark.executor.memory", "4g") \ .getOrCreate() 3. Ensure Spark version compatibility (3.5.x or 4.x) Getting Help ------------ If your issue isn't covered here: 1. Check the :doc:`api` documentation for method signatures and parameters 2. Review :doc:`migration` for breaking changes between versions 3. Open an issue at https://github.com/dwsmith1983/spark-bestfit/issues