Distributed Sampling ==================== After fitting a distribution, you can generate samples using distributed computing capabilities. This is particularly useful when you need to generate millions of samples efficiently. Basic Usage ----------- Generate distributed samples from a fitted distribution using any backend: .. code-block:: python from spark_bestfit import DistributionFitter from spark_bestfit.backends import BackendFactory from spark_bestfit.sampling import sample_distributed # Fit distribution fitter = DistributionFitter(spark) results = fitter.fit(df, column="value") best = results.best(n=1)[0] # Generate 1 million distributed samples backend = BackendFactory.create("spark", spark_session=spark) samples_df = sample_distributed( distribution=best.distribution, parameters=best.parameters, n=1_000_000, backend=backend, ) samples_df.show(5) The result is a DataFrame that can be used for further processing: .. code-block:: text +-------------------+ | sample| +-------------------+ | 0.4691122931291924| |-0.2828633018445851| | 1.0093545783546243| | 0.582873245234523| | -1.23234234234234 | +-------------------+ Backend Options --------------- Use any backend for distributed sampling: .. code-block:: python from spark_bestfit.backends import BackendFactory from spark_bestfit.sampling import sample_distributed # Spark backend = BackendFactory.create("spark", spark_session=spark) samples_df = sample_distributed(best.distribution, best.parameters, n=1_000_000, backend=backend) # Ray backend = BackendFactory.create("ray") samples_df = sample_distributed(best.distribution, best.parameters, n=1_000_000, backend=backend) # Local (for testing) backend = BackendFactory.create("local", max_workers=4) samples_df = sample_distributed(best.distribution, best.parameters, n=1_000_000, backend=backend) Reproducibility --------------- Use the ``random_seed`` parameter for reproducible results: .. code-block:: python # Reproducible sampling samples1 = sample_distributed( best.distribution, best.parameters, n=10000, backend=backend, random_seed=42 ) samples2 = sample_distributed( best.distribution, best.parameters, n=10000, backend=backend, random_seed=42 ) # samples1 and samples2 will contain the same values Each partition receives a unique seed derived from the base seed plus the partition ID, ensuring both reproducibility and statistical independence across partitions. Partition Control ----------------- You can control the number of partitions for parallel sampling: .. code-block:: python # Use 16 partitions for sampling samples_df = sample_distributed( distribution=best.distribution, parameters=best.parameters, n=1_000_000, backend=backend, num_partitions=16, random_seed=42, ) If not specified, the default parallelism for the backend is used. Custom Column Names ------------------- Specify a custom column name for the output: .. code-block:: python samples_df = sample_distributed( distribution=best.distribution, parameters=best.parameters, n=10000, backend=backend, column_name="generated_values" ) # DataFrame has column "generated_values" instead of "sample" Local vs Distributed Sampling ----------------------------- spark-bestfit offers two sampling methods: .. list-table:: :header-rows: 1 * - Method - Use Case - Output * - ``sample(size=N)`` - Small to medium samples (< 10M) - NumPy array * - ``sample_distributed(n=N, backend=...)`` - Large samples (> 10M) - DataFrame (Spark/pandas) Performance Characteristics ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Benchmark results on local mode (your mileage may vary on a cluster): .. list-table:: :header-rows: 1 * - N Samples - Local (ms) - Spark (ms) - Winner * - 1,000 - 0.3 - 336 - Local * - 1,000,000 - 16 - 57 - Local * - 10,000,000 - 149 - 125 - **Spark** * - 50,000,000 - 777 - 481 - **Spark** **Key takeaways:** - **Crossover point**: ~10 million samples in local mode - **Spark overhead**: ~300ms baseline cost for job setup - **Cluster advantage**: On a multi-node cluster, the crossover point is lower due to true parallelism across workers - **Memory distribution**: Even when local is faster, distributed sampling distributes memory across the cluster, enabling sample sizes that wouldn't fit on a single node Removed: sample_spark() ----------------------- .. versionchanged:: 3.0.1 The ``sample_spark()`` method was removed in v3.0.1 (deprecated since v2.2.0). Use ``sample_distributed()`` with an explicit backend instead. If you are migrating from code that used ``sample_spark()``: .. code-block:: python # Removed (was deprecated since v2.2.0) # samples_df = best.sample_spark(n=1_000_000, spark=spark) # Use instead from spark_bestfit.backends import BackendFactory from spark_bestfit.sampling import sample_distributed backend = BackendFactory.create("spark", spark_session=spark) samples_df = sample_distributed( distribution=best.distribution, parameters=best.parameters, n=1_000_000, backend=backend, ) API Reference ------------- See :func:`spark_bestfit.sampling.sample_distributed` for full API documentation.