Bounded Distribution Fitting ============================ spark-bestfit supports fitting distributions with explicit bounds. This is useful for data that has natural constraints like percentages (0-100), ages (0+), prices (0+), or any domain-specific limits. Basic Usage ----------- Use the ``bounded`` parameter to enable automatic bound detection from your data: .. code-block:: python from spark_bestfit import DistributionFitter fitter = DistributionFitter(spark) # Auto-detect bounds from data min/max results = fitter.fit(df, column="percentage", bounded=True) # Get best fit - samples will respect the bounds best = results.best(n=1)[0] samples = best.sample(1000) # All samples within [data_min, data_max] Explicit Bounds --------------- For precise control, specify bounds explicitly: .. code-block:: python # Both bounds explicit results = fitter.fit( df, column="percentage", bounded=True, lower_bound=0.0, upper_bound=100.0, ) # Only lower bound (e.g., prices must be non-negative) results = fitter.fit( df, column="price", bounded=True, lower_bound=0.0, ) # Only upper bound results = fitter.fit( df, column="score", bounded=True, upper_bound=1.0, ) **Using FitterConfig (v2.2+):** .. code-block:: python from spark_bestfit import FitterConfigBuilder # Create reusable bounded config config = (FitterConfigBuilder() .with_bounds(lower=0.0, upper=100.0) .build()) results = fitter.fit(df, column="percentage", config=config) .. note:: When only one bound is specified and ``bounded=True``, the other bound is auto-detected from the data. Use ``-inf`` or ``inf`` to explicitly disable a bound while keeping the other explicit. How It Works ------------ Bounded fitting uses a two-step process: 1. **Fit the unbounded distribution**: Standard MLE fitting is performed on the data to estimate distribution parameters. 2. **Truncate the distribution**: The fitted distribution is truncated to the specified bounds using CDF inversion. This ensures: - PDF integrates to 1 over the bounded domain - Samples are always within bounds - All statistical methods (pdf, cdf, ppf, sample) respect bounds The truncation uses the formula: .. code-block:: text ppf_truncated(u) = ppf_original(cdf_lb + u * (cdf_ub - cdf_lb)) where: cdf_lb = CDF at lower bound cdf_ub = CDF at upper bound u ~ Uniform(0, 1) Working with Bounded Results ---------------------------- The ``DistributionFitResult`` object tracks bounds and applies them automatically: .. code-block:: python best = results.best(n=1)[0] # Check bounds print(f"Lower bound: {best.lower_bound}") # e.g., 0.0 print(f"Upper bound: {best.upper_bound}") # e.g., 100.0 # All methods respect bounds automatically samples = best.sample(1000) # Samples within bounds pdf_vals = best.pdf(x_values) # Normalized PDF cdf_vals = best.cdf(x_values) # CDF: 0 below lb, 1 above ub quantiles = best.ppf([0.25, 0.5, 0.75]) # Quantiles within bounds # Get scipy distribution (already truncated) dist = best.get_scipy_dist() dist.rvs(size=100) # Also respects bounds Serialization ------------- Bounds are preserved when saving and loading results: .. code-block:: python # Save best result with bounds best = results.best(n=1)[0] best.save("model.json") # Load - bounds are restored from spark_bestfit.results import DistributionFitResult loaded = DistributionFitResult.load("model.json") print(loaded.lower_bound, loaded.upper_bound) # Bounds preserved Multi-Column Bounded Fitting ---------------------------- You can specify **different bounds per column** using dictionaries: .. code-block:: python # Different bounds for each column results = fitter.fit( df, columns=["percentage", "price", "age"], bounded=True, lower_bound={"percentage": 0.0, "price": 0.0, "age": 0.0}, upper_bound={"percentage": 100.0, "price": 10000.0, "age": 120.0}, ) # Each column has its own bounds pct_result = results.for_column("percentage").best(n=1)[0] print(pct_result.lower_bound, pct_result.upper_bound) # 0.0, 100.0 price_result = results.for_column("price").best(n=1)[0] print(price_result.lower_bound, price_result.upper_bound) # 0.0, 10000.0 **Partial dictionaries** are supported - unspecified columns auto-detect from data: .. code-block:: python # Only specify bounds for some columns results = fitter.fit( df, columns=["col_a", "col_b", "col_c"], bounded=True, lower_bound={"col_a": 0.0}, # Only col_a has explicit lower bound upper_bound={"col_b": 100.0}, # Only col_b has explicit upper bound ) # col_c auto-detects both bounds from data **Scalar bounds** apply to all columns (backward compatible): .. code-block:: python # Same bounds for all columns results = fitter.fit( df, columns=["col_a", "col_b", "col_c"], bounded=True, lower_bound=0.0, # Applied to all columns upper_bound=1.0, # Applied to all columns ) Use Cases --------- **Percentages and Proportions (0-100 or 0-1)** .. code-block:: python results = fitter.fit( df, column="conversion_rate", bounded=True, lower_bound=0.0, upper_bound=1.0 ) **Non-Negative Values (prices, counts, durations)** .. code-block:: python results = fitter.fit( df, column="price", bounded=True, lower_bound=0.0 ) **Age Data** .. code-block:: python results = fitter.fit( df, column="age", bounded=True, lower_bound=0.0, upper_bound=120.0 ) **Score Ranges** .. code-block:: python results = fitter.fit( df, column="credit_score", bounded=True, lower_bound=300.0, upper_bound=850.0 ) Discrete Distributions ---------------------- Bounded fitting is also supported for discrete distributions: .. code-block:: python from spark_bestfit import DiscreteDistributionFitter # Auto-detect bounds fitter = DiscreteDistributionFitter(spark) results = fitter.fit(df, column="count", bounded=True) # Explicit bounds results = fitter.fit( df, column="count", bounded=True, lower_bound=0, upper_bound=100, ) best = results.best(n=1, metric="aic")[0] print(best.lower_bound, best.upper_bound) .. note:: For discrete distributions, bounds are stored with the fit result but sampling uses the underlying scipy distribution. The bounds serve as metadata for the valid range of the fitted distribution. Performance Considerations -------------------------- Bounded fitting adds minimal overhead: - Fitting time is unchanged (bounds are applied post-fit) - Sampling is ~10% slower due to CDF inversion transform - PDF/CDF/PPF evaluation has negligible overhead For very large sample generation, the overhead of truncation is small compared to the random number generation itself.