Bounded Distribution Fitting¶
spark-bestfit supports fitting distributions with explicit bounds. This is useful for data that has natural constraints like percentages (0-100), ages (0+), prices (0+), or any domain-specific limits.
Basic Usage¶
Use the bounded parameter to enable automatic bound detection from your data:
from spark_bestfit import DistributionFitter
fitter = DistributionFitter(spark)
# Auto-detect bounds from data min/max
results = fitter.fit(df, column="percentage", bounded=True)
# Get best fit - samples will respect the bounds
best = results.best(n=1)[0]
samples = best.sample(1000) # All samples within [data_min, data_max]
Explicit Bounds¶
For precise control, specify bounds explicitly:
# Both bounds explicit
results = fitter.fit(
df,
column="percentage",
bounded=True,
lower_bound=0.0,
upper_bound=100.0,
)
# Only lower bound (e.g., prices must be non-negative)
results = fitter.fit(
df,
column="price",
bounded=True,
lower_bound=0.0,
)
# Only upper bound
results = fitter.fit(
df,
column="score",
bounded=True,
upper_bound=1.0,
)
Using FitterConfig (v2.2+):
from spark_bestfit import FitterConfigBuilder
# Create reusable bounded config
config = (FitterConfigBuilder()
.with_bounds(lower=0.0, upper=100.0)
.build())
results = fitter.fit(df, column="percentage", config=config)
Note
When only one bound is specified and bounded=True, the other bound is
auto-detected from the data. Use -inf or inf to explicitly disable
a bound while keeping the other explicit.
How It Works¶
Bounded fitting uses a two-step process:
Fit the unbounded distribution: Standard MLE fitting is performed on the data to estimate distribution parameters.
Truncate the distribution: The fitted distribution is truncated to the specified bounds using CDF inversion. This ensures:
PDF integrates to 1 over the bounded domain
Samples are always within bounds
All statistical methods (pdf, cdf, ppf, sample) respect bounds
The truncation uses the formula:
ppf_truncated(u) = ppf_original(cdf_lb + u * (cdf_ub - cdf_lb))
where:
cdf_lb = CDF at lower bound
cdf_ub = CDF at upper bound
u ~ Uniform(0, 1)
Working with Bounded Results¶
The DistributionFitResult object tracks bounds and applies them automatically:
best = results.best(n=1)[0]
# Check bounds
print(f"Lower bound: {best.lower_bound}") # e.g., 0.0
print(f"Upper bound: {best.upper_bound}") # e.g., 100.0
# All methods respect bounds automatically
samples = best.sample(1000) # Samples within bounds
pdf_vals = best.pdf(x_values) # Normalized PDF
cdf_vals = best.cdf(x_values) # CDF: 0 below lb, 1 above ub
quantiles = best.ppf([0.25, 0.5, 0.75]) # Quantiles within bounds
# Get scipy distribution (already truncated)
dist = best.get_scipy_dist()
dist.rvs(size=100) # Also respects bounds
Serialization¶
Bounds are preserved when saving and loading results:
# Save best result with bounds
best = results.best(n=1)[0]
best.save("model.json")
# Load - bounds are restored
from spark_bestfit.results import DistributionFitResult
loaded = DistributionFitResult.load("model.json")
print(loaded.lower_bound, loaded.upper_bound) # Bounds preserved
Multi-Column Bounded Fitting¶
You can specify different bounds per column using dictionaries:
# Different bounds for each column
results = fitter.fit(
df,
columns=["percentage", "price", "age"],
bounded=True,
lower_bound={"percentage": 0.0, "price": 0.0, "age": 0.0},
upper_bound={"percentage": 100.0, "price": 10000.0, "age": 120.0},
)
# Each column has its own bounds
pct_result = results.for_column("percentage").best(n=1)[0]
print(pct_result.lower_bound, pct_result.upper_bound) # 0.0, 100.0
price_result = results.for_column("price").best(n=1)[0]
print(price_result.lower_bound, price_result.upper_bound) # 0.0, 10000.0
Partial dictionaries are supported - unspecified columns auto-detect from data:
# Only specify bounds for some columns
results = fitter.fit(
df,
columns=["col_a", "col_b", "col_c"],
bounded=True,
lower_bound={"col_a": 0.0}, # Only col_a has explicit lower bound
upper_bound={"col_b": 100.0}, # Only col_b has explicit upper bound
)
# col_c auto-detects both bounds from data
Scalar bounds apply to all columns (backward compatible):
# Same bounds for all columns
results = fitter.fit(
df,
columns=["col_a", "col_b", "col_c"],
bounded=True,
lower_bound=0.0, # Applied to all columns
upper_bound=1.0, # Applied to all columns
)
Use Cases¶
Percentages and Proportions (0-100 or 0-1)
results = fitter.fit(
df, column="conversion_rate",
bounded=True, lower_bound=0.0, upper_bound=1.0
)
Non-Negative Values (prices, counts, durations)
results = fitter.fit(
df, column="price",
bounded=True, lower_bound=0.0
)
Age Data
results = fitter.fit(
df, column="age",
bounded=True, lower_bound=0.0, upper_bound=120.0
)
Score Ranges
results = fitter.fit(
df, column="credit_score",
bounded=True, lower_bound=300.0, upper_bound=850.0
)
Discrete Distributions¶
Bounded fitting is also supported for discrete distributions:
from spark_bestfit import DiscreteDistributionFitter
# Auto-detect bounds
fitter = DiscreteDistributionFitter(spark)
results = fitter.fit(df, column="count", bounded=True)
# Explicit bounds
results = fitter.fit(
df,
column="count",
bounded=True,
lower_bound=0,
upper_bound=100,
)
best = results.best(n=1, metric="aic")[0]
print(best.lower_bound, best.upper_bound)
Note
For discrete distributions, bounds are stored with the fit result but sampling uses the underlying scipy distribution. The bounds serve as metadata for the valid range of the fitted distribution.
Performance Considerations¶
Bounded fitting adds minimal overhead:
Fitting time is unchanged (bounds are applied post-fit)
Sampling is ~10% slower due to CDF inversion transform
PDF/CDF/PPF evaluation has negligible overhead
For very large sample generation, the overhead of truncation is small compared to the random number generation itself.