FitterConfig Builder¶
Added in version 2.2.0.
spark-bestfit provides a fluent builder pattern for configuring distribution fitting.
The FitterConfig dataclass and FitterConfigBuilder offer a cleaner alternative
to passing many parameters to fit().
Why Use FitterConfig?¶
The fit() method supports 15+ parameters for continuous distributions:
# Traditional approach - many parameters
results = fitter.fit(
df, column="value",
bins=100,
use_rice_rule=False,
support_at_zero=True,
max_distributions=50,
prefilter=True,
enable_sampling=True,
sample_fraction=0.1,
max_sample_size=500_000,
sample_threshold=5_000_000,
bounded=True,
lower_bound=0.0,
upper_bound=100.0,
num_partitions=16,
lazy_metrics=True,
)
With FitterConfigBuilder, this becomes:
from spark_bestfit import FitterConfigBuilder
# Builder pattern - cleaner and reusable
config = (FitterConfigBuilder()
.with_bins(100, use_rice_rule=False)
.with_support_at_zero()
.with_max_distributions(50)
.with_prefilter()
.with_sampling(fraction=0.1, max_size=500_000, threshold=5_000_000)
.with_bounds(lower=0.0, upper=100.0)
.with_partitions(16)
.with_lazy_metrics()
.build())
results = fitter.fit(df, column="value", config=config)
Benefits:
Cleaner code: Grouped, readable configuration
Reusable: Same config works across multiple fits
IDE-friendly: Better autocomplete and discoverability
Immutable: Frozen dataclass prevents accidental mutation
Backward compatible: Individual parameters still work
Basic Usage¶
Create a configuration using the builder:
from spark_bestfit import DistributionFitter, FitterConfigBuilder, LocalBackend
# Create a configuration
config = (FitterConfigBuilder()
.with_bins(100)
.with_lazy_metrics()
.build())
# Use with fitter
fitter = DistributionFitter(backend=LocalBackend())
results = fitter.fit(df, column="value", config=config)
Or create FitterConfig directly:
from spark_bestfit import FitterConfig
config = FitterConfig(
bins=100,
lazy_metrics=True,
)
Builder Methods¶
Method |
Description |
|---|---|
|
Configure histogram binning (continuous only) |
|
Enable bounded/truncated distribution fitting |
|
Configure data sampling for large datasets |
|
Defer KS/AD computation until accessed |
|
Pre-filter incompatible distributions |
|
Only fit non-negative distributions |
|
Limit number of distributions to fit |
|
Set parallel partition count |
|
Create immutable |
Estimation Method¶
Added in version 2.5.0.
Configure the parameter estimation method:
config = (
FitterConfigBuilder()
.with_estimation_method('mse') # or 'mle', 'auto'
.build()
)
Options:
mle: Maximum Likelihood Estimation (default)mse: Maximum Spacing Estimation (robust for heavy-tailed)auto: Automatically select based on data characteristics
Config Attributes¶
Attribute |
Default |
Description |
|---|---|---|
|
50 |
Number of histogram bins or tuple of bin edges |
|
True |
Auto-determine bin count using Rice rule |
|
False |
Only fit non-negative distributions |
|
None |
Limit distributions to fit (None = all) |
|
False |
Pre-filter incompatible distributions |
|
True |
Enable sampling for large datasets |
|
None |
Explicit sample fraction (None = auto) |
|
1,000,000 |
Max rows when auto-determining sample |
|
10,000,000 |
Row count above which sampling applies |
|
False |
Enable truncated distribution fitting |
|
None |
Lower bound (scalar or per-column dict) |
|
None |
Upper bound (scalar or per-column dict) |
|
None |
Parallel partitions (None = auto) |
|
False |
Defer KS/AD computation |
Reusing Configurations¶
A key benefit of FitterConfig is reusability across multiple fits:
from spark_bestfit import DistributionFitter, FitterConfigBuilder
# Create config once
config = (FitterConfigBuilder()
.with_bins(100)
.with_bounds(lower=0)
.with_lazy_metrics()
.build())
fitter = DistributionFitter(spark)
# Reuse for multiple columns
for col in ["price", "quantity", "revenue"]:
results = fitter.fit(df, column=col, config=config)
best = results.best(n=1, metric="aic")[0]
print(f"{col}: {best.distribution}")
# Reuse for different DataFrames
for df in [df_train, df_test, df_validation]:
results = fitter.fit(df, column="value", config=config)
Per-Column Bounds¶
For multi-column fitting with different bounds per column:
config = (FitterConfigBuilder()
.with_bounds(
lower={"price": 0.0, "temperature": -40.0},
upper={"price": 10000.0, "temperature": 50.0}
)
.build())
results = fitter.fit(df, columns=["price", "temperature"], config=config)
Progress Callback Override¶
The progress_callback parameter can be passed directly to fit() even when
using a config. This allows different callbacks for different fits while reusing
the same config:
from spark_bestfit import console_progress
config = (FitterConfigBuilder()
.with_lazy_metrics()
.build())
# Different callback per fit
results1 = fitter.fit(df, column="col1", config=config, progress_callback=console_progress)
results2 = fitter.fit(df, column="col2", config=config) # No callback
Or set the callback on the config itself:
config = (FitterConfigBuilder()
.with_lazy_metrics()
.build()
.with_progress_callback(console_progress))
Config for Discrete Distributions¶
FitterConfig works with both continuous and discrete fitters. Continuous-only
attributes (like bins, use_rice_rule, support_at_zero) are simply
ignored by DiscreteDistributionFitter:
from spark_bestfit import DiscreteDistributionFitter, FitterConfigBuilder
# Same config works for both fitters
config = (FitterConfigBuilder()
.with_bounds(lower=0, upper=100)
.with_lazy_metrics()
.build())
# Continuous fitter
continuous_fitter = DistributionFitter(spark)
continuous_results = continuous_fitter.fit(df, column="value", config=config)
# Discrete fitter (bins/support_at_zero ignored)
discrete_fitter = DiscreteDistributionFitter(spark)
discrete_results = discrete_fitter.fit(df, column="counts", config=config)
Backward Compatibility¶
Individual parameters continue to work as before. When both config and
individual parameters are provided, config takes precedence:
config = FitterConfigBuilder().with_max_distributions(5).build()
# Config wins: max_distributions=5 is used, not 10
results = fitter.fit(
df, column="value",
config=config,
max_distributions=10, # Ignored when config is provided
)
Exception: progress_callback always overrides the config’s callback when
passed directly to fit().
Immutability¶
FitterConfig is a frozen dataclass. Attempting to modify it raises an error:
config = FitterConfig(bins=100)
config.bins = 200 # Raises FrozenInstanceError!
To create a modified config, use dataclasses.replace():
from dataclasses import replace
config = FitterConfig(bins=100, lazy_metrics=True)
modified = replace(config, bins=200) # New config with bins=200
Or use the with_progress_callback() convenience method:
config = FitterConfig(lazy_metrics=True)
with_callback = config.with_progress_callback(my_callback)