FitterConfig Builder

Added in version 2.2.0.

spark-bestfit provides a fluent builder pattern for configuring distribution fitting. The FitterConfig dataclass and FitterConfigBuilder offer a cleaner alternative to passing many parameters to fit().

Why Use FitterConfig?

The fit() method supports 15+ parameters for continuous distributions:

# Traditional approach - many parameters
results = fitter.fit(
    df, column="value",
    bins=100,
    use_rice_rule=False,
    support_at_zero=True,
    max_distributions=50,
    prefilter=True,
    enable_sampling=True,
    sample_fraction=0.1,
    max_sample_size=500_000,
    sample_threshold=5_000_000,
    bounded=True,
    lower_bound=0.0,
    upper_bound=100.0,
    num_partitions=16,
    lazy_metrics=True,
)

With FitterConfigBuilder, this becomes:

from spark_bestfit import FitterConfigBuilder

# Builder pattern - cleaner and reusable
config = (FitterConfigBuilder()
    .with_bins(100, use_rice_rule=False)
    .with_support_at_zero()
    .with_max_distributions(50)
    .with_prefilter()
    .with_sampling(fraction=0.1, max_size=500_000, threshold=5_000_000)
    .with_bounds(lower=0.0, upper=100.0)
    .with_partitions(16)
    .with_lazy_metrics()
    .build())

results = fitter.fit(df, column="value", config=config)

Benefits:

  • Cleaner code: Grouped, readable configuration

  • Reusable: Same config works across multiple fits

  • IDE-friendly: Better autocomplete and discoverability

  • Immutable: Frozen dataclass prevents accidental mutation

  • Backward compatible: Individual parameters still work

Basic Usage

Create a configuration using the builder:

from spark_bestfit import DistributionFitter, FitterConfigBuilder, LocalBackend

# Create a configuration
config = (FitterConfigBuilder()
    .with_bins(100)
    .with_lazy_metrics()
    .build())

# Use with fitter
fitter = DistributionFitter(backend=LocalBackend())
results = fitter.fit(df, column="value", config=config)

Or create FitterConfig directly:

from spark_bestfit import FitterConfig

config = FitterConfig(
    bins=100,
    lazy_metrics=True,
)

Builder Methods

Method

Description

with_bins(bins, use_rice_rule)

Configure histogram binning (continuous only)

with_bounds(lower, upper, auto_detect)

Enable bounded/truncated distribution fitting

with_sampling(fraction, max_size, threshold, enabled)

Configure data sampling for large datasets

with_lazy_metrics(lazy)

Defer KS/AD computation until accessed

with_prefilter(mode)

Pre-filter incompatible distributions

with_support_at_zero(enabled)

Only fit non-negative distributions

with_max_distributions(n)

Limit number of distributions to fit

with_partitions(n)

Set parallel partition count

build()

Create immutable FitterConfig

Estimation Method

Added in version 2.5.0.

Configure the parameter estimation method:

config = (
    FitterConfigBuilder()
    .with_estimation_method('mse')  # or 'mle', 'auto'
    .build()
)

Options:

  • mle: Maximum Likelihood Estimation (default)

  • mse: Maximum Spacing Estimation (robust for heavy-tailed)

  • auto: Automatically select based on data characteristics

Config Attributes

Attribute

Default

Description

bins

50

Number of histogram bins or tuple of bin edges

use_rice_rule

True

Auto-determine bin count using Rice rule

support_at_zero

False

Only fit non-negative distributions

max_distributions

None

Limit distributions to fit (None = all)

prefilter

False

Pre-filter incompatible distributions

enable_sampling

True

Enable sampling for large datasets

sample_fraction

None

Explicit sample fraction (None = auto)

max_sample_size

1,000,000

Max rows when auto-determining sample

sample_threshold

10,000,000

Row count above which sampling applies

bounded

False

Enable truncated distribution fitting

lower_bound

None

Lower bound (scalar or per-column dict)

upper_bound

None

Upper bound (scalar or per-column dict)

num_partitions

None

Parallel partitions (None = auto)

lazy_metrics

False

Defer KS/AD computation

Reusing Configurations

A key benefit of FitterConfig is reusability across multiple fits:

from spark_bestfit import DistributionFitter, FitterConfigBuilder

# Create config once
config = (FitterConfigBuilder()
    .with_bins(100)
    .with_bounds(lower=0)
    .with_lazy_metrics()
    .build())

fitter = DistributionFitter(spark)

# Reuse for multiple columns
for col in ["price", "quantity", "revenue"]:
    results = fitter.fit(df, column=col, config=config)
    best = results.best(n=1, metric="aic")[0]
    print(f"{col}: {best.distribution}")

# Reuse for different DataFrames
for df in [df_train, df_test, df_validation]:
    results = fitter.fit(df, column="value", config=config)

Per-Column Bounds

For multi-column fitting with different bounds per column:

config = (FitterConfigBuilder()
    .with_bounds(
        lower={"price": 0.0, "temperature": -40.0},
        upper={"price": 10000.0, "temperature": 50.0}
    )
    .build())

results = fitter.fit(df, columns=["price", "temperature"], config=config)

Progress Callback Override

The progress_callback parameter can be passed directly to fit() even when using a config. This allows different callbacks for different fits while reusing the same config:

from spark_bestfit import console_progress

config = (FitterConfigBuilder()
    .with_lazy_metrics()
    .build())

# Different callback per fit
results1 = fitter.fit(df, column="col1", config=config, progress_callback=console_progress)
results2 = fitter.fit(df, column="col2", config=config)  # No callback

Or set the callback on the config itself:

config = (FitterConfigBuilder()
    .with_lazy_metrics()
    .build()
    .with_progress_callback(console_progress))

Config for Discrete Distributions

FitterConfig works with both continuous and discrete fitters. Continuous-only attributes (like bins, use_rice_rule, support_at_zero) are simply ignored by DiscreteDistributionFitter:

from spark_bestfit import DiscreteDistributionFitter, FitterConfigBuilder

# Same config works for both fitters
config = (FitterConfigBuilder()
    .with_bounds(lower=0, upper=100)
    .with_lazy_metrics()
    .build())

# Continuous fitter
continuous_fitter = DistributionFitter(spark)
continuous_results = continuous_fitter.fit(df, column="value", config=config)

# Discrete fitter (bins/support_at_zero ignored)
discrete_fitter = DiscreteDistributionFitter(spark)
discrete_results = discrete_fitter.fit(df, column="counts", config=config)

Backward Compatibility

Individual parameters continue to work as before. When both config and individual parameters are provided, config takes precedence:

config = FitterConfigBuilder().with_max_distributions(5).build()

# Config wins: max_distributions=5 is used, not 10
results = fitter.fit(
    df, column="value",
    config=config,
    max_distributions=10,  # Ignored when config is provided
)

Exception: progress_callback always overrides the config’s callback when passed directly to fit().

Immutability

FitterConfig is a frozen dataclass. Attempting to modify it raises an error:

config = FitterConfig(bins=100)
config.bins = 200  # Raises FrozenInstanceError!

To create a modified config, use dataclasses.replace():

from dataclasses import replace

config = FitterConfig(bins=100, lazy_metrics=True)
modified = replace(config, bins=200)  # New config with bins=200

Or use the with_progress_callback() convenience method:

config = FitterConfig(lazy_metrics=True)
with_callback = config.with_progress_callback(my_callback)