ADR-0004: Adaptive Sampling Strategy

Status:

Accepted

Date:

2026-01-11 (v3.0.0)

Context

Distribution fitting on large datasets (100M+ rows) requires sampling to achieve reasonable performance. The original approach used uniform random sampling, which works well for symmetric distributions but can miss important characteristics of skewed data:

  1. Tail underrepresentation: Heavy-tailed distributions (Pareto, lognormal) have rare but important extreme values that uniform sampling may miss

  2. Fitting failures: Undersampled tails lead to poor parameter estimates, especially for shape parameters

  3. One-size-fits-all: Symmetric data doesn’t need stratified sampling’s overhead

We needed sampling that adapts to data characteristics.

Decision

We implemented adaptive sampling with three modes in config.py:

class SamplingMode(Enum):
    AUTO = "auto"        # Select based on skewness
    UNIFORM = "uniform"  # Force uniform random sampling
    STRATIFIED = "stratified"  # Force stratified sampling

Skewness-based selection (AUTO mode):

  • |skew| < 0.5 (mild): Uniform sampling - efficient for symmetric data

  • 0.5 <= |skew| < 2.0 (moderate): Stratified with 5 percentile bins

  • |skew| >= 2.0 (high): Stratified with 10 bins + tail oversampling

Configuration:

config = (FitterConfigBuilder()
    .with_adaptive_sampling(
        enabled=True,
        mode=SamplingMode.AUTO,
        skew_threshold_mild=0.5,
        skew_threshold_high=2.0,
    )
    .build())

Implementation details:

  1. Skewness is computed on a small preliminary sample (10k rows)

  2. Stratified sampling uses percentile-based bins to ensure representation

  3. High-skew mode oversamples the 95th+ percentile tail

  4. Thresholds are configurable for domain-specific tuning

Consequences

Positive:

  • Better parameter estimates for skewed distributions

  • Automatic detection removes user guesswork

  • Configurable thresholds allow domain tuning

  • Backwards compatible: UNIFORM mode preserves old behavior

Negative:

  • Additional computation for skewness detection

  • Stratified sampling is slower than uniform

  • Two-stage sampling (detect then sample) adds latency

Neutral:

  • Default thresholds (0.5, 2.0) based on statistical literature for “moderate” and “high” skewness classifications

References