ADR-0004: Adaptive Sampling Strategy¶
- Status:
Accepted
- Date:
2026-01-11 (v3.0.0)
Context¶
Distribution fitting on large datasets (100M+ rows) requires sampling to achieve reasonable performance. The original approach used uniform random sampling, which works well for symmetric distributions but can miss important characteristics of skewed data:
Tail underrepresentation: Heavy-tailed distributions (Pareto, lognormal) have rare but important extreme values that uniform sampling may miss
Fitting failures: Undersampled tails lead to poor parameter estimates, especially for shape parameters
One-size-fits-all: Symmetric data doesn’t need stratified sampling’s overhead
We needed sampling that adapts to data characteristics.
Decision¶
We implemented adaptive sampling with three modes in config.py:
class SamplingMode(Enum):
AUTO = "auto" # Select based on skewness
UNIFORM = "uniform" # Force uniform random sampling
STRATIFIED = "stratified" # Force stratified sampling
Skewness-based selection (AUTO mode):
|skew| < 0.5(mild): Uniform sampling - efficient for symmetric data0.5 <= |skew| < 2.0(moderate): Stratified with 5 percentile bins|skew| >= 2.0(high): Stratified with 10 bins + tail oversampling
Configuration:
config = (FitterConfigBuilder()
.with_adaptive_sampling(
enabled=True,
mode=SamplingMode.AUTO,
skew_threshold_mild=0.5,
skew_threshold_high=2.0,
)
.build())
Implementation details:
Skewness is computed on a small preliminary sample (10k rows)
Stratified sampling uses percentile-based bins to ensure representation
High-skew mode oversamples the 95th+ percentile tail
Thresholds are configurable for domain-specific tuning
Consequences¶
Positive:
Better parameter estimates for skewed distributions
Automatic detection removes user guesswork
Configurable thresholds allow domain tuning
Backwards compatible:
UNIFORMmode preserves old behavior
Negative:
Additional computation for skewness detection
Stratified sampling is slower than uniform
Two-stage sampling (detect then sample) adds latency
Neutral:
Default thresholds (0.5, 2.0) based on statistical literature for “moderate” and “high” skewness classifications
References¶
PR #160: Adaptive sampling (v3.0.0)
Issue #70: Feature request
Related: ADR-0005: Estimation Methods (MSE for heavy tails)