ADR-0004: Adaptive Sampling Strategy ===================================== :Status: Accepted :Date: 2026-01-11 (v3.0.0) Context ------- Distribution fitting on large datasets (100M+ rows) requires sampling to achieve reasonable performance. The original approach used uniform random sampling, which works well for symmetric distributions but can miss important characteristics of skewed data: 1. **Tail underrepresentation**: Heavy-tailed distributions (Pareto, lognormal) have rare but important extreme values that uniform sampling may miss 2. **Fitting failures**: Undersampled tails lead to poor parameter estimates, especially for shape parameters 3. **One-size-fits-all**: Symmetric data doesn't need stratified sampling's overhead We needed sampling that adapts to data characteristics. Decision -------- We implemented adaptive sampling with three modes in ``config.py``:: class SamplingMode(Enum): AUTO = "auto" # Select based on skewness UNIFORM = "uniform" # Force uniform random sampling STRATIFIED = "stratified" # Force stratified sampling **Skewness-based selection** (AUTO mode): - ``|skew| < 0.5`` (mild): Uniform sampling - efficient for symmetric data - ``0.5 <= |skew| < 2.0`` (moderate): Stratified with 5 percentile bins - ``|skew| >= 2.0`` (high): Stratified with 10 bins + tail oversampling **Configuration**:: config = (FitterConfigBuilder() .with_adaptive_sampling( enabled=True, mode=SamplingMode.AUTO, skew_threshold_mild=0.5, skew_threshold_high=2.0, ) .build()) **Implementation details:** 1. Skewness is computed on a small preliminary sample (10k rows) 2. Stratified sampling uses percentile-based bins to ensure representation 3. High-skew mode oversamples the 95th+ percentile tail 4. Thresholds are configurable for domain-specific tuning Consequences ------------ **Positive:** - Better parameter estimates for skewed distributions - Automatic detection removes user guesswork - Configurable thresholds allow domain tuning - Backwards compatible: ``UNIFORM`` mode preserves old behavior **Negative:** - Additional computation for skewness detection - Stratified sampling is slower than uniform - Two-stage sampling (detect then sample) adds latency **Neutral:** - Default thresholds (0.5, 2.0) based on statistical literature for "moderate" and "high" skewness classifications References ---------- - `PR #160 `_: Adaptive sampling (v3.0.0) - `Issue #70 `_: Feature request - Related: :doc:`ADR-0005-estimation-methods` (MSE for heavy tails)