ADR-0001: Multi-Backend Architecture¶
- Status:
Accepted
- Date:
2026-01-04 (v2.0.0)
Context¶
spark-bestfit was originally designed exclusively for Apache Spark, using Pandas UDFs for parallel distribution fitting. However, this created several limitations:
Development friction: Local testing required a full Spark installation
ML workflow gaps: Ray is increasingly popular for ML pipelines, but users had to convert data between formats
Small dataset overhead: Spark’s overhead isn’t justified for datasets that fit in memory
We needed a way to support multiple execution backends while maintaining a consistent API and avoiding code duplication.
Decision¶
We introduced the ExecutionBackend protocol using Python’s structural
subtyping (PEP 544). Any class implementing the required methods is compatible
without explicit inheritance.
Protocol definition (protocols.py):
@runtime_checkable
class ExecutionBackend(Protocol):
def broadcast(self, data: Any) -> Any: ...
def destroy_broadcast(self, handle: Any) -> None: ...
def parallel_fit(...) -> List[Dict[str, Any]]: ...
def get_parallelism(self) -> int: ...
def collect_column(self, df: Any, column: str) -> np.ndarray: ...
# ... additional methods
Backend implementations:
SparkBackend: Apache Spark via Pandas UDFs (original behavior)LocalBackend:concurrent.futures.ProcessPoolExecutorfor developmentRayBackend: Ray distributed computing for ML workflows
Factory pattern (backends/factory.py):
class BackendFactory:
@classmethod
def for_dataframe(cls, df: Any) -> ExecutionBackend:
# Auto-detect: Ray Dataset -> RayBackend
# pandas DataFrame -> LocalBackend
# else -> SparkBackend
@classmethod
def create(cls, backend_type: str, **kwargs) -> ExecutionBackend:
# Explicit creation by name
Lazy imports: Optional dependencies (PySpark, Ray) are imported only when the corresponding backend is instantiated, allowing installation without all backends.
Consequences¶
Positive:
Users can develop locally with
LocalBackendwithout SparkRay users get native integration without data format conversion
Consistent API regardless of backend choice
Duck typing enables future backends without modifying core code
Optional dependencies reduce installation size
Negative:
Protocol methods must be implemented consistently across backends
Testing matrix grows with each backend (currently 3x)
Some backend-specific optimizations may not be portable
Neutral:
Fitters accept an optional
backendparameter; if omitted, auto-detection is used based on DataFrame type