ADR-0001: Multi-Backend Architecture

Status:

Accepted

Date:

2026-01-04 (v2.0.0)

Context

spark-bestfit was originally designed exclusively for Apache Spark, using Pandas UDFs for parallel distribution fitting. However, this created several limitations:

  1. Development friction: Local testing required a full Spark installation

  2. ML workflow gaps: Ray is increasingly popular for ML pipelines, but users had to convert data between formats

  3. Small dataset overhead: Spark’s overhead isn’t justified for datasets that fit in memory

We needed a way to support multiple execution backends while maintaining a consistent API and avoiding code duplication.

Decision

We introduced the ExecutionBackend protocol using Python’s structural subtyping (PEP 544). Any class implementing the required methods is compatible without explicit inheritance.

Protocol definition (protocols.py):

@runtime_checkable
class ExecutionBackend(Protocol):
    def broadcast(self, data: Any) -> Any: ...
    def destroy_broadcast(self, handle: Any) -> None: ...
    def parallel_fit(...) -> List[Dict[str, Any]]: ...
    def get_parallelism(self) -> int: ...
    def collect_column(self, df: Any, column: str) -> np.ndarray: ...
    # ... additional methods

Backend implementations:

  • SparkBackend: Apache Spark via Pandas UDFs (original behavior)

  • LocalBackend: concurrent.futures.ProcessPoolExecutor for development

  • RayBackend: Ray distributed computing for ML workflows

Factory pattern (backends/factory.py):

class BackendFactory:
    @classmethod
    def for_dataframe(cls, df: Any) -> ExecutionBackend:
        # Auto-detect: Ray Dataset -> RayBackend
        #              pandas DataFrame -> LocalBackend
        #              else -> SparkBackend

    @classmethod
    def create(cls, backend_type: str, **kwargs) -> ExecutionBackend:
        # Explicit creation by name

Lazy imports: Optional dependencies (PySpark, Ray) are imported only when the corresponding backend is instantiated, allowing installation without all backends.

Consequences

Positive:

  • Users can develop locally with LocalBackend without Spark

  • Ray users get native integration without data format conversion

  • Consistent API regardless of backend choice

  • Duck typing enables future backends without modifying core code

  • Optional dependencies reduce installation size

Negative:

  • Protocol methods must be implemented consistently across backends

  • Testing matrix grows with each backend (currently 3x)

  • Some backend-specific optimizations may not be portable

Neutral:

  • Fitters accept an optional backend parameter; if omitted, auto-detection is used based on DataFrame type

References

  • PR #86: V2.0.0 Multi-backend architecture

  • PR #93: BackendFactory addition (v2.2.0)

  • PEP 544: Structural Subtyping (Protocols)