A Comparative Study of Distributed Machine Learning Platforms: Design Methods and Evaluation
This article surveys design approaches for distributed machine learning platforms, classifies them into basic dataflow, parameter‑server, and advanced dataflow models, examines examples such as Spark, PMLS, TensorFlow and MXNet, and presents performance evaluations and future research directions.
Machine learning and deep learning have become core technologies in many domains, leading to a proliferation of distributed machine learning (ML) platforms. This paper surveys the design methods used in such platforms, categorizes them, and outlines future research directions.
The surveyed platforms are classified into three fundamental design approaches: (1) basic dataflow, (2) parameter‑server model, and (3) advanced dataflow.
Basic Dataflow – Spark : In Spark, computation is modeled as a directed acyclic graph (DAG) of resilient distributed datasets (RDDs). Each vertex represents an RDD and edges represent transformations or actions. Spark’s driver stores model parameters and workers communicate with the driver each iteration, which incurs high overhead because new RDDs must be created for updated parameters and iterative shuffling limits scalability. Spark is primarily a general‑purpose data‑processing engine, not optimized for iterative ML workloads.
Parameter‑Server Model – PMLS : PMLS introduces a dedicated parameter server (PS) that maintains model parameters in a distributed in‑memory key‑value store. The PS can be sharded and replicated, allowing easy scaling with the number of nodes. Workers fetch the latest parameters from their local PS replica and compute on assigned data partitions. PMLS also adopts the Stale‑Synchronous Parallel (SSP) consistency model, which relaxes strict bulk‑synchronous updates to reduce synchronization delays while preserving convergence guarantees.
Advanced Dataflow – TensorFlow (and MXNet) : TensorFlow uses a data‑flow graph where nodes represent stateful operations and edges carry tensors. Users declare a static symbolic graph that can be partitioned and rewritten for distributed execution, optionally combined with a parameter‑server abstraction for data parallelism. MXNet (and DyNet) extend this idea with dynamic graph construction, simplifying programming for complex models.
Evaluation : Experiments were conducted on Amazon EC2 m4.xlarge instances (4 vCPU, 16 GB RAM, 750 Mbps EBS). Two representative ML tasks were used: a two‑layer logistic regression and an image‑classification neural network. Results show that Spark lags behind PMLS and MXNet in both tasks, especially for deeper networks where iteration overhead is larger. CPU utilization graphs indicate higher overhead for Spark due to serialization costs.
Conclusions and Future Directions : Distributed ML platforms still face bottlenecks in network bandwidth and CPU overhead. Improving data‑flow abstractions, providing first‑class support for models and parameters, and developing better monitoring and performance‑prediction tools (e.g., Ernest, CherryPick) are essential. Open challenges include elastic resource scheduling, runtime performance optimization, and defining suitable programming abstractions and testing methodologies for ML workloads.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.