Artificial Intelligence 8 min read

Horovod Distributed Training Plugin: Design, Usage, and Deadlock Prevention

This article reviews Horovod, a popular third‑party distributed deep‑learning training plugin, explaining its simple three‑line integration, the challenges of deadlocks in all‑reduce operations, and the architectural components—including background threads, coordinators, and MPI/Gloo controllers—that enable scalable and efficient data‑parallel training.

DataFunSummit
DataFunSummit
DataFunSummit
Horovod Distributed Training Plugin: Design, Usage, and Deadlock Prevention

Horovod is a widely used third‑party distributed training plugin for deep learning that focuses on simple data‑parallelism, offering easy integration and strong performance.

1. Quick Usage Example

Using TensorFlow gradients as an example, a Horovod program only needs three short code insertions:

hvd.init()

Wrap the existing optimizer with DistributedOptimizer

Insert a broadcast hook with BroadcastGlobalVariablesHook

These minimal additions do not break user code and make Horovod easy to adopt.

2. Avoiding Deadlocks

All‑reduce calls can be blocking, and TensorFlow’s dynamic scheduling may cause some workers to wait indefinitely if gradients are reduced before all workers have produced them. Horovod solves this by using a background thread loop that continuously polls and a coordinator (rank 0) that counts gradient availability across workers. Only when every worker has generated a particular gradient does the coordinator trigger the collective All‑Reduce, preventing deadlocks.

3. Startup Process

The call hvd.init() launches all Horovod processes and starts the BackgroundThreadLoop() on a background thread, which repeatedly invokes RunLoopOnce() until the program ends.

4. Communication and Controllers

Horovod separates negotiation (the controller) from the actual collective communication. Two controllers are provided: MPI and Gloo. MPI is heavier, so Gloo was added later to reduce dependency weight while still handling coordination tasks.

5. Overall Code Architecture

The main components include:

GlobalState : Holds global runtime state used by the background thread.

OpManager : Executes collective operations (e.g., All‑Reduce) using the selected communication library.

Request/Response : Structures for messages exchanged between workers and the coordinator.

Queue : Bridges producers (gradient‑producing ops) and the consumer (background thread).

All Horovod ops are asynchronous; their ComputeAsync method enqueues gradients, and the actual communication occurs later when the background thread processes the queue.

6. Additional Performance Considerations

TensorFusion : Merges small tensors to improve bandwidth utilization while avoiding excessive fusion that would block overlap.

Background Thread Polling Interval : Too frequent polling adds coordinator overhead; too infrequent delays communication.

AutoTuning : Adjusts parameters like fusion size and polling interval at runtime using Bayesian optimization.

Coordinator Cache : Mitigates central bottlenecks when scaling to thousands of workers.

The article includes several diagrams illustrating Horovod’s scalability, the hvd.init() call stack, the communication‑controller separation, and the full process flow.

For more details, refer to the original Horovod source analysis on Zhihu.

Performance Optimizationdeep learningdistributed trainingHorovodMPIData ParallelGloo
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.