Integrating Deep Learning with Apache Hadoop: Caffe-on-Spark on GPU‑Enhanced Clusters
This article describes how Yahoo integrated deep learning into its massive Hadoop ecosystem by adding GPU nodes, using YARN and Spark to run Caffe at scale, and presents performance results on AlexNet and GoogLeNet alongside open‑source contributions.
Introduction
Over the past decade Yahoo has invested heavily in Apache Hadoop, operating more than 40,000 servers and 600 PB of data across 19 clusters, and using the platform for large‑scale machine learning such as classification, ranking, and word‑vector computation.
Deep learning is a core technology for many Yahoo products; the Flickr team demonstrated its use for scene detection, object recognition, and aesthetic scoring, enabling automatic image tagging for end users.
To make deep learning available to more Yahoo services, the technology was migrated onto the Hadoop clusters, offering several advantages:
The deep‑learning workflow can run directly on the Hadoop cluster where the data resides, eliminating costly data movement.
It can be expressed as a first‑class Apache Oozie workflow, using Hadoop for data preprocessing and Spark pipelines for model training.
YARN supports deep learning, allowing multiple experiments to run concurrently on the same cluster, a significant improvement over manual notebook‑based GPU scheduling.
Enhancing the Hadoop Cluster
GPU nodes were added to the cluster, each equipped with four Nvidia Tesla K80 cards (two GK210 GPUs per card), delivering roughly ten times the compute power of traditional CPU nodes.
Each GPU node provides two network interfaces: Ethernet for external traffic and InfiniBand for high‑speed intra‑GPU communication, supporting RDMA access to GPU memory.
Using YARN’s node‑label feature (YARN‑796), jobs can declare whether they require CPU or GPU containers, enabling efficient scheduling of GPU‑intensive workloads.
Distributed Deep Learning: Caffe‑on‑Spark
Yahoo built a distributed computing stack on top of Apache Spark and Caffe to run deep‑learning jobs on the enhanced Hadoop cluster. Users submit jobs with a command similar to the following:
spark-submit --master yarn --deploy-mode cluster \
--files solver.prototxt,net.prototxt \
--num-executors <# of EXECUTORS> \
--archives caffe_on_grid.tgz \
--conf spark.executorEnv.LD_LIBRARY_PATH="./caffe_on_grid.tgz/lib64" \
--class com.yahoo.ml.CaffeOnSpark caffe-on-spark-1.0-jar-with-dependencies.jar \
-devices <# of GPUs PER EXECUTOR> \
-conf solver.prototxt \
-input hdfs://
\
-model hdfs://The command lets users specify the number of Spark executors, GPUs per executor, and HDFS locations for training data and model files, while standard Caffe prototxt files define the network architecture.
During execution, each executor processes a data partition from HDFS, spawns multiple Caffe training threads (one per GPU), and exchanges gradient updates across GPUs using MPI Allreduce over RDMA.
Performance Evaluation
Benchmarks on the ImageNet 2012 dataset show that training time decreases as more GPUs are used. With four GPUs, training reaches 50 % accuracy in only 35 % of the time required by a single GPU; using eight GPUs yields diminishing returns due to insufficient per‑GPU batch size.
Further tests on the deeper GoogLeNet model across four servers (4 × 8 GPUs) achieve top‑5 accuracy above 80 % in under 10 hours, compared with 40 hours for a single GPU to reach 60 % top‑5 accuracy. Scaling to eight GPUs provides a 680 % speed‑up for reaching 60 % top‑5 accuracy.
Open‑Source Resources
Yahoo has contributed several patches to the Caffe codebase on GitHub (github.com/BVLC/caffe), including multi‑GPU support on a single machine, RDMA data transfer, improved data pipelines, timing information, optional I/O dependencies, and refactored solver code.
Future articles will detail the design and implementation of Caffe‑on‑Spark, and the community is invited to provide feedback to [email protected].
Conclusion
This article outlines an approach to integrate the Apache Hadoop ecosystem with deep learning on a heterogeneous GPU + CPU cluster, presents early performance results that are encouraging, and signals ongoing investment in Hadoop, Spark, and Caffe to make large‑scale deep learning more effective.
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.