Artificial Intelligence 13 min read

GraphLearn: An Industrial‑Scale Distributed Graph Learning Platform and Its System Optimizations

This article introduces GraphLearn, a large‑scale distributed graph learning platform designed for industrial GNN workloads, details its architecture, sampling implementation, training pipeline, system optimizations such as GPU‑accelerated sampling, and showcases real‑world applications in recommendation and risk control.

DataFunSummit

Oct 1, 2022

GraphLearn: An Industrial‑Scale Distributed Graph Learning Platform and Its System Optimizations

GraphLearn is a distributed graph learning platform built for industrial‑scale Graph Neural Network (GNN) workloads, supporting graphs with billions of edges and heterogeneous attributes. It combines a robust graph engine for storage, sampling, and fault‑tolerant scaling with deep‑learning engines (TensorFlow/PyTorch) to provide a unified programming model.

Platform Overview : The system stores large graphs in a distributed in‑memory engine, offers Python and Gremlin‑like GSL interfaces for subgraph extraction, and integrates feature handling modules that convert raw data into dense feature tables.

Topology and Feature Storage : Graph topology is partitioned across servers; each partition stores adjacency lists locally, enabling one‑hop neighbor retrieval without network hops and multi‑hop caching via hotspot buffers. Features are indexed by source/destination node IDs, stored as contiguous tables, and edge attributes (weights, labels) are kept in adjacency lists. Edge‑weight‑based top‑K sampling achieves O(1) time complexity.

Sampling Implementation : Sampling follows a distributed Partition → Alias (local) → Stitch pipeline. The Partition step maps input nodes to the server holding their neighbors, Alias performs O(1) sampling locally, and Stitch concatenates results. The execution is an Op that can be customized for tasks such as degree queries. g.E("u2i").batch(64).alias('edge').outV().alias('src'); GNN Model Paradigm : GraphLearn adopts Mini‑Batch training for scalability. The pipeline includes subgraph sampling (EgoGraph or SubGraph), feature preprocessing (handling int, float, string attributes), and message passing (GraphSAGE‑style aggregation for EgoGraph or generic subgraph propagation).

Architecture Layers :

Graph Engine + NN Engine: distributed storage, RPC, and compatibility with TensorFlow/PyTorch.

Data Layer: FeatureColumn/FeatureHandler for preprocessing, Dataset for organizing sampled subgraphs.

Network Layer: built‑in graph convolution operators supporting heterogeneous graphs.

Model Layer: implementations of common GNNs (GCN, GraphSAGE, GAT, RGCN, UltraGCN) and extensibility for custom algorithms.

System Optimizations :

Sampling Optimization : DAG‑based parallel query execution, lock‑free actor scheduling, and Gremlin‑like language for expressive sampling.

Sparse‑Scenario GNN Optimizations : AdamAsync optimizer, string feature hashing, embedding fusion, top‑K edge‑weight sampling, and advanced negative‑sampling strategies.

gl_torch – GPU Acceleration for PyTorch : CSR‑based graph topology on GPU/pinned memory, CUDA kernels for sampling (≈80× faster than CPU), UnifiedTensor for zero‑copy CPU‑GPU feature access, and flexible runtime modes allowing topology, features, and sampling to reside on CPU, GPU, or pinned memory.

Application Cases :

Recommendation Recall: converting user‑item interactions into edge‑prediction tasks, using GraphSAGE, UltraGCN, and sequence‑based SURGE models.

Security Risk Control: RGCN on heterogeneous user‑item‑comment graphs for spam registration detection and GCN on homogeneous comment‑similarity graphs for spam review detection.

Online Inference : A sampling‑service architecture (Data Loader, Sample Builder, Collector, Store, Publisher) delivers subgraphs with sub‑millisecond latency (2‑hop P99 ≈ 20 ms) and supports 20 k QPS per node with linear scalability. The inference pipeline updates dynamic user data, performs sampling, and feeds results to a GNN model service.

The platform’s source code and documentation are publicly available, enabling practitioners to adopt large‑scale GNN training and inference in production environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large-Scale Graph Distributed computing Sampling Optimization

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.