Kuaishou Graph Database Storage‑Compute Separation Architecture and Its Application in Real‑Time Recommendation
This article presents Kuaishou's graph database storage‑compute separation architecture, detailing its application in real‑time recommendation scenarios, core requirements of cost, performance and usability, the layered service design, memory‑compact models, edge structures, snapshot isolation, and key performance optimizations such as Share‑Nothing and columnar data flow.
The article shares Kuaishou's graph database storage‑compute separation architecture from an engineering perspective and explains how it is applied to real‑time recommendation recall.
Three typical application scenarios are described: (1) triangle diffusion, where two‑hop graph traversal is used to retrieve related users or videos for recommendation; (2) triangle common, which identifies common interests or connections between two nodes to enhance point‑to‑point recommendations; and (3) existence problems, which detect valuable comments or relationships by checking for specific edge types such as Follow, Friend, or Like.
The core demands of the system are cost, performance, and ease of use. Cost is critical because the graph contains billions of edges and trillions of total data, requiring scalable resources; performance must meet sub‑millisecond latency for large‑scale queries; and usability demands fast rule deployment and flexible filtering.
The storage‑compute separation architecture consists of three layers: Graph Service (the query execution layer), Tree Service (the graph model layer with caching and memory options), and Storage layer (a hybrid of SSD and S3 cold storage that separates hot and cold data). This separation allows independent scaling of CPU, memory, and disk resources.
BWTree Service implements the memory layer with strong consistency. It uses a write‑ahead log (WAL) to commit changes, applies logs to in‑memory pages, handles conflicts via log replay, and employs leader election for multi‑replica environments.
The compact memory model mimics an operating‑system page table using mmap‑allocated three‑level tables and a CLOCK eviction algorithm to reclaim pages with zero access count, minimizing memory waste.
The edge model is built from four tree structures: Record Tree, Unique Index Tree, Num Neighbors Materialize Tree, and Bidirectional Materialize Tree. These structures can be generated or omitted based on user configuration to represent different edge relationships.
Snapshot isolation is achieved by versioning pages; each read request carries a version number, allowing the system to present a consistent snapshot and avoid phantom reads or uncommitted data.
Key performance optimizations include the Share‑Nothing principle, which binds connections and workers to the same thread to eliminate cross‑thread latency, and a data flow that stores data in row format on disk but converts it to columnar format after reading. The columnar data is transmitted via Apache‑Arrow, enabling vectorized computation and achieving query latencies of around 10 ms for 100 k‑row workloads.
The article concludes by summarizing the presented content and thanking the audience.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.