Graph Database Selection and NebulaGraph Architecture for a Knowledge‑Graph Platform
The article explains how the cloud‑construction platform evaluated graph‑database options based on open‑source, scalability, latency, storage capacity and import capabilities, ultimately choosing NebulaGraph, and then details NebulaGraph’s distributed meta, storage and query services as well as the overall multi‑layer knowledge‑graph platform architecture and future application scenarios.
1 Background
Graph data structures naturally model real‑world entities such as buyers, suppliers and contracts as vertices and their interactions as edges, making it easy to describe complex relationships. In the cloud‑construction platform these graphs are used for knowledge‑graph mining, security risk control, data governance, and search/recommendation services.
2 Graph Database Selection
The team considered five criteria: (A) open‑source only, (B) distributed architecture with good scalability, (C) millisecond‑level multi‑hop query latency, (D) ability to store billions of vertices/edges, and (E) bulk import from the data warehouse.
Based on these, candidates were grouped into three categories:
First class : Neo4j, ArangoDB, Virtuoso, TigerGraph, RedisGraph – high performance single‑node open‑source databases, unsuitable for large‑scale distributed scenarios.
Second class : JanusGraph, HugeGraph – add a generic graph‑semantic layer on existing storage but suffer from limited push‑down computation and poor multi‑hop performance.
Third class : DGraph, NebulaGraph – redesign storage model, vertex/edge distribution and execution engine for deep multi‑hop optimization, meeting the selection requirements.
DGraph, created by former Google engineer Manish Rai Jain in 2016, uses an RDF model, Go implementation, BadgerDB‑based storage and Raft for strong consistency.
NebulaGraph, launched in 2019 by former Facebook engineer Ye Xiaomeng, adopts an property‑graph model, C++ implementation, RocksDB‑based storage and Raft for strong consistency.
Performance tests on the LDBC‑SNB benchmark showed NebulaGraph outperforming DGraph and HugeGraph in data import, real‑time writes and multi‑hop queries, and its active community led to its final selection.
3 NebulaGraph Architecture
A complete NebulaGraph cluster consists of three service types: Query Service, Storage Service and Meta Service, each with its own executable binary and deployable on the same or different nodes.
Meta Service : Implements a Leader/Follower model; the leader provides external services while followers replicate updates. It stores schema, partitioning metadata, and manages long‑running jobs such as data migration, leader changes, compaction and index rebuilding.
Storage‑Compute Separation : The architecture separates compute (above the dashed line) from storage (below). This enables independent scaling of each layer and allows the storage service to serve multiple compute engines (OLTP query service and future OLAP frameworks).
Stateless Compute Layer : Each compute node runs a stateless query engine that reads metadata from the Meta Service and interacts with the Storage Service, making the compute cluster easy to manage with Kubernetes or cloud deployments.
Shared‑Nothing Distributed Storage Layer : The Storage Service follows a shared‑nothing design with three layers – a local Store Engine (currently RocksDB‑based), a Consensus layer implementing multi‑group Raft for each partition, and a Graph‑API layer that translates graph operations into KV requests, enabling true graph storage and efficient push‑down computation.
4 Knowledge‑Graph Platform Architecture
The platform is organized into four layers:
Data Application Layer : Business services integrate the graph SDK to perform real‑time CRUD operations on graph data.
Data Storage Layer : Deployed as a cluster with replication factor ≥3; the service remains available as long as a majority of replicas are alive.
Data Production Layer : Graph data originates from two sources – batch ETL jobs that convert warehouse tables into vertex/edge Hive tables for offline import, and near‑real‑time streams (Spark/Flink) that write data via bulk online APIs.
Support Platform : Provides schema management, permission control, data quality checks, CRUD APIs, cluster scaling, graph profiling, export, monitoring, visualization, and package management.
5 Summary and Outlook
The knowledge‑graph platform will offer end‑to‑end self‑service management of graph data, allowing business units to create schemas, import data, configure import jobs and use the provided SDK for data operations.
Future integrations include search engines, recommendation systems, automated shop‑assistant Q&A, and data‑governance tools, aiming to support millions of suppliers and tens of millions of construction‑material items, and to generate significant revenue by 2023.
YunZhu Net Technology Team
Technical practice sharing from the YunZhu Net Technology Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.