Databases 29 min read

Evolution and Architecture of Graph Databases: From Early Designs to Modern Distributed Systems

This article surveys the development of graph databases, describing their underlying data models, storage designs across relational, native, document, and wide‑column systems, and reviewing representative modern distributed graph databases while discussing current challenges and future directions such as GQL standardization and graph‑AI integration.

DataFunTalk
DataFunTalk
DataFunTalk
Evolution and Architecture of Graph Databases: From Early Designs to Modern Distributed Systems

Graph (Vertex‑Edge) structures are widely used to model relationships in domains such as knowledge graphs, social networks, finance, and bio‑informatics, prompting the rise of graph database systems that store and query graph data efficiently.

Background

Traditional databases store graph data by converting it to relational, document, or column models, while native graph databases (e.g., Neo4j, TigerGraph) design storage formats specifically for vertices, edges, and their properties.

History

Starting from Google’s GFS, MapReduce, and BigTable papers (2003‑2006), the NoSQL ecosystem expanded to include key‑value, wide‑column, document, and graph stores, driven by the need for high‑throughput, low‑latency, and strong‑consistency data services.

Graph Database Storage Designs

Common representations include adjacency lists, adjacency matrices, and edge lists, each influencing query execution strategies and performance characteristics.

Native Graph Databases

Neo4j

Neo4j stores vertices, edges, and their properties as linked records; edges are stored once and linked via double‑linked lists, enabling efficient two‑hop traversals without loading full vertex records.

Figure 1 – Neo4j storage model.

RedisGraph

Built on Redis, RedisGraph uses an adjacency‑matrix design stored in compressed sparse row (CSR) format; vertices and edges are kept in arrays of fixed‑size blocks, with label‑based containers to improve locality.

Figure 2 – RedisGraph storage layout.

Sparksee

Sparksee (formerly DEX) uses bitmap‑based indexes and bit‑wise operations to accelerate graph workloads, storing value sets for vertices/edges and representing topology with bitmaps split into 32‑bit sub‑sequences.

Figure 3 – Sparksee value set and bitmap.

RDBMS‑Based Graph Databases

AgensGraph

Built on PostgreSQL, AgensGraph stores vertex and edge attributes as JSON in tables; each table page contains fixed‑size slots pointing to variable‑length records, enabling efficient on‑disk navigation.

Figure 4 – AgensGraph page layout.

Document‑Based Graph Databases

OrientDB

OrientDB separates vertices and edges into classes, each with a unique RID (record ID) composed of class ID, cluster ID, and position; it supports regular and lightweight edge lists for flexible traversal.

Figure 5 – OrientDB storage format.

Wide‑Column‑Based Graph Databases

JanusGraph

JanusGraph uses HBase as its backend; each vertex is a row with cells for properties and adjacency edges, while edges are stored with composite keys (label, direction, neighbor ID, edge ID) to support sorted scans.

Figure 6 – JanusGraph storage format.

Current Landscape (Graph Database 2.0)

NebulaGraph

NebulaGraph is an open‑source distributed graph database with three services (Query, Storage, Meta). Meta uses Raft for consistency; Storage follows a shared‑nothing KV design, and queries are parsed into ASTs before execution.

ByteGraph

ByteGraph (by ByteDance) separates compute and storage, with an in‑memory layer caching KV pairs and a disk‑based KV store; it splits a vertex’s outgoing neighbors into multiple KV pairs forming a logical B‑Tree.

EasyGraph

EasyGraph (Tencent) adopts a storage‑compute separation using TiKV (Raft‑based KV store) and integrates a graph visualization engine and AngelGraph for graph algorithms, including community detection and GNNs.

Neo4j (latest)

Neo4j now supports a distributed mode with multi‑replica Raft‑based consistency; it offers Primary (strong consistency) and Secondary (read‑only) modes and includes the Graph Data Science library for analytics and GNNs.

TigerGraph

TigerGraph is a commercial distributed MPP graph database supporting OLTP and OLAP, with a Graph Storage Engine (GSE) that compresses data into a native format and a Graph Processing Engine (GPE) that executes queries; it uses Kafka for data sync and GSQL as its query language.

Future Directions

The community is moving toward a unified graph query language (GQL) to lower learning barriers and enable advanced cost‑based optimizations; integration with graph AI, graph‑ML, and streaming graph systems is expected to unlock deeper value from graph data.

Author Introduction

Prof. James Cheng leads the Husky Data Lab at CUHK, focusing on distributed systems, graph computing, and graph data management; he has collaborated with Alibaba, Huawei, Tencent, and ByteDance, and received awards such as the Hong Kong Young Scientist award and ATC'21 Best Paper.

distributed systemsBig Datagraph databaseNeo4jstorage architectureNoSQLNebulaGraph
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.