Applying Graph Database Technology to Baidu Chinese Dictionary Service
To meet Baidu Chinese’s need for sub‑200 ms responses on multi‑hop queries across millions of dictionary entities, the team replaced MySQL with the open‑source HugeGraph graph database backed by RocksDB, deploying a multi‑master, REST‑enabled architecture with caching, bulk loading, and a data‑intervention platform to ensure fast, reliable traversal of semantic relationships.
The rapid development of various industries has increased data inter‑connectivity, making traditional relational databases inadequate for handling deep, heterogeneous relationships. This article introduces the use of a graph database to support the complex, multi‑type data of Baidu Chinese (a dictionary service covering characters, words, poems, idioms, etc.).
Baidu Chinese stores over ten categories of entities, amounting to more than ten million records. Each entity type (e.g., a poem) has dozens of attributes, and queries often involve multi‑hop relationships such as “Who is the author of the poem *Quiet Night Thoughts* and which dynasty does he belong to?”. Using a relational database like MySQL would require many tables and indexes, leading to costly joins and latency that cannot meet the 200 ms response requirement and the peak QPS of over a thousand.
Graph Database Overview
A graph database (GDB) stores data as nodes, edges, and properties, allowing direct traversal of relationships. Queries are fast because relationships are persisted within the graph structure, making it suitable for highly connected data.
Selection Criteria
The team evaluated graph databases based on open‑source availability, maturity and scalability, low operational cost, rich documentation/community, and bulk import/export capabilities. Neo4j, while popular, lacks distributed support in its community edition and uses the GPL‑V3 license, which is unsuitable for Baidu’s needs.
Consequently, Baidu adopted the open‑source HugeGraph as the underlying graph engine. HugeGraph supports multiple storage back‑ends; RocksDB was chosen for its file‑based storage, SSD performance, and ease of migration.
HugeGraph also provides a RESTful API and a bulk‑loader tool, facilitating the import of billions of records. It is compatible with Apache TinkerPop3 and uses the Gremlin traversal language for graph queries.
Deployment Architecture
HugeGraph is deployed on Baidu’s internal PaaS platform using virtualization. A multi‑master setup stores the full dataset on each instance to achieve low‑latency reads. Data consistency across instances is ensured by a unified intervention platform that coordinates updates.
To further reduce latency, an Nginx layer proxies requests to HugeGraph, handling timeout control and caching hot data (approximately 30% of queries) via proxy_cache . Data files are shared between servers through AFS cloud disks.
Data Intervention Platform
The platform supports real‑time and batch data interventions, as well as export functions. It records each operation, verifies results, and provides rollback capabilities. A retry mechanism with alerting ensures transactional consistency across the HugeGraph cluster.
Data preparation involves packaging the RocksDB files, uploading them to AFS, and loading them during service startup. In case of server failure, the data package is re‑uploaded and restored on a new machine, with the intervention platform marking the service as abnormal to prevent concurrent edits.
Query Example
The following Gremlin query retrieves the author’s dynasty for the poem “Quiet Night Thoughts”:
shici.traversal().V().hasLabel('poem_name').hasId('p_name-静夜思').outE('name_poem').inV().hasLabel('poem').outE('type_poem_author').inV().path();
A Data‑Dictionary (DA) module parses natural‑language queries into Gremlin templates, enabling flexible user queries to be translated into executable graph traversals.
Conclusion and Outlook
HugeGraph adds a semantic graph layer on top of existing storage, offering powerful traversal capabilities. However, multi‑hop traversals can suffer performance issues in OLTP scenarios, and the storage options are limited. Despite these drawbacks, the combination of HugeGraph and Lucene meets Baidu Chinese’s current requirements. Future work will focus on further optimization and deeper exploitation of graph database value.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.