Databases 12 min read

Applying Graph Database Technology to Baidu Chinese Dictionary Service

To meet Baidu Chinese’s need for sub‑200 ms responses on multi‑hop queries across millions of dictionary entities, the team replaced MySQL with the open‑source HugeGraph graph database backed by RocksDB, deploying a multi‑master, REST‑enabled architecture with caching, bulk loading, and a data‑intervention platform to ensure fast, reliable traversal of semantic relationships.

Baidu Geek Talk

Aug 25, 2021

Applying Graph Database Technology to Baidu Chinese Dictionary Service

The rapid development of various industries has increased data inter‑connectivity, making traditional relational databases inadequate for handling deep, heterogeneous relationships. This article introduces the use of a graph database to support the complex, multi‑type data of Baidu Chinese (a dictionary service covering characters, words, poems, idioms, etc.).

Baidu Chinese stores over ten categories of entities, amounting to more than ten million records. Each entity type (e.g., a poem) has dozens of attributes, and queries often involve multi‑hop relationships such as “Who is the author of the poem *Quiet Night Thoughts* and which dynasty does he belong to?”. Using a relational database like MySQL would require many tables and indexes, leading to costly joins and latency that cannot meet the 200 ms response requirement and the peak QPS of over a thousand.

Graph Database Overview

A graph database (GDB) stores data as nodes, edges, and properties, allowing direct traversal of relationships. Queries are fast because relationships are persisted within the graph structure, making it suitable for highly connected data.

Selection Criteria

The team evaluated graph databases based on open‑source availability, maturity and scalability, low operational cost, rich documentation/community, and bulk import/export capabilities. Neo4j, while popular, lacks distributed support in its community edition and uses the GPL‑V3 license, which is unsuitable for Baidu’s needs.

Consequently, Baidu adopted the open‑source HugeGraph as the underlying graph engine. HugeGraph supports multiple storage back‑ends; RocksDB was chosen for its file‑based storage, SSD performance, and ease of migration.

HugeGraph also provides a RESTful API and a bulk‑loader tool, facilitating the import of billions of records. It is compatible with Apache TinkerPop3 and uses the Gremlin traversal language for graph queries.

Deployment Architecture

HugeGraph is deployed on Baidu’s internal PaaS platform using virtualization. A multi‑master setup stores the full dataset on each instance to achieve low‑latency reads. Data consistency across instances is ensured by a unified intervention platform that coordinates updates.

To further reduce latency, an Nginx layer proxies requests to HugeGraph, handling timeout control and caching hot data (approximately 30% of queries) via proxy_cache. Data files are shared between servers through AFS cloud disks.

Data Intervention Platform

The platform supports real‑time and batch data interventions, as well as export functions. It records each operation, verifies results, and provides rollback capabilities. A retry mechanism with alerting ensures transactional consistency across the HugeGraph cluster.

Data preparation involves packaging the RocksDB files, uploading them to AFS, and loading them during service startup. In case of server failure, the data package is re‑uploaded and restored on a new machine, with the intervention platform marking the service as abnormal to prevent concurrent edits.

Query Example

The following Gremlin query retrieves the author’s dynasty for the poem “Quiet Night Thoughts”:

shici.traversal().V().hasLabel('poem_name').hasId('p_name-静夜思').outE('name_poem').inV().hasLabel('poem').outE('type_poem_author').inV().path();

A Data‑Dictionary (DA) module parses natural‑language queries into Gremlin templates, enabling flexible user queries to be translated into executable graph traversals.

Conclusion and Outlook

HugeGraph adds a semantic graph layer on top of existing storage, offering powerful traversal capabilities. However, multi‑hop traversals can suffer performance issues in OLTP scenarios, and the storage options are limited. Despite these drawbacks, the combination of HugeGraph and Lucene meets Baidu Chinese’s current requirements. Future work will focus on further optimization and deeper exploitation of graph database value.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend Architecture Graph Database data modeling RocksDB Baidu Chinese Gremlin HugeGraph

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.