Big Data 13 min read

iQIYI Basic Data Platform: Architecture, High Availability, and Service Practices

iQIYI’s Basic Data Platform unifies data exchange across dozens of business lines by providing massive storage, distribution, online query and offline analysis services, employing an access layer, unified management, fine‑grained governance, dual‑cluster ID generation, active‑standby HBase with MongoDB WAL, RocketMQ messaging with server‑side filtering, and horizontally scalable read replicas to ensure high availability and performance.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Basic Data Platform: Architecture, High Availability, and Service Practices

iQIYI's Basic Data Platform was built to unify internal data exchange standards, solving problems such as inconsistent IDs across teams, divergent data definitions, and untimely data updates.

The platform now integrates data from most of iQIYI's business lines, including UGC video, full‑network film information, resource slots, live streaming, games, literature, e‑commerce, etc., providing massive data storage, distribution, online query, and offline analysis services.

Currently the platform manages nearly one hundred tables, with total data volume in the tens of billions, daily growth of millions of records and tens of millions of messages, covering dozens of business teams.

Service Capabilities

The platform provides the following capabilities:

Overall Architecture

Access Layer: HTTP and RPC interfaces, unified message listening, and offline scan SDK.

Unified Management Platform: development tools for table definition viewing, data volume, message volume, change logs, real‑time queries, and a one‑stop field definition management system.

Service Governance: fine‑grained permission control, traffic shaping, and other governance functions for data access.

Service Process

1. Define tables and field structures in the management platform and publish a Protobuf data‑definition package.

2. Production services write data via the ID service and write service; data is first stored in HBase, then an update notification message is sent. Downstream services subscribe to the message, obtain the changed ID and field information, and retrieve the latest data via the read service.

3. The platform records each change in HBase for troubleshooting. The change log includes the business, timestamp, and IP address of the modification.

4. Message merging service consolidates messages with the same ID to reduce downstream traffic. Different priority levels have different merging windows (e.g., live streaming uses a shorter window).

Service Solutions

4.1 ID Service High Availability

The ID service uses two MySQL clusters: one generates odd IDs, the other generates even IDs. If one cluster fails, the other continues to provide IDs.

4.2 Message Distribution

Initially the platform used ActiveMQ VirtualTopic for coarse‑grained segregation, but it proved inflexible. A custom ActiveMQ plugin was developed to implement fine‑grained routing based on subscription rules, similar to an AOP mechanism.

5.1 HBase Read Performance Issue

Because each write may trigger many reads, occasional RegionServer failures caused timeouts. The solution was to add a cache layer. After evaluating Redis, CouchBase, and MongoDB, MongoDB was chosen for its capacity and acceptable performance.

Each write generates a unique SessionID used as a version number. Cache invalidation and updates are performed asynchronously to reduce latency.

To keep HBase consistent with the cache, the SessionID column family is stored in‑memory (IN_MEMORY='true').

5.2 HBase Availability

Single‑node failures are handled by HBase itself, but whole‑cluster or data‑center failures required a cross‑region solution. A same‑city active‑standby architecture was built, with MongoDB acting as a write‑ahead log (WAL) replicated across three data centers.

Synchronizer service asynchronously writes WAL to the primary HBase; reads combine data from MongoDB and HBase, with Hystrix circuit‑breaker handling primary HBase outages.

5.3 ActiveMQ Problems and Migration to RocketMQ

ActiveMQ suffered from severe performance degradation under slow consumers and limited horizontal scalability. After evaluating Kafka and RocketMQ, RocketMQ was selected for its server‑side filtering capability.

The deployment consists of a three‑data‑center active‑standby cluster, ensuring that both message production and consumption remain unaffected by a single node or data‑center failure.

A custom RocketMQ client SDK pushes subscription rules to a FilterServer, enabling efficient server‑side filtering.

5.4 Expanding Read Capacity

To achieve horizontal scalability of reads, a business‑level replica (SlaveRead) was introduced. It synchronizes data from the primary store via messages, updates its own copy, and serves downstream read requests. Multiple replicas can be chained, reducing load on the primary database.

Conclusion

Overall, iQIYI's Basic Data Platform continuously improves its technology and service solutions to address real‑world business challenges. The platform has accumulated practical experience with HBase, RocketMQ, and high‑availability designs, and will keep exploring ways to enhance service capability, stability, and performance.

Big DataHigh Availabilitydata-platformHBaseMessage QueueiQIYI
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.