Zhihu's Real-Time Computing Platform: From Skytree 1.0 to Mipha 2.0
Zhihu’s real‑time computing platform, initially built as Skytree 1.0 on Kubernetes and later re‑engineered as Mipha 2.0 with Flink SQL, unified metadata management, dynamic jar loading, UDF support, Protobuf format, CDC integration, and extensive operational optimizations, now processes petabyte‑scale data with high reliability.
Flink has become a dominant stream‑processing framework due to its reliability and ease of use. Zhihu introduced Flink in 2018, evolving from version 1.6.x to 1.13.x, and today it handles petabyte‑scale data with over 700 real‑time jobs, 500 batch jobs, 50 TB of memory, 13 k cores, and peak traffic of 17 GB/s.
Skytree 1.0 was built on Kubernetes using Flink session mode, supporting Jar tasks, single‑cluster single‑job streaming, multi‑job batch, monitoring, alerts, and templates for common tasks such as Kafka‑to‑Hive. While well received, it suffered from complex Jar development, high maintenance cost, lack of native API access, and technical debt after Flink added native Kubernetes support.
To address these issues, Zhihu iterated to Mipha 2.0 , redesigning the platform around Flink SQL.
Architecture
Mipha consists of three components:
mipha‑web‑api: user‑facing services for monitoring, job management, UDF and version control.
Flink SQL Gateway (customized Ververica gateway): supports SQL and Jar submissions on YARN or Kubernetes.
mipha‑metastore: unified metadata storage and proxy service.
Flink SQL Support
The gateway compiles user SQL into a job graph, provides catalog metadata (defaulting to Hive Metastore), and allows extensible UDF/Connector integration. Because the upstream SQL Gateway only supports Flink 1.12, Mipha adds compatibility patches for Flink 1.13.
Unified Metadata Management
mipha‑metastore validates and parses CREATE TABLE statements into structured data, stores them in a database, and uses Casbin for table‑level permissions, avoiding exposure of connection strings. It also proxies schema‑aware data sources (Hive, JDBC) and schema‑less sources (Kafka, Redis), and integrates Iceberg tables via a modified Hive catalog.
Dynamic Jar Loading
To keep the Flink base image stable, Mipha injects required JARs at container start via an entrypoint hook and environment variables, allowing hot‑swap of connectors, plugins, and fixes without rebuilding images.
UDF Support
Users can register functions with syntax like CREATE FUNCTION [IF NOT EXISTS] AS , which are stored in the TableEnvironment. UDF JARs are loaded dynamically at runtime, and versioned through Git‑backed CI/CD pipelines.
Protobuf Format
To handle Protobuf messages, Mipha adopts a third‑party approach where users compile Protobuf classes into JARs and reference them in table DDL, e.g.:
CREATE TABLE `kafka`.`region1`.`test_table` (
`member_id` BIGINT,
`scene_code` STRING,
`server_timestamp` BIGINT,
`response_timestamp` BIGINT,
`param` MAP
,
`items` ARRAY
>>,
`ts` AS CURRENT_TIMESTAMP,
WATERMARK FOR `ts` AS `ts` - INTERVAL '1' MINUTE
) WITH (
'properties.bootstrap.servers' = 'xxx',
'connector' = 'kafka',
'format' = 'protobufV1',
'topic' = 'xxxx',
'protobufV1.class-name' = 'com.zhihu.ts.proto.Feed$Feed'
);
ADD JAR 'protobuf-java-3.19.1.jar';
ADD JAR 'ts-protos-0.0.6.jar';
SELECT * FROM `kafka`.`region1`.`test_table`;Flink CDC Platform
Mzha integrates Flink CDC for change data capture, using a per‑MySQL‑instance CDC model and upsert‑Kafka sinks to reduce binlog pressure. For large TiDB tables, a hybrid source combines TiKV snapshots (batch) with CDC streams (incremental) to achieve efficient real‑time tables.
Operational Optimizations
Resource over‑provisioning is mitigated by configuring minimal CPU requests with high limits for JobManagers, saving ~0.9 CPU per job. Logs are collected and persisted to Hive/Elasticsearch for post‑mortem analysis. Dynamic checkpoint scheduling is exposed via REST APIs (startCheckpointScheduler, stopCheckpointScheduler, updateCheckpointConfig). Logical resource pools enforce per‑business quota and cost allocation.
Future Plans
Zhihu aims to improve job debugging, enhance Flink SQL state‑schema evolution, provide feature‑service APIs, and further integrate with data‑lake technologies.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.