Big Data 21 min read

Zhihu's Real-Time Computing Platform: From Skytree 1.0 to Mipha 2.0

Zhihu’s real‑time computing platform, initially built as Skytree 1.0 on Kubernetes and later re‑engineered as Mipha 2.0 with Flink SQL, unified metadata management, dynamic jar loading, UDF support, Protobuf format, CDC integration, and extensive operational optimizations, now processes petabyte‑scale data with high reliability.

DataFunTalk
DataFunTalk
DataFunTalk
Zhihu's Real-Time Computing Platform: From Skytree 1.0 to Mipha 2.0

Flink has become a dominant stream‑processing framework due to its reliability and ease of use. Zhihu introduced Flink in 2018, evolving from version 1.6.x to 1.13.x, and today it handles petabyte‑scale data with over 700 real‑time jobs, 500 batch jobs, 50 TB of memory, 13 k cores, and peak traffic of 17 GB/s.

Skytree 1.0 was built on Kubernetes using Flink session mode, supporting Jar tasks, single‑cluster single‑job streaming, multi‑job batch, monitoring, alerts, and templates for common tasks such as Kafka‑to‑Hive. While well received, it suffered from complex Jar development, high maintenance cost, lack of native API access, and technical debt after Flink added native Kubernetes support.

To address these issues, Zhihu iterated to Mipha 2.0 , redesigning the platform around Flink SQL.

Architecture

Mipha consists of three components:

mipha‑web‑api: user‑facing services for monitoring, job management, UDF and version control.

Flink SQL Gateway (customized Ververica gateway): supports SQL and Jar submissions on YARN or Kubernetes.

mipha‑metastore: unified metadata storage and proxy service.

Flink SQL Support

The gateway compiles user SQL into a job graph, provides catalog metadata (defaulting to Hive Metastore), and allows extensible UDF/Connector integration. Because the upstream SQL Gateway only supports Flink 1.12, Mipha adds compatibility patches for Flink 1.13.

Unified Metadata Management

mipha‑metastore validates and parses CREATE TABLE statements into structured data, stores them in a database, and uses Casbin for table‑level permissions, avoiding exposure of connection strings. It also proxies schema‑aware data sources (Hive, JDBC) and schema‑less sources (Kafka, Redis), and integrates Iceberg tables via a modified Hive catalog.

Dynamic Jar Loading

To keep the Flink base image stable, Mipha injects required JARs at container start via an entrypoint hook and environment variables, allowing hot‑swap of connectors, plugins, and fixes without rebuilding images.

UDF Support

Users can register functions with syntax like CREATE FUNCTION [IF NOT EXISTS] AS , which are stored in the TableEnvironment. UDF JARs are loaded dynamically at runtime, and versioned through Git‑backed CI/CD pipelines.

Protobuf Format

To handle Protobuf messages, Mipha adopts a third‑party approach where users compile Protobuf classes into JARs and reference them in table DDL, e.g.:

CREATE TABLE `kafka`.`region1`.`test_table` (
  `member_id` BIGINT,
  `scene_code` STRING,
  `server_timestamp` BIGINT,
  `response_timestamp` BIGINT,
  `param` MAP
,
  `items` ARRAY
>>,
  `ts` AS CURRENT_TIMESTAMP,
  WATERMARK FOR `ts` AS `ts` - INTERVAL '1' MINUTE
) WITH (
  'properties.bootstrap.servers' = 'xxx',
  'connector' = 'kafka',
  'format' = 'protobufV1',
  'topic' = 'xxxx',
  'protobufV1.class-name' = 'com.zhihu.ts.proto.Feed$Feed'
);
ADD JAR 'protobuf-java-3.19.1.jar';
ADD JAR 'ts-protos-0.0.6.jar';
SELECT * FROM `kafka`.`region1`.`test_table`;

Flink CDC Platform

Mzha integrates Flink CDC for change data capture, using a per‑MySQL‑instance CDC model and upsert‑Kafka sinks to reduce binlog pressure. For large TiDB tables, a hybrid source combines TiKV snapshots (batch) with CDC streams (incremental) to achieve efficient real‑time tables.

Operational Optimizations

Resource over‑provisioning is mitigated by configuring minimal CPU requests with high limits for JobManagers, saving ~0.9 CPU per job. Logs are collected and persisted to Hive/Elasticsearch for post‑mortem analysis. Dynamic checkpoint scheduling is exposed via REST APIs (startCheckpointScheduler, stopCheckpointScheduler, updateCheckpointConfig). Logical resource pools enforce per‑business quota and cost allocation.

Future Plans

Zhihu aims to improve job debugging, enhance Flink SQL state‑schema evolution, provide feature‑service APIs, and further integrate with data‑lake technologies.

big dataFlinkkubernetesreal-time computingmetadata managementSQL Gateway
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.