SeaTunnel Multi‑Dimensional Practice at Vipshop: ClickHouse‑Hive Integration and Data Platform Integration
The article details Vipshop's multi‑dimensional use of SeaTunnel to integrate Hive and ClickHouse, describing data import/export challenges, tool selection among DataX, SeaTunnel and Spark, custom configurations, platform integration, and future improvements for high‑performance OLAP pipelines.
Guest Speaker: Wang Yu, Vipshop OLAP team leader; Editor: Zhang Detong; Platform: DataFunTalk.
Overview: This article presents Vipshop's multi‑dimensional practice of SeaTunnel, covering component selection, OLAP usage, integration with Vipshop's data platform, crowd‑analysis solutions, and custom enhancements.
01 ClickHouse Data Import Requirements and Pain Points
1. Vipshop OLAP Architecture – The diagram shows Vipshop's OLAP stack, where the data service and compute engine modules are responsible for data processing. Underlying warehouses include offline, real‑time, and lake warehouses. Compute engines use Presto, Kylin, and ClickHouse (treated as a compute engine to leverage its strong computation performance). SQL‑based services and custom non‑SQL analytics are provided for BI and business needs.
2. Requirements – By connecting Presto Connector and Spark components, Vipshop unifies Hive, Kudu, and Alluxio. Introducing ClickHouse created a data island; importing/exporting data between Hive and ClickHouse required substantial effort. The first requirement was to improve import/export efficiency and incorporate ClickHouse into the big‑data ecosystem. A second requirement was to accelerate slow Presto SQL queries by offloading OLAP calculations to ClickHouse. A third requirement was to combine real‑time Kafka/Flink data with Hive dimension tables for comprehensive analysis.
3. Pain Points – Performance considerations for short ETL cycles (5 minutes), data‑quality guarantees (row counts consistency between Hive and ClickHouse), and full support for complex data types such as HyperLogLog and Bitmap.
02 ClickHouse and Hive Export/Import Tool Selection
Vipshop evaluated open‑source tools—DataX, SeaTunnel, and custom Spark‑JDBC solutions—without considering commercial products. SeaTunnel and Spark run on Vipshop's Yarn cluster, offering distributed reads/writes and several‑fold higher performance than the non‑distributed DataX.
SeaTunnel and Spark handle billions of rows smoothly, while DataX struggles beyond that scale. SeaTunnel also provides extensive plugin extensibility, supporting multiple data sources (including Redis) and offering better stability as a self‑contained tool.
Given the daily exposure‑table volume (tens of billions) and the 5‑minute processing SLA, Vipshop selected SeaTunnel as the primary data‑warehouse import/export tool.
03 Hive Data Import to ClickHouse
The Hive table shown is a three‑level product dimension table with a bitmap of user IDs. SeaTunnel configuration (env, source, sink) reads the Hive table via Spark, applies optional UDFs, and writes to ClickHouse using the RowBinary format for high‑speed bulk loading. Pre‑SQL and check‑SQL scripts handle data preparation, partition cleanup, and validation, enabling reliable re‑runs.
SeaTunnel’s sink can target either ClickHouse local tables or distributed tables. For distributed tables, Vipshop configures ClickHouse’s murmurHash64 sharding rule and lets SeaTunnel write directly to the appropriate hosts, reducing join costs and CPU load.
04 ClickHouse Data Export to Hive
Vipshop needs to export daily bitmap‑based audience segments (≈200 k users) from ClickHouse back to Hive for downstream ETL and external media delivery via PIKA. SeaTunnel’s source is ClickHouse, sink is Hive, with data validation performed in Hive.
Custom plugins (plugin‑spark‑source‑ClickHouse, plugin‑spark‑sink‑Hive) were developed to map ClickHouse bitmap data to Hive DataFrames, handling schema conversion and bitmap type mapping.
05 Integration of SeaTunnel with Vipshop Data Platform
Vipshop’s scheduling system “ShuFang” orchestrates SeaTunnel jobs, assigning tasks to agents based on load, and running them in isolated containers similar to Kubernetes pods. The platform monitors job status, data quality, and triggers retries or alerts on failure.
SeaTunnel also supports crowd‑analysis workflows: bitmap audience data generated in ClickHouse is exported to Hive, combined with other ETL tasks, and finally pushed to external media stores.
06 Future Work
Upcoming improvements include addressing high CPU usage during ClickHouse writes by implementing a CK‑local read/write‑separation mode, adding more sinks (e.g., PIKA), and enhancing data‑type support.
Thank you for listening.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.