Big Data 13 min read

Building a Unified Streaming‑Batch Lakehouse with Amoro Mixed Iceberg

This article describes how Shanghai Steel Union leveraged Amoro Mixed Iceberg on top of Apache Iceberg to create a unified streaming‑batch lakehouse, addressing small‑file and upsert challenges, simplifying architecture, improving data freshness, and providing a scalable solution for real‑time and batch analytics.

DataFunTalk
DataFunTalk
DataFunTalk
Building a Unified Streaming‑Batch Lakehouse with Amoro Mixed Iceberg

Amoro is a lakehouse management system built on open data‑lake tables such as Apache Iceberg, offering plug‑in data self‑optimisation mechanisms and management services for an out‑of‑the‑box lakehouse experience.

Authors : Xiong Jun – Data development engineer at NetEase, focusing on Flink real‑time computing and lakehouse development with Apache Iceberg. Wang Tao – Senior platform development engineer at NetEase, specializing in big data and lakehouse platform construction.

Business background : Shanghai Steel Union, a global commodity data service provider, faces high‑frequency real‑time reporting and quality‑inspection requirements, requiring both real‑time event data and CDC data from databases. Existing solutions using Hive, Kafka, HBase, Kudu, ClickHouse cause high maintenance cost and data duplication.

To unify data management and reduce stack complexity, they decided to adopt a lakehouse architecture based on data‑lake technology.

Introducing Apache Iceberg : Iceberg provides upsert and minute‑level latency capabilities and integrates with Flink, Spark, Trino, etc. However, they encountered two pain points: (1) excessive small files generated by frequent commits, leading to OOM and merge conflicts; (2) inability to incrementally consume upserted Iceberg tables.

Introducing Amoro Mixed Iceberg : Amoro extends Iceberg with enhanced upsert support, automatic file merging, and MOR (Merge‑On‑Read) capabilities. Benefits include:

Enhanced upsert ability with primary‑key tables, supporting ODS upserts and incremental CDC consumption, and enabling millisecond‑level data delivery via Log Store.

Out‑of‑the‑box data optimisation (automatic file merging, orphan file cleanup, expiration), eliminating manual batch jobs and OOM issues.

Improved data timeliness: MOR queries via Trino achieve minute‑level latency, while continuous file merging keeps query performance stable.

Implementation architecture : The storage layer uses Amoro Mixed Iceberg tables to build a unified streaming‑batch lakehouse, replacing previous Hive‑based offline and Kafka‑based real‑time pipelines. ODS is fully built on Amoro tables, EDW is constructed via Flink double‑stream joins and dimension joins, and ADS retains ClickHouse, Elasticsearch, etc., receiving data through both stream and batch.

Computation primarily runs on Flink SQL, supplemented by Trino for ad‑hoc queries.

To date, more than 30 Mixed Iceberg tables have been onboarded, handling 20 million updates per day.

Table management : AMS provides built‑in metadata management, allowing easy table creation, automatic file merging, and data expiration with minimal configuration. The current deployment uses Amoro 0.5.0 with Flink Optimizer for continuous file merging.

Data ingestion : Two data sources feed the lakehouse – CDC from databases (using Flink CDC) and event logs from Kafka. Unique IDs serve as primary keys to resolve duplicate logs via upsert.

Real‑time downstream construction : Incremental CDC consumption from Mixed Iceberg powers both EDW and ADS layers. High‑frequency tables use Log Store for millisecond latency; others use Change Store for 1‑5 minute latency. Flink double‑stream joins and dimension look‑up joins (e.g., with ClickHouse) are employed.

Example of a Flink SQL look‑up join with ClickHouse:

CREATE CATALOG dw_amoro_catalog WITH (
   'type' = 'arctic',
   'properties.auth.simple.hadoop_username' = 'hdfs',
   'properties.auth.type' = 'simple',
   'metastore.url'='zookeeper://xxx:xxx/dw_amoro_cluster/dw_amoro_catalog'
);

create TEMPORARY table xxx (
    `id` BIGINT ,
    `name` STRING,
    `update_time` TIMESTAMP(3),
    `_sign` TINYINT,
    PRIMARY KEY (id) NOT ENFORCED
) WITH (
  'connector' = 'clickhouse',
  'url' = 'clickhouse://xxx:xx',
  'username' = 'xx',
  'password' = 'xx',
  'database-name' = 'ODS',
  'table-name' = 'xx',
  'lookup.cache.max-rows'='200000',
  'lookup.cache.ttl'='100000',
  'sink.batch-size' = '1000',
  'sink.flush-interval' = '5000',
  'sink.max-retries' = '3'
);

-- insert into xxx
SELECT
  FROM_UNIXTIME(A.INCOMETIME / 1000 , 'yyyy-MM') DATA_DATE,
  A.MEMBERID MEMBER_ID, mm.name MEMBER_NAME ,
  FROM_UNIXTIME(A.INVOICETIME / 1000 , 'yyyy-MM-dd HH:mm:ss') INVOICETIME,
  FROM_UNIXTIME(A.INCOMETIME / 1000 , 'yyyy-MM-dd') INCOMETIME,
  LISTAGG(DISTINCT D.NAME) FNC_TYPE_NAME,
  A.AMOUNT, A.STATUS, A.INVOICECODE, A.ADMINID ADMIN_ID 
FROM (
  select *,PROCTIME() ts  from dw_amoro_catalog.ods.oracle_new_finance_fnc_finance
) A 
LEFT JOIN dw_amoro_catalog.ods.xxx B ON A.INVOICEID = B.INVOICEID 
LEFT JOIN dw_amoro_catalog.ods.xxx C ON B.INVOICEID = C.INVOICEID 
LEFT JOIN dw_amoro_catalog.ods.xxx D ON C.BUSINESSTYPEID =  D.TYPEID 
JOIN xxx FOR SYSTEM_TIME AS OF A.ts AS mm ON mm.id =A.MEMBERID 
WHERE A.STATUS = 2 AND FROM_UNIXTIME(A.INCOMETIME / 1000, 'yyyy-MM') = FROM_UNIXTIME(UNIX_TIMESTAMP() , 'yyyy-MM')
  AND mm._sign = 1 
GROUP BY A.INVOICEID, A.MEMBERID,mm.name, A.INVOICETIME, A.INCOMETIME, A.AMOUNT, A.STATUS, A.INVOICECODE ,A.ADMINID;

MOR queries : A Trino cluster (5 nodes × 24 GB) provides ad‑hoc queries with minute‑level latency, leveraging Amoro’s Merge‑On‑Read support.

Future plans :

Use Flink MOR to read Amoro Mixed Format tables for downstream construction.

Introduce Mixed Format look‑up join to replace ClickHouse in certain dimension‑join scenarios.

Replace Kudu and Impala with Mixed Format tables to eliminate data silos and reduce memory consumption.

Conclusion : By adopting Amoro’s out‑of‑the‑box capabilities and Mixed Iceberg’s enhanced upsert and MOR features, the team solved production challenges of streaming‑batch integration and data freshness, and will continue to contribute to the open‑source community.

Big DataFlinkdata managementApache IcebergLakehouseAmoro
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.