Big Data 10 min read

How Uber Reduced Data Freshness from Hours to Minutes Using Flink Streaming

Uber rebuilt its data‑lake ingestion pipeline with Apache Flink, replacing batch jobs with a streaming architecture that cuts data freshness from hours to minutes, lowers compute usage by 25%, and solves challenges like small‑file proliferation, partition skew, and checkpoint‑commit synchronization at petabyte scale.

Past Memory Big Data
Past Memory Big Data
Past Memory Big Data
How Uber Reduced Data Freshness from Hours to Minutes Using Flink Streaming

Introduction

At Uber, the data lake underpins analytics and machine learning across the company. Historically, ingestion was driven by batch jobs, giving data freshness measured in hours. To meet the demand for near‑real‑time insights, Uber rebuilt the ingestion stack on Apache Flink®, achieving fresher data, lower cost, and petabyte‑scale scalability.

Why Streaming?

Two key drivers prompted the shift: data freshness and cost efficiency. Business units such as delivery, rides, finance, and marketing required data with minute‑level latency for real‑time experiments and model development. Batch ingestion introduced delays of several hours—or even days—hindering iteration and decision‑making. Flink‑based streaming reduced freshness from hours to minutes, accelerating model releases, experiment cycles, and analytical accuracy.

Batch jobs built on Apache Spark™ are resource‑intensive, scheduling large distributed computations at fixed intervals regardless of workload variability. At Uber’s scale—thousands of datasets and hundreds of petabytes—this meant tens of thousands of CPU cores running daily. Streaming eliminates the overhead of frequent batch scheduling, allowing resources to scale smoothly with traffic.

Architecture Overview

The IngestionNext system consists of multiple layers. In the data plane, events arrive in Apache Kafka® and are consumed by Flink jobs, which write to the lake in Apache Hudi™ format, providing transactional commits, rollbacks, and time‑travel capabilities. Freshness and integrity are measured end‑to‑end, from source to sink.

A control plane automates job lifecycle management (create, deploy, restart, stop, delete), configuration changes, and health checks, enabling consistent and safe ingestion across thousands of datasets. The system also incorporates regional failover and fallback strategies so that, during interruptions, ingestion can shift regions or temporarily run in batch mode without data loss.

Main Challenges and Solutions

Small Files

Streaming ingestion generates many small Apache Parquet™ files, degrading query performance and increasing metadata and storage overhead. The traditional record‑by‑record merge decompresses each Parquet file, converts columnar data to rows, merges, then recompresses—an expensive process.

IngestionNext introduces row‑group‑level compaction that operates directly on Parquet’s native columnar structure, avoiding costly recompression and achieving more than a ten‑fold speedup in compaction.

The open‑source community (e.g., Apache Hudi PR #13365) explored schema‑evolution‑aware merges using padding and masks, but this added significant implementation complexity and maintenance risk.

Our approach enforces schema consistency, merging only files that share the same schema, eliminating the need for masks or low‑level code changes while delivering faster, more reliable compaction.

Partition Skew

Transient downstream slowdowns (e.g., GC pauses) cause Kafka consumption imbalance among Flink subtasks, leading to skew, reduced compaction efficiency, and slower queries.

We mitigated this through operational tuning (aligning parallelism with partitions, adjusting pull parameters), connector‑level fairness (round‑robin, pausing/resuming overloaded partitions, per‑partition quotas), and enhanced observability (per‑partition lag metrics, auto‑scaling based on skew detection, targeted alerts).

Checkpoint and Commit Synchronization

Flink checkpoints track consumed offsets, while Hudi commits track write operations. Misalignment during failures can cause data loss or duplication.

To resolve this, we extended Hudi commit metadata to embed the Flink checkpoint ID, enabling deterministic recovery during rollbacks or failovers.

Results

Deploying the Flink‑based ingestion platform on several of Uber’s largest datasets demonstrated minute‑level freshness and a 25 % reduction in compute usage compared with batch ingestion. The figure below shows a concrete improvement in data freshness.

Data freshness before and after streaming ingestion
Data freshness before and after streaming ingestion

Next Steps

IngestionNext has transformed ingestion from batch to streaming, dramatically lowering latency. However, downstream raw‑data transformation and analysis still lag behind. To truly accelerate data freshness, the real‑time capability must extend end‑to‑end—from ingestion through transformation to live insights. This is critical because Uber’s data lake serves delivery, rides, ML, passenger, market, map, finance, and marketing analytics, making freshness a top priority across these domains.

Conclusion

The migration from batch to streaming marks a major milestone in Uber’s data platform evolution. Rebuilding ingestion on Apache Flink delivers fresher data, stronger reliability, and scalable efficiency at petabyte scale. The design emphasizes automated resilience and operational simplicity, allowing engineers to focus on data‑driven products rather than pipeline management, and positions the organization to pursue real‑time ETL and analytics for a complete data‑freshness loop.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Apache FlinkData LakeApache HudiUberStreaming IngestionData Freshness
Past Memory Big Data
Written by

Past Memory Big Data

A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.