Big Data 12 min read

Building a Large-Scale Near Real-Time Data Analytics Platform at Lyft Using Apache Flink

Lyft transformed its legacy data pipeline by designing a cloud‑native, Flink‑based near real‑time analytics platform that ingests billions of events, writes Parquet files to S3, leverages Presto for interactive queries, and implements multi‑stage non‑blocking ETL, fault‑tolerant back‑fill, and extensive performance optimizations.

DataFunTalk

Oct 29, 2020

Building a Large-Scale Near Real-Time Data Analytics Platform at Lyft Using Apache Flink

Lyft, a North American ride‑sharing platform, needed a large‑scale near‑real‑time analytics system to process billions of events per day. The legacy platform suffered from high latency, small‑file overhead, and limited schema support.

To address these issues, Lyft built a new architecture based on Apache Flink running on AWS. Event streams from mobile apps and backend services are ingested via Kinesis, processed by Flink, and written directly to S3 in columnar Parquet format. Hive stores table metadata, and Presto provides interactive SQL queries for data engineers, analysts, and ML scientists.

The platform performs multi‑stage non‑blocking ETL using Apache Airflow, applying compression and deduplication before persisting data. Checkpointing occurs every three minutes, and a global state aggregator coordinates partition watermarks to ensure consistency.

Performance optimizations include reducing cluster size by a factor of ten, leveraging Parquet statistics and partition pruning for faster Presto queries, adding entropy prefixes and marker files to improve S3 I/O, and implementing smart schema evolution checks.

Fault tolerance is achieved through data back‑fill streams that replay missed events after job failures, and through idempotent Airflow scheduling, atomic partition swaps, and self‑healing partition management.

Future plans involve deploying Flink on Kubernetes, expanding the ingestion framework to support databases and logs, automating event‑driven ETL, and further query optimizations using Parquet metadata.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink real-time analytics Streaming aws ETL data lake Parquet

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.