Building a Large-Scale Near Real-Time Data Analytics Platform at Lyft Using Apache Flink
Lyft transformed its legacy data pipeline by designing a cloud‑native, Flink‑based near real‑time analytics platform that ingests billions of events, writes Parquet files to S3, leverages Presto for interactive queries, and implements multi‑stage non‑blocking ETL, fault‑tolerant back‑fill, and extensive performance optimizations.
Lyft, a North American ride‑sharing platform, needed a large‑scale near‑real‑time analytics system to process billions of events per day. The legacy platform suffered from high latency, small‑file overhead, and limited schema support.
To address these issues, Lyft built a new architecture based on Apache Flink running on AWS. Event streams from mobile apps and backend services are ingested via Kinesis, processed by Flink, and written directly to S3 in columnar Parquet format. Hive stores table metadata, and Presto provides interactive SQL queries for data engineers, analysts, and ML scientists.
The platform performs multi‑stage non‑blocking ETL using Apache Airflow, applying compression and deduplication before persisting data. Checkpointing occurs every three minutes, and a global state aggregator coordinates partition watermarks to ensure consistency.
Performance optimizations include reducing cluster size by a factor of ten, leveraging Parquet statistics and partition pruning for faster Presto queries, adding entropy prefixes and marker files to improve S3 I/O, and implementing smart schema evolution checks.
Fault tolerance is achieved through data back‑fill streams that replay missed events after job failures, and through idempotent Airflow scheduling, atomic partition swaps, and self‑healing partition management.
Future plans involve deploying Flink on Kubernetes, expanding the ingestion framework to support databases and logs, automating event‑driven ETL, and further query optimizations using Parquet metadata.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.