Big Data 12 min read

Building a Large-Scale Near Real-Time Data Analytics Platform at Lyft Using Apache Flink

Lyft transformed its legacy data pipeline by designing a cloud‑native, Flink‑based near real‑time analytics platform that ingests billions of events, writes Parquet files to S3, leverages Presto for interactive queries, and implements multi‑stage non‑blocking ETL, fault‑tolerant back‑fill, and extensive performance optimizations.

DataFunTalk
DataFunTalk
DataFunTalk
Building a Large-Scale Near Real-Time Data Analytics Platform at Lyft Using Apache Flink

Lyft, a North American ride‑sharing platform, needed a large‑scale near‑real‑time analytics system to process billions of events per day. The legacy platform suffered from high latency, small‑file overhead, and limited schema support.

To address these issues, Lyft built a new architecture based on Apache Flink running on AWS. Event streams from mobile apps and backend services are ingested via Kinesis, processed by Flink, and written directly to S3 in columnar Parquet format. Hive stores table metadata, and Presto provides interactive SQL queries for data engineers, analysts, and ML scientists.

The platform performs multi‑stage non‑blocking ETL using Apache Airflow, applying compression and deduplication before persisting data. Checkpointing occurs every three minutes, and a global state aggregator coordinates partition watermarks to ensure consistency.

Performance optimizations include reducing cluster size by a factor of ten, leveraging Parquet statistics and partition pruning for faster Presto queries, adding entropy prefixes and marker files to improve S3 I/O, and implementing smart schema evolution checks.

Fault tolerance is achieved through data back‑fill streams that replay missed events after job failures, and through idempotent Airflow scheduling, atomic partition swaps, and self‑healing partition management.

Future plans involve deploying Flink on Kubernetes, expanding the ingestion framework to support databases and logs, automating event‑driven ETL, and further query optimizations using Parquet metadata.

Flinkreal-time analyticsStreamingAWSETLdata lakeparquet
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.