Understanding Big Data Processing Architectures: Lambda, Kappa, and Lambda Plus
This article explains the technical challenges of large‑scale data processing, compares the classic Lambda and Kappa architectures, and introduces the cloud‑native Lambda Plus solution built on TableStore and Blink that simplifies batch‑stream integration for TB‑scale workloads.
Big data analysis creates significant economic and social value across industries, but it faces common technical challenges such as handling both low‑latency real‑time data and petabyte‑scale historical data, ensuring reliability and scalability, managing complex streaming‑batch stacks, and maintaining operability of large‑scale architectures.
The Lambda architecture addresses these challenges by writing immutable data to both batch and streaming layers, pre‑computing batch views, and merging them with real‑time views at query time. However, it introduces four major difficulties: dual‑write consistency, storage cost and latency for massive historical data, duplicated development for batch and stream processing, and high operational complexity.
Kappa architecture simplifies the stack by using a single streaming pipeline that can also replay events for historical analysis, relying on an append‑only log (e.g., Kafka) for long‑term storage. While it reduces write‑side complexity, it still struggles with storage cost, limited ad‑hoc query capabilities, and the need for a robust indexing layer.
To bridge the gap, the Kappa+ approach (used by Uber) unifies batch and stream processing with engines like Spark or Flink, supports exactly‑once semantics, and processes events based on event time, while leveraging tiered storage systems such as Apache Hudi for efficient updates.
Lambda Plus, a cloud‑native solution from Alibaba Cloud, combines TableStore (a serverless NoSQL database) and Blink (an enhanced Flink‑based real‑time engine). TableStore serves as the master dataset, providing low‑latency reads, multi‑dimensional indexes, and a TunnelService for streaming data directly to Blink without a separate message queue.
The architecture consists of three layers: a batch layer where Blink reads TableStore to compute batch views; a streaming layer where Blink consumes real‑time data via TunnelService; and a serving layer that exposes both batch and stream views through TableStore’s global secondary and multi‑dimensional indexes for ad‑hoc queries.
Lambda Plus directly addresses the four Lambda challenges: it eliminates dual‑write by using TableStore as the single source of truth, offers scalable low‑latency storage with built‑in indexing, unifies batch and stream code with Blink, and simplifies the serving tier with native indexes.
Typical use cases include IoT telemetry, time‑series logs, web‑crawler data, and user behavior tracking at the terabyte scale, where fast ingestion, real‑time analytics, and flexible ad‑hoc queries are required.
Overall, the combination of TableStore and Blink provides a fully integrated, serverless big‑data processing pipeline that reduces component count, operational overhead, and cost while expanding the analytical capabilities of the underlying data.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.