Big Data 11 min read

LinkedIn’s Scaling and Evolution of Kafka: Quotas, New Consumer, Reliability, Security, and Monitoring

The article details how LinkedIn has massively scaled Kafka usage over several years, addressing quotas, a new ZooKeeper‑free consumer, reliability enhancements, security features, monitoring frameworks, fault testing, and ecosystem integrations to support its massive data‑driven operations.

Qunar Tech Salon

Nov 8, 2015

LinkedIn’s Scaling and Evolution of Kafka: Quotas, New Consumer, Reliability, Security, and Monitoring

LinkedIn’s Kafka Usage Growth

LinkedIn began large‑scale use of Kafka in July 2011, processing about 1 billion messages per day, which grew to 200 billion by 2012, 2 trillion by July 2013, and later surpassed 1 trillion messages daily with peak rates of over 4.5 million messages per second and weekly throughput of 1.34 PB, achieving a 1,200‑fold increase over four years.

Key Focus Areas

As the scale expanded, LinkedIn emphasized reliability, cost, security, availability, and other core metrics, leading to extensive exploration across multiple features and domains.

Quotas

Because many applications share the same Kafka cluster, misuse can impact performance and SLA for others. Scenarios like reprocessing an entire database can flood the cluster, causing network saturation and disk pressure. LinkedIn introduced a feature that throttles producers and consumers when byte‑per‑second thresholds are exceeded, with a whitelist for higher‑bandwidth users, ensuring broker stability.

Developing a New Consumer

The existing consumer relied on ZooKeeper, which introduced security and split‑brain issues. In collaboration with Confluent and the open‑source community, LinkedIn helped develop a new consumer that depends only on the Kafka broker, eliminating ZooKeeper dependencies. The new consumer reconciles low‑level and high‑level consumer APIs, handling tasks such as error handling and retries.

Reliability and Availability Improvements

LinkedIn enhanced Kafka reliability by:

MirrorMaker lossless transfer : Modified MirrorMaker to confirm successful delivery before acknowledging consumption.

Replica lag monitoring : Switched from byte‑based to time‑based lag thresholds for replica health.

New producer : Implemented a pipelined producer to boost performance, currently being refined.

Topic deletion : Fixed numerous bugs to enable safe topic deletion in upcoming Kafka releases.

Security

LinkedIn contributed to adding encryption, authentication, and authorization features to Kafka, aiming to enable encryption by 2015 and additional security capabilities by 2016.

Kafka Monitoring Framework

LinkedIn built a standardized monitoring framework that runs test applications publishing and consuming topics to verify basic functionality, end‑to‑end latency, and that new Kafka versions do not break existing clients.

Fault Injection Testing

LinkedIn created the “Simoorg” fault‑injection framework to simulate low‑level failures such as disk write errors, shutdowns, and process kills, ensuring new Kafka releases handle failures gracefully.

Application Latency Monitoring

The “Burrow” tool monitors consumer lag to assess application health.

Maintaining Cluster Balance

LinkedIn ensures cluster balance by:

Avoiding placing partition leaders and replicas in the same rack.

Distributing topic partitions evenly across brokers.

Preventing disk and network saturation on a few nodes.

SRE engineers regularly rebalance partitions, and prototypes have been built to make the system smarter.

Kafka Ecosystem at LinkedIn

Beyond the core broker, client, and MirrorMaker, LinkedIn provides internal services such as:

REST interface support for non‑Java clients.

Schema registry for automatic deserialization.

Audit topics for cost accounting and usage tracking.

Large‑message handling that splits and reassembles messages exceeding the 1 MB limit.

These efforts have influenced many other companies (Yahoo!, Twitter, Netflix, Uber) that use Kafka for data analysis and stream processing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Scalability Reliability big-data LinkedIn

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.