Big Data 14 min read

Kafka Best Practices for High Throughput: 20 Recommendations

This article presents New Relic's 20 best‑practice recommendations for operating Apache Kafka at high throughput, covering partitions, consumers, producers, and brokers, and explains key concepts, configuration tuning, monitoring, and architectural considerations to ensure reliable, scalable streaming pipelines.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Kafka Best Practices for High Throughput: 20 Recommendations

Apache Kafka is a popular distributed streaming platform used by large companies such as New Relic, Uber, and Square to build scalable, high‑throughput, and reliable real‑time data pipelines. In production, New Relic’s Kafka clusters handle over 15 million messages per second with an aggregate rate near 1 Tbps.

While Kafka simplifies stream processing, large‑scale deployments can become complex: consumers may fall behind, messages can be lost, retention limits, and publish‑subscribe patterns can affect performance. To mitigate these issues, New Relic shares 20 best‑practice recommendations organized into four areas: Partitions, Consumers, Producers, and Brokers.

Partitions

1. Understand the data rate of each partition to provision adequate storage space.

2. Unless you have a specific architectural need, use random partitioning when writing to topics to avoid hot‑partition bottlenecks.

Consumers

3. Upgrade consumers older than Kafka 0.10 to avoid ZooKeeper‑based coordination bugs that cause rebalance storms.

4. Tune socket buffers (e.g., receive.buffer.bytes) to handle high‑speed inbound traffic; for 10 Gbps+ networks consider 8–16 MB buffers.

5. Design consumers with back‑pressure mechanisms, preferably using fixed‑size off‑heap buffers.

6. Be aware of garbage‑collection pauses that can disrupt consumer groups and broker stability.

Producers

7. Configure acknowledgments (acks) so producers know when messages are safely persisted.

8. Set retries appropriately; for zero‑tolerance loss workloads consider Integer.MAX_VALUE.

9. Tune buffer.memory and batch.size based on producer data rate, partition count, and available memory.

Brokers

10. Enable topic compression and tune log.cleaner parameters to control disk usage.

11. Monitor network throughput, disk I/O, and CPU usage on each broker.

12. Distribute leader partitions evenly across brokers to avoid network‑I/O hotspots.

13. Watch for ISR shrinkage, under‑replicated partitions, and unpreferred leaders.

14. Adjust Log4j settings to retain useful logs without exhausting disk space.

15. Disable automatic topic creation and apply retention policies to unused topics.

16. Provide sufficient memory for high‑throughput brokers to keep data in OS cache.

17. Consider isolating high‑SLO topics onto dedicated broker subsets.

18. Use newer message formats on old clients via conversion services.

19. Do not assume local‑host testing reflects production performance; replication factor and network latency differ.

Additional resources include Kafka’s official documentation, Confluent webinars, and related articles on Elasticsearch node shutdown and disk‑based Kafka performance.

Kafkabest practiceshigh throughputpartitionsBrokersConsumersProducers
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.