Operations 33 min read

Practical Practices for Enhancing Kafka Cluster Stability at 360

This article details 360's comprehensive approach to improving Apache Kafka cluster stability through proactive operations, capacity assessment, parameter tuning, monitoring, version upgrades, and traffic control, offering concrete guidelines and best‑practice recommendations for large‑scale message‑queue deployments.

360 Smart Cloud
360 Smart Cloud
360 Smart Cloud
Practical Practices for Enhancing Kafka Cluster Stability at 360

1. Proactive Operations

During the iterative stability optimization of 360's online Kafka clusters, we distilled three practical stages—pre‑emptive prevention, runtime monitoring, and in‑process control—forming a "5‑10‑15" response standard (5 min rapid response, 10 min issue localization, 15 min emergency mitigation) that underpins SLA assurance.

1.1 Pre‑emptive Prevention

Capacity Assessment: Evaluate hardware limits and cluster bottlenecks.

Parameter Tuning: Optimize broker and client configurations for resource efficiency and lower latency.

Version Upgrade: Adopt newer Kafka releases to leverage added features and performance improvements.

User Profiling: Analyze traffic peaks, QPS, topic distribution, and client SDK usage to tailor services.

Cluster Segmentation: Separate online/offline and core/non‑core workloads into dedicated clusters for efficient resource management.

Admission Review: Verify expected connections, QPS, and storage needs before allowing new client access.

1.2 Runtime Monitoring

Observability: Collect key hardware and software metrics to accelerate fault diagnosis.

Alerting: Set thresholds on critical indicators (e.g., hardware failures, P99 latency spikes).

Daily Inspection: Conduct regular checks on capacity usage, network bandwidth, and node load balance.

1.3 In‑process Control

Emergency Drills: Regularly rehearse SOPs in simulated environments.

Active Defense: Enable broker‑side rate limiting and IP‑based connection throttling.

Rapid Bleeding: Use prepared tools to restore MQ services within 15 minutes.

Incident Reporting: Escalate after 10 minutes if unresolved, and broadcast progress after 15 minutes.

2. Capacity Assessment

Kafka's throughput is bounded by disk performance and network bandwidth. Matching network speed to disk I/O (e.g., 10 Gb/s for 10 × 125 MB/s HDDs) prevents bandwidth from becoming a bottleneck.

2.1 Network Bandwidth

Typical NIC speeds are 25 Gb/s or 10 Gb/s; ensure they align with aggregate disk write capacity.

2.2 Disk Performance

Use RAID10 for redundancy and performance, or JBOD for higher throughput at the cost of higher operational overhead.

Capacity formula: Estimated throughput = Disk count × Single‑disk throughput × Broker count ÷ Topic replication factor

3. Parameter Tuning

Both broker and client settings must be iteratively adjusted based on testing and monitoring.

3.1 Broker Tuning

num.network.threads : Number of threads handling network requests; typically set to CPU core count.

num.io.threads : Threads for disk I/O; align with disk performance and CPU.

socket.send.buffer.bytes / socket.receive.buffer.bytes : TCP socket buffers; start with 100 KB–1 MB and tune per network conditions.

num.replica.fetchers : Fetcher threads for replica sync; increase for large clusters while monitoring CPU/network usage.

3.2 Client Tuning

Ensure replication.factor > 1 , use acks=all , set retries > 3 , and enable enable.idempotence=true for reliable delivery.

Adjust batch.size , linger.ms , and buffer.memory to improve throughput.

3.3 Runtime Environment

Increase OS file descriptor limits and allocate JVM heap carefully to leave sufficient memory for the page cache.

4. Monitoring & Alerts

Track P99 production latency and its sub‑metrics (RequestQueueTime, LocalTime, RemoteTime, etc.) to quickly spot anomalies such as cold reads or I/O hotspots.

4.1 Production Latency

Breakdown of total time includes queueing, local processing, remote replication, response sending, message conversion, and throttling.

4.2 Cold Reads

Cold reads evict hot data from the page cache, increase fetcher latency, and can cause P99 spikes, especially when ack=-1 is used.

5. Multi‑Version Management

Legacy clusters used Kafka v0.9 and v1.1; upgrading to v2.8.2 brings new features, bug fixes, and better compatibility with tools like MirrorMaker2.

5.1 Benefits of New Versions

Improved DDOS short‑connection handling (KIP‑306, KIP‑402, KIP‑612), richer JMX metrics, and broader client compatibility via ApiVersionRequest .

5.2 Compatibility

Broker‑to‑broker communication uses inter.broker.protocol.version to maintain protocol compatibility across mixed‑version clusters.

5.3 Upgrade Process

Two‑phase rolling upgrade: first update binary packages while keeping the old protocol version, then raise inter.broker.protocol.version after validation.

6. Traffic Control

Implement connection‑rate quotas ( connection_creation_rate ), bandwidth quotas ( producer_byte_rate , consumer_byte_rate ), and request‑percentage limits to protect the cluster from short‑connection storms and DDoS‑like behavior.

6.1 Massive Short Connections

A client bug that opened thousands of connections per minute caused P99 latency to jump from tens of milliseconds to ~4 seconds, prompting IP‑based connection throttling and automatic iptables bans.

6.2 Network Bandwidth Quota

Configure per‑user or per‑client‑ID byte‑rate limits to prevent a single topic from exhausting cluster I/O.

6.3 Request Rate Quota

Limit the combined CPU time of network and I/O threads per user/client ( request_percentage ), though effectiveness is limited for short‑connection workloads.

6.4 Connection Creation Rate Quota

Introduced in Kafka v2.7, this caps new connection creation per broker to mitigate connection storms.

7. Conclusion

The practices described—from proactive ops and capacity planning to monitoring, version upgrades, and traffic control—significantly improved the stability and performance of 360's Kafka clusters, while highlighting remaining challenges such as cold‑read latency, scaling complexities, and hardware costs.

monitoringKafkatraffic controlClusterUpgradestabilitytuning
360 Smart Cloud
Written by

360 Smart Cloud

Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.