Design and Implementation of a Kafka‑Proxy for High Availability and Traffic Governance
This article presents a Kafka‑Proxy solution that enhances cluster availability, traffic governance, seamless cluster switching, near‑source production/consumption, non‑disruptive offset resetting, and message flow control through metadata sharing and a lightweight proxy layer.
Challenges of Existing Kafka Clusters
As business scales, the current Kafka clusters face large‑cluster bottlenecks, lack of cluster‑level disaster recovery, and operational pain points such as resetting consumer offsets.
Design Goals
1. Seamless cluster switching to isolate core services. 2. Cross‑region traffic monitoring with alerts. 3. Online consumer offset reset without downtime. 4. Topic‑level traffic circuit‑breaker. 5. Dual‑region support for near‑source production and consumption.
System Design
A lightweight proxy layer is inserted between clients and brokers, fully compatible with the Kafka protocol, allowing seamless integration without modifying existing clients or brokers. The proxy is co‑located with brokers to maximize throughput and minimize latency.
Architecture Overview
The design features shared metadata, enabling multiple backend clusters to serve a single public address. Each cluster handles a subset of topics, and metadata sharing allows seamless failover and maintenance.
Key Components
Netty Server/Client for request handling.
Key Queue and DataTable for channel and request management.
SendWorker and Acks0SendWorker for processing normal and acks=0 requests.
ChannelManager and Cache Manager for mapping and metadata caching.
Processor for request transformation before forwarding to brokers.
Workflow
Proxy starts and listens on port 19092.
Parses incoming ByteBuf to extract ApiKey and acks.
Routes acks=0 requests to Acks0SendWorker; other requests to SendWorker based on channelId.
Matches responses to requests via requestId and forwards them back to clients.
On mismatches or exceptions, connections are reset to maintain stability.
Features
Seamless Cluster Switching enables topic migration across AZs with metadata synchronization, though early versions may lose unconsumed data.
Near‑Source Production/Consumption directs traffic to the closest region/AZ based on client IP, reducing latency.
Non‑Disruptive Offset Reset allows offset resets via the management platform while clients continue consuming, with safeguards for idempotency.
Production and Consumption Circuit‑Breaker blocks overloaded topics at the proxy level, returning default responses to protect the broker cluster.
Benefits
Fine‑grained management of massive Kafka clusters, improved resource utilization, enhanced fault isolation, traffic monitoring, and client‑parameter best‑practice guidelines lead to higher stability and cost savings.
Future Outlook
Automated failover scripts for dual‑center deployments.
Expanded testing across client versions.
Seamless Kafka version upgrades via the proxy.
Additionally, the open‑source CKibana project (Kibana + ClickHouse) is highlighted for high‑throughput log analytics.
Tongcheng Travel Technology Center
Pursue excellence, start again with Tongcheng! More technical insights to help you along your journey and make development enjoyable.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.