Cloud Computing 13 min read

Efficient and Resilient Cloud Gateway at Scale: Architecture, Key Technologies, and Operational Practices of Tencent TGW

The article presents a comprehensive analysis of Tencent's TGW cloud gateway, detailing its modular architecture, high‑performance forwarding plane, lossless state migration, rapid fault recovery, multi‑level redundancy, operational best practices, and security mechanisms that enable ultra‑low latency and high availability for large‑scale internet services.

Tencent Cloud Developer

May 20, 2025

Efficient and Resilient Cloud Gateway at Scale: Architecture, Key Technologies, and Operational Practices of Tencent TGW

Background and Goals

Large‑scale cloud data centers are the backbone of modern internet services. Tencent TGW (Tencent Gateway) integrates elastic public access and intelligent load balancing to meet the rapid traffic growth and stringent latency requirements of online gaming, real‑time audio/video, and other latency‑sensitive workloads. The system aims for ultra‑high forwarding performance, seamless state migration, rapid fault recovery, and sub‑microsecond latency.

Architecture and Workflow

TGW follows a hierarchical modular design consisting of three parts:

Forwarding plane: stateless TGW‑EIP (elastic public access) and stateful TGW‑CLB (cloud load balancer).

Control plane: global orchestrator, per‑cluster operator, and distributed data plane (Load Distributor).

Auxiliary components: BGP+ECMP routing, probing agents, and log aggregation proxies.

Deployment places TGW‑EIP at regional entry points and TGW‑CLB inside each Availability Zone (AZ). Inbound traffic flows through BGP to TGW‑EIP, which performs NAT and tunnel encapsulation, then to TGW‑CLB for stateful load‑balancing based on service identifiers.

Key Technical Highlights

High‑performance forwarding plane : TGW‑EIP uses a Run‑to‑Completion model with single‑core batch processing, hash prefetching, and conflict reduction, achieving 53% higher throughput and 66‑105 µs latency. TGW‑CLB adopts a Pipeline+RTC hybrid model with dynamic dispatch, lock‑free ring buffers, and a 1:2 dispatch‑to‑process core ratio, delivering 2.9× the throughput of the Tripod baseline.

State migration mechanism : Supports lossless hot migration, copying configuration and connection state before BGP‑announcing the new cluster. Migration completes within 4 seconds, with a proxy fallback for unrecognized flows.

Fault recovery : Multi‑level fault‑tolerance (AZ, rack, machine, link) with active‑active or active‑standby modes, enabling seconds‑level recovery. Cluster‑internal link synchronization filters short‑lived flows and batches updates to reduce overhead.

Fault detection and localization : Colored probing probes (TCP half‑handshake) run every 5 seconds, recording Trace Points (TP) and Drop Points (DP) to pinpoint node failures within one minute.

Security and DDoS protection : Layered scrubbing, per‑VIP rate limiting, dynamic isolation of attacked VIPs to dedicated cleaning clusters, and protocol compliance checks (e.g., dropping malformed GRE or QUIC packets).

Operational Experience

TGW has run stably for eight years across Tencent Cloud’s global infrastructure, serving gaming, live‑streaming, and financial workloads. Operational practices include blast‑radius isolation via hierarchical design, 50% redundancy at each layer, hot‑standby clusters for sub‑second failover, gradual traffic ramp‑up, automated scaling triggers, gray‑release version rollout, and rapid rollback.

Future Directions

The roadmap envisions integration of hardware offload, programmable forwarding, and further performance‑reliability enhancements to keep TGW at the forefront of next‑generation intelligent network infrastructure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

fault tolerance network security Cloud Gateway high-performance forwarding large-scale cloud State Migration

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.