Horizontal Scaling in TDSQL: Design Principles, Practices, and Case Studies
This article explains the background, challenges, and design principles of horizontal scaling for TDSQL, detailing its architecture, scaling process, shard‑key selection, high availability, distributed transactions, performance optimization, practical case studies, and a Q&A section.
Horizontal Scaling Background and Challenges
The article begins by contrasting vertical scaling, which expands a single machine’s resources and is limited by hardware, with horizontal scaling, which adds more machines to achieve theoretically unlimited capacity but introduces complexity such as data sharding, hotspot management, routing changes, rollback handling, consistency guarantees, and performance linearity.
Vertical vs. Horizontal Scaling
Vertical scaling improves CPU, memory, or storage of a single instance, while horizontal scaling distributes data across multiple nodes, allowing unlimited growth but requiring careful handling of distributed concerns.
Issues in Horizontal Scaling
Key challenges include how to split data (sharding), avoiding hotspot nodes, migrating data and updating routing without business impact, ensuring rollback and high‑availability, maintaining strong consistency, and achieving linear performance growth.
TDSQL Horizontal Scaling Practice
Architecture
TDSQL consists of three layers: a SQL engine layer that abstracts storage details, a storage layer composed of multiple SETs (each can be a primary‑plus‑replicas unit), and a Scheduler module that monitors and controls the cluster, handling scaling and failover transparently to the business.
Scaling Process
Initially data resides on a single SET but is already split into 256 shards. Scaling involves adding new SETs via a UI, allocating resources, synchronizing data, freezing writes briefly, updating routing, and finally removing redundant data. The process can expand from 1 to 2, 3, or up to 256 nodes, with each step designed to be minimally invasive.
Design Principles Behind TDSQL Scaling
Shard‑Key Selection
Businesses are encouraged to define a shard key (e.g., user ID, device ID) when creating tables, ensuring balanced data distribution, co‑location of related rows, and efficient query execution. If no key is provided, TDSQL selects one randomly, which may degrade performance.
High Availability and Reliability
The scaling workflow includes data synchronization, continuous data verification, a brief write‑freeze during routing updates, and delayed deletion of redundant data, all of which preserve strong consistency and minimize impact on the application.
Distributed Transactions
After scaling, data spans multiple nodes; TDSQL uses a two‑phase commit protocol to guarantee atomicity across nodes. The transaction manager is decentralized, allowing linear performance growth and robust fault tolerance.
Achieving Linear Performance Growth
Performance is enhanced by keeping related data on the same node, parallel computation on shards, push‑down of query predicates, data redundancy to reduce cross‑node traffic, and stream‑based aggregation that avoids large memory spikes.
Practical Cases
The article provides guidelines for choosing shard keys, criteria for when to scale (disk usage, CPU, QPS thresholds, upcoming traffic spikes), and real‑world cloud cluster examples: a 4‑SET cluster capable of expanding to 128 nodes and an 8‑SET cluster expandable to 64 nodes.
Q&A
Answers clarify that pre‑scale tables are already partitioned, backup consistency is ensured by strong replication within each SET, and the two‑phase commit mechanism avoids single‑point failures by using stateless SQL engines and distributed logging.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.