Databases 24 min read

Horizontal Scaling in TDSQL: Design Principles, Practices, and Case Studies

This article explains the background, challenges, and design principles of horizontal scaling for TDSQL, detailing its architecture, scaling process, shard‑key selection, high availability, distributed transactions, performance optimization, practical case studies, and a Q&A section.

IT Architects Alliance
IT Architects Alliance
IT Architects Alliance
Horizontal Scaling in TDSQL: Design Principles, Practices, and Case Studies

Horizontal Scaling Background and Challenges

The article begins by contrasting vertical scaling, which expands a single machine’s resources and is limited by hardware, with horizontal scaling, which adds more machines to achieve theoretically unlimited capacity but introduces complexity such as data sharding, hotspot management, routing changes, rollback handling, consistency guarantees, and performance linearity.

Vertical vs. Horizontal Scaling

Vertical scaling improves CPU, memory, or storage of a single instance, while horizontal scaling distributes data across multiple nodes, allowing unlimited growth but requiring careful handling of distributed concerns.

Issues in Horizontal Scaling

Key challenges include how to split data (sharding), avoiding hotspot nodes, migrating data and updating routing without business impact, ensuring rollback and high‑availability, maintaining strong consistency, and achieving linear performance growth.

TDSQL Horizontal Scaling Practice

Architecture

TDSQL consists of three layers: a SQL engine layer that abstracts storage details, a storage layer composed of multiple SETs (each can be a primary‑plus‑replicas unit), and a Scheduler module that monitors and controls the cluster, handling scaling and failover transparently to the business.

Scaling Process

Initially data resides on a single SET but is already split into 256 shards. Scaling involves adding new SETs via a UI, allocating resources, synchronizing data, freezing writes briefly, updating routing, and finally removing redundant data. The process can expand from 1 to 2, 3, or up to 256 nodes, with each step designed to be minimally invasive.

Design Principles Behind TDSQL Scaling

Shard‑Key Selection

Businesses are encouraged to define a shard key (e.g., user ID, device ID) when creating tables, ensuring balanced data distribution, co‑location of related rows, and efficient query execution. If no key is provided, TDSQL selects one randomly, which may degrade performance.

High Availability and Reliability

The scaling workflow includes data synchronization, continuous data verification, a brief write‑freeze during routing updates, and delayed deletion of redundant data, all of which preserve strong consistency and minimize impact on the application.

Distributed Transactions

After scaling, data spans multiple nodes; TDSQL uses a two‑phase commit protocol to guarantee atomicity across nodes. The transaction manager is decentralized, allowing linear performance growth and robust fault tolerance.

Achieving Linear Performance Growth

Performance is enhanced by keeping related data on the same node, parallel computation on shards, push‑down of query predicates, data redundancy to reduce cross‑node traffic, and stream‑based aggregation that avoids large memory spikes.

Practical Cases

The article provides guidelines for choosing shard keys, criteria for when to scale (disk usage, CPU, QPS thresholds, upcoming traffic spikes), and real‑world cloud cluster examples: a 4‑SET cluster capable of expanding to 128 nodes and an 8‑SET cluster expandable to 64 nodes.

Q&A

Answers clarify that pre‑scale tables are already partitioned, backup consistency is ensured by strong replication within each SET, and the two‑phase commit mechanism avoids single‑point failures by using stateless SQL engines and distributed logging.

Performance OptimizationShardingHigh Availabilityhorizontal scalingDistributed DatabasesTDSQL
IT Architects Alliance
Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.