Primary Key and Index Design for Distributed Databases
The article explains why auto‑increment primary keys are unsuitable for sharded databases, recommends globally unique ordered identifiers, and details index and global‑table strategies—including redundant tables, index tables, and embedding shard information in keys—to achieve efficient single‑shard queries.
In a distributed database the primary key must be globally unique across all shards; using an auto‑increment key violates this because the value is generated after insertion and can repeat on different shards.
For example, the orders table is defined as:
CREATE TABLE `orders` (
`O_ORDERKEY` int NOT NULL auto_increment,
`O_CUSTKEY` int NOT NULL,
`O_ORDERSTATUS` char(1) NOT NULL,
`O_TOTALPRICE` decimal(15,2) NOT NULL,
`O_ORDERDATE` date NOT NULL,
`O_ORDERPRIORITY` char(15) NOT NULL,
`O_CLERK` char(15) NOT NULL,
`O_SHIPPRIORITY` int NOT NULL,
`O_COMMENT` varchar(79) NOT NULL,
PRIMARY KEY (`O_ORDERKEY`),
KEY (`O_CUSTKEY`)
) ENGINE=InnoDB;If O_ORDERKEY is auto‑incremented, the same key value (e.g., 1) can appear in multiple shards, leading to duplicate primary keys.
Therefore, auto‑increment should be avoided in sharded environments; instead, use ordered globally unique identifiers such as MySQL’s UUID, business‑generated keys, or algorithms like Snowflake (with caution about time‑backward issues).
Index design : Queries are routed by the shard key, but many workloads also need to search by other columns. A simple query like SELECT * FROM orders WHERE o_orderkey = 1 would require scanning all shards because the shard key is o_custkey , not o_orderkey .
Two design options are presented:
Redundant table where the shard key itself is the primary key.
Include the shard key in an additional index.
A practical solution is to create an index table that stores only the primary key and its shard key:
CREATE TABLE idx_orderkey_custkey (
o_orderkey INT,
o_custkey INT,
PRIMARY KEY (o_orderkey)
);Querying then becomes a two‑step process:
# step 1
SELECT o_custkey FROM idx_orderkey_custkey WHERE o_orderkey = 1;
# step 2
SELECT * FROM orders WHERE o_custkey = ? AND o_orderkey = 1;This ensures each step can be routed to a single shard, dramatically improving performance regardless of the total number of shards.
An even more efficient approach embeds the shard information directly in the primary key string (e.g., o_orderkey = string(o_orderkey + o_custkey) ), allowing the system to determine the target shard from the key itself and execute a single‑shard query.
For global tables that lack a shard key but are small and rarely updated (e.g., a nation reference table), replicating the table on every shard avoids cross‑shard joins.
Unique index design : Like primary keys, unique indexes must also be globally unique in a sharded environment; using UUID‑based keys guarantees uniqueness across all shards.
Summary :
Use ordered UUIDs or other globally unique identifiers for primary keys in sharded databases.
Design unique indexes with the same global‑uniqueness principle.
If a unique index is not the shard key, store shard information in the indexed column to enable single‑shard queries.
Replicate small, infrequently‑changed tables on every shard to avoid cross‑shard lookups.
Architect's Guide
Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.