Storing 2 Billion Phone Numbers: Int vs String, Schema Design and Pitfalls
When designing a schema for two‑billion phone numbers, use a VARCHAR(20) column with utf8mb4 Unicode, a unique index, and proper validation/encryption rather than a 32‑bit INT, because strings preserve leading zeros, international prefixes, extensions, and support business extensibility, fault tolerance, and future changes.
In an interview scenario a candidate is asked how to store 2 billion phone numbers and whether to use INT or STRING, VARCHAR or CHAR.
Phone numbers are 11‑digit numeric strings; an INT (32‑bit) maxes at 2,147,483,647, which cannot hold 11‑digit numbers. A 64‑bit LONG (BIGINT) can store them, but it loses leading zeros, cannot represent international prefixes (+86) or hyphens, and makes range queries inefficient.
Long phoneNumber = 01324567890L; // compile error – leading zero not allowedUsing a character type (VARCHAR) preserves the exact representation, supports optional country codes, hyphens, extensions, and enables flexible pattern matching.
CREATE TABLE user_tab (
id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT COMMENT '用户ID',
phone_number VARCHAR(20) NOT NULL COMMENT '手机号',
PRIMARY KEY (id),
UNIQUE KEY idx_phone (phone_number)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci COMMENT='用户表';The interview expects the candidate to discuss three hidden assessment points: business extensibility, data fault‑tolerance, and comprehensive problem‑solving.
Choosing VARCHAR(20) instead of VARCHAR(11) leaves room for international numbers (+8613822223333), extensions (13822223333#123), and future business changes.
Typical pitfalls in daily development include:
Using a field length that is too short (VARCHAR(11) truncates longer numbers).
Wrong character set/collation – use utf8mb4 and utf8mb4_unicode_ci to store all Unicode symbols.
Missing unique index on phone numbers, leading to duplicates.
Skipping data cleaning – remove spaces, hyphens, keep only + and digits.
Storing phone numbers in plain text – apply encryption or masking.
// Strict validation (11‑digit pure numbers)
String regex = "^1(3[0-9]|4[579]|5[0-35-9]|6[2567]|7[0-8]|8[0-9]|9[0-35-9])\\d{8}$";
// Loose validation (allows +86 prefix)
String looseRegex = "^(\\+\\d{1,3})?1(3\\d|4[579]|5[0-35-9]|6[2567]|7[0-8]|8\\d|9[0-35-9])\\d{8}$";Overall, a reasonable column length equals the basic requirement plus international extension plus a tolerance buffer.
Java Tech Enthusiast
Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.