Storing 2 Billion Phone Numbers: Why VARCHAR(20) Is Preferable to INT/BIGINT
The article explains why using a VARCHAR(20) column to store billions of phone numbers is safer and more flexible than INT or BIGINT, covering range limits, data integrity, query convenience, interview expectations, and common pitfalls such as insufficient field length, charset issues, missing indexes, and lack of encryption.
Preface
A fan reported an interview question from ByteDance: for storing 2 billion phone numbers, should you choose INT or STRING? VARCHAR or CHAR? Why?
The interviewer's reaction suggested the answer was unsatisfactory.
This article shares the author's thinking.
1. Problems with using INT or BIGINT for 2 billion phone numbers
1.1 Can INT hold an 11‑digit phone number?
Phone numbers are 11‑digit numbers, e.g., 13728199213 .
In Java, int is 32‑bit with a maximum value of 2^31‑1 = 2,147,483,647 (≈2×10⁹), which cannot store an 11‑digit phone number.
To store it you need a 64‑bit Long , which maps to a database BIGINT .
1.2 Data integrity
Storing 01324567890 as a Long loses the leading zero, breaking data integrity.
Long phoneNumber =01324567890L; // compilation error, Java does not allow leading‑zero Long literalsPhone numbers may also contain country codes ( +86 ) or hyphens ( 137-2819-9213 ), making integer types unsuitable.
1.3 Query inconvenience
Searching for numbers starting with 137 requires converting a BIGINT to a string before performing a fuzzy match, which is inefficient.
2. Advantages of using String (VARCHAR)
Fidelity : Stores digits, symbols, and leading zeros exactly.
Flexibility : Supports fuzzy queries, international numbers, and future extensions.
Convenience : No overflow or format‑conversion worries.
CREATE TABLE user_tab (
id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT COMMENT '用户ID',
phone_number VARCHAR(20) NOT NULL COMMENT '手机号',
PRIMARY KEY (id),
UNIQUE KEY idx_phone (phone_number)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci COMMENT='用户表';2.1 Why VARCHAR(20) instead of VARCHAR(11)
Phone numbers are 11 digits, but using VARCHAR(11) fails for international numbers (+86 13822223333, 14 digits), numbers with country code (008613822223333, 15 digits), or extensions (13822223333#123).
Future business may need to store landline numbers (e.g., 010-62223333 ), virtual numbers ( 17012341234-5678 ), or mixed login identifiers (email + phone).
Therefore, field length and type should leave room for business changes.
Data‑tolerance considerations include:
Uncontrollable input : Users may enter spaces or symbols; storing the raw string simplifies cleaning.
Design compromise : Enforcing VARCHAR(11) forces strict filtering, increasing complexity.
Storage cost comparison:
VARCHAR(11) : max 11 bytes (1 byte per digit in utf8mb4).
VARCHAR(20) : max 20 bytes.
For 2 billion rows, the extra 9 bytes per row adds ~18 GB, still acceptable compared with the ~16 GB saved by using BIGINT .
Interviewers expect an answer formula like:
合理长度 = 基础需求 + 国际扩展 + 容错缓冲Beyond the formula, demonstrate thinking about business extensibility, data tolerance, and comprehensive problem analysis.
2.2 Extreme scenarios
If the number is purely numeric and does not start with 0, BIGINT can be used, but never INT . Mentioning such edge cases shows thoroughness.
3. Common development pitfalls
3.1 Field length too short
VARCHAR(11) truncates international numbers like +8613822223333 .
Use VARCHAR(20) to accommodate international and extension numbers.
3.2 Character set and collation
Using utf8 cannot store emojis or special symbols.
Use utf8mb4 with utf8mb4_unicode_ci for full Unicode support.
3.3 Improper index design
Missing a unique index on phone numbers leads to duplicates.
ALTER TABLE user ADD UNIQUE INDEX idx_phone (phone);3.4 Lack of data cleaning and validation
Storing raw inputs like 138-2222-3333 or 138 222 23333 creates inconsistent formats.
Clean before insertion: remove spaces, hyphens, keep only + and digits. Regex validation, e.g., ^\+?\d{8,20}$ (allows optional + and 8‑20 digits).
3.5 Ignoring privacy and security
Storing phone numbers in plain text risks privacy leaks.
Encrypt storage using AES or built‑in DB encryption. Mask output, e.g., 138****3333 .
3.6 Risk control validation
// Strict validation (11‑digit pure number, no country code)
String regex = "^1(3[0-9]|4[579]|5[0-35-9]|6[2567]|7[0-8]|8[0-9]|9[0-35-9])\\d{8}$";
// Loose validation (allows country code, e.g., +86 13812345678)
String looseRegex = "^(\\+\\d{1,3})?1(3\\d|4[579]|5[0-35-9]|6[2567]|7[0-8]|8\\d|9[0-35-9])\\d{8}$";Overall, the article guides developers on choosing appropriate data types for massive phone‑number storage, emphasizing extensibility, data integrity, query efficiency, and security.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.