Understanding Characters, Character Sets, and Encoding: From ASCII to Unicode
This article explains the concepts of characters, character sets, and character encoding, describes how computers store and render text using methods like ASCII, GB2312, Unicode, and UTF‑8/16/32, and discusses why garbled text occurs across different languages and systems.
Human history shows that writing is a great invention, and in the computer world we need to understand how text is stored and transmitted.
Common garbled text issues, such as the infamous "锟斤拷" symbols, arise when characters are not correctly encoded or decoded.
Characters are abstract symbols that can be letters, numbers, punctuation, emojis, or images. In computers, characters must be converted to binary for storage.
A character set is a collection of characters, like a codebook that maps characters to specific patterns.
Character encoding is the rule that converts characters into binary according to a character set, enabling the computer to render them on screen.
Encoding methods start by assigning a unique number to each character in a set and then storing that number as binary. A byte (8 bits) can represent 256 different values, allowing a direct mapping for up to 256 characters.
The first widely used character set was ASCII , which fits within 128 characters, leaving half of a byte unused.
Because many languages have far more characters, extended sets like GB2312 (over 6,000 Chinese characters) were created, requiring multiple bytes per character and leading to compatibility problems.
To unify global text, the Unicode standard was introduced, offering 17 planes with 65,536 code points each, supporting over 1.1 million characters; the latest version (15.0) contains 149,186 characters.
Unicode uses several encoding schemes:
UTF‑32 : fixed‑length 4‑byte encoding, simple but wasteful.
UTF‑16 : variable‑length 2‑ or 4‑byte encoding, commonly used in JavaScript.
UTF‑8 : variable‑length 1‑ to 4‑byte encoding, efficient and now the dominant web standard.
In UTF‑8, the leading bits of each byte indicate the length of the character, with specific patterns for 1‑byte, n‑byte sequences.
When text encoded in one scheme (e.g., GBK) is opened with another (e.g., UTF‑8), garbled characters appear, illustrating the importance of matching encoding and decoding methods.
References:
https://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html
https://juejin.cn/post/7164308348035137567
https://www.youtube.com/watch?v=zSstXi-j7Qc&t=302s
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.