Understanding Character Encoding: ASCII, GB2312, Unicode, and UTF-8
This article explains the history, purpose, and differences of major character encodings—including ASCII, GB2312, Unicode, and UTF-8—while showing how they are used and converted in modern computing environments.
Character Encoding Types
Many developers wonder why there are so many different encodings such as ASCII, GBK, GB2312, Unicode, and UTF‑8. To grasp these issues, we first need to look at the history of character encoding, which is a legacy "debt" inherited from early computers.
What Is Character Encoding
Computers fundamentally understand only binary 0s and 1s. Data is stored as bits, grouped into bytes (1 byte = 8 bits). To let humans read and write text, each character is assigned a numeric code that can be represented in bytes. The earliest such system is the ASCII code.
ASCII Encoding
ASCII is the earliest character set, using one byte per character. It defines control characters (code 0‑31) for device control and printable characters (code 32‑126) for letters, digits, punctuation, and some symbols.
控制字符 :ASCII codes 0‑31 are control characters, e.g., null, carriage return, line feed, tab, bell, backspace, space.
0‑6: null, carriage return, line feed, tab, bell, backspace, space
7‑8: non‑printable characters (backspace and bell)
可打印字符 :ASCII codes 32‑126 are printable characters, including digits, uppercase and lowercase letters, punctuation, and special symbols.
Digits: '0'‑'9'
Uppercase: 'A'‑'Z'
Lowercase: 'a'‑'z'
Punctuation: '.', ',', '!', '?', ';', ':', '/' etc.
Special symbols: '~', '@', '#', '$' etc.
GB2312 and Other Chinese Encodings
ASCII cannot represent Chinese characters, which number in the thousands. China introduced a series of national standards called GB (Guóbiāo) encodings. The most influential is GB 2312‑1980, which extends ASCII to include common Chinese characters and is widely supported in mainland China and Singapore.
Unicode Standard Encoding
Because many languages have their own encodings (e.g., Shift_JIS for Japanese, EUC‑KR for Korean), text mixing often leads to garbled output. Unicode unifies all characters into a single code space, eliminating such conflicts.
The most common Unicode implementation is UCS‑16, using two bytes per character (four bytes for rare characters). Modern operating systems and most programming languages support Unicode directly.
UTF‑8 Encoding
While Unicode solves compatibility, using a fixed two‑byte representation for every character doubles storage for pure English text. UTF‑8 solves this by using a variable‑length encoding: 1‑byte for ASCII characters, 3‑bytes for most Chinese characters, and up to 6 bytes for very rare symbols.
Because UTF‑8 is backward compatible with ASCII, legacy software that only understands ASCII can still process UTF‑8 data.
Conversion Between Encodings
To convert between encodings, a string in a non‑Unicode format can be decode d to a Unicode string, and a Unicode string can be encode() d to another format. Understanding when to encode and when to decode is essential for handling text correctly.
Summary
In summary:
ASCII was created to handle English characters.
GB2312 was created to handle Chinese characters.
Unicode was created to handle characters from all languages.
UTF‑8 was created to store Unicode efficiently, using variable‑length encoding.
When working with files, text is typically read as UTF‑8, converted to Unicode in memory, edited, and then saved back as UTF‑8. Web pages often declare their encoding with <meta charset="UTF-8" /> .
Top Architecture Tech Stack
Sharing Java and Python tech insights, with occasional practical development tool tips.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.