Fundamentals 11 min read

Understanding Character Encoding: ASCII, GB2312, Unicode, and UTF-8

This article explains the history, purpose, and differences of major character encodings—including ASCII, GB2312, Unicode, and UTF-8—while showing how they are used and converted in modern computing environments.

Top Architecture Tech Stack

Feb 23, 2024

Understanding Character Encoding: ASCII, GB2312, Unicode, and UTF-8

Character Encoding Types

Many developers wonder why there are so many different encodings such as ASCII, GBK, GB2312, Unicode, and UTF‑8. To grasp these issues, we first need to look at the history of character encoding, which is a legacy "debt" inherited from early computers.

What Is Character Encoding

Computers fundamentally understand only binary 0s and 1s. Data is stored as bits, grouped into bytes (1 byte = 8 bits). To let humans read and write text, each character is assigned a numeric code that can be represented in bytes. The earliest such system is the ASCII code.

ASCII Encoding

ASCII is the earliest character set, using one byte per character. It defines control characters (code 0‑31) for device control and printable characters (code 32‑126) for letters, digits, punctuation, and some symbols. 控制字符 ：ASCII codes 0‑31 are control characters, e.g., null, carriage return, line feed, tab, bell, backspace, space.

0‑6: null, carriage return, line feed, tab, bell, backspace, space

7‑8: non‑printable characters (backspace and bell) 可打印字符 ：ASCII codes 32‑126 are printable characters, including digits, uppercase and lowercase letters, punctuation, and special symbols.

Digits: '0'‑'9'

Uppercase: 'A'‑'Z'

Lowercase: 'a'‑'z'

Punctuation: '.', ',', '!', '?', ';', ':', '/' etc.

Special symbols: '~', '@', '#', '$' etc.

GB2312 and Other Chinese Encodings

ASCII cannot represent Chinese characters, which number in the thousands. China introduced a series of national standards called GB (Guóbiāo) encodings. The most influential is GB 2312‑1980, which extends ASCII to include common Chinese characters and is widely supported in mainland China and Singapore.

Unicode Standard Encoding

Because many languages have their own encodings (e.g., Shift_JIS for Japanese, EUC‑KR for Korean), text mixing often leads to garbled output. Unicode unifies all characters into a single code space, eliminating such conflicts.

The most common Unicode implementation is UCS‑16, using two bytes per character (four bytes for rare characters). Modern operating systems and most programming languages support Unicode directly.

UTF‑8 Encoding

While Unicode solves compatibility, using a fixed two‑byte representation for every character doubles storage for pure English text. UTF‑8 solves this by using a variable‑length encoding: 1‑byte for ASCII characters, 3‑bytes for most Chinese characters, and up to 6 bytes for very rare symbols.

Because UTF‑8 is backward compatible with ASCII, legacy software that only understands ASCII can still process UTF‑8 data.

Conversion Between Encodings

To convert between encodings, a string in a non‑Unicode format can be decode d to a Unicode string, and a Unicode string can be encode() d to another format. Understanding when to encode and when to decode is essential for handling text correctly.

Summary

In summary:

ASCII was created to handle English characters.

GB2312 was created to handle Chinese characters.

Unicode was created to handle characters from all languages.

UTF‑8 was created to store Unicode efficiently, using variable‑length encoding.

When working with files, text is typically read as UTF‑8, converted to Unicode in memory, edited, and then saved back as UTF‑8. Web pages often declare their encoding with <meta charset="UTF-8" />.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

programming Unicode UTF-8 character encoding fundamentals ASCII GB2312

Written by

Top Architecture Tech Stack

Sharing Java and Python tech insights, with occasional practical development tool tips.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.