Fundamentals 10 min read

Understanding Characters, Character Sets, and Encoding: From ASCII to Unicode

This article explains the concepts of characters, character sets, and character encoding, describes how computers store and render text using methods like ASCII, GB2312, Unicode, and UTF‑8/16/32, and discusses why garbled text occurs across different languages and systems.

360 Tech Engineering

Jul 18, 2023

Understanding Characters, Character Sets, and Encoding: From ASCII to Unicode

Human history shows that writing is a great invention, and in the computer world we need to understand how text is stored and transmitted.

Common garbled text issues, such as the infamous "锟斤拷" symbols, arise when characters are not correctly encoded or decoded.

Characters are abstract symbols that can be letters, numbers, punctuation, emojis, or images. In computers, characters must be converted to binary for storage.

A character set is a collection of characters, like a codebook that maps characters to specific patterns.

Character encoding is the rule that converts characters into binary according to a character set, enabling the computer to render them on screen.

Encoding methods start by assigning a unique number to each character in a set and then storing that number as binary. A byte (8 bits) can represent 256 different values, allowing a direct mapping for up to 256 characters.

The first widely used character set was ASCII , which fits within 128 characters, leaving half of a byte unused.

Because many languages have far more characters, extended sets like GB2312 (over 6,000 Chinese characters) were created, requiring multiple bytes per character and leading to compatibility problems.

To unify global text, the Unicode standard was introduced, offering 17 planes with 65,536 code points each, supporting over 1.1 million characters; the latest version (15.0) contains 149,186 characters.

Unicode uses several encoding schemes:

UTF‑32 : fixed‑length 4‑byte encoding, simple but wasteful.

UTF‑16 : variable‑length 2‑ or 4‑byte encoding, commonly used in JavaScript.

UTF‑8 : variable‑length 1‑ to 4‑byte encoding, efficient and now the dominant web standard.

In UTF‑8, the leading bits of each byte indicate the length of the character, with specific patterns for 1‑byte, n‑byte sequences.

When text encoded in one scheme (e.g., GBK) is opened with another (e.g., UTF‑8), garbled characters appear, illustrating the importance of matching encoding and decoding methods.

References:

https://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html

https://juejin.cn/post/7164308348035137567

https://www.youtube.com/watch?v=zSstXi-j7Qc&t=302s

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Unicode UTF-8 character encoding ASCII computing fundamentals

Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.