Understanding Unicode Encoding and Implementing Emoji Detection in Java
This article explains Unicode's structure, encoding ranges, UTF-8/16/32 representations, byte order considerations, and provides Java code to detect emojis in strings, illustrating practical usage of Unicode concepts for text processing.
Unicode (Universal Coded Character Set) is an industry standard that defines a character set and encoding schemes, assigning a unique binary code to every character across languages to enable cross‑language and cross‑platform text handling; development began in 1990 and it was officially released in 1994.
The code space spans 0x0000‑0x10FFFF and is divided into 17 planes: Plane 0 (Basic Multilingual Plane) contains common letters, digits, and frequently used characters; Plane 1 (Supplementary Multilingual Plane) includes additional language characters and emojis; Plane 2 (CJK Extension) is dedicated to Chinese‑Japanese‑Korean characters; Plane 14 is for special use; Planes 15‑16 are private‑use areas; the remaining planes are largely unassigned.
Within Plane 0, the Private Use Area (0xE000‑0xF8FF) provides 6,400 code points for custom encoding, while the Surrogate Area (0xD800‑0xDFFF) reserves 2,048 code points for UTF‑16 encoding.
Examples: the Chinese character ‘润’ is at 0x6DA6, and the emoji ‘😆’ is at 0x1F606.
Unicode Transformation Format (UTF) defines three concrete encodings: UTF‑8, UTF‑16, and UTF‑32.
UTF‑8
UTF‑8 encodes each code point using 1 to 4 bytes. The following table shows the mapping from Unicode ranges to byte patterns:
Unicode code point (hex)
UTF‑8 byte pattern (binary)
000000‑00007F
0xxxxxxx
000080‑0007FF
110xxxxx 10xxxxxx
000800‑00FFFF
1110xxxx 10xxxxxx 10xxxxxx
010000‑10FFFF
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
UTF‑16
UTF‑16 uses 2‑byte units; characters below 0x10000 are stored directly, while those from 0x10000 to 0x10FFFF are represented with a surrogate pair (high surrogate 0xD800‑0xDBFF, low surrogate 0xDC00‑0xDFFF). The conversion algorithm is:
Subtract 0x10000 from the code point (U' = U‑0x10000).
Split the 20‑bit result into high ten bits (yyyy yyyy yy) and low ten bits (xx xxxx xxxx).
Form the surrogate pair: high = 0xD800 + high ten bits, low = 0xDC00 + low ten bits.
Example: code point 0x1F606 becomes the surrogate pair D83D DE06.
UTF‑32
UTF‑32 stores each code point in a fixed 4‑byte unit, making conversion trivial but wasteful for most texts.
UTF‑8 vs UTF‑16 Comparison
Saving a file with 1,000 common Chinese characters (range 0x4E00‑0x9FFF) as UTF‑8 yields about 3 KB (3 bytes per character), while UTF‑16 yields about 2 KB (2 bytes per character). For 1,000 ASCII letters or digits, UTF‑8 uses 1 KB and UTF‑16 still uses 2 KB, illustrating the space trade‑offs.
Because most frequently used characters fall within 0x0000‑0xFFFF, UTF‑16 is often a practical default (e.g., Java’s char type is 16 bits and uses UTF‑16).
Byte Order and BOM
When saving as UTF‑16, the file may be UTF‑16LE (little‑endian) or UTF‑16BE (big‑endian). The following table lists the Byte Order Mark (BOM) for each encoding:
Encoding
BOM (hex)
UTF‑8 without BOM
None
UTF‑8 with BOM
EF BB BF
UTF‑16LE
FF FE
UTF‑16BE
FE FF
UTF‑32LE
FF FE 00 00
UTF‑32BE
00 00 FE FF
Microsoft often prefixes UTF‑8 files with the BOM (EF BB BF) to help Windows Notepad detect the encoding, though this is not required on other platforms.
Practical Example: Detecting Emoji in a String (Java)
Requirement: consider a string invalid for association if it contains spaces, any emoji, or exceeds ten characters.
Implementation:
public static boolean containsEmoji(String str) {
int len = str.length();
for (int i = 0; i < len; i++) {
int codePoint = Character.codePointAt(str, i);
if (isEmojiCharacterByWiki(codePoint)) {
return true;
}
}
return false;
}Helper method that checks whether a code point falls within known emoji ranges (based on Wikipedia and Unicode blocks):
private static boolean isEmojiCharacterByWiki(int codePoint) {
return ((codePoint >= 0X2070) && (codePoint <= 0X2BFF)) ||
((codePoint >= 0X3000) && (codePoint <= 0X30FF)) ||
((codePoint >= 0X3200) && (codePoint <= 0X32FF)) ||
((codePoint >= 0x1F000) && (codePoint <= 0x1FA6F));
}References include Java language specifications, Oracle documentation on Unicode handling, and various Unicode block tables.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.