Fundamentals 14 min read

Understanding Unicode Encoding and Implementing Emoji Detection in Java

This article explains Unicode's structure, encoding ranges, UTF-8/16/32 representations, byte order considerations, and provides Java code to detect emojis in strings, illustrating practical usage of Unicode concepts for text processing.

360 Tech Engineering
360 Tech Engineering
360 Tech Engineering
Understanding Unicode Encoding and Implementing Emoji Detection in Java

Unicode (Universal Coded Character Set) is an industry standard that defines a character set and encoding schemes, assigning a unique binary code to every character across languages to enable cross‑language and cross‑platform text handling; development began in 1990 and it was officially released in 1994.

The code space spans 0x0000‑0x10FFFF and is divided into 17 planes: Plane 0 (Basic Multilingual Plane) contains common letters, digits, and frequently used characters; Plane 1 (Supplementary Multilingual Plane) includes additional language characters and emojis; Plane 2 (CJK Extension) is dedicated to Chinese‑Japanese‑Korean characters; Plane 14 is for special use; Planes 15‑16 are private‑use areas; the remaining planes are largely unassigned.

Within Plane 0, the Private Use Area (0xE000‑0xF8FF) provides 6,400 code points for custom encoding, while the Surrogate Area (0xD800‑0xDFFF) reserves 2,048 code points for UTF‑16 encoding.

Examples: the Chinese character ‘润’ is at 0x6DA6, and the emoji ‘😆’ is at 0x1F606.

Unicode Transformation Format (UTF) defines three concrete encodings: UTF‑8, UTF‑16, and UTF‑32.

UTF‑8

UTF‑8 encodes each code point using 1 to 4 bytes. The following table shows the mapping from Unicode ranges to byte patterns:

Unicode code point (hex)

UTF‑8 byte pattern (binary)

000000‑00007F

0xxxxxxx

000080‑0007FF

110xxxxx 10xxxxxx

000800‑00FFFF

1110xxxx 10xxxxxx 10xxxxxx

010000‑10FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF‑16

UTF‑16 uses 2‑byte units; characters below 0x10000 are stored directly, while those from 0x10000 to 0x10FFFF are represented with a surrogate pair (high surrogate 0xD800‑0xDBFF, low surrogate 0xDC00‑0xDFFF). The conversion algorithm is:

Subtract 0x10000 from the code point (U' = U‑0x10000).

Split the 20‑bit result into high ten bits (yyyy yyyy yy) and low ten bits (xx xxxx xxxx).

Form the surrogate pair: high = 0xD800 + high ten bits, low = 0xDC00 + low ten bits.

Example: code point 0x1F606 becomes the surrogate pair D83D DE06.

UTF‑32

UTF‑32 stores each code point in a fixed 4‑byte unit, making conversion trivial but wasteful for most texts.

UTF‑8 vs UTF‑16 Comparison

Saving a file with 1,000 common Chinese characters (range 0x4E00‑0x9FFF) as UTF‑8 yields about 3 KB (3 bytes per character), while UTF‑16 yields about 2 KB (2 bytes per character). For 1,000 ASCII letters or digits, UTF‑8 uses 1 KB and UTF‑16 still uses 2 KB, illustrating the space trade‑offs.

Because most frequently used characters fall within 0x0000‑0xFFFF, UTF‑16 is often a practical default (e.g., Java’s char type is 16 bits and uses UTF‑16).

Byte Order and BOM

When saving as UTF‑16, the file may be UTF‑16LE (little‑endian) or UTF‑16BE (big‑endian). The following table lists the Byte Order Mark (BOM) for each encoding:

Encoding

BOM (hex)

UTF‑8 without BOM

None

UTF‑8 with BOM

EF BB BF

UTF‑16LE

FF FE

UTF‑16BE

FE FF

UTF‑32LE

FF FE 00 00

UTF‑32BE

00 00 FE FF

Microsoft often prefixes UTF‑8 files with the BOM (EF BB BF) to help Windows Notepad detect the encoding, though this is not required on other platforms.

Practical Example: Detecting Emoji in a String (Java)

Requirement: consider a string invalid for association if it contains spaces, any emoji, or exceeds ten characters.

Implementation:

public static boolean containsEmoji(String str) {
    int len = str.length();
    for (int i = 0; i < len; i++) {
        int codePoint = Character.codePointAt(str, i);
        if (isEmojiCharacterByWiki(codePoint)) {
            return true;
        }
    }
    return false;
}

Helper method that checks whether a code point falls within known emoji ranges (based on Wikipedia and Unicode blocks):

private static boolean isEmojiCharacterByWiki(int codePoint) {
    return ((codePoint >= 0X2070) && (codePoint <= 0X2BFF)) ||
           ((codePoint >= 0X3000) && (codePoint <= 0X30FF)) ||
           ((codePoint >= 0X3200) && (codePoint <= 0X32FF)) ||
           ((codePoint >= 0x1F000) && (codePoint <= 0x1FA6F));
}

References include Java language specifications, Oracle documentation on Unicode handling, and various Unicode block tables.

JavaemojiencodingUnicodeUTF-8UTF-16
360 Tech Engineering
Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.