How to Accurately Substring Chinese Text in PHP Across GBK and UTF-8
This guide explains how to correctly truncate strings containing Chinese characters in PHP by detecting character byte length for GBK and UTF‑8 encodings, and provides a reusable my_substr function with example usage.
Key points to know:
In GBK encoding, a Chinese character occupies 2 bytes; in UTF‑8 it occupies 3 bytes.
The ord() function returns the ASCII value of the first character of a string.
Chinese characters have ASCII values greater than 0xA0.
The essential technique is to determine whether each character in the string is Chinese or English by checking if ord(substr($str,$start,1)) > 0xA0 . If true, it is a Chinese character; otherwise, it is an English character.
The following PHP function my_substr implements this logic, allowing you to specify the start position, length, and byte size (2 for GBK, 3 for UTF‑8):
<code><?php
/* param $str The string to be truncated.
* param $start Starting position, 0 for the first character.
* param $length Number of characters to extract; if empty, extract to the end.
* param $bite Byte length of a Chinese character, default 2 for GBK, 3 for UTF‑8.
*/
function my_substr($str, $start, $length = "", $bite = 2) {
$pos = 0; // byte position in the string
// Calculate byte offset for the start position
for ($i = 0; $i < $start; $i++) {
if (ord(substr($str, $i, 1)) > 0xA0) {
$pos += $bite; // Chinese character
} else {
$pos += 1; // English character
}
}
if ($length == "") {
return substr($str, $pos); // to the end
} else {
if ($length < 0) {
$length = 0;
}
$string = "";
for ($i = 1; $i <= $length; $i++) {
if (ord(substr($str, $pos, 1)) > 0xA0) {
$string .= substr($str, $pos, $bite);
$pos += $bite;
} else {
$string .= substr($str, $pos, 1);
$pos += 1;
}
}
return $string;
}
}
$str = "a这是一段中文";
echo my_substr($str, 0); // output whole string
echo "\n";
echo my_substr($str, 0, 1); // output 'a'
echo "\n";
echo my_substr($str, 1, 2); // output '这是一'
?>
</code>Adjust the $bite parameter to 3 when working with UTF‑8 encoded strings.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.