How to Fix HTML Entity Bugs That Break Rich Text Rendering
This article explains why HTML entities like "<" and ">" can disappear in rich‑text fields, analyzes the underlying tokenizer state machine, and provides a lightweight hack that inserts empty comment nodes to preserve the original text without breaking legacy rendering logic.
Starting from a bug
One day a product team reported that a rich‑text field on a page displayed incorrectly. The backend stored the string
a<b<c, but the page only showed
a. The issue stemmed from the rich‑text renderer decoding HTML entities, turning the sequence into a malformed tag that the browser then omitted.
HTML input → Entity decode → dangerouslySetInnerHTML → DOM → UI
When the input is
<p>a&lt;b&lt;c</p>, decoding yields
<p>a<b<c</p>. The fragment
<b<cis interpreted as a tag, leaving only
ain the DOM.
The most straightforward fix would be to remove the Entity decode step, but that logic is deeply embedded in legacy code and referenced throughout the project, making a direct change risky.
Instead, a safer approach is to add a compensating mechanism that neutralises the effect of Entity decode for the problematic characters, limiting the change to the affected module.
Theoretical Learning
The HTML parser builds a DOM tree by tokenising the input stream. According to the WHATWG specification, the tokenizer maintains a finite‑state machine. When it encounters a "<" character, it enters the Tag open state, and the next character determines the parsing path:
"!" – enters markup‑declaration open state (e.g., , , "/" – enters end‑tag open state (e.g., ). ASCII letters – enters tag name state, treating subsequent characters as a tag name. "?" – creates a bogus comment node and re‑parses. EOF – outputs the "<" character. Other characters – outputs "<" and signals an invalid first character of a tag name.
As long as we ensure that after a "<" the next character is not one of the first four cases, the tokenizer will not switch to tag parsing, and the original text will be displayed unchanged.
Hack Practice
To verify the idea quickly, we can use DevTools' "Edit as HTML" or an external parser like
parse5via the AST Explorer site to inspect the generated DOM tree.
<code><!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Test</title>
</head>
<body>
<h1>我的第一个标题</h1>
<p>我的第一个段落。</p>
</body>
</html></code>Parsing
<p>a<b<c</p>with
parse5shows a node named
b<c<with a
pattribute, matching the spec.
The hack inserts an empty comment
<!-- -->between the "<" and any character that would trigger states 1‑4:
<code>function replaceHTMLEntityTagStr(input = '') {
// entity can be named, hex, or decimal
return input.replace(/((&lt;)|(&#x3C;)|(&#60;))(!|\/|\?|[a-zA-Z])/gi,
(all, p1, p2, p3, p4, p5) => `${p1}<!-- -->${p5}`);
}</code>This transformation prevents the tokenizer from entering the tag‑parsing states while leaving the UI unchanged.
Hack Insights
Key takeaways:
Focus on the final UI outcome rather than the underlying DOM implementation.
Fundamental compiler‑theory concepts like tokenisers and state machines can solve practical front‑end bugs.
Adopt incremental, controllable fixes before attempting large‑scale refactors.
Compatibility: The hack works in modern WebKit/Chromium browsers, as well as IE 7+, iOS 11 Safari, Chromium 44, and Firefox 91.01.
Tencent IMWeb Frontend Team
IMWeb Frontend Community gathering frontend development enthusiasts. Follow us for refined live courses by top experts, cutting‑edge technical posts, and to sharpen your frontend skills.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.