Frontend Development 12 min read

How to Fix HTML Entity Bugs That Break Rich Text Rendering

This article explains why HTML entities like "<" and ">" can disappear in rich‑text fields, analyzes the underlying tokenizer state machine, and provides a lightweight hack that inserts empty comment nodes to preserve the original text without breaking legacy rendering logic.

Tencent IMWeb Frontend Team
Tencent IMWeb Frontend Team
Tencent IMWeb Frontend Team
How to Fix HTML Entity Bugs That Break Rich Text Rendering

Starting from a bug

One day a product team reported that a rich‑text field on a page displayed incorrectly. The backend stored the string

a<b<c

, but the page only showed

a

. The issue stemmed from the rich‑text renderer decoding HTML entities, turning the sequence into a malformed tag that the browser then omitted.

HTML input → Entity decode → dangerouslySetInnerHTML → DOM → UI

When the input is

<p>a<b<c</p>

, decoding yields

<p>a<b<c</p>

. The fragment

<b<c

is interpreted as a tag, leaving only

a

in the DOM.

The most straightforward fix would be to remove the Entity decode step, but that logic is deeply embedded in legacy code and referenced throughout the project, making a direct change risky.

Instead, a safer approach is to add a compensating mechanism that neutralises the effect of Entity decode for the problematic characters, limiting the change to the affected module.

Theoretical Learning

The HTML parser builds a DOM tree by tokenising the input stream. According to the WHATWG specification, the tokenizer maintains a finite‑state machine. When it encounters a "<" character, it enters the Tag open state, and the next character determines the parsing path:

"!" – enters markup‑declaration open state (e.g., , , "/" – enters end‑tag open state (e.g., ). ASCII letters – enters tag name state, treating subsequent characters as a tag name. "?" – creates a bogus comment node and re‑parses. EOF – outputs the "<" character. Other characters – outputs "<" and signals an invalid first character of a tag name.

As long as we ensure that after a "<" the next character is not one of the first four cases, the tokenizer will not switch to tag parsing, and the original text will be displayed unchanged.

Hack Practice

To verify the idea quickly, we can use DevTools' "Edit as HTML" or an external parser like

parse5

via the AST Explorer site to inspect the generated DOM tree.

<code>&lt;!DOCTYPE html&gt;
&lt;html&gt;
  &lt;head&gt;
    &lt;meta charset="utf-8"&gt;
    &lt;title&gt;Test&lt;/title&gt;
  &lt;/head&gt;
  &lt;body&gt;
    &lt;h1&gt;我的第一个标题&lt;/h1&gt;
    &lt;p&gt;我的第一个段落。&lt;/p&gt;
  &lt;/body&gt;
&lt;/html&gt;</code>

Parsing

&lt;p&gt;a&lt;b&lt;c&lt;/p&gt;

with

parse5

shows a node named

b&lt;c&lt;

with a

p

attribute, matching the spec.

The hack inserts an empty comment

&lt;!-- --&gt;

between the "<" and any character that would trigger states 1‑4:

<code>function replaceHTMLEntityTagStr(input = '') {
  // entity can be named, hex, or decimal
  return input.replace(/((&amp;lt;)|(&amp;#x3C;)|(&amp;#60;))(!|\/|\?|[a-zA-Z])/gi,
    (all, p1, p2, p3, p4, p5) => `${p1}&lt;!-- --&gt;${p5}`);
}</code>

This transformation prevents the tokenizer from entering the tag‑parsing states while leaving the UI unchanged.

Hack Insights

Key takeaways:

Focus on the final UI outcome rather than the underlying DOM implementation.

Fundamental compiler‑theory concepts like tokenisers and state machines can solve practical front‑end bugs.

Adopt incremental, controllable fixes before attempting large‑scale refactors.

Compatibility: The hack works in modern WebKit/Chromium browsers, as well as IE 7+, iOS 11 Safari, Chromium 44, and Firefox 91.01.

frontendJavaScriptParsingHTMLBug FixEntityTokenizer
Tencent IMWeb Frontend Team
Written by

Tencent IMWeb Frontend Team

IMWeb Frontend Community gathering frontend development enthusiasts. Follow us for refined live courses by top experts, cutting‑edge technical posts, and to sharpen your frontend skills.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.