Frontend Development 12 min read

How to Fix HTML Entity Bugs That Break Rich Text Rendering

This article explains why HTML entities like "<" and ">" can disappear in rich‑text fields, analyzes the underlying tokenizer state machine, and provides a lightweight hack that inserts empty comment nodes to preserve the original text without breaking legacy rendering logic.

Tencent IMWeb Frontend Team

Aug 26, 2021

How to Fix HTML Entity Bugs That Break Rich Text Rendering

Starting from a bug

One day a product team reported that a rich‑text field on a page displayed incorrectly. The backend stored the string a<b<c, but the page only showed a. The issue stemmed from the rich‑text renderer decoding HTML entities, turning the sequence into a malformed tag that the browser then omitted.

HTML input → Entity decode → dangerouslySetInnerHTML → DOM → UI

When the input is a<b<c, decoding yields a<b<c. The fragment <b<c is interpreted as a tag, leaving only a in the DOM.

The most straightforward fix would be to remove the Entity decode step, but that logic is deeply embedded in legacy code and referenced throughout the project, making a direct change risky.

Instead, a safer approach is to add a compensating mechanism that neutralises the effect of Entity decode for the problematic characters, limiting the change to the affected module.

Theoretical Learning

The HTML parser builds a DOM tree by tokenising the input stream. According to the WHATWG specification, the tokenizer maintains a finite‑state machine. When it encounters a "<" character, it enters the Tag open state, and the next character determines the parsing path:

"!" – enters markup‑declaration open state (e.g., , , "/" – enters end‑tag open state (e.g., ). ASCII letters – enters tag name state, treating subsequent characters as a tag name. "?" – creates a bogus comment node and re‑parses. EOF – outputs the "<" character. Other characters – outputs "<" and signals an invalid first character of a tag name.

As long as we ensure that after a "<" the next character is not one of the first four cases, the tokenizer will not switch to tag parsing, and the original text will be displayed unchanged.

Hack Practice

To verify the idea quickly, we can use DevTools' "Edit as HTML" or an external parser like parse5 via the AST Explorer site to inspect the generated DOM tree.

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <title>Test</title>
  </head>
  <body>
    <h1>我的第一个标题</h1>
    <p>我的第一个段落。</p>
  </body>
</html>

Parsing a<b<c with parse5 shows a node named b<c< with a p attribute, matching the spec.

The hack inserts an empty comment  between the "<" and any character that would trigger states 1‑4:

function replaceHTMLEntityTagStr(input = '') {
  // entity can be named, hex, or decimal
  return input.replace(/((&lt;)|(&#x3C;)|(&#60;))(!|\/|\?|[a-zA-Z])/gi,
    (all, p1, p2, p3, p4, p5) => `${p1}<!-- -->${p5}`);
}

This transformation prevents the tokenizer from entering the tag‑parsing states while leaving the UI unchanged.

Hack Insights

Key takeaways:

Focus on the final UI outcome rather than the underlying DOM implementation.

Fundamental compiler‑theory concepts like tokenisers and state machines can solve practical front‑end bugs.

Adopt incremental, controllable fixes before attempting large‑scale refactors.

Compatibility: The hack works in modern WebKit/Chromium browsers, as well as IE 7+, iOS 11 Safari, Chromium 44, and Firefox 91.01.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

JavaScript parsing HTML Bug Fix Entity tokenizer

Written by

Tencent IMWeb Frontend Team

IMWeb Frontend Community gathering frontend development enthusiasts. Follow us for refined live courses by top experts, cutting‑edge technical posts, and to sharpen your frontend skills.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.