How to Build a Simple HTML AST Parser in JavaScript
This article explains how to transform raw HTML strings into a structured abstract syntax tree (AST) using JavaScript regular expressions and a step‑by‑step parsing algorithm, covering tag, attribute, and text node handling, with a complete example implementation of a lightweight AST parser.
AST parsers are frequently used in frameworks like Vue.js (e.g., VNode). The same idea can be applied to convert unstructured HTML strings into structured objects for analysis, processing, and rendering.
How to Parse into an AST?
HTML source code is just a text string, but browsers, Babel, or Vue need a structured representation to understand each node's purpose. Regular expressions are used to transform the string into a hierarchical data structure.
This article describes an AST parser implementation with all core details in under a hundred lines of code.
Goal
The goal is to convert the following HTML snippet into an AST:
<div class="classAttr" data-type="dataType" data-id="dataId" style="color:red">我是外层div<span>我是内层span</span></div>The resulting AST (properties can be defined as needed) looks like:
{
"node": "root",
"child": [{
"node": "element",
"tag": "div",
"class": "classAttr",
"dataset": {"type": "dataType", "id": "dataId"},
"attrs": [{"name": "style", "value": "color:red"}],
"child": [{
"node": "text",
"text": "我是外层div"
}, {
"node": "element",
"tag": "span",
"dataset": {},
"attrs": [],
"child": [{
"node": "text",
"text": "我是内层span"
}]
}]
}]
}The root node contains a
childarray; each element node records its tag, class, data attributes, other attributes, and nested children.
Review of Regular Expressions
^ matches the start of a line (e.g.,
/^a/matches "ab" but not "ba").
$ matches the end of a line.
* matches zero or more occurrences of the preceding token.
+ matches one or more occurrences.
[ab] matches either "a" or "b".
\w matches word characters (letters, digits, underscore).
Matching Tag Elements
Represent the HTML string
<div>我是一个div</div>with a regular expression that captures the opening tag, content, and closing tag.
const ncname = '[a-zA-Z_][\w-.]*'Combine it into:
`<${ncname}>`Full tag pattern:
`<${ncname}></${ncname}>`To allow any characters inside the tag:
`<${ncname}>[\s\S]*</${ncname}>`Matching Tag Attributes
Attribute names typically consist of letters, underscores, or colons, followed by letters, digits, underscores, hyphens, colons, or dots:
const attrKey = /[a-zA-Z_:][-a-zA-Z0-9_:.]*/Supported attribute syntaxes:
class="title"
class='title'
class=title
const attr = /([a-zA-Z_:][-a-zA-Z0-9_:.]*)=("([^"]*)"|'([^']*)'|([^\s"'=<>`]+))/Testing the regex:
"class=abc".match(attr); // → ["class=abc", "class", "abc", undefined, undefined, "abc"]
"class='abc'".match(attr); // → ["class='abc'", "class", "'abc'", undefined, "abc", undefined]The second result includes the surrounding quotes, so a non‑capturing group (?:) is needed to exclude them.
"abcde".match(/a(?:b)c(.*)/); // → ["abcde", "de"]Final attribute regex with optional spaces and non‑capturing groups:
const attr = /([a-zA-Z_:][-a-zA-Z0-9_:.]*)\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s"'=<>`]+))/gResult illustration:
Matching Nodes
Combine tag and attribute patterns to match a complete node:
/<[a-zA-Z_][\w\-\.]*(?:\s+([a-zA-Z_:][-a-zA-Z0-9_:.]*)\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s"'=<>`]+)))*>[\s\S]*<\/[a-zA-Z_][\w\-\.]*>/AST Parsing in Practice
Using the above regular expressions, the parser processes the HTML in three stages: start tag, inner content, and end tag.
const DOM = /<[a-zA-Z_][\w\-\.]*(?:\s+([a-zA-Z_:][-a-zA-Z0-9_:.]*)\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s"'=<>`]+)))*>[\s\S]*<\/[a-zA-Z_][\w\-\.]*>/;
const startTag = /<([a-zA-Z_][\w\-\.]*)((?:\s+([a-zA-Z_:][-a-zA-Z0-9_:.]*)\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s"'=<>`]+)))*)\s*(\/?)>/;
const endTag = /<\/([a-zA-Z_][\w\-\.]*)>/;
const attr = /([a-zA-Z_:][-a-zA-Z0-9_:.]*)\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s"'=<>`]+))/g;The parser maintains a buffer array (
bufArray) for unmatched start tags and a result object (
results) representing the AST root.
const bufArray = [];
const results = {node: 'root', child: []};
let chars, match;
while (html && last != html) {
last = html;
chars = true;
// parsing logic goes here
}When an end tag (
</...) is encountered, the parser removes the matched portion and calls
parseEndTagto pop the corresponding start tag from the buffer.
if (html.indexOf("</") == 0) {
match = html.match(endTag);
if (match) {
chars = false;
html = html.substring(match[0].length);
match[0].replace(endTag, parseEndTag);
}
}If a start tag is found,
parseStartTagextracts the tag name, attributes, and dataset, then pushes the node onto
bufArray(unless it is a unary/self‑closing tag).
else if (html.indexOf("<") == 0) {
match = html.match(startTag);
if (match) {
chars = false;
html = html.substring(match[0].length);
match[0].replace(startTag, parseStartTag);
}
}Text that is not part of a tag is treated as a text node and added to the current parent.
if (chars) {
let index = html.indexOf('<');
let text;
if (index < 0) {
text = html;
html = '';
} else {
text = html.substring(0, index);
html = html.substring(index);
}
const node = {node: 'text', text};
pushChild(node);
}The helper
pushChildadds a node to the root when the buffer is empty, or to the most recent unmatched start tag otherwise.
function pushChild(node) {
if (bufArray.length === 0) {
results.child.push(node);
} else {
const parent = bufArray[bufArray.length - 1];
if (!parent.child) parent.child = [];
parent.child.push(node);
}
} parseStartTagbuilds an element node, separates
data-attributes into
dataset, stores other attributes in
attrs, and pushes the node onto the buffer unless it is unary.
function parseStartTag(tag, tagName, rest) {
tagName = tagName.toLowerCase();
const ds = {};
const attrs = [];
const node = {node: 'element', tag: tagName};
rest.replace(attr, function(match, name) {
const value = arguments[2] || arguments[3] || arguments[4] || '';
if (name && name.indexOf('data-') == 0) {
ds[name.replace('data-', '')] = value;
} else if (name == 'class') {
node.class = value;
} else {
attrs.push({name, value});
}
});
node.dataset = ds;
node.attrs = attrs;
if (!arguments[7]) {
bufArray.push(node);
} else {
pushChild(node);
}
} parseEndTagfinds the matching start tag in the buffer (searching from the end) and attaches it to the AST.
function parseEndTag(tag, tagName) {
let pos = 0;
for (pos = bufArray.length - 1; pos >= 0; pos--) {
if (bufArray[pos].tag == tagName) break;
}
if (pos >= 0) {
pushChild(bufArray.pop());
}
}With these pieces, a simple yet functional HTML‑to‑AST parser is complete. The full implementation can be found in the referenced repositories.
References
Vue HTML parser source: https://github.com/vuejs/vue/blob/dev/src/compiler/parser/html-parser.js
Easy‑AST example: https://github.com/antiter/blogs/tree/master/code-mark/easy-ast.js
WecTeam
WecTeam (维C团) is the front‑end technology team of JD.com’s Jingxi business unit, focusing on front‑end engineering, web performance optimization, mini‑program and app development, serverless, multi‑platform reuse, and visual building.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.