Fundamentals 16 min read

How to Build a Simple HTML AST Parser in JavaScript

This article explains how to transform raw HTML strings into a structured abstract syntax tree (AST) using JavaScript regular expressions and a step‑by‑step parsing algorithm, covering tag, attribute, and text node handling, with a complete example implementation of a lightweight AST parser.

WecTeam
WecTeam
WecTeam
How to Build a Simple HTML AST Parser in JavaScript

AST parsers are frequently used in frameworks like Vue.js (e.g., VNode). The same idea can be applied to convert unstructured HTML strings into structured objects for analysis, processing, and rendering.

How to Parse into an AST?

HTML source code is just a text string, but browsers, Babel, or Vue need a structured representation to understand each node's purpose. Regular expressions are used to transform the string into a hierarchical data structure.

This article describes an AST parser implementation with all core details in under a hundred lines of code.

Goal

The goal is to convert the following HTML snippet into an AST:

<div class="classAttr" data-type="dataType" data-id="dataId" style="color:red">我是外层div<span>我是内层span</span></div>

The resulting AST (properties can be defined as needed) looks like:

{
  "node": "root",
  "child": [{
    "node": "element",
    "tag": "div",
    "class": "classAttr",
    "dataset": {"type": "dataType", "id": "dataId"},
    "attrs": [{"name": "style", "value": "color:red"}],
    "child": [{
      "node": "text",
      "text": "我是外层div"
    }, {
      "node": "element",
      "tag": "span",
      "dataset": {},
      "attrs": [],
      "child": [{
        "node": "text",
        "text": "我是内层span"
      }]
    }]
  }]
}

The root node contains a

child

array; each element node records its tag, class, data attributes, other attributes, and nested children.

Review of Regular Expressions

^ matches the start of a line (e.g.,

/^a/

matches "ab" but not "ba").

$ matches the end of a line.

* matches zero or more occurrences of the preceding token.

+ matches one or more occurrences.

[ab] matches either "a" or "b".

\w matches word characters (letters, digits, underscore).

Matching Tag Elements

Represent the HTML string

<div>我是一个div</div>

with a regular expression that captures the opening tag, content, and closing tag.

const ncname = '[a-zA-Z_][\w-.]*'

Combine it into:

`<${ncname}>`

Full tag pattern:

`<${ncname}></${ncname}>`

To allow any characters inside the tag:

`<${ncname}>[\s\S]*</${ncname}>`

Matching Tag Attributes

Attribute names typically consist of letters, underscores, or colons, followed by letters, digits, underscores, hyphens, colons, or dots:

const attrKey = /[a-zA-Z_:][-a-zA-Z0-9_:.]*/

Supported attribute syntaxes:

class="title"

class='title'

class=title

const attr = /([a-zA-Z_:][-a-zA-Z0-9_:.]*)=("([^"]*)"|'([^']*)'|([^\s"'=<>`]+))/

Testing the regex:

"class=abc".match(attr); // → ["class=abc", "class", "abc", undefined, undefined, "abc"]
"class='abc'".match(attr); // → ["class='abc'", "class", "'abc'", undefined, "abc", undefined]

The second result includes the surrounding quotes, so a non‑capturing group (?:) is needed to exclude them.

"abcde".match(/a(?:b)c(.*)/); // → ["abcde", "de"]

Final attribute regex with optional spaces and non‑capturing groups:

const attr = /([a-zA-Z_:][-a-zA-Z0-9_:.]*)\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s"'=<>`]+))/g

Result illustration:

Matching Nodes

Combine tag and attribute patterns to match a complete node:

/<[a-zA-Z_][\w\-\.]*(?:\s+([a-zA-Z_:][-a-zA-Z0-9_:.]*)\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s"'=<>`]+)))*>[\s\S]*<\/[a-zA-Z_][\w\-\.]*>/

AST Parsing in Practice

Using the above regular expressions, the parser processes the HTML in three stages: start tag, inner content, and end tag.

const DOM = /<[a-zA-Z_][\w\-\.]*(?:\s+([a-zA-Z_:][-a-zA-Z0-9_:.]*)\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s"'=<>`]+)))*>[\s\S]*<\/[a-zA-Z_][\w\-\.]*>/;
const startTag = /<([a-zA-Z_][\w\-\.]*)((?:\s+([a-zA-Z_:][-a-zA-Z0-9_:.]*)\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s"'=<>`]+)))*)\s*(\/?)>/;
const endTag = /<\/([a-zA-Z_][\w\-\.]*)>/;
const attr = /([a-zA-Z_:][-a-zA-Z0-9_:.]*)\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s"'=<>`]+))/g;

The parser maintains a buffer array (

bufArray

) for unmatched start tags and a result object (

results

) representing the AST root.

const bufArray = [];
const results = {node: 'root', child: []};
let chars, match;
while (html && last != html) {
  last = html;
  chars = true;
  // parsing logic goes here
}

When an end tag (

</...

) is encountered, the parser removes the matched portion and calls

parseEndTag

to pop the corresponding start tag from the buffer.

if (html.indexOf("</") == 0) {
  match = html.match(endTag);
  if (match) {
    chars = false;
    html = html.substring(match[0].length);
    match[0].replace(endTag, parseEndTag);
  }
}

If a start tag is found,

parseStartTag

extracts the tag name, attributes, and dataset, then pushes the node onto

bufArray

(unless it is a unary/self‑closing tag).

else if (html.indexOf("<") == 0) {
  match = html.match(startTag);
  if (match) {
    chars = false;
    html = html.substring(match[0].length);
    match[0].replace(startTag, parseStartTag);
  }
}

Text that is not part of a tag is treated as a text node and added to the current parent.

if (chars) {
  let index = html.indexOf('<');
  let text;
  if (index < 0) {
    text = html;
    html = '';
  } else {
    text = html.substring(0, index);
    html = html.substring(index);
  }
  const node = {node: 'text', text};
  pushChild(node);
}

The helper

pushChild

adds a node to the root when the buffer is empty, or to the most recent unmatched start tag otherwise.

function pushChild(node) {
  if (bufArray.length === 0) {
    results.child.push(node);
  } else {
    const parent = bufArray[bufArray.length - 1];
    if (!parent.child) parent.child = [];
    parent.child.push(node);
  }
}
parseStartTag

builds an element node, separates

data-

attributes into

dataset

, stores other attributes in

attrs

, and pushes the node onto the buffer unless it is unary.

function parseStartTag(tag, tagName, rest) {
  tagName = tagName.toLowerCase();
  const ds = {};
  const attrs = [];
  const node = {node: 'element', tag: tagName};
  rest.replace(attr, function(match, name) {
    const value = arguments[2] || arguments[3] || arguments[4] || '';
    if (name && name.indexOf('data-') == 0) {
      ds[name.replace('data-', '')] = value;
    } else if (name == 'class') {
      node.class = value;
    } else {
      attrs.push({name, value});
    }
  });
  node.dataset = ds;
  node.attrs = attrs;
  if (!arguments[7]) {
    bufArray.push(node);
  } else {
    pushChild(node);
  }
}
parseEndTag

finds the matching start tag in the buffer (searching from the end) and attaches it to the AST.

function parseEndTag(tag, tagName) {
  let pos = 0;
  for (pos = bufArray.length - 1; pos >= 0; pos--) {
    if (bufArray[pos].tag == tagName) break;
  }
  if (pos >= 0) {
    pushChild(bufArray.pop());
  }
}

With these pieces, a simple yet functional HTML‑to‑AST parser is complete. The full implementation can be found in the referenced repositories.

References

Vue HTML parser source: https://github.com/vuejs/vue/blob/dev/src/compiler/parser/html-parser.js

Easy‑AST example: https://github.com/antiter/blogs/tree/master/code-mark/easy-ast.js

HTML parsingJavaScriptASTregular expressionscompiler fundamentals
WecTeam
Written by

WecTeam

WecTeam (维C团) is the front‑end technology team of JD.com’s Jingxi business unit, focusing on front‑end engineering, web performance optimization, mini‑program and app development, serverless, multi‑platform reuse, and visual building.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.