Fundamentals 16 min read

How to Build a Simple HTML AST Parser in JavaScript

This article explains how to transform raw HTML strings into a structured abstract syntax tree (AST) using JavaScript regular expressions and a step‑by‑step parsing algorithm, covering tag, attribute, and text node handling, with a complete example implementation of a lightweight AST parser.

WecTeam

Oct 17, 2019

How to Build a Simple HTML AST Parser in JavaScript

AST parsers are frequently used in frameworks like Vue.js (e.g., VNode). The same idea can be applied to convert unstructured HTML strings into structured objects for analysis, processing, and rendering.

How to Parse into an AST?

HTML source code is just a text string, but browsers, Babel, or Vue need a structured representation to understand each node's purpose. Regular expressions are used to transform the string into a hierarchical data structure.

This article describes an AST parser implementation with all core details in under a hundred lines of code.

Goal

The goal is to convert the following HTML snippet into an AST:

<div class="classAttr" data-type="dataType" data-id="dataId" style="color:red">我是外层div<span>我是内层span</span></div>

The resulting AST (properties can be defined as needed) looks like:

{
  "node": "root",
  "child": [{
    "node": "element",
    "tag": "div",
    "class": "classAttr",
    "dataset": {"type": "dataType", "id": "dataId"},
    "attrs": [{"name": "style", "value": "color:red"}],
    "child": [{
      "node": "text",
      "text": "我是外层div"
    }, {
      "node": "element",
      "tag": "span",
      "dataset": {},
      "attrs": [],
      "child": [{
        "node": "text",
        "text": "我是内层span"
      }]
    }]
  }]
}

The root node contains a child array; each element node records its tag, class, data attributes, other attributes, and nested children.

Review of Regular Expressions

^ matches the start of a line (e.g., /^a/ matches "ab" but not "ba").

$ matches the end of a line.

* matches zero or more occurrences of the preceding token.

+ matches one or more occurrences.

[ab] matches either "a" or "b".

\w matches word characters (letters, digits, underscore).

Matching Tag Elements

Represent the HTML string <div>我是一个div</div> with a regular expression that captures the opening tag, content, and closing tag. const ncname = '[a-zA-Z_][\w-.]*' Combine it into: `<${ncname}>` Full tag pattern: `<${ncname}></${ncname}>` To allow any characters inside the tag:

`<${ncname}>[\s\S]*</${ncname}>`

Matching Tag Attributes

Attribute names typically consist of letters, underscores, or colons, followed by letters, digits, underscores, hyphens, colons, or dots: const attrKey = /[a-zA-Z_:][-a-zA-Z0-9_:.]*/ Supported attribute syntaxes:

class="title"

class='title'

class=title

const attr = /([a-zA-Z_:][-a-zA-Z0-9_:.]*)=("([^"]*)"|'([^']*)'|([^\s"'=<>`]+))/

Testing the regex:

"class=abc".match(attr); // → ["class=abc", "class", "abc", undefined, undefined, "abc"]
"class='abc'".match(attr); // → ["class='abc'", "class", "'abc'", undefined, "abc", undefined]

The second result includes the surrounding quotes, so a non‑capturing group (?:) is needed to exclude them.

"abcde".match(/a(?:b)c(.*)/); // → ["abcde", "de"]

Final attribute regex with optional spaces and non‑capturing groups:

const attr = /([a-zA-Z_:][-a-zA-Z0-9_:.]*)\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s"'=<>`]+))/g

Result illustration:

Matching Nodes

Combine tag and attribute patterns to match a complete node:

/<[a-zA-Z_][\w\-\.]*(?:\s+([a-zA-Z_:][-a-zA-Z0-9_:.]*)\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s"'=<>`]+)))*>[\s\S]*<\/[a-zA-Z_][\w\-\.]*>/

AST Parsing in Practice

Using the above regular expressions, the parser processes the HTML in three stages: start tag, inner content, and end tag.

const DOM = /<[a-zA-Z_][\w\-\.]*(?:\s+([a-zA-Z_:][-a-zA-Z0-9_:.]*)\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s"'=<>`]+)))*>[\s\S]*<\/[a-zA-Z_][\w\-\.]*>/;
const startTag = /<([a-zA-Z_][\w\-\.]*)((?:\s+([a-zA-Z_:][-a-zA-Z0-9_:.]*)\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s"'=<>`]+)))*)\s*(\/?)>/;
const endTag = /<\/([a-zA-Z_][\w\-\.]*)>/;
const attr = /([a-zA-Z_:][-a-zA-Z0-9_:.]*)\s*=\s*(?:"([^"]*)"|'([^']*)'|([^\s"'=<>`]+))/g;

The parser maintains a buffer array ( bufArray) for unmatched start tags and a result object ( results) representing the AST root.

const bufArray = [];
const results = {node: 'root', child: []};
let chars, match;
while (html && last != html) {
  last = html;
  chars = true;
  // parsing logic goes here
}

When an end tag ( </...) is encountered, the parser removes the matched portion and calls parseEndTag to pop the corresponding start tag from the buffer.

if (html.indexOf("</") == 0) {
  match = html.match(endTag);
  if (match) {
    chars = false;
    html = html.substring(match[0].length);
    match[0].replace(endTag, parseEndTag);
  }
}

If a start tag is found, parseStartTag extracts the tag name, attributes, and dataset, then pushes the node onto bufArray (unless it is a unary/self‑closing tag).

else if (html.indexOf("<") == 0) {
  match = html.match(startTag);
  if (match) {
    chars = false;
    html = html.substring(match[0].length);
    match[0].replace(startTag, parseStartTag);
  }
}

Text that is not part of a tag is treated as a text node and added to the current parent.

if (chars) {
  let index = html.indexOf('<');
  let text;
  if (index < 0) {
    text = html;
    html = '';
  } else {
    text = html.substring(0, index);
    html = html.substring(index);
  }
  const node = {node: 'text', text};
  pushChild(node);
}

The helper pushChild adds a node to the root when the buffer is empty, or to the most recent unmatched start tag otherwise.

function pushChild(node) {
  if (bufArray.length === 0) {
    results.child.push(node);
  } else {
    const parent = bufArray[bufArray.length - 1];
    if (!parent.child) parent.child = [];
    parent.child.push(node);
  }
}

parseStartTag

builds an element node, separates data- attributes into dataset, stores other attributes in attrs, and pushes the node onto the buffer unless it is unary.

function parseStartTag(tag, tagName, rest) {
  tagName = tagName.toLowerCase();
  const ds = {};
  const attrs = [];
  const node = {node: 'element', tag: tagName};
  rest.replace(attr, function(match, name) {
    const value = arguments[2] || arguments[3] || arguments[4] || '';
    if (name && name.indexOf('data-') == 0) {
      ds[name.replace('data-', '')] = value;
    } else if (name == 'class') {
      node.class = value;
    } else {
      attrs.push({name, value});
    }
  });
  node.dataset = ds;
  node.attrs = attrs;
  if (!arguments[7]) {
    bufArray.push(node);
  } else {
    pushChild(node);
  }
}

parseEndTag

finds the matching start tag in the buffer (searching from the end) and attaches it to the AST.

function parseEndTag(tag, tagName) {
  let pos = 0;
  for (pos = bufArray.length - 1; pos >= 0; pos--) {
    if (bufArray[pos].tag == tagName) break;
  }
  if (pos >= 0) {
    pushChild(bufArray.pop());
  }
}

With these pieces, a simple yet functional HTML‑to‑AST parser is complete. The full implementation can be found in the referenced repositories.

References

Vue HTML parser source: https://github.com/vuejs/vue/blob/dev/src/compiler/parser/html-parser.js

Easy‑AST example: https://github.com/antiter/blogs/tree/master/code-mark/easy-ast.js

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

html-parsing JavaScript AST regular expressions compiler fundamentals

Written by

WecTeam

WecTeam (维C团) is the front‑end technology team of JD.com’s Jingxi business unit, focusing on front‑end engineering, web performance optimization, mini‑program and app development, serverless, multi‑platform reuse, and visual building.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.