Information Security 30 min read

Dynamic Web Crawling Techniques for Vulnerability Scanning with Pyppeteer

This article details the practical implementation of a dynamic web crawler for vulnerability scanning, covering Chrome headless setup, browser initialization, JavaScript hook injection for DOM events, navigation locking, form handling, link collection, deduplication, and task scheduling using pyppeteer.

360 Tech Engineering

May 31, 2019

Dynamic Web Crawling Techniques for Vulnerability Scanning with Pyppeteer

Dynamic crawling is a prerequisite for web vulnerability discovery; this guide explains the key issues and solutions when building a dynamic crawler using Python's pyppeteer (an unofficial Puppeteer port).

1. Introduction

Static crawlers fail on modern SPA frameworks (Vue, React) and ES6 code. Headless Chromium now provides the necessary rendering capabilities, so the article adopts Chromium's headless mode as the content engine.

2. Initialization Settings

To avoid XSS Auditor and other interference, the browser is launched with a set of flags and an incognito context is created. Example launch code:

browser = await launch({
    "executablePath": chrome_executable_path,
    "args": [
        "--disable-gpu",
        "--disable-web-security",
        "--disable-xss-auditor", // disable XSS Auditor
        "--no-sandbox",
        "--disable-setuid-sandbox",
        "--allow-running-insecure-content", // allow insecure content
        "--disable-webgl",
        "--disable-popup-blocking"
    ],
    "ignoreHTTPSErrors": True // ignore certificate errors
})

Then an incognito context and a page are created with custom settings such as a normal User‑Agent, request interception, JavaScript enablement, cache disabling, and viewport size.

context = browser.createIncognitoBrowserContext()
page = await context.newPage()
tasks = [
    asyncio.ensure_future(page.setUserAgent("...")),
    asyncio.ensure_future(page.evaluateOnNewDocument("...")),
    asyncio.ensure_future(page.setRequestInterception(True)),
    asyncio.ensure_future(page.setJavaScriptEnabled(True)),
    asyncio.ensure_future(page.setCacheEnabled(False)),
    asyncio.ensure_future(page.setViewport({"width": 1920, "height": 1080}))
]
await asyncio.wait(tasks)

3. Code Injection

Before the page loads, JavaScript hooks are injected to capture new URLs. The article shows how to override window.history.pushState and replaceState, listen for hashchange, and hook window.open and window.close. It also mentions hooking WebSocket, EventSource, and fetch (code omitted for brevity).

window.history.pushState = function(a, b, url) { console.log(url); }
window.history.replaceState = function(a, b, url) { console.log(url); }
Object.defineProperty(window.history, "pushState", {"writable": false, "configurable": false});
Object.defineProperty(window.history, "replaceState", {"writable": false, "configurable": false});

4. Navigation Locking

Unwanted navigation can interrupt crawling. The article discusses three strategies: cancel front‑end redirects while recording the target, follow back‑end redirects with empty bodies, and render bodies with content while recording the Location header. Because aborting a navigation request throws an exception, the solution is to intercept the request and respond with HTTP 204, which tells the browser to stay on the current document.

async def intercept_request(request: Request):
    if request.isNavigationRequest() and not request.frame.parentFrame:
        await request.respond({"status": 204})
        # save request to task queue
page.on('request', lambda r: asyncio.ensure_future(intercept_request(r)))

5. Form Handling

Static form reconstruction is insufficient; the crawler must fill and submit forms exactly as a real user would. The guide categorises input types (text, email, tel, radio, checkbox, select, file, hidden) and shows how to populate them, remove restrictive attributes ( accept, required) for file uploads, and submit forms without causing page reloads (e.g., using a hidden iframe or invoking form.submit() directly).

# Example: fill a multi‑select element
select_elements = await page_handler.querySelectorAll("select")
for each in select_elements:
    random_str = get_random_str()
    await page_handler.evaluate("(ele, value) => ele.setAttribute('sec_auto_select', value)", each, random_str)
    attr_selector = f"select[sec_auto_select={random_str}]"
    value_list = await page_handler.querySelectorEval(attr_selector, get_all_options_values_js())
    if len(value_list) > 0:
        await page_handler.select(attr_selector, value_list[0])

6. Event Triggering

All registered events should be triggered. For inline events, the script collects elements with attributes like onclick, onblur, etc., and dispatches a CustomEvent. For DOM0 events, property setters are overridden via Object.defineProperties on HTMLElement.prototype. For DOM2 events, addEventListener is wrapped to log and then call the original handler.

# Hook DOM0 events
Object.defineProperties(HTMLElement.prototype, {
    onclick: {set: function(newValue){ onclick = newValue; dom0_listener_hook(this, "click"); }},
    onchange: {set: function(newValue){ onchange = newValue; dom0_listener_hook(this, "change"); }},
    // ... other events
});
Object.defineProperty(HTMLElement.prototype, "onclick", {"configurable": false});

# Hook DOM2 events
let old_event_handle = Element.prototype.addEventListener;
Element.prototype.addEventListener = function(event_name, event_func, useCapture) {
    let name = `<${this.tagName}>` + this.id + this.name + this.getAttribute("class") + "|" + event_name;
    console.log(name);
    old_event_handle.apply(this, arguments);
};

7. Link Collection

Beyond href and src, the crawler extracts URLs from attributes such as data-url, longDesc, lowsrc, and even comments. It uses DOM queries or TreeWalker to gather these values and resolves relative URLs using the <base> tag when present.

function get_src_or_href_sec_auto(nodes) {
    let result = [];
    for (let node of nodes) {
        let src = node.getAttribute("src");
        if (src) { result.push(src) }
    }
    return result;
}
links = await page_handler.querySelectorAllEval("[src]", get_src_or_href_sec_auto);

8. Deduplication

URL deduplication is complex; the article suggests combining parameter analysis, RESTful patterns, and structural similarity. It proposes a fuzzy feature vector derived from DOM structure, compressed via modulo and discretisation, stored in Elasticsearch with a whitespace analyzer. Matching vectors with a minimum‑should‑match threshold quickly yields similar pages.

"query": {
    "match": {
        "fuzz_vector": {
            "query": "0:6 1:5 2:3 3:7 ...",
            "operator": "or",
            "minimum_should_match": 30
        }
    }
}

9. Task Scheduling

Because launching a Chromium instance is expensive, the crawler keeps a single browser alive and opens multiple tabs (tasks) within it. An asynchronous scheduler limits the number of concurrent tabs based on CPU usage and adds new tasks to the event loop as slots become free.

class Scheduler(object):
    def __init__(self, task_queue):
        self.loop = asyncio.get_event_loop()
        self.max_task_count = 10
        self.finish_count = 0
        self.task_queue = task_queue
        self.task_count = len(task_queue)
    async def tab_task(self, num):
        print(f"task {num} start run ... ")
        await asyncio.sleep(1)
        print(f"task {num} finish ... ")
        self.finish_count += 1
    async def manager_task(self):
        while len(self.task_queue) != 0 or self.finish_count != self.task_count:
            if len(asyncio.Task.all_tasks(self.loop)) - 1 < self.max_task_count and len(self.task_queue) != 0:
                param = self.task_queue.pop(0)
                self.loop.create_task(self.tab_task(param))
            await asyncio.sleep(0.5)

10. Conclusion

The article emphasizes that dynamic crawling requires continuous refinement, extensive event handling, and careful resource management. Sharing practical tips and code snippets helps practitioners build more effective scanners that discover deeper links and vulnerabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Dynamic Analysis browser automation Web Crawling vulnerability scanning javascript hooking pyppeteer

Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.