Dynamic Web Crawling Techniques for Vulnerability Scanning with Pyppeteer
This article details the practical implementation of a dynamic web crawler for vulnerability scanning, covering Chrome headless setup, browser initialization, JavaScript hook injection for DOM events, navigation locking, form handling, link collection, deduplication, and task scheduling using pyppeteer.
Dynamic crawling is a prerequisite for web vulnerability discovery; this guide explains the key issues and solutions when building a dynamic crawler using Python's pyppeteer (an unofficial Puppeteer port).
1. Introduction
Static crawlers fail on modern SPA frameworks (Vue, React) and ES6 code. Headless Chromium now provides the necessary rendering capabilities, so the article adopts Chromium's headless mode as the content engine.
2. Initialization Settings
To avoid XSS Auditor and other interference, the browser is launched with a set of flags and an incognito context is created. Example launch code:
browser = await launch({
"executablePath": chrome_executable_path,
"args": [
"--disable-gpu",
"--disable-web-security",
"--disable-xss-auditor", // disable XSS Auditor
"--no-sandbox",
"--disable-setuid-sandbox",
"--allow-running-insecure-content", // allow insecure content
"--disable-webgl",
"--disable-popup-blocking"
],
"ignoreHTTPSErrors": True // ignore certificate errors
})Then an incognito context and a page are created with custom settings such as a normal User‑Agent, request interception, JavaScript enablement, cache disabling, and viewport size.
context = browser.createIncognitoBrowserContext()
page = await context.newPage()
tasks = [
asyncio.ensure_future(page.setUserAgent("...")),
asyncio.ensure_future(page.evaluateOnNewDocument("...")),
asyncio.ensure_future(page.setRequestInterception(True)),
asyncio.ensure_future(page.setJavaScriptEnabled(True)),
asyncio.ensure_future(page.setCacheEnabled(False)),
asyncio.ensure_future(page.setViewport({"width": 1920, "height": 1080}))
]
await asyncio.wait(tasks)3. Code Injection
Before the page loads, JavaScript hooks are injected to capture new URLs. The article shows how to override window.history.pushState and replaceState , listen for hashchange , and hook window.open and window.close . It also mentions hooking WebSocket , EventSource , and fetch (code omitted for brevity).
window.history.pushState = function(a, b, url) { console.log(url); }
window.history.replaceState = function(a, b, url) { console.log(url); }
Object.defineProperty(window.history, "pushState", {"writable": false, "configurable": false});
Object.defineProperty(window.history, "replaceState", {"writable": false, "configurable": false});4. Navigation Locking
Unwanted navigation can interrupt crawling. The article discusses three strategies: cancel front‑end redirects while recording the target, follow back‑end redirects with empty bodies, and render bodies with content while recording the Location header. Because aborting a navigation request throws an exception, the solution is to intercept the request and respond with HTTP 204, which tells the browser to stay on the current document.
async def intercept_request(request: Request):
if request.isNavigationRequest() and not request.frame.parentFrame:
await request.respond({"status": 204})
# save request to task queue
page.on('request', lambda r: asyncio.ensure_future(intercept_request(r)))5. Form Handling
Static form reconstruction is insufficient; the crawler must fill and submit forms exactly as a real user would. The guide categorises input types (text, email, tel, radio, checkbox, select, file, hidden) and shows how to populate them, remove restrictive attributes ( accept , required ) for file uploads, and submit forms without causing page reloads (e.g., using a hidden iframe or invoking form.submit() directly).
# Example: fill a multi‑select element
select_elements = await page_handler.querySelectorAll("select")
for each in select_elements:
random_str = get_random_str()
await page_handler.evaluate("(ele, value) => ele.setAttribute('sec_auto_select', value)", each, random_str)
attr_selector = f"select[sec_auto_select={random_str}]"
value_list = await page_handler.querySelectorEval(attr_selector, get_all_options_values_js())
if len(value_list) > 0:
await page_handler.select(attr_selector, value_list[0])6. Event Triggering
All registered events should be triggered. For inline events, the script collects elements with attributes like onclick , onblur , etc., and dispatches a CustomEvent . For DOM0 events, property setters are overridden via Object.defineProperties on HTMLElement.prototype . For DOM2 events, addEventListener is wrapped to log and then call the original handler.
# Hook DOM0 events
Object.defineProperties(HTMLElement.prototype, {
onclick: {set: function(newValue){ onclick = newValue; dom0_listener_hook(this, "click"); }},
onchange: {set: function(newValue){ onchange = newValue; dom0_listener_hook(this, "change"); }},
// ... other events
});
Object.defineProperty(HTMLElement.prototype, "onclick", {"configurable": false}); # Hook DOM2 events
let old_event_handle = Element.prototype.addEventListener;
Element.prototype.addEventListener = function(event_name, event_func, useCapture) {
let name = `<${this.tagName}>` + this.id + this.name + this.getAttribute("class") + "|" + event_name;
console.log(name);
old_event_handle.apply(this, arguments);
};7. Link Collection
Beyond href and src , the crawler extracts URLs from attributes such as data-url , longDesc , lowsrc , and even comments. It uses DOM queries or TreeWalker to gather these values and resolves relative URLs using the <base> tag when present.
function get_src_or_href_sec_auto(nodes) {
let result = [];
for (let node of nodes) {
let src = node.getAttribute("src");
if (src) { result.push(src) }
}
return result;
}
links = await page_handler.querySelectorAllEval("[src]", get_src_or_href_sec_auto);8. Deduplication
URL deduplication is complex; the article suggests combining parameter analysis, RESTful patterns, and structural similarity. It proposes a fuzzy feature vector derived from DOM structure, compressed via modulo and discretisation, stored in Elasticsearch with a whitespace analyzer. Matching vectors with a minimum‑should‑match threshold quickly yields similar pages.
"query": {
"match": {
"fuzz_vector": {
"query": "0:6 1:5 2:3 3:7 ...",
"operator": "or",
"minimum_should_match": 30
}
}
}9. Task Scheduling
Because launching a Chromium instance is expensive, the crawler keeps a single browser alive and opens multiple tabs (tasks) within it. An asynchronous scheduler limits the number of concurrent tabs based on CPU usage and adds new tasks to the event loop as slots become free.
class Scheduler(object):
def __init__(self, task_queue):
self.loop = asyncio.get_event_loop()
self.max_task_count = 10
self.finish_count = 0
self.task_queue = task_queue
self.task_count = len(task_queue)
async def tab_task(self, num):
print(f"task {num} start run ... ")
await asyncio.sleep(1)
print(f"task {num} finish ... ")
self.finish_count += 1
async def manager_task(self):
while len(self.task_queue) != 0 or self.finish_count != self.task_count:
if len(asyncio.Task.all_tasks(self.loop)) - 1 < self.max_task_count and len(self.task_queue) != 0:
param = self.task_queue.pop(0)
self.loop.create_task(self.tab_task(param))
await asyncio.sleep(0.5)10. Conclusion
The article emphasizes that dynamic crawling requires continuous refinement, extensive event handling, and careful resource management. Sharing practical tips and code snippets helps practitioners build more effective scanners that discover deeper links and vulnerabilities.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.