Information Security 9 min read

Graph-Based Detection of Malicious Webpages: Methods, Experiments, and Future Work

This article presents a comprehensive study on detecting malicious webpages using heterogeneous graph structures and Graph Convolutional Networks, detailing background challenges, technical approaches, model iterations, optimization techniques for large‑scale deployment, experimental results, and directions for future research.

ByteDance Terminal Technology

Jan 11, 2022

Graph-Based Detection of Malicious Webpages: Methods, Experiments, and Future Work

With the rapid development of internet technologies, malicious webpages that spread harmful content such as gambling, pornography, and phishing have become increasingly prevalent, especially on mobile platforms, prompting the need for effective detection methods.

The proposed solution builds a heterogeneous graph by treating URLs as nodes and their loading order as edges, incorporating various node types (download, redirect, etc.) and optionally merging node type information as features to reduce memory usage.

Textual information from webpages is also modeled using a Text‑GCN approach, and the two graphs (URL‑based and text‑based) are fused to capture both behavioral and content cues.

The detection pipeline evolved through three model versions: (1) GCN on cross‑site redirect graph, (2) Text‑GCN on heterogeneous text graph, and (3) a fused heterogeneous graph combined with Cluster‑Text‑GCN, which achieved the best performance with F1 scores above 0.95 and strong interpretability.

To enable production deployment under a 32 GB memory limit, several optimizations were applied: reducing node count, using sparse matrices and DataFrames, halving covariance matrix storage, filtering low‑frequency terms, pruning weak edges, employing Cluster‑GCN to cut graph size, and customizing NetworkX data structures.

These optimizations allow training and inference on graphs with millions of nodes and billions of edges within the memory constraints.

After deployment, the system identifies roughly 3 million malicious domains, with a daily increase of about 100 k new entries.

Future work includes enriching heterogeneous graph features, scaling the graph to cover broader time windows, and enhancing the model with higher‑order information, as well as extending applications to anomalous JavaScript detection and broader black‑gray‑industry investigations.

References: 1) Detecting Mobile Malicious Webpages in Real Time; 2) Malicious URL Detection using Machine Learning: A Survey; 3) Survey of Malicious Webpage Detection; 4) Graph Convolutional Networks for Text Classification; 5) Cluster‑GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization Information Security graph neural networks GCN heterogeneous graph Detection malicious webpages

Written by

ByteDance Terminal Technology

Official account of ByteDance Terminal Technology, sharing technical insights and team updates.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.