Graph-Based Detection of Malicious Webpages: Methods, Experiments, and Future Directions
This article presents a comprehensive study on detecting malicious webpages by constructing heterogeneous graphs from URL redirection and textual features, applying Graph Convolutional Networks and Cluster‑Text‑GCN models, detailing optimization techniques for large‑scale deployment, and outlining future research directions.
With the rapid evolution of internet technologies, a variety of websites and web pages have proliferated, providing opportunities for black‑gray market operators to spread harmful content such as gambling, pornography, and illicit drugs; consequently, numerous malicious‑webpage detection methods have emerged.
The rise of mobile internet, widespread 5G coverage, and the growing number of mobile devices have introduced new characteristics for malicious webpages, which differ from PC‑based attacks and require dedicated detection approaches.
Recent reports from security firms indicate that the lifecycle of malicious pages, especially phishing sites, has dramatically shortened from dozens of hours to just a few hours, intensifying the need for timely detection techniques.
Open‑source community projects like PhishTank demonstrate the value of collaborative efforts in combating malicious web activities.
In the context of large‑scale mobile applications, massive traffic can be hijacked by malicious pages to disseminate prohibited information; therefore, effective identification methods are essential to protect user security and improve app experience.
Technical Approach
Observations show that malicious sites often employ behaviors such as frequent redirects, prompting users to download malicious apps, or invoking applications. To capture these behaviors, we construct a heterogeneous graph where URLs are nodes and the loading order creates edges; node types (e.g., download, redirect) are encoded as features to reduce memory consumption.
Two graph‑construction strategies were considered: treating each node‑type as a separate node, or representing only the domain as a node while storing the type as a feature. The latter was adopted for efficiency.
Malicious sites often belong to organized groups, sharing similar text and content themes (e.g., gambling, pornography). To exploit this, we incorporate textual information using the Text‑GCN model and fuse it with the domain‑graph structure, as illustrated below.
The edge weights for document‑word and word‑word connections (weight = 1 for identical items) are computed using the formula shown in the following figure.
With the weighted heterogeneous graph and node features, a GCN is trained on labeled domain nodes and then used to predict unknown domains.
Models and Experiments
Version 1: Cross‑site redirect graph + GCN
Version 2: Text‑GCN on heterogeneous text graph
Version 3: Fusion of redirect graph and text graph using Cluster‑Text‑GCN (best performance, F1 > 0.95 and strong interpretability)
During deployment, constructing the heterogeneous graph caused significant memory pressure, often exceeding the 32 GB limit of the Dorado platform. To enable production use, we applied several optimizations:
Reduced the number of domain nodes by encoding node types as features.
Utilized sparse matrices for adjacency and data‑frame operations.
Stored only the upper‑triangular part of covariance matrices when calculating Text‑GCN edge weights.
Raised TF‑IDF frequency thresholds, expanded stop‑word lists, limited the vocabulary to ~10 000 terms, and switched to float32.
Discarded low‑weight word‑word edges (threshold 0.03–0.08).
Adopted Cluster‑GCN, which uses 60‑70 % of edges while preserving accuracy.
Reimplemented NetworkX’s Graph class with a lightweight data structure.
These measures allow training and inference on graphs with millions of nodes and billions of edges within 32 GB of memory.
After launch, the cross‑site redirect GCN model identifies roughly 3 million malicious domains, with a daily increase of about 100 k new domains.
Future Work
Planned directions include enriching heterogeneous‑graph features to detect a broader range of malicious sites, scaling the graph to cover longer time spans, and leveraging higher‑order information for model improvement.
We also aim to extend the graph‑based approach to other scenarios such as anomalous JavaScript sequences and website hijacking analysis, and to use the discovered black‑gray‑market groups for offline attribution and takedown.
References
Detecting Mobile Malicious Webpages in Real Time
Malicious URL Detection using Machine Learning: A Survey
恶意网页识别研究综述
Graph Convolutional Networks for Text Classification
Cluster‑GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks
ByteDance Terminal Technology
Official account of ByteDance Terminal Technology, sharing technical insights and team updates.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.