Information Security 9 min read

Graph-Based Detection of Malicious Webpages: Methods, Experiments, and Future Work

This article presents a comprehensive study on detecting malicious webpages using heterogeneous graph structures and Graph Convolutional Networks, detailing background challenges, technical approaches, model iterations, optimization techniques for large‑scale deployment, experimental results, and directions for future research.

ByteDance Terminal Technology
ByteDance Terminal Technology
ByteDance Terminal Technology
Graph-Based Detection of Malicious Webpages: Methods, Experiments, and Future Work

With the rapid development of internet technologies, malicious webpages that spread harmful content such as gambling, pornography, and phishing have become increasingly prevalent, especially on mobile platforms, prompting the need for effective detection methods.

The proposed solution builds a heterogeneous graph by treating URLs as nodes and their loading order as edges, incorporating various node types (download, redirect, etc.) and optionally merging node type information as features to reduce memory usage.

Textual information from webpages is also modeled using a Text‑GCN approach, and the two graphs (URL‑based and text‑based) are fused to capture both behavioral and content cues.

The detection pipeline evolved through three model versions: (1) GCN on cross‑site redirect graph, (2) Text‑GCN on heterogeneous text graph, and (3) a fused heterogeneous graph combined with Cluster‑Text‑GCN, which achieved the best performance with F1 scores above 0.95 and strong interpretability.

To enable production deployment under a 32 GB memory limit, several optimizations were applied: reducing node count, using sparse matrices and DataFrames, halving covariance matrix storage, filtering low‑frequency terms, pruning weak edges, employing Cluster‑GCN to cut graph size, and customizing NetworkX data structures.

These optimizations allow training and inference on graphs with millions of nodes and billions of edges within the memory constraints.

After deployment, the system identifies roughly 3 million malicious domains, with a daily increase of about 100 k new entries.

Future work includes enriching heterogeneous graph features, scaling the graph to cover broader time windows, and enhancing the model with higher‑order information, as well as extending applications to anomalous JavaScript detection and broader black‑gray‑industry investigations.

References: 1) Detecting Mobile Malicious Webpages in Real Time; 2) Malicious URL Detection using Machine Learning: A Survey; 3) Survey of Malicious Webpage Detection; 4) Graph Convolutional Networks for Text Classification; 5) Cluster‑GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks.

Optimizationinformation securityGraph Neural NetworksGCNheterogeneous graphDetectionmalicious webpages
ByteDance Terminal Technology
Written by

ByteDance Terminal Technology

Official account of ByteDance Terminal Technology, sharing technical insights and team updates.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.