GEE Graph Embedding Algorithm for Business Security Anomaly Detection
The article presents the GEE (Graph Encoder Embedding) algorithm for business security anomaly detection, explains its label‑propagation foundation, evaluates it on ten‑million‑edge real data, identifies inefficiencies in the original implementation, and demonstrates that vectorized NumPy/Pandas optimizations reduce runtime from 55 seconds to about 4 seconds while preserving meaningful TSNE‑visualized embeddings.
In security monitoring and anti-fraud business scenarios, detecting anomalies in traffic data and user behavior is essential. Traditional statistical methods struggle with complex, evolving attack patterns, especially for small-scale attacks lacking significant clustering features.
This article explores Graph Embedding techniques for anomaly detection, introducing the GEE (Graph Encoder Embedding) algorithm based on One-Hot encoding. The core principle is label propagation, which expresses node features through weighted propagation processes.
The author validates the algorithm using two papers with real business data containing approximately 10 million edges. The graph includes nodes for user IDs, IPs, device IDs, and browser IDs, with labels for geographic regions (provinces) and operating systems across 50 dimensions.
Performance testing revealed unexpected results: the original algorithm took ~55 seconds, while the sparse matrix improved version took ~158 seconds (including ~90 seconds for data conversion). Analysis identified three key issues: weight matrix redundancy, lack of vectorization, and memory dependency in sparse matrix computations.
The author implemented several optimizations: simplified weight matrix (35s), numpy+pandas vectorized batch version (6.3s), pandas-only version (7.4s), and pure numpy version (4.1s) - achieving nearly order-of-magnitude improvement over the original algorithm.
TSNE visualization of the embedding results shows clear clustering characteristics, demonstrating meaningful results. The code implementations are preserved below:
The simplified weight matrix version:
def graph_encode_embedding(X, Y, n_K, show_prog=False): """ compute the edge embedding matrix Z and node weight matrix W 参考论文的原始实现 :param X: edge list, list of tuple, [(src, dst, weight), ...] :param Y: node label, array of int, [node_label, ...] :param n_K: number of classes :return: Z, W """ # 初始化权重矩阵W W = np.zeros(n_K) # 遍历每个类别 for k in range(n_K): # 统计每个类别的节点数量 W[k] = (Y == k).sum() # 计算每个类别的权重,即每个类别节点数量的倒数,为了避免除零错误,分母加1 W = 1 / (W + 1) # 初始化节点嵌入矩阵Z Z = np.zeros((Y.shape[0], n_K)) # 遍历每一条边 for src, dst, edg_w in X: src = int(src) dst = int(dst) label_src = Y[src] label_dst = Y[dst] if label_dst >= 0: Z[src, label_dst] = Z[src, label_dst] + W[label_dst] * edg_w if (label_src >= 0) and (src != dst): Z[dst, label_src] = Z[dst, label_src] + W[label_src] * edg_w return Z, WThe numpy-only vectorized version with groupby functionality:
def group_sum(indexes, values): """ sum values by index 根据索引求和, 相当于: values.groupby(indexes).sum() :param indexes: array of int, [[index1, index2, ...], ...] :param values: array of float, [value1, value2, ...] :return: grp_indexes, grp_sums """ if indexes.ndim == 1: reindex = indexes else: reindex = np.zeros(indexes.shape[0], dtype=indexes.dtype) for axis in reversed(range(indexes.shape[-1])): reindex = indexes[:, axis] * (reindex.max() + 1) + reindex order = np.argsort(reindex) sorted_reindex = reindex[order] sorted_indexes = indexes[order] sorted_values = values[order] _, grp_idx = np.unique(sorted_reindex, return_index=True) grp_sums = np.add.reduceat(sorted_values, grp_idx, axis=0) grp_indexes = sorted_indexes[grp_idx] return grp_indexes, grp_sumsBaidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.