Artificial Intelligence 21 min read

GEE Graph Embedding Algorithm for Business Security Anomaly Detection

The article presents the GEE (Graph Encoder Embedding) algorithm for business security anomaly detection, explains its label‑propagation foundation, evaluates it on ten‑million‑edge real data, identifies inefficiencies in the original implementation, and demonstrates that vectorized NumPy/Pandas optimizations reduce runtime from 55 seconds to about 4 seconds while preserving meaningful TSNE‑visualized embeddings.

Baidu Geek Talk

Dec 18, 2024

GEE Graph Embedding Algorithm for Business Security Anomaly Detection

In security monitoring and anti-fraud business scenarios, detecting anomalies in traffic data and user behavior is essential. Traditional statistical methods struggle with complex, evolving attack patterns, especially for small-scale attacks lacking significant clustering features.

This article explores Graph Embedding techniques for anomaly detection, introducing the GEE (Graph Encoder Embedding) algorithm based on One-Hot encoding. The core principle is label propagation, which expresses node features through weighted propagation processes.

The author validates the algorithm using two papers with real business data containing approximately 10 million edges. The graph includes nodes for user IDs, IPs, device IDs, and browser IDs, with labels for geographic regions (provinces) and operating systems across 50 dimensions.

Performance testing revealed unexpected results: the original algorithm took ~55 seconds, while the sparse matrix improved version took ~158 seconds (including ~90 seconds for data conversion). Analysis identified three key issues: weight matrix redundancy, lack of vectorization, and memory dependency in sparse matrix computations.

The author implemented several optimizations: simplified weight matrix (35s), numpy+pandas vectorized batch version (6.3s), pandas-only version (7.4s), and pure numpy version (4.1s) - achieving nearly order-of-magnitude improvement over the original algorithm.

TSNE visualization of the embedding results shows clear clustering characteristics, demonstrating meaningful results. The code implementations are preserved below:

The simplified weight matrix version:

<span>def graph_encode_embedding(X, Y, n_K, show_prog=False):</span>

<span>    """</span>

<span>    compute the edge embedding matrix Z and node weight matrix W</span>

<span>    参考论文的原始实现</span>

<span><br/></span>

<span>    :param X: edge list, list of tuple, [(src, dst, weight), ...]</span>

<span>    :param Y: node label, array of int, [node_label, ...]</span>

<span>    :param n_K: number of classes</span>

<span>    :return: Z, W</span>

<span>    """</span>

<span>    # 初始化权重矩阵W</span>

<span>    W = np.zeros(n_K)</span>

<span>    # 遍历每个类别</span>

<span>    for k in range(n_K):</span>

<span>        # 统计每个类别的节点数量</span>

<span>        W[k] = (Y == k).sum()</span>

<span>    # 计算每个类别的权重，即每个类别节点数量的倒数，为了避免除零错误，分母加1</span>

<span>    W = 1 / (W + 1)</span>

<span>    # 初始化节点嵌入矩阵Z</span>

<span>    Z = np.zeros((Y.shape[0], n_K))</span>

<span>    # 遍历每一条边</span>

<span>    for src, dst, edg_w in X:</span>

<span>        src = int(src)</span>

<span>        dst = int(dst)</span>

<span>        label_src = Y[src]</span>

<span>        label_dst = Y[dst]</span>

<span>        if label_dst >= 0:</span>

<span>            Z[src, label_dst] = Z[src, label_dst] + W[label_dst] * edg_w</span>

<span>        if (label_src >= 0) and (src != dst):</span>

<span>            Z[dst, label_src] = Z[dst, label_src] + W[label_src] * edg_w</span>

<span>    return Z, W</span>

The numpy-only vectorized version with groupby functionality:

<span>def group_sum(indexes, values):</span>

<span>    """</span>

<span>    sum values by index</span>

<span>    根据索引求和, 相当于: values.groupby(indexes).sum()</span>

<span><br/></span>

<span>    :param indexes: array of int, [[index1, index2, ...], ...]</span>

<span>    :param values: array of float, [value1, value2, ...]</span>

<span>    :return: grp_indexes, grp_sums</span>

<span>    """</span>

<span>    if indexes.ndim == 1:</span>

<span>        reindex = indexes</span>

<span>    else:</span>

<span>        reindex = np.zeros(indexes.shape[0], dtype=indexes.dtype)</span>

<span>        for axis in reversed(range(indexes.shape[-1])):</span>

<span>            reindex = indexes[:, axis] * (reindex.max() + 1) + reindex</span>

<span>    order = np.argsort(reindex)</span>

<span>    sorted_reindex = reindex[order]</span>

<span>    sorted_indexes = indexes[order]</span>

<span>    sorted_values = values[order]</span>

<span>    _, grp_idx = np.unique(sorted_reindex, return_index=True)</span>

<span>    grp_sums = np.add.reduceat(sorted_values, grp_idx, axis=0)</span>

<span>    grp_indexes = sorted_indexes[grp_idx]</span>

<span>    return grp_indexes, grp_sums</span>

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

anti-fraud Anomaly Detection graph embedding business security GEE algorithm label propagation

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.