Artificial Intelligence 8 min read

Multiscale PU Learning for Detecting AI‑Generated Text

Researchers from Peking University and Huawei present a multiscale positive‑unlabeled learning framework that significantly improves detection of AI‑generated short and long texts, addressing the difficulty of distinguishing AI‑written content from human writing and outperforming existing baselines on multiple benchmarks.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Multiscale PU Learning for Detecting AI‑Generated Text

With the rapid advancement of generative large language models, AI‑generated text is becoming indistinguishable from human writing, leading to misuse and social problems; reliable detection methods are urgently needed.

Researchers from Peking University and Huawei propose a robust AI‑generated text detector that leverages positive‑unlabeled (PU) learning and a novel multiscale PU (MPU) loss to handle both short and long inputs.

Short AI‑generated sentences often overlap with human language, making binary classification ineffective; the authors treat human text as positive samples and AI‑generated text as unlabeled, redesigning the loss function to improve discrimination.

The traditional PU loss consists of three components: loss on positive samples, loss assuming all unlabeled samples are negative, and loss treating positives as negatives, together with a prior positive probability. The MPU loss extends this by making the prior length‑sensitive, allowing the loss to adapt to varying text lengths.

An abstract cyclic model is introduced to estimate the prior probability: each token’s contribution is inversely proportional to sentence length, and the model outputs a confidence that a sentence is human‑written. By aggregating token contributions, the prior probability increases with text length, reflecting reduced uncertainty.

To enrich training data, a multiscale module randomly masks sentences and reorders the remaining ones while preserving order, creating diverse length distributions that better suit PU learning.

Experiments on the short‑text Tweep‑Fake dataset and on ChatGPT‑generated corpora show that replacing the standard binary loss with MPU loss yields higher F1 scores, surpassing baselines such as OpenAI’s detector and DetectGPT, with up to a 1% improvement on full‑length texts.

Ablation studies confirm that each component—length‑sensitive prior, MPU loss, and the multiscale data augmentation—contributes positively, with MPU consistently outperforming traditional PU.

In summary, the proposed multiscale PU learning framework effectively addresses the challenge of detecting short AI‑generated texts and represents a solid step toward controlling the misuse of AIGC content.

large language modelsText ClassificationAI detectionmultiscalePu-Learningshort text detection
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.