Artificial Intelligence 13 min read

Utilizing Negative Samples for Knowledge Distillation of Large Language Models

This paper presents a novel framework that leverages negative samples during large language model distillation through three stages—Negative Assistive Training, Negative Calibration Enhancement, and Adaptive Self‑Consistency—demonstrating significant accuracy gains on challenging mathematical reasoning benchmarks and improved generalization to out‑of‑distribution tasks.

DataFunTalk
DataFunTalk
DataFunTalk
Utilizing Negative Samples for Knowledge Distillation of Large Language Models

Large language models (LLMs) achieve strong reasoning performance but their black‑box nature and massive parameter counts hinder practical deployment, especially for complex mathematical problems where erroneous reasoning chains often appear.

At AAAI 2024, the Xiaohongshu search algorithm team introduced an innovative framework that fully exploits negative samples—data that lead to incorrect answers—during the distillation of LLMs into smaller, specialized models. The framework consists of three serialized steps: Negative Assistive Training (NAT), Negative Calibration Enhancement (NCE), and Adaptive Self‑Consistency (ASC).

Negative Assistive Training (NAT) employs a dual‑LoRA architecture to absorb knowledge from negative data and dynamically integrate it with positive LoRA modules, using a negative knowledge absorption phase and a dynamic integration unit to prevent forgetting of useful information.

Negative Calibration Enhancement (NCE) treats the outputs on negative samples as a baseline to calibrate the self‑enhancement process. By measuring the inconsistency between positive and negative reasoning chains with KL divergence and weighting samples via a β‑scaled loss, NCE selectively reinforces critical reasoning steps.

Adaptive Self‑Consistency (ASC) improves the voting stage by training a ranking model on both positive and negative data, allowing adaptive re‑weighting of candidate reasoning paths based on their quality rather than assigning equal or probability‑based weights.

The authors evaluated the framework on the challenging MATH benchmark (12,500 problems) and four out‑of‑distribution datasets (GSM8K, ASDiv, MultiArith, SVAMP). Teacher models were GPT‑3.5‑turbo and GPT‑4; the student model was LLaMA‑7B. NAT consistently raised accuracy across all baselines, NCE added roughly a 10 % improvement over standard knowledge distillation, and ASC outperformed traditional self‑consistency and weighted self‑consistency strategies.

Overall, the study demonstrates that negative samples contain valuable knowledge that, when properly harnessed, can significantly enhance the reasoning capabilities of compact models through a comprehensive distillation pipeline.

machine learningChain-of-Thoughtmodel specializationnegative samplesknowledge transferLLM distillation
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.