Artificial Intelligence 13 min read

Uncovering the ‘Sandwich’ Bottleneck in Residual Quantized Semantic IDs for Generative Search

This study investigates the “sandwich” bottleneck observed in residual‑quantized semantic identifiers (RQ‑SID) used in generative search and recommendation systems, revealing that token concentration in intermediate codebooks caused by path sparsity and long‑tail distributions degrades performance, and proposes two effective mitigation strategies that improve efficiency and generalization in e‑commerce applications.

JD Cloud Developers
JD Cloud Developers
JD Cloud Developers
Uncovering the ‘Sandwich’ Bottleneck in Residual Quantized Semantic IDs for Generative Search

Abstract

Generative search and recommendation have become an innovative paradigm that uses numeric identifiers to improve efficiency and generalization. In e‑commerce, methods such as TIGER employ Residual Quantization Semantic Identifiers (RQ‑SID) but suffer from a “sandwich” bottleneck where intermediate codebook tokens become overly concentrated. Experiments show that path sparsity and long‑tail distribution are the main causes, significantly harming performance. Two optimization strategies are proposed, demonstrating notable gains in real‑world e‑commerce search‑recommendation systems.

Background

Numeric identifier representations are widely adopted in industry for their simplicity, speed, and strong generalization, especially in long‑sequence recommendation. Prominent methods include DSI, NCI, TIGER, GDR, and GenRet. TIGER generates semantic identifiers (SID) via Residual Quantization (RQ), capturing hierarchical semantic information and excelling in product‑centric e‑commerce scenarios.

Task Definition

Given a user profile (age, gender, membership status) and historical interaction logs, along with a current search query, the task is to predict the most likely purchased product using SID‑based models.

RQ‑VAE SID Generation

The TIGER approach leverages RQ to produce SIDs that effectively reflect complex hierarchical relationships in product data, thereby boosting recommendation performance.

Sandwich Phenomenon

Analysis of RQ‑generated SIDs reveals a “sandwich” effect: the middle layer’s codebook tokens are highly concentrated, leading to many‑to‑one and one‑to‑many mappings. This creates path sparsity (only a small fraction of possible paths are used) and a long‑tail distribution where most tokens occupy a few head tokens, severely limiting representational capacity.

Figure showing token concentration in middle layer
Figure showing token concentration in middle layer

4.1 Visualization

Using billions of search logs, a dual‑tower model (e.g., DSSM, BERT) generates product embeddings, which are then quantized by RQ to obtain SIDs. Aggregated token distributions across three layers consistently exhibit a dense middle layer, confirming the sandwich effect across various parameter settings.

Figure of token distribution statistics
Figure of token distribution statistics

4.2 Phenomenon Analysis

Two synthetic embedding distributions (uniform and non‑uniform) are quantized by RQ. The first layer distributes tokens evenly, but the second layer’s residuals become non‑uniform, causing a few tokens to dominate (long‑tail). The third layer re‑uniformizes. This progressive behavior creates the sandwich structure.

4.3 Practical Impact

Experiments split test sets into head‑token and tail‑token groups. Models perform markedly better on head‑token sets and worse on tail‑token sets, a pattern observed across LLaMA2, Baichuan2, and Qwen1.5 with various RQ configurations. Additional experiments swapping first‑ and second‑layer tokens demonstrate that the sandwich effect directly degrades model accuracy, while providing the first token as input mitigates the issue.

Solution

Two simple distribution‑based remedies are proposed:

Heuristic removal of the entire second layer, eliminating the long‑tail but risking capacity loss.

Adaptive top‑K token pruning in the second layer (top@K strategy with threshold p), yielding a variable‑length SID that preserves overall distribution while reducing bottleneck impact.

Evaluations on LLaMA models show that adaptive pruning improves performance with negligible extra computation, and outperforms baseline even when combined with focal or margin losses.

Figure of performance gains after token pruning
Figure of performance gains after token pruning

Conclusion

The paper systematically explores the limitations of RQ‑SID in generative search/recommendation, identifying the sandwich bottleneck caused by token concentration and long‑tail distribution. Extensive experiments validate the phenomenon and its root cause in residual quantization. Two mitigation strategies—second‑layer removal and adaptive token pruning—are shown to effectively alleviate the bottleneck, with the latter achieving the best results, thereby providing a solid foundation for future model optimizations.

Future Work

Enhance SID generation by incorporating temporal and statistical features for ranking‑critical tasks.

Unify sparse (SID) and dense representations to enable LLMs to model dense feature trends.

Achieve lossless end‑to‑end search pipelines.

residual quantizatione-commerce recommendationsemantic identifiersgenerative searchlong-tail distribution
JD Cloud Developers
Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.