Overcoming the Hourglass Effect in Residual Quantization for Generative Retrieval
This paper investigates the “hourglass” phenomenon in residual‑quantized semantic identifiers for generative search and recommendation, revealing that token concentration in intermediate codebooks causes path sparsity and long‑tail distributions, and proposes heuristic layer removal and adaptive token‑pruning strategies that markedly improve model performance.
0 Abstract
Generative search/recommendation has become an innovative paradigm that uses numeric identifiers to improve efficiency and generalization, especially in e‑commerce where methods like TIGER employ residual‑quantized semantic identifiers (RQ‑SID). However, RQ‑SID suffers from an “hourglass” phenomenon: intermediate codebook tokens become overly concentrated, limiting the full potential of generative methods. Through extensive experiments and ablations we identify path sparsity and long‑tail distribution as the main causes, demonstrate their impact on codebook utilization and data distribution, and propose effective solutions that improve performance in real‑world e‑commerce tasks.
1 Background
Numeric identifier representations are widely adopted in industry for their simplicity, efficiency, and strong generalization, particularly for long‑behavior sequence recommendation. Notable methods include DSI, NCI, TIGER, GDR, and GenRet. TIGER generates semantic identifiers (SID) via residual quantization (RQ), capturing semantic and hierarchical information, which is especially advantageous in product‑centric e‑commerce scenarios.
The performance ceiling of RQ‑based methods heavily depends on SID generation, which is the core focus of this work.
2 Task Definition
Given a user profile (e.g., age, gender, membership status) and historical interaction sequence, along with a current search query, the task is to predict the most likely next purchased product using SID‑based models.
3 RQ‑VAE SID Generation
SID generation via residual quantization (RQ) captures semantic information and hierarchical structure, greatly enhancing recommendation performance in e‑commerce.
4 Hourglass Phenomenon
In RQ‑generated SIDs, intermediate codebook tokens become overly concentrated, creating many‑to‑one and one‑to‑many mappings. This leads to path sparsity (only a small fraction of possible paths are used) and a long‑tail distribution where most tokens cluster in a few head tokens, severely limiting representational capacity.
4.1 Visualization
Using billions of query‑product logs, we trained dual‑tower models (e.g., DSSM, BERT) to obtain product embeddings, then applied RQ to generate semantic IDs for all items.
Visualization across multiple parameter settings shows a pronounced hourglass shape, with the second layer’s tokens highly concentrated.
Statistical metrics (entropy, Gini coefficient, standard deviation) confirm low entropy, high Gini, and large variance for the second‑layer token distribution, indicating strong imbalance.
Overall, the hourglass effect manifests as path sparsity (low codebook utilization) and long‑tail token concentration in the middle layer.
4.2 Phenomenon Analysis
We analyze RQ’s mechanics by comparing uniform vs. non‑uniform input embeddings. After the first quantization layer, residuals become non‑uniform, causing the second layer to focus on outliers and produce a long‑tail token distribution. Subsequent layers gradually return to uniformity, forming the hourglass shape.
4.3 Practical Impact
Experiments split test sets into head‑token and tail‑token groups. Models perform significantly better on head‑token subsets and worse on tail‑token subsets, a pattern observed across LLaMA2, Baichuan2, Qwen1.5 and various RQ configurations.
Additional experiments swapping first and second layer tokens, or providing the first token as input, demonstrate that the hourglass effect directly degrades model performance, while mitigating it restores accuracy.
5 Solutions
We propose two simple distribution‑based remedies: (1) heuristically remove the second layer entirely, eliminating the long‑tail effect (at the risk of reduced capacity), and (2) adaptively prune top‑K tokens from the second layer using a threshold p, yielding a variable‑length SID while preserving overall distribution.
Experiments on LLaMA models show that adaptive token removal improves performance with comparable computational cost, and top‑@400 pruning consistently outperforms baselines. Performance gains plateau as more tokens are removed, and complete removal harms recall due to loss of informative tokens.
6 Conclusion
This study systematically examines the limitations of RQ‑SID in generative search/recommendation, identifying the hourglass phenomenon caused by token concentration in intermediate layers. Through extensive ablations we confirm its root in residual quantization and demonstrate two effective mitigation strategies—layer removal and adaptive token pruning—both of which substantially boost model performance.
7 Future Work
1. Optimize SID production and representation by incorporating temporal and statistical features for finer‑grained ranking. 2. Unify sparse (SID) and dense representations to enable LLMs to model dense feature trends directly. 3. Achieve lossless end‑to‑end search pipelines.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.