Artificial Intelligence 13 min read

Overcoming the Hourglass Effect in Residual Quantization for Generative Retrieval

This paper investigates the “hourglass” phenomenon in residual‑quantized semantic identifiers for generative search and recommendation, revealing that token concentration in intermediate codebooks causes path sparsity and long‑tail distributions, and proposes heuristic layer removal and adaptive token‑pruning strategies that markedly improve model performance.

JD Cloud Developers

Apr 27, 2025

Overcoming the Hourglass Effect in Residual Quantization for Generative Retrieval

0 Abstract

Generative search/recommendation has become an innovative paradigm that uses numeric identifiers to improve efficiency and generalization, especially in e‑commerce where methods like TIGER employ residual‑quantized semantic identifiers (RQ‑SID). However, RQ‑SID suffers from an “hourglass” phenomenon: intermediate codebook tokens become overly concentrated, limiting the full potential of generative methods. Through extensive experiments and ablations we identify path sparsity and long‑tail distribution as the main causes, demonstrate their impact on codebook utilization and data distribution, and propose effective solutions that improve performance in real‑world e‑commerce tasks.

1 Background

Numeric identifier representations are widely adopted in industry for their simplicity, efficiency, and strong generalization, particularly for long‑behavior sequence recommendation. Notable methods include DSI, NCI, TIGER, GDR, and GenRet. TIGER generates semantic identifiers (SID) via residual quantization (RQ), capturing semantic and hierarchical information, which is especially advantageous in product‑centric e‑commerce scenarios.

The performance ceiling of RQ‑based methods heavily depends on SID generation, which is the core focus of this work.

2 Task Definition

Given a user profile (e.g., age, gender, membership status) and historical interaction sequence, along with a current search query, the task is to predict the most likely next purchased product using SID‑based models.

3 RQ‑VAE SID Generation

SID generation via residual quantization (RQ) captures semantic information and hierarchical structure, greatly enhancing recommendation performance in e‑commerce.

4 Hourglass Phenomenon

In RQ‑generated SIDs, intermediate codebook tokens become overly concentrated, creating many‑to‑one and one‑to‑many mappings. This leads to path sparsity (only a small fraction of possible paths are used) and a long‑tail distribution where most tokens cluster in a few head tokens, severely limiting representational capacity.

4.1 Visualization

Using billions of query‑product logs, we trained dual‑tower models (e.g., DSSM, BERT) to obtain product embeddings, then applied RQ to generate semantic IDs for all items.

Visualization across multiple parameter settings shows a pronounced hourglass shape, with the second layer’s tokens highly concentrated.

Statistical metrics (entropy, Gini coefficient, standard deviation) confirm low entropy, high Gini, and large variance for the second‑layer token distribution, indicating strong imbalance.

Overall, the hourglass effect manifests as path sparsity (low codebook utilization) and long‑tail token concentration in the middle layer.

4.2 Phenomenon Analysis

We analyze RQ’s mechanics by comparing uniform vs. non‑uniform input embeddings. After the first quantization layer, residuals become non‑uniform, causing the second layer to focus on outliers and produce a long‑tail token distribution. Subsequent layers gradually return to uniformity, forming the hourglass shape.

4.3 Practical Impact

Experiments split test sets into head‑token and tail‑token groups. Models perform significantly better on head‑token subsets and worse on tail‑token subsets, a pattern observed across LLaMA2, Baichuan2, Qwen1.5 and various RQ configurations.

Additional experiments swapping first and second layer tokens, or providing the first token as input, demonstrate that the hourglass effect directly degrades model performance, while mitigating it restores accuracy.

5 Solutions

We propose two simple distribution‑based remedies: (1) heuristically remove the second layer entirely, eliminating the long‑tail effect (at the risk of reduced capacity), and (2) adaptively prune top‑K tokens from the second layer using a threshold p, yielding a variable‑length SID while preserving overall distribution.

Experiments on LLaMA models show that adaptive token removal improves performance with comparable computational cost, and top‑@400 pruning consistently outperforms baselines. Performance gains plateau as more tokens are removed, and complete removal harms recall due to loss of informative tokens.

6 Conclusion

This study systematically examines the limitations of RQ‑SID in generative search/recommendation, identifying the hourglass phenomenon caused by token concentration in intermediate layers. Through extensive ablations we confirm its root in residual quantization and demonstrate two effective mitigation strategies—layer removal and adaptive token pruning—both of which substantially boost model performance.

7 Future Work

1. Optimize SID production and representation by incorporating temporal and statistical features for finer‑grained ranking. 2. Unify sparse (SID) and dense representations to enable LLMs to model dense feature trends directly. 3. Achieve lossless end‑to‑end search pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

residual quantization Generative Retrieval semantic identifiers hourglass phenomenon Token Pruning

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.