Evaluating Large Language Model Item Encoders for Textual Collaborative Filtering in Recommendation Systems
This article investigates whether replacing traditional ID-based item encoders with massive LLMs such as GPT‑3 improves recommendation performance, by conducting extensive experiments on three real‑world datasets, analyzing performance limits, generality of item representations, and comparing against ID‑based and prompt‑based methods.
TL;DR : Unlike prior LLM‑for‑Rec work that uses OpenAI APIs for prompting, this study replaces the item encoder with a 175‑billion‑parameter GPT‑3 model, fine‑tunes up to 600‑billion‑parameter LLMs, and evaluates the extreme performance limits of text‑based recommendation paradigms.
Research Motivation : The classic ID‑based recommendation paradigm has dominated for a decade; with the rise of large language models (LLMs) in NLP, the authors ask whether encoding items with an LLM can surpass ID‑based methods. They conduct a “TCF paradigm” study using GPT‑3 as the item encoder, including costly experiments such as fine‑tuning a 600‑billion‑parameter model.
Model Architecture : Two representative recommendation backbones are evaluated – a dual‑tower DSSM model (simplified CTR) and the sequential SASRec model. These serve as the downstream recommendation architectures for the LLM‑based item encoder.
Datasets : Experiments use three real‑world text‑rich datasets: Microsoft MIND news clicks, H&M fashion purchases, and Bili video recommendations. Items are represented by titles or descriptions, and user‑item interactions are implicit feedback (click, purchase, comment).
Experimental Observations :
Q1 : Scaling the text encoder from 125 M to 175 B parameters consistently improves TCF performance, though the relationship is not strictly linear; 350 M models perform worst.
Q2 : Even extremely large LLMs do not yield universally transferable item embeddings; fine‑tuning still outperforms frozen representations, and the cost of fine‑tuning a 66 B model is prohibitive.
Q3 : When using the DSSM backbone, TCF with 175 B LLMs still lags behind ID‑based collaborative filtering; however, with the SASRec backbone, frozen LLM‑based TCF matches or exceeds ID‑CF on warm‑item scenarios.
Q4 : LLM‑based TCF shows limited transfer learning ability; pre‑training on a large user‑item set improves performance but does not close the gap to models trained directly on each recommendation dataset.
Q5 : Prompt‑based ChatGPT4Rec performs significantly worse than TCF across typical recommendation tasks, highlighting current limitations of pure prompting for large‑scale recommendation.
Conclusion : The study does not propose a new algorithm but provides a thorough empirical analysis of the TCF paradigm with massive LLM item encoders. Results suggest that TCF has not yet reached its performance ceiling, and larger LLMs may yield further gains, yet substantial training costs and limited transferability remain major challenges for building a universal recommendation foundation model.
References : The summary includes citations to seminal works such as Brown et al. (2020) on LLMs, the original BERT paper, Word2Vec, DSSM, SASRec, and related recommendation literature.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.