Artificial Intelligence 13 min read

Exploring Large Language Models for Recommendation Systems: Experiments and Insights

This article investigates how large language models can be applied to recommendation tasks, presenting two usage strategies, experimental evaluations on multiple datasets, comparisons with traditional baselines, and analyses of prompting methods, cost, and cold‑start performance.

DataFunSummit

Nov 1, 2023

Exploring Large Language Models for Recommendation Systems: Experiments and Insights

The recent surge of large language models (LLMs) has prompted research into their applicability for recommendation systems. Two main strategies are discussed: using an LLM as the backbone of a recommender (e.g., BERT4Rec, UniSRec, P5) and using an LLM as a supplemental component that generates richer user, item, or context embeddings or textual explanations.

Three ranking paradigms are examined for top‑K item selection: point‑wise scoring, pair‑wise comparison, and list‑wise ordering, each with distinct interaction costs.

An experimental pipeline is built around a unified prompt template comprising a task description, demonstration examples, and a new input query. Both zero‑shot and few‑shot prompting are evaluated.

Experiments are conducted on four datasets—MovieLens (movies), Amazon books, Amazon music, and MIND (news)—with baselines including Random, Pop, Matrix Factorization (MF), and Neural Collaborative Filtering (NCF). Metrics used are NDCG and MRR.

Key findings: LLMs substantially outperform random and popularity baselines across domains; ChatGPT achieves the best overall performance, especially in list‑wise mode where a single query yields the ranking. Traditional models still dominate when abundant interaction data are available, but LLMs excel at cold‑start scenarios due to their world knowledge.

Further analyses reveal that increasing the number of prompt examples or historical items does not guarantee better results, as additional context introduces noise. Zero‑shot prompting already beats random/pop, while few‑shot prompting yields the strongest performance.

Case studies illustrate successful ranking with explanations, failures where the model refuses to answer or provides incorrect rankings, and the need for post‑processing to handle such outcomes.

The discussion highlights open questions such as how to combine LLMs with ID‑based embeddings, the necessity of natural‑language interfaces, and the importance of fine‑tuning LLMs for specific domains.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Artificial Intelligence LLM Prompt Engineering ranking cold start

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.