Skill‑RM Shows More Resources Can Harm LLM Scoring – A Deep Dive into Alibaba’s New Evaluation Framework

The Skill‑RM paper reveals that simply appending evaluation resources can degrade large‑model scoring, while structuring those resources into a Reward‑Evaluation Skill boosts performance across benchmarks, best‑of‑N selection, and RL‑based instruction following.

Alibaba QwenEvaluation FrameworkRLHF

0 likes · 7 min read

Skill‑RM Shows More Resources Can Harm LLM Scoring – A Deep Dive into Alibaba’s New Evaluation Framework