PaperAgent
Jun 11, 2026 · Artificial Intelligence
Skill‑RM Shows More Resources Can Harm LLM Scoring – A Deep Dive into Alibaba’s New Evaluation Framework
The Skill‑RM paper reveals that simply appending evaluation resources can degrade large‑model scoring, while structuring those resources into a Reward‑Evaluation Skill boosts performance across benchmarks, best‑of‑N selection, and RL‑based instruction following.
Alibaba QwenEvaluation FrameworkRLHF
0 likes · 7 min read
