From Gut Feelings to Measurable Metrics: Practicing the Rubrics‑Based Expert Knowledge Extraction and Annotation System CRAFT
The article analyzes the growing difficulty of evaluating large AI models, critiques traditional RLVR and RLHF approaches, introduces a Rubrics‑based evaluation paradigm, describes the design and three‑stage workflow of the CRAFT system, reports math‑domain experiments showing up to 6.2 percentage‑point gains, and outlines future extensions to other domains.
Why Traditional Model Evaluation Falls Short
As large language models become more capable, the community faces a core question: how to assess them objectively? Early ChatGPT evaluations relied on subjective judgments such as usefulness or fluency. By 2024‑2025, models like OpenAI’s o1 and DeepSeek’s R1 entered a "closed inference" phase where tasks have verifiable outcomes, enabling an outcome‑based reward (RLVR). However, open‑ended tasks—creative writing, medical advice—still lack a single correct answer and are typically evaluated by human preference pairs (A vs B), a process known as RLHF.
Fundamental Problems with RLVR and RLHF
RLVR judges only the final answer, ignoring the reasoning process, which can lead to reward leakage: a model that arrives at the correct answer via a flawed chain may be over‑rewarded. For example, a model claims that the maximum of a polynomial on an interval must lie at the right endpoint, substitutes x=2, and outputs 2—correct numerically but based on an invalid monotonicity argument. This is a "stealing reward" scenario where the reasoning is wrong but the answer is right.
RLHF requires large amounts of human‑generated preference data and is vulnerable to surface‑feature bias. A model may produce a LaTeX‑formatted solution that looks clean but contains a fundamental multiplication‑rule error, while a terse, poorly formatted answer could be penalized despite being mathematically correct.
Rubrics as Rewards
To address these issues, the authors adopt the "Rubrics as Rewards" paradigm originally proposed by Skill AI for medical data. Rubrics turn expert intuition into a structured scoring sheet with explicit dimensions, weights, and criteria, dramatically reducing evaluation variance.
CRAFT System Design
CRAFT (Collaborative Rubrics‑based Annotation and Feedback Tool) follows a human‑in‑the‑loop philosophy rather than full automation. Its technical implementation uses a three‑level hierarchy: sub‑domains → primary dimensions → secondary dimensions. In mathematics, for instance, seven sub‑domains (analysis, number theory, etc.) each define 4‑5 primary dimensions, each of which defines 4‑6 secondary dimensions. Every secondary dimension includes a description and an example to ensure consistent expert interpretation.
Three‑Stage Workflow
Framework Construction : Domain experts define the sub‑domain split, dimension hierarchy, and write descriptions and examples. This knowledge base forms the "knowledge root" and cannot be generated by the model.
Automated Rubrics Production : When a user submits a problem (e.g., the fraction addition 2/20 + 1/10), the system identifies the sub‑domain (arithmetic‑fraction), then runs a three‑step pipeline—Generate (draft answer), Check (diagnostic verification), Regen (refinement). The interface shows real‑time status for each stage.
Human‑Machine Collaborative Annotation : The UI presents model scores on the left and expert scoring inputs on the right. Each Rubric dimension displays the score, rationale, and evidence. Discrepancies are highlighted for experts to correct, and all divergence data are logged for later analysis and model fine‑tuning.
Math‑Domain Practice and Results
The authors evaluated CRAFT on the Process Bench benchmark, which requires a model to pinpoint the exact step where reasoning fails. Using a base model they achieved 45.8 % accuracy. Adding a simple "Liker" process‑evaluation layer raised accuracy to 48.4 %. When Rubrics were split into six fine‑grained levels (Five Levels), performance dropped because the judge model could not reliably distinguish the narrow bands. The simplest "Binary Rubrics" yielded the largest gain—approximately a 6.2 percentage‑point improvement over the baseline.
Future Outlook
Having validated the approach in mathematics, the team plans to extend CRAFT to domains such as finance (long, structured reports with hard and soft constraints) and medicine (high‑stakes clinical reasoning). An open‑source roadmap is announced to release the Rubrics data, the CRAFT post‑training framework, and the annotation platform, enabling other verticals to embed expert know‑how into LLMs.
Selected Q&A
Q1: Is human intervention limited to the annotation stage? A: No. Experts are essential from the initial framework construction to encode domain know‑how.
Q2: How is expert cost evaluated? A: Early pilots involve a few hundred to a few thousand items; once the framework is solid, scaling to tens of thousands becomes largely automated, making it easier to scale than RLHF.
Q3: How does this relate to the "Bitter Lesson" about the limits of human knowledge injection? A: In specialized domains (e.g., medicine) current models still lack sufficient know‑how, so explicit expert knowledge remains crucial. Mathematics serves as a low‑cost testbed before tackling higher‑risk domains.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
