From RLHF to Rubrics-as-Rewards: Why Domain-Expert Trainers Just Became the Bottleneck

The shift, in one paragraph

For three years the canonical answer to "how do we align this model on our domain" was a pipeline that started with supervised fine-tuning (SFT) on human demonstrations and finished with reinforcement learning from human feedback (RLHF) on preference labels. That pipeline is still in production, but it stopped being the frontier sometime in early 2026. The new shape is: RLVR (Reinforcement Learning with Verifiable Rewards) for any domain where the outcome can be checked — code that compiles, math that's right, retrieval that returns the canonical document, a function that passes a test. Rubrics-as-Rewards for everything else — domains where there is no programmatic verifier but a senior expert can write a rubric the model learns to satisfy. RLHF is now the third option, used where neither verifiable rewards nor rubrics are tractable.

The headline framing is "new training methods." The substance is what each shift implies for who does the work: the labeler has been demoted; the senior domain expert has been promoted; and the rubric — versioned, code-reviewed, treated as production artifact — is the new load-bearing object in the post-training stack.

Why RLVR ate the verifiable-domains tier first

RLVR's claim is simple and, in the domains where it applies, devastating: if the reward signal is deterministic and rule-based — "this code passes 47 of 50 unit tests", "this math derivation matches the canonical answer", "this query retrieved the authoritative document" — you don't need humans grading thousands of preference comparisons to teach a model to be better at the task. You need a verifier.

Three consequences fall out of that:

The cost structure flipped. A traditional RLHF run for a 7B-class model required tens of thousands of preference labels at $1–$5 each. An RLVR run on the same model requires zero preference labels and a verifier — usually code your team already has (test suites, formal validators, retrieval ground-truth files). For verifiable domains, post-training cost dropped by an order of magnitude, and the bottleneck became engineering the verifier, not recruiting the labelers.

Generalization improved on the workloads where it works. RLVR-trained models tend to match SFT on in-distribution tasks and outperform SFT meaningfully on out-of-distribution tasks. The intuition is that a verifiable reward forces the model to learn generalizable problem-solving rather than memorize the labeler's preferences. For coding and math agents specifically, this translates into models that handle edge cases more robustly.

The honesty floor raised. RLVR is harder to game than RLHF in the verifiable domains it touches — you can't sweet-talk a unit test. That doesn't fix every failure mode (sycophancy, refusal patterns, tool overuse remain), but it removes a class of issues that came directly from the preference-label distribution.

The limit, of course, is that most enterprise workloads are not cleanly verifiable. Drafting a customer email, summarizing a clinical note, reviewing a legal contract, evaluating a deal memo — none of these have a deterministic checker. RLVR was the easy half of the post-training problem.

Why Rubrics-as-Rewards is the actually-new thing

Rubrics-as-Rewards is the harder, more recent move: extend the structure of RLVR — explicit, decomposable reward signals — into domains where there is no programmatic verifier, by replacing the verifier with a rubric authored by a domain expert.

A rubric for an investment-banking pitch deck might say: "5 points for correctly identifying the deal driver in the first slide; 3 points for using the firm's standard valuation framework; 2 points for the comparables being from the right sector and the right time window; -5 points for any factual claim about the target that isn't traceable to the data room." A rubric for a clinical note summary might list 12 specific information elements that must appear, ten formatting rules, and three things that constitute hallucinations. A rubric for a code review comment might enumerate the categories of feedback that count as substantive versus surface.

The rubric is what a senior practitioner would say if you asked them to grade a junior's work. The practical move is to encode that grading criterion explicitly enough that an LLM grader can apply it consistently — and then use the LLM grader's score as the reward signal for the model being trained.

This collapses the difference between RLVR's clean economics and RLHF's broad applicability. The cost structure looks like RLVR (an LLM grader is fast and cheap); the domain coverage looks like RLHF (any rubric-able task); the operational discipline is somewhere in between.

Where the bottleneck actually lives now

The industry conversation about "AI training" still defaults to images of labelers in spreadsheet rows producing preference comparisons. That picture is a year out of date. The bottleneck moved.

The new bottleneck is the rubric author. A rubric is only as good as the senior practitioner who wrote it. Hire a generalist labeler to write a rubric for clinical note summarization and you get a rubric that misses the things a doctor would catch. Hire a doctor to write the same rubric and you get something the model can actually be trained against. The unit of value moved from cheap labor at scale to expensive expertise at point.

The second bottleneck is the verifier engineer. RLVR works only as well as the verifier. A verifier that's too lenient lets the model learn to game it; a verifier that's too strict starves the model of reward signal. Engineering the verifier — writing the test harness, the canonical-answer corpus, the retrieval ground truth, the formal validator — is now a real job, distinct from training the model and distinct from labeling.

The third bottleneck is the eval suite owner. Whatever method you use to train, the question "did this run make the model better on the work my team actually does?" is answered by a workload-specific eval suite, replayed against gold-standard outputs, graded by either a verifier or a rubric. Most teams don't have this. Without it, every training run is a vibe.

In that order: rubrics, verifiers, evals. Three roles, all of them senior, all of them under-staffed at the typical enterprise.

What this means for buyers of AI training services

Stop buying labels by the thousand. That market still exists for legacy SFT runs and for the long tail of workloads where there's nothing better, but it's no longer the procurement question. The procurement question is who can author rubrics for our domain, who can engineer verifiers for our verifiable workloads, and who can stand up the eval suite that tells us whether the post-training run actually moved the needle.

Senior practitioners are the lever, not the volume. A team of fifty labelers will not produce a rubric the model can learn from. A team of three doctors, three lawyers, or three senior engineers — depending on the domain — will. The right buying motion is hours of senior expertise, not cents per label.

Vertical specialization beats horizontal scale. The labeling-platform vendors that won 2022 won by being horizontal — same workflow for every domain. The training-services vendors that win 2026 will be the ones with named domain expertise in finance, healthcare, legal, regulated software — because that's the only way the rubric is actually authoritative. Horizontal at scale is now a commodity; vertical at depth is the moat.

Rubrics are a versioned product. A rubric should live in a repo, with a CHANGELOG, with code review, with an eval suite that catches when a rubric change breaks model behavior. Treating rubrics as one-off artifacts written into a Google Doc is the same mistake teams used to make with prompts: under-investing in the artifact that turns out to compound.

Where we'd push back on the framing

Two honest caveats.

RLVR is not magic outside its domain. Recent research suggests that RLVR's reasoning gains may be primarily search compression — making the model find correct answers faster — rather than expanded reasoning capability. For workloads where the underlying problem is genuinely outside the model's reasoning frontier, RLVR is not a free upgrade. Calibrate the expectation: RLVR makes capable models more reliably correct on verifiable tasks; it does not turn an incapable model into a capable one.

Rubrics inherit their author's blind spots. A rubric written by one senior expert is a rubric calibrated to that expert's idiosyncratic preferences. The teams that get this right cross-author rubrics across multiple practitioners and arbitrate the disagreements explicitly, the same way a peer-reviewed style guide gets written. A rubric authored by one person is a hypothesis, not a standard.

What we'd build differently this quarter for an AI training program

Inventory which workloads are RLVR-tractable. Anything where the reward can be checked against a deterministic signal — code agents, math agents, retrieval agents, structured-output agents. Stop buying preference labels for those. Engineer verifiers instead.
For non-verifiable workloads, hire (or contract) the senior expert before the labeler. A rubric authored by a domain practitioner with twenty years in the role is worth more than ten thousand preference labels from generalists. The cost-per-rubric is high; the leverage is much higher.
Stand up the eval suite as the first artifact, not the last. Whatever your training method, the question "did this run improve outcomes on the work my team actually does" needs an answer that isn't "the loss went down." Replay actual workload, grade against gold standard, report.
Treat rubrics, verifiers, and eval suites as production code. Repo, code review, CHANGELOG, regression tests. The teams that already do this for prompts will not need to be told twice; the teams that don't will discover, painfully, that an undocumented rubric drifts faster than an undocumented prompt.
Cross-author the rubrics. Two or three senior practitioners, explicit arbitration on disagreements, a written rationale for each rubric criterion. The rubric that survives that process is durable; the one that doesn't is one expert's opinion in a trench coat.

Sonnet Code's take

The RLHF era trained a generation of buyers to think of AI training as labels purchased in bulk. The RLVR + Rubrics-as-Rewards era is going to train a different generation of buyers to think of AI training as expertise encoded as artifact — verifiers, rubrics, gold-standard examples, eval suites — authored by senior practitioners and treated as production code. We staff that work directly. AI training at Sonnet Code is senior domain reviewers — engineers, doctors, financial analysts, lawyers, depending on the workload — authoring the rubrics, golden examples, red-team prompts, and eval suites that calibrate frontier models against the standards your team would actually defend in an audit. We pair that with AI development engagements that build the verifiers, the eval-suite runners, and the routing logic that turns those artifacts into a feedback loop a model can actually be improved against. If your team is rebuilding its training pipeline off the back of the RLVR shift, the next hire isn't another labeler. It's the senior practitioner who can author the rubric that makes everything downstream work.