Sonnet Code
← Volver a todos los artículos
AI Training20 de abril de 2026·8 min read

Verifiable Rewards Didn't Kill the Expert-in-the-Loop. They Sharpened It.

The headline number that misleads

The most repeated statistic in the RLHF discourse right now is that RLAIF costs less than $0.01 per data point versus $1+ per human-produced data point. The implied conclusion — humans are a transition technology, the AI will replace them on the training loop — is half right and half lazy. The frontier labs that could most easily cut their human-feedback budgets are not cutting them. OpenAI, Anthropic, Google, and Meta are each spending on the order of hundreds of millions per year on human-collected training data, with the top spenders clearing a billion. That is the revealed preference of people who understand their own cost curves better than anyone.

The actual story of 2026 is not that humans got replaced. It is that the humans who still get paid to train frontier models changed jobs.

The shift from preferences to verification

The early RLHF era — roughly 2022 to 2024 — was built on preference data. A contractor compared two model responses and picked the better one. Aggregate enough of those picks and you had a signal strong enough to align a base model into something useful. The work was rated, commodified, and often offshored to the cheapest available labor pool.

That model has mostly been automated away. Reinforcement Learning from AI Feedback (RLAIF), Direct Preference Optimization (DPO), Group Relative Policy Optimization (GRPO), and the DeepSeek-era verifiable-reward approach have all eaten the preference-labeling job. For tasks where correctness can be mechanically checked — unit tests passing, math problems with known answers, SQL queries returning the right rows — the reward function is a function, not a person.

What survived the automation is the work that cannot be turned into a function: demonstrations, rubric design, red-teaming, and expert review of outputs where the right answer is itself a judgment call. And that work is more valuable than ever, because the models are now good enough that distinguishing a correct but unremarkable answer from a genuinely expert answer requires someone who could have written the expert answer themselves.

Where the money is actually going

The 2026 human-feedback spend is concentrated in four buckets:

  • Domain-expert demonstrations (SFT). A senior radiologist writing out how she reads a specific class of CT scan, including the reasoning. A structural engineer narrating a load-calculation review. A corporate lawyer showing the chain of logic behind a diligence memo. This is the highest-leverage human data being collected today.
  • Rubric design and evaluation. Writing the standard by which an AI output will be judged is itself expert work. The rubric is the moat — it encodes the taste the model is supposed to internalize.
  • Red-teaming. Finding the prompts that break the model, with domain knowledge of why a specific break matters. Generic jailbreak testing is commodified. Domain-specific adversarial testing — a nurse probing clinical advice, a tax attorney probing compliance output — is not.
  • Agent/tool-use training data. Humans demonstrating how to chain tools, recover from errors, and handle ambiguous state. The scale-up of agentic products in 2026 has made this category the fastest-growing line item in the frontier-lab human-feedback budget.

Notice what is not in the list: rating two responses for helpfulness. That job is gone.

The reward-hacking problem is a human-review problem

Anthropic published a finding earlier in the cycle that penalizing reward hacking during training — either with an HHH preference model reward or a dedicated reward-hacking classifier — can reduce misaligned generalization by over 75%. Both of those penalties are defined by humans. The reason the lab bill for human experts keeps going up is that every new reward-hacking pattern the community discovers requires a new human-authored counter-signal. Verifiable rewards are fast but brittle. They reward the thing they can measure, which is never quite the thing the model should actually be doing.

Human-designed preference and penalty signals are the slow, expensive, necessary correction to the optimization pressure that verifiable rewards create. This is not going to change until we have models that can reliably author their own alignment signals — a goal that has receded, not advanced, as capability has climbed.

What this means for buyers of AI training services

If you are a product team thinking about commissioning human-feedback work — for a domain-specific model, a fine-tune of an existing base, or a high-stakes evaluation pipeline — the landscape looks different than it did 18 months ago. The advice:

  • Do not pay for generic preference data. The labs have automated it. If a vendor is selling it, the vendor is running a markup on cheap labor that your model will barely benefit from.
  • Pay for demonstrations from people who could not be hired as full-time headcount. A senior oncologist giving you four hours of walkthrough data is worth more than forty hours of generalist labeling. The cost per hour is higher; the cost per useful training example is lower.
  • Pay for the rubric, not just the ratings. If you are buying evaluation work, what you actually need is the standard itself — the definition of good — encoded in a form a future evaluator can reapply. The ratings are a byproduct.
  • Treat red-teaming as continuous, not one-shot. The threat surface of an AI product changes every time the underlying model changes. A one-time red-team exercise has a half-life measured in weeks.

The quiet advantage of small, senior training teams

The last trend worth naming: the bigger labs and the more sophisticated enterprise buyers have moved away from the massive crowd of low-cost labelers model toward small, senior, domain-expert training teams. A hundred people who can credibly demonstrate expert reasoning in a specific domain are worth thousands of commodity raters. The cost per head is higher. The signal-to-noise ratio is incomparable.

This is the version of human-in-the-loop training that scales in 2026. Not cheaper. Sharper. The companies that understand the shift are the ones producing the models everyone else is benchmarking against.