Eval-as-a-Service Just Crossed $1B — Domain-Expert RLHF Is the New Procurement Line, Not the Research Budget

The release, in one paragraph

On May 20, 2026, IDC and Forrester jointly released a projection that the enterprise eval-as-a-service market will cross $1 billion in 2026 — up from a combined $340M in 2025 and roughly $90M in 2024. The 2026 estimate covers the spend on domain-expert evaluation, RLHF data, red-teaming engagements, and ongoing eval-suite maintenance — separate from the model license, separate from the inference bill, separate from the engineering build. The named buyers are the obvious suspects (Scale AI, Surge AI, Mercor, Snorkel AI, Labelbox) plus a wave of vertical specialists; two of them — health.eval (positioning as the credentialed marketplace for MD-authored medical AI evals) and legal.eval (JD-authored legal AI evals) — both launched publicly this week. Reimbursement rates for credentialed domain experts authoring eval data have roughly doubled in twelve months: a board-certified radiologist authoring image-eval rubrics now bills in the $400-600/hour range; a senior corporate attorney authoring contract-review rubrics, $500-800/hour; a principal staff engineer authoring coding-agent rubrics, $350-500/hour.

The surprising line in the projection isn't the dollar amount. AI training data has been a multi-hundred-million-dollar market for two years. The surprising line is the shift in who's doing the work. The anonymous-gig-worker model that built Scale AI's first chapter is moving up-market — toward credentialed, named, professionally-licensed experts authoring rubrics for the workloads where the cost of a wrong answer is high enough that the buyer wants the author's name in the audit log. Enterprise AI procurement in 2026 now carries a line item for domain-expert review that didn't exist on the budget twelve months ago. The teams shipping production AI reliably aren't the ones with the best models. They're the ones whose evals were authored by senior practitioners with credentials the executive committee recognizes.

Why domain-expert RLHF is the procurement decision, not the data-labeling line item

For three years, the production-AI conversation has assumed a one-way curve: better models → better outputs → fewer evals needed. The last twelve months have flipped that framing. As frontier base models have commoditized — Opus 4.7, GPT-5.5, Gemini 3.5, Llama 4.5, Composer 2.5 all sitting within a couple of benchmark points of each other on most workloads — the differentiator has shifted from which model you call to which evals you measure it against. And the evals that matter are the ones authored by someone with the credentials and judgment to grade what a correct answer looks like on a workload the model wasn't generically trained for.

Frontier models are commoditizing; workload-specific evals aren't. Five frontier-tier coding models cluster within two benchmark points on SWE-Bench. Five frontier-tier general models cluster within three points on MMLU. That's a real convergence, and it's accelerating. The thing that doesn't converge — that gets harder, not easier, as the market matures — is the eval suite for your workload: your codebase, your contract template, your radiology workflow, your customer-support style guide, your incident-response runbook. The competitive moat in production AI is moving from model choice to eval ownership.

"Senior practitioner authored the rubric" is the new procurement requirement. Enterprise AI buyers in 2026 — the ones with security review, compliance review, executive committee signoff — are asking a question they didn't ask in 2024: who authored the eval suite, what are their credentials, and what's their audit trail? That question is impossible to answer well if the eval data came from anonymous crowd-workers grading against an unspecified rubric. It's straightforward to answer if the rubric was authored by a named, credentialed practitioner whose review survived the executive committee. The procurement-side pressure is what's driving the up-market shift in the labeling/eval market — and it's not going away.

The vendor's general-purpose model is necessary but not sufficient. Every frontier lab now ships a model that scores in the high-90s on the eval they chose to publish. None of them ship a model that's been measured against your workload on your rubric. The gap between the vendor's published eval and your production eval is the gap that turns into incident reports six months after deploy. The teams shipping reliably are the ones who close the gap themselves; the teams that trust the vendor's published score are the ones explaining the incident to the executive committee at quarter end.

What the credentialed marketplace shift actually changes

The most operationally interesting piece of this week's news isn't the market projection — it's the public launch of health.eval and legal.eval, two vertical platforms positioned around credentialed-expert marketplaces rather than general-purpose labeling. They're the leading indicator for a pattern the market is going to see across every regulated domain through 2026 and 2027.

Healthcare AI now requires MD-authored eval suites — and procurement is enforcing it. Hospital systems, payer organizations, and pharma R&D groups deploying clinical-decision-support AI in 2026 are not signing off on eval suites authored by anonymous labelers. The procurement requirement reads "board-certified specialist in the relevant clinical domain, with documented professional standing, named in the eval-suite audit trail." That requirement makes a marketplace like health.eval — explicitly credentialing radiologists, oncologists, cardiologists, and pathologists, with verified board certifications and active practice credentials — a procurement enabler, not a labeling vendor.

Legal AI follows the same pattern. Law firms, corporate legal departments, and legal-tech buyers deploying contract-review, e-discovery, and litigation-support AI are demanding eval suites authored by JD-credentialed attorneys with active bar standing in the relevant jurisdiction. legal.eval's pitch — "every rubric in our marketplace is authored by an attorney with documented bar admission, every annotation timestamped and signed" — is a direct response to that procurement reality.

The reimbursement-rate inflation is the leading indicator that the shift is real. Credentialed domain experts authoring AI eval data are billing at rates that approximate their professional consulting rates, not at rates that approximate crowd-labeling. A board-certified radiologist who would charge $500/hour for a consulting engagement now charges $400-600/hour for AI eval work. That rate exists because the market is willing to pay it — because the buyer-side procurement requirement is real, and the supply of credentialed experts willing to do this work part-time is genuinely constrained.

The "anonymous gig worker" tier is not going away — it's specializing. The bulk-volume, low-stakes annotation work that built Scale AI's first chapter still has a market. But the high-stakes eval work is moving up-market into the credentialed marketplace, and the procurement budgets are following. The right frame for 2026 isn't "crowd labeling is dead." It's "crowd labeling is now a different product than credentialed-expert eval — different price, different procurement, different audit trail, different buyers."

What it doesn't change

Eval suites still age. A rubric authored by a senior practitioner in Q1 reflects the workload, the model behavior, and the failure modes of Q1; by Q4, the workload has shifted, the model has been updated, and the failure modes have evolved. The team that buys a one-shot eval engagement and treats the resulting rubric as a permanent artifact is buying a stale measurement by mid-year. The credentialed-marketplace shift makes this more important to budget for, not less — the experts you hire to author the rubric in Q1 should also be on retainer to refresh it through the year.

The fine-tune still needs engineering plumbing. A great rubric, a great set of golden examples, and a great senior practitioner authoring the work are necessary inputs to a successful RLHF or SFT run. They are not sufficient. The engineering that ingests the eval data, runs it through the training pipeline, evaluates the resulting model, and ships the updated checkpoint to production is a separate body of work — usually owned by an ML-engineering team, sometimes outsourced — that the credentialed marketplace doesn't address. Budget for both halves.

The senior practitioner is still the bottleneck. Even with a credentialed marketplace surface, the number of board-certified radiologists willing to do AI eval work, the number of senior attorneys willing to author contract-review rubrics, the number of principal staff engineers willing to grade coding-agent outputs — those populations are finite. The marketplace makes them easier to find and easier to engage. It does not multiply them. Plan for credentialed-expert capacity as a real constraint on your AI roadmap, the same way you plan for senior-engineer capacity on your software roadmap.

Where we'd push back on the framing

"$1B market" includes a lot of demo budget that won't survive Q1 2027 budget reviews. The eval-as-a-service spend in 2026 includes engagements that were funded out of "AI transformation" budgets that are still climbing fast. Some of that spend is real, durable, and tied to specific production workloads; some of it is exploratory, the kind that gets cut the first time a CFO asks for ROI. Read the $1B as the upper bound on demand, not the steady-state.

Many "vertical eval platforms" are wrappers over generic labeling with a credentialed-expert badge. The credentialing layer is the differentiator; the underlying annotation infrastructure is increasingly commoditized. A vertical platform that ships a strong credentialing process on top of a thin annotation wrapper is doing real procurement-enablement work. A vertical platform that ships marketing copy about credentialing on top of the same crowd-labeling backend as a generic vendor is doing procurement theater. The buyer-side diligence — who actually does the annotations, what's the audit trail, who reviews the rubric — separates the two.

The procurement question should be "who reviewed the rubric," not "which platform did the data come from." A rubric authored by a senior practitioner, reviewed by a second senior practitioner, signed by a named clinical lead or principal engineer, with the review trail in the audit log, is the artifact that survives the executive committee. Whether that artifact was produced through Scale AI, Surge, Mercor, health.eval, legal.eval, or an in-house credentialed-expert team is a procurement-mechanics question, not a quality question. Don't let the marketplace logo distract from the rubric review.

"Credentialed expert" is not a license to skip the engineering work. A great rubric authored by a great expert, applied to a great set of golden examples, still needs the engineering pipeline that ingests, runs, evaluates, and ships. Some of the loudest "we adopted AI training and it didn't work" stories from 2025 come from teams that bought the rubric and skipped the plumbing. The credentialed marketplace makes the inputs better; it does not make the engineering invisible.

What we'd build differently this week

Catalog your eval suites by domain. For each AI workload in production, answer: what's the rubric, who authored it, what are their credentials, when was it last refreshed? The workloads where the answer is "the model card and a vibe" are the workloads carrying procurement risk; bring them in scope for a credentialed-expert review this quarter.
Hire (or contract) the senior practitioner who owns each rubric. Not a labeler — an owner. A named person whose job description includes "defends the rubric at quarterly review." In regulated domains, the credentials matter — board certification, bar admission, principal-engineer standing. Outside regulated domains, the seniority and the named accountability matter. Either way, the rubric needs an owner.
Instrument the rubric's decay rate over time. Track how often the rubric needs updating, how often a previously-passing model run starts to fail against an unchanged rubric, how often new failure modes surface from production traffic that weren't covered in the original eval. The decay rate tells you what you need to budget for ongoing rubric maintenance — which is almost always more than the team initially scoped.
Treat eval data as a versioned, audited asset. Same code-review process you use for code; same audit trail you use for compliance artifacts. Every rubric change goes through a PR; every annotation set is timestamped and signed; every model evaluation is reproducible against a specific rubric version. The teams that treat eval data as scratch material are the teams who can't answer the executive committee's question about reproducibility six months later.
Plan for credentialed-expert capacity as a constrained resource. The board-certified radiologist who's willing to author your AI eval rubric is also the radiologist who's reading actual studies during the day. The senior attorney authoring your contract-review rubric is also handling actual matters. Capacity is real, scheduling is real, retention matters. Build the relationship as if you're recruiting senior staff, not as if you're posting a labeling job.

Sonnet Code's take

The eval-as-a-service market crossing $1B is the moment domain-expert RLHF stopped being a research-team curiosity and started being an enterprise procurement line. The right read isn't "AI training is having a moment." It's that production AI reliability in 2026 is fundamentally an eval problem — and the evals that matter are the ones authored by senior practitioners whose credentials the procurement team, the security team, and the executive committee all recognize. Teams that treat eval as a one-shot labeling engagement will discover their rubric has decayed past the point where it catches the failure mode that lands in the incident report; teams that treat eval as an ongoing relationship with named, credentialed practitioners will ship AI features that survive both the executive committee and the next model release.

We staff both halves of that work. AI training at Sonnet Code is the credentialed-practitioner side of the eval engagement — staff engineers, security architects, principal reviewers, and (through partner networks in regulated domains) credentialed clinicians and attorneys — who author the rubrics, the golden examples, the failure-mode catalogs, and the calibration sets that grade what your AI actually does on your workloads, with named accountability and an audit trail your procurement team can defend. We pair it with AI development engagements that build the engineering plumbing — the eval harness, the routing layer, the observability stack, the deploy gate — that turns the rubric from a document into a release-blocking artifact. If your team is reading the IDC/Forrester projection this week and wondering whether your AI roadmap survives a procurement review that now asks "who authored your evals," the next conversation isn't about which platform to buy from. It's about which workloads need a credentialed-expert rubric, who owns it inside your org, and the senior practitioner whose signoff makes the rubric defensible at the next executive committee meeting.