The number, in one paragraph
Surge AI — a five-year-old, bootstrapped, profitable data-labeling company doing roughly $1.2B in annual revenue with about 130 full-time employees and 50,000 expert contractors — opened its first-ever capital raise this cycle, with reports of a $1B round at a valuation up to $25B. The customer roster is OpenAI, Google, Microsoft, Meta, and Anthropic. The pitch is not data labeling in the spreadsheets-of-bounding-boxes sense; it's RLHF preference data, evaluation environments, red-team prompts, and SFT demonstrations written by domain experts paid premium per-minute rates.
The number is interesting on its own. Read against Anthropic's $30B run rate, OpenAI's $25B, and Cursor's $13B post-Composer-2 valuation, it is the part of the AI economy that is least talked about and most structurally durable.
Why the human-in-the-loop layer doesn't commoditize
Three forces work against commoditization here, and they are doing so visibly:
1. The work is non-substitutable by the model. RLHF training data has to be produced by someone who knows more than the model. By definition, you can't have the model write it. The cheaper labelers a generation ago — Mechanical Turk-style microtask workers — are no longer competent to grade frontier-model output, because frontier-model output is competent enough that judging it requires expertise. The natural floor on what an RLHF labeler costs is "how much does it cost to hire someone qualified to know whether the model is correct", and that floor goes up every time the models improve.
2. The buyer has every reason to pay a premium. A $5M training run is cheap to ruin with bad RLHF data. A $50M training run is unbearably expensive to ruin with bad RLHF data. The labs paying Surge are not optimizing for the cheapest annotator — they're optimizing for the annotator whose mistakes don't appear in the model six months later. That is a profile that almost guarantees premium pricing, because the alternative is catastrophic at training-run scale and invisible until production.
3. Specialization compounds. Surge isn't selling the same product to OpenAI as it sells to a mid-market enterprise. Each customer relationship spawns a workflow, a rubric, a vetted pool of annotators, and an internal QA layer that gets harder for a competitor to replicate over time. The moat isn't "we have annotators." The moat is "we have the right annotators trained against the right rubric for this customer, with the QA process to keep them calibrated."
This is the same pattern that made strategic outsourcing more durable than the body-shopping it eventually replaced: domain depth + customer-specific process + a workforce qualified enough to be hard to retrain elsewhere.
What it means for the next layer down
Surge's pricing tells you what RLHF labor costs at the frontier. The next layer down — enterprises that want to fine-tune or RLHF a model on their own domain — is where the price-performance curve gets interesting.
A regional bank that wants its underwriting agent to behave like its best underwriter cannot send that data to Surge. The data is regulated, the underwriters are domain-internal, and the rubric only one or two people in the building actually understand. What that bank needs is the Surge model — a managed pipeline of vetted senior reviewers, a calibrated rubric, a QA layer — applied to its own internal experts. That is a fundamentally different procurement than buying a labeling platform and pointing it at a CSV.
The shape of demand for the next two years:
- Frontier labs keep paying Surge / Scale / Labelbox-tier providers the premium they already pay. That market is durable but largely allocated.
- Enterprises with domain-specific workflows want a smaller-scale version of the same thing — a managed RLHF / SFT / red-team pipeline run by senior people who can be embedded with their internal experts. This market is just opening.
- Mid-market and SMB mostly cannot afford either, and will live on prompt engineering plus general-purpose evals until the price curve bends.
What buyers should actually ask for
If your enterprise is staring at a fine-tuning or RLHF project right now, the procurement questions worth pressing on:
- Who actually writes the data? If the answer is "our annotation platform's labeler pool," that is not a frontier-quality answer. The right answer is named senior people, vetted against your rubric, calibrated against gold-standard examples your internal experts wrote.
- What is the rubric, and who owns it? A rubric written by a vendor is a vendor's rubric. A rubric owned by your domain leads, with the vendor as the operator, is yours.
- Where does the QA layer live? Inter-annotator agreement, calibration drift, and reviewer fatigue are real failure modes. The vendor that doesn't talk about them is the vendor that doesn't measure them.
- What does the data deliverable look like? Spreadsheets are the wrong unit. The right unit is a versioned, hash-stable, repository-managed dataset with a license and a chain of custody.
The vendors that answer these cleanly have learned from Surge's playbook. The ones that dodge them will be undercut by a model in eighteen months.
Where we'd push back on the narrative
Two honest gaps worth naming, because the Surge story has been told in glossy form everywhere.
Profitable bootstrapped companies have valuations that reflect supply scarcity, not just product depth. Surge's $25B is partly a function of how few credible alternatives there are at frontier scale right now. Scale AI lost talent and customers after the Meta acquisition; Labelbox is mid-market; the rest of the field is fragmented. That premium is not guaranteed to hold once Anthropic, OpenAI, or Google verticalize their own annotator networks — and at least one frontier lab's posture this year suggests they are quietly doing exactly that.
Premium per-minute rates aren't the whole annotator economics. The 30–40 cents per working minute number gets cited everywhere as evidence the labor is well-priced. The under-cited part is utilization — annotators aren't paid when there's no work, calibration drift removes them from the pool, and the vetting funnel is brutal. Reading the labor as uniformly premium misses how variable the actual annual income is.
Neither of these undermines the broader point. They just keep the narrative honest.
What we would do with this today
- Treat your eval data as an asset class. If your team has been writing one-off prompts and one-off graders, stop. The data your senior people produce when they grade model output is the most valuable durable asset you'll create this year. Version it, license it internally, keep ownership.
- Run one fine-tune or RLHF pilot before procurement asks you to. The shape of the work — recruiting reviewers, calibrating a rubric, running a QA pass, measuring drift — is unfamiliar to most engineering orgs. Better to learn it on a small, low-stakes project than on the one that ships next quarter.
- Don't conflate the labeling vendor with the training vendor. A platform that gives you annotators is a tool. A team that designs the rubric, runs the pipeline, and stands behind the data quality is an outcome. The latter is what frontier labs buy. It's also what enterprises increasingly need to buy.
Sonnet Code's take
The training-data layer is the part of the AI stack least visible to product teams and most expensive to outsource at scale, which is why it commands the valuations that it does. Sonnet Code runs AI training engagements for clients who don't want to pay frontier-lab pricing for frontier-lab problems but who do need senior reviewers, calibrated rubrics, and managed pipelines applied to their own domain — RLHF data, SFT demonstrations, red-team prompts, and evaluations written by people who actually know the workflow. We staff the same pattern Surge sells, scaled to the kinds of problems an enterprise actually has: not "make Claude better at general coding," but "make our underwriting agent stop missing the edge cases our most senior underwriter would catch." If you're starting to feel the pull to fine-tune, RLHF, or stand up a domain-specific eval suite, the next conversation is about who owns the rubric and who writes the data — and that's the one we run.

