The numbers, in one paragraph
The May 2026 cycle of enterprise-AI surveys lands with a single, uncomfortable shape. Deloitte's State of AI in the Enterprise has 97% of executives saying their company deployed AI agents in the past year and 52% of employees actively using them. Only 29% report significant ROI from generative AI, and only 23% report significant ROI from agents specifically. 79% of organizations face challenges adopting AI — a double-digit increase year over year — and 54% of C-suite executives admit AI adoption is "tearing their company apart." A separate read of the agent landscape puts mature governance for autonomous agents at one in five organizations. AI investment is on track to clear $650B annually.
Compress those numbers and a single fact falls out: deployment is essentially universal, ROI is a quarter of the field, governance is a fifth, and spend is straight up. The interesting question stopped being "should we deploy agents?" two quarters ago. The interesting question is "why are three out of four enterprises that deployed agents not getting paid for it?"
Why the gap is structural, not temporary
There are three readings of the gap, and all three are partially right.
Reading 1: Bad procurement. A meaningful share of the deployment count is pilots dressed up as production — an agent shipped behind a feature flag, used by twelve people, never rolled out, never measured. That inflates the deployment number without producing the ROI to match. This reading is true, and it is the explanation enterprises tell themselves first because it's flattering: "we just need to ship more."
Reading 2: Wrong workload selection. Plenty of agents got deployed against workflows where the model wasn't the bottleneck. A team automates a workflow that was already 90% automated, and the marginal improvement is invisible. Or the team picks a high-visibility workflow — exec-level summarization, customer-facing content — that the agent does adequately but not reliably enough to remove the human from the loop. ROI requires either removing labor or expanding output, and many deployed agents do neither cleanly.
Reading 3: Missing scaffold. The agent works on a happy-path demo. In production, it hits a long tail of edge cases the integration layer can't handle, the eval suite never caught, and the governance model can't audit. The fix is not a better model — Opus 4.7, GPT-5.5, and DeepSeek V4 are all very good — it is the scaffolding around the model: routing, observability, the eval suite tied to actual workloads, the governance model that lets a regulated business actually ship.
The third reading is the one that explains why ROI is stuck at 23% even as model quality improved meaningfully across the last six months. The model is no longer the bottleneck. The scaffold is.
What "missing scaffold" actually costs
A working scaffold around an agent does five things — every one of them invisible in a procurement deck and every one of them load-bearing in production:
- Routing decides which workload goes to which model and falls back when the primary is degraded. Without it, the team is paying frontier prices for tasks Haiku could handle and brittle on the day Opus has a regional outage.
- Observability logs prompts, model versions, tool calls, costs, and outcomes per request. Without it, the team can't tell which prompt change caused the regression they're seeing in week three.
- Eval suites replay actual workloads against the agent and grade the outputs against a workload-specific rubric. Without them, every model swap is a vibes-based decision and every prompt change ships unblessed.
- Governance answers: who can deploy a prompt change, who reviews a new tool, who approves a new model, what gets audited, what's logged for SOC 2 / SOX / HIPAA / whatever the regulator cares about. Without it, the program either freezes (compliance vetoes everything) or yolos (compliance hasn't caught up yet, but will).
- Cost ceiling caps per-agent and per-task spend so a single stuck task doesn't burn four figures on a stubborn loop. Without it, finance becomes the surprise stakeholder in month two.
All five are engineering work. None of them ship in the agent vendor's pitch deck. All five are where the gap between "deployed" and "earning back the deployment" actually lives.
What buyers should ask before approving the next agent project
If your CFO is asking why last year's AI budget hasn't shown up in operating leverage, the questions worth pressing on:
- Which workloads, specifically, are agents handling end-to-end without a human in the loop? If the answer is "none," the agent is augmenting a human and the ROI math is labor-hour deflation, not labor replacement. Both are valid; only one matches the slide deck.
- What's the per-workload eval suite, and when was it last run? If the answer is silence or "we run SWE-bench-style benchmarks," the team isn't measuring its own production reality. The eval that matters is replay of your tasks against the agent.
- What's the unit economics? Per-request cost, per-task cost, per-resolved-ticket cost. If the team can't answer these inside ten seconds, the program isn't measured.
- Who owns the governance model? If it's "the AI team," the answer is incomplete — security, legal, finance, and the regulator all have legitimate stakes. The mature shops have a cross-functional review board that meets monthly.
- What's the rollback procedure for a bad prompt deploy? If the answer is "we don't deploy bad prompts," the team isn't being honest about the failure mode.
The vendors that can answer these cleanly have built the scaffold. The ones that dodge them are selling the model and outsourcing the rest to you.
Where we'd push back on the doom narrative
Two gaps in the "AI is tearing companies apart" reading.
The 23% ROI number includes a denominator of organizations that bought the wrong thing for the wrong reason. Some of those deployments were vanity projects, executive-mandated rollouts with no specific workload, or proofs-of-concept that nobody intended to monetize. The interesting number is conditional: of the enterprises that deployed an agent against a measurable workload with a defined success metric, what share saw ROI? That number — anecdotally, from inside the boutique shops we talk to — is much closer to 60–70%. The 23% headline tells you what enterprises bought; it does not tell you what works when bought correctly.
Mature governance in 1 in 5 organizations is fast, not slow, by historical standards. Cloud governance was below 20% mature five years into broad enterprise cloud adoption. Containerization governance was below 20% mature five years into Kubernetes adoption. Two years into broad agent deployment, 20% is roughly on track, not a crisis. The shops moving faster than that are not waiting for the industry to figure it out — they're hiring or contracting the scaffold work directly.
What we'd build differently this quarter
- Cut the agent portfolio. Pick three. A hundred deployed agents with no ROI is worse than three deployed agents with measurable revenue or labor leverage. Kill the eighty without business cases.
- Stand up the routing layer as code in your repo. Not in a vendor UI. The routing logic is where ROI compounds, and it needs to be ours, versioned, code-reviewed, with tests.
- Build a workload-specific eval suite and run it monthly. "The last 200 tickets my support team resolved, replayed through the agent, graded against the resolution we shipped." That's the eval. It tells you week over week whether the program is improving.
- Stand up the governance committee before the auditor asks. Cross-functional, monthly, with a documented prompt-change approval flow, a deprecation process, and an incident-review checklist. The committee that exists is cheaper than the consent decree that doesn't.
- Set unit economics targets per agent. Per-request cost, per-task cost, per-outcome cost. Deploy with a kill criterion. Most agent programs would benefit from a "kill if cost-per-resolved-ticket > $X by Q3" rule built in from day one.
Sonnet Code's take
The agent ROI gap isn't a model problem and it isn't really a procurement problem — it's a scaffolding problem, and most enterprises bought the part of the stack that vendors will sell them while leaving the load-bearing part (routing, eval suites, governance, observability, unit economics) on the wishlist. We staff that work on two sides: AI development, where we build the routing layer, the observability plumbing, the integration glue, and the cost-control logic that turns a deployed agent into a measurable line in the operating budget; and AI training, where senior domain reviewers author the workload-specific eval suites, golden patches, and red-team prompts that tell you — every month — whether your agents are still earning back their cost. If your CFO is sharpening pencils for an AI ROI conversation this quarter, the next investment is rarely the next model. It's the scaffolding that turns the models you already bought into ROI you can actually defend.

