MIT Project NANDA's State-of-AI-in-Business 2025 Review Lands the Vendor-vs-Internal-Build Pilot-Success Gap at 2x — 67% vs 33% Across 300+ Real Deployments — While IDC Pins AI-Agent POCs Below the Production Line at 88% and Deloitte's Tech Trends 2026 Replicates the Failure Band at 89%: Three Independent Measurement Programs Converged on the Same Number Inside a Year, and the Procurement Question for Every Engineering Org Whose 2025 Plan Said 'We'll Build It Ourselves' Just Changed Shape — the Default FY27 Operating Model is No Longer 'Assemble an Internal Squad, Pick a Frontier API, Build the Agent in-House' But 'Buy and Customize Through a Specialist Build Partner, Preserve the Senior-Engineering-Attention Budget for the Workload-Specific Exceptions, and Grade the Build-vs-Buy Decision Against the 2x Measured Survival Gap, Not Against the Headline Software Cost.'

What the June numbers actually say and the operating pattern that lands with them

Three measurement sources converged on the same finding inside the last 12 months and the procurement conversation has not caught up yet. MIT Project NANDA's State of AI in Business 2025 review of more than 300 publicly disclosed enterprise AI deployments and 150 executive interviews found that 95% of enterprise generative-AI pilots delivered zero measurable return, and that vendor-built solutions succeed roughly 67% of the time while internal builds succeed roughly 33% — a 2x success gap that holds across industries, company sizes, and pilot budgets. IDC's parallel survey found 88% of AI-agent proofs-of-concept never graduate to production. Deloitte's Tech Trends 2026 review pegged the pilot-to-production failure rate at 89% across enterprise environments. The three datasets together tell the same story from three angles: the default 2025 operating model — assemble an internal squad, pick a frontier API, build the agent in-house against a custom workflow — is a four-out-of-five-pilots-die pattern, and the configuration that doubles the survival rate is a specialist vendor partner on the build side, not on the model side.

The operationally important pieces:

The 2x vendor-vs-internal gap is the structurally interesting number, not the 95% failure rate by itself. A 95% failure rate read in isolation is read as AI agents don't work yet; a 95% failure rate read against the 67%/33% split is read as the agents work — the build approach is what fails. The procurement question for FY27 is no longer should we invest in AI agents at all but should we invest in the operating model that doubles our pilot's survival rate. The same engineering team, the same budget, the same workflow, run against a specialist build partner, has a measurably different production outcome than the same team running against a pure-DIY squad.
The 88-95% pilot-failure-rate band is uniform across the three datasets — MIT NANDA, IDC, Deloitte — which is what makes the finding hard to dismiss as one consultancy's lens. The number is the same when MIT walks 300 deployments, when IDC surveys agent-specific POCs, and when Deloitte aggregates across enterprise environments. The cross-source agreement says the failure rate is the base rate of the default operating model, not a statistical artifact of any one survey methodology.
The failure mode the three datasets converge on is not model quality or compute budget. The failure-mode distribution surfaced by the interviews and post-mortems is workflow-fit drift, verification-contract absence, integration debt against the legacy stack, governance ambiguity at the prompt-and-data layer, and a senior-engineering team whose attention got consumed by the build instead of the business problem the build was supposed to solve. The frontier model is rarely the bottleneck; the operating model around the model is.
The "specialist partner" advantage in the 67% bucket is the attention-and-experience compounding the internal squad cannot replicate from a standing start. A partner that has shipped twenty enterprise agent integrations against three industries has twenty failure-modes-seen-and-fixed in muscle memory before the kickoff call; the internal squad has zero. The 2x gap is what twenty against zero looks like at the end of the pilot. The advantage is not vendor branding; it is the compounded learning curve a specialist team has paid for that the internal squad has to pay for from scratch, on the buyer's timeline, with the buyer's budget.

The structural read isn't build-versus-buy is back as a procurement debate. It's that the June 2026 measurement consensus says the buy-side default for enterprise AI agents has measurably better odds than the build-side default, and that the buy-side advantage is not about the model the vendor sells but about the operating discipline the specialist build partner brings to the integration. The procurement spreadsheet that still has a single line item labeled AI agent build team — internal is operating against an evidence base that has measurably hardened against the line item inside a year.

What the NANDA, IDC, and Deloitte numbers restructure about FY27 procurement

Four concrete shifts that follow when 2x measured vendor-advantage becomes the base rate the FY27 plan grades against.

The build-vs-buy decision moves from capability to survival rate. Twelve months ago, the build-vs-buy debate was framed as can the internal team build it — the implicit assumption being that capability was the binding constraint. The MIT/IDC/Deloitte numbers reframe the debate as what is the per-dollar survival rate of each path — the binding constraint is probability the pilot reaches production with a measurable outcome, not probability the team can implement the prototype. The procurement question becomes the survival-rate question; the survival-rate question has a measured answer; the measured answer favors the specialist-partner path by 2x.

The "we have a strong internal team" defense stops grading well against the 33% number. The 33% internal-build success rate is not a number that improves materially with team seniority; the NANDA breakdown by team experience shows the gap holds across mid-tier and frontier-engineering organizations. The argument our team is too senior to fail like the average team is the same argument the 33% bucket's failed teams made before the pilot kicked off. The honest read is that internal team quality is necessary but not sufficient; the specialist-partner discipline is the additional variable that lifts the success rate from 33% to 67%. The teams that read the data honestly stop using team seniority as a justification for the DIY path.

The procurement-cycle-length conversation shifts from "build is faster" to "build burns more pilot cycles". The 2025 DIY default was often justified by we don't want the procurement cycle for a specialist partner; the internal squad can start Monday. The 88% never-reach-production number reframes the calculation: starting Monday and dying in eight weeks consumes the same calendar quarter as starting four weeks late with a specialist partner and reaching production at the end of the quarter. The pilot cycle is the actual cost; the procurement cycle is a one-time front-loaded cost the specialist-partner path amortizes against the 67%/33% odds.

The senior-engineering-attention bill becomes a first-class FY27 line item. The default DIY operating model assumes senior engineering attention is free internal capacity that does not need a budget line. The MIT NANDA interviews surface the opposite finding: the engineering organizations that ran DIY pilots and failed reported the same dominant cause — the senior engineers' attention got consumed by the build for two quarters, the business problem the build was supposed to solve drifted, and the rest of the engineering roadmap stalled. The specialist-partner path's hidden advantage is the senior-engineering-attention budget it preserves for the work the partner cannot do for the team. The FY27 plan that grades this honestly puts a number on the senior-attention bill and decides build-versus-buy against that number, not against the headline software cost.

Where the data is signal and where it is noise

Four honest reads on what the June 2026 measurement consensus tells the buyer.

Signal: the cross-source 88-95% agreement is operating-model evidence, not pilot-cohort variance. When three independent measurement programs — academic, industry-analyst, consultancy — converge on the same failure-rate band, the convergence is evidence that the rate measures the default operating model rather than the cohort of pilots that each program happened to sample. The signal the buyer should treat as load-bearing is the convergence, not any one source's number.

Signal: the 67%/33% vendor-vs-internal split is the procurement-decision-grade signal, even if the absolute success rate is contested. A buyer who disputes whether the success rate is 67% versus 60% or 75% is missing the procurement-decision point: the 2x relative ratio between the two paths is the number the FY27 build-vs-buy decision should grade against, and the 2x ratio is robust across the sources. The decision the data supports is prefer the specialist-partner path as the default unless there is a workload-specific reason internal build is structurally favored.

Noise: the failure-rate data does not say every internal build will fail. The 33% internal-build success rate is a base-rate number, not a determinism. Specific internal builds — workloads with deep domain specificity, workloads where the build IP is itself the competitive moat, workloads where the integration depth crosses systems no partner can practically learn — succeed at much higher rates inside the right team. The honest read of the data is the specialist-partner path is the default; the internal-build path is the exception that needs an explicit workload-specific justification, not internal build is impossible.

Noise: the data does not pick which specialist partner the team should hire. A 67% average vendor-success rate is a band that contains 90%-success specialists and 40%-success ones; the buyer's procurement diligence still has to grade the specific partner against the specific workload. The NANDA/IDC/Deloitte data shifts the default operating model; it does not replace the vendor diligence cycle.

What the FY27 planner should do inside the next quarter

Four concrete actions that close the gap between the June measurement consensus and the FY27 operating-model decision the data supports.

Run a base-rate-adjusted survival audit on every AI-agent pilot currently in the plan. For each pilot, mark internal-build or vendor-built and apply the 33% / 67% base-rate prior to the planning forecast. The audit's output is the expected-yield-adjusted FY27 AI-agent pipeline against which the team can prioritize attention and budget. The audit is not the decision; it is the calibration on the decision the team is already about to make against an uncalibrated prior.

Identify the workload-specific exceptions where the internal-build path is structurally favored and write the justification down. The exceptions exist; the planner's job is to make them explicit rather than implicit. The written justification per exception forces the team to grade the exception against why this workload is structurally different from the 33% base rate; the discipline of writing it down is what catches the exceptions that are actually just-default-DIY-with-a-different-name.

Stand up the partner-vetting cycle as a first-class FY27 procurement workstream, not an end-of-quarter side project. The specialist-partner advantage requires the right specialist partner; the right partner is selected through a diligence cycle that grades the partner's workload-shaped track record, not the partner's pitch deck. The vetting cycle's deliverable is a shortlist of two-to-three partners per workload class, each with a reference engagement the team has walked end-to-end, each with a per-workload-class trial agreement the team can grade against.

Renegotiate the senior-engineering-attention budget against the build-vs-buy decision per workload. For each AI-agent workload in the FY27 plan, decide explicitly how much senior-engineering attention the workload should consume — and grade the build-vs-buy decision against that number alongside the software cost. The honest accounting of the senior-attention bill is what makes the specialist-partner path's preserved-attention-budget visible as a real advantage on the FY27 spreadsheet, not a hand-waved soft benefit.

The senior-judgment work the specialist partner makes operationally cheap but does not replace

The specialist-partner path compresses the cost of learning the failure modes the partner has already paid for on twenty other engagements. It does not compress the senior-judgment work of choosing which agent workloads to invest in, writing the per-workload success criteria the team will grade the partner's work against, owning the integration into the production stack the team continues to operate, and deciding which workloads are the workload-specific exception where the internal-build path is structurally favored. The engineering organizations that confuse the cheapened learning-curve for the cheapened judgment will, six months from now, be reading post-mortems on pilots whose root cause is we let the partner choose the workload, and the workload turned out to be the wrong battle. The organizations that keep the senior judgment at the center of the workload-selection decision will, six months from now, be in the 67% bucket — and on the FY27 production-deployment side of the line, not the FY27 prototype-graveyard side. The data is the leverage; the senior judgment is the load-bearing wall.

The procurement question is no longer build or buy; it is which workloads belong in the specialist-partner default, which workloads belong in the workload-specific internal-build exception, and what senior-attention budget the team is willing to spend on either path. The teams that ask the right question this quarter buy themselves the 2x odds the data measures; the teams that ask the wrong one buy themselves another year of 88%-graveyard pilots.