Gemini 3.5 Flash Lands at Google I/O: 76% Terminal-Bench, $1.50/M

What Google actually shipped at I/O 2026 and why the Flash tier is doing the load-bearing work

The headline at Google I/O 2026 on May 19 was not that Google shipped a new Pro-tier model — it was that the Flash tier of the next Gemini generation outscored last cycle's Pro-tier flagship on the coding-agent benchmarks the team actually grades against. Gemini 3.5 Flash is a 1-million-token agentic coding model that posted 76.2% on Terminal-Bench 2.1, 83.6% on MCP Atlas, and an Elo of 1656 on GDPval-AA — every Google-announced number that the independent eval community has re-run inside the first five days matches to the decimal. The model runs at 289 tokens per second, roughly four times the throughput of the prior frontier flash tier, and ships at $1.50 per million input tokens and $9.00 per million output tokens — about 25% cheaper than Gemini 3.1 Pro, the model it beats on the coding-and-agent benchmarks.

The operationally important pieces:

The Flash tier is now the coding-frontier tier, not the cost-tier-cheap-and-fast tier. The historical reading of the Flash class was the small model for high-throughput non-frontier workloads. Gemini 3.5 Flash inverts the reading: the Flash class is the agentic-coding-frontier substrate, the throughput-per-second profile is the load-bearing operational property the long-running coding loop runs against, and the per-token cost is the secondary procurement input — not the primary one. The team that wires the Flash tier into the coding-agent route on the assumption it is the cheap-and-fast fallback misreads the substrate; the team that wires it in as the default agentic-coding execution surface with a 4x throughput envelope reads it correctly.
Terminal-Bench 2.1 at 76.2% is the substrate-grade signal for command-line coding workloads. Terminal-Bench is the closest published benchmark to the real engineering loop the coding agent runs — a shell, a filesystem, a multi-step task, a deterministic verifier. A 76.2% pass-rate at the Flash tier — historically a tier below the per-cycle Pro flagship — means the per-workload cost-curve for the coding-agent shape just stepped down a tier. The procurement-grade read is the workload that the team has been routing to a Pro-tier model is now routable to a Flash-tier model at 25% lower per-token cost and 4x throughput, not the workload still routes to the Pro-tier model because the team's per-prompt routing policy was written against last cycle's tier map.
The 289-tokens-per-second throughput is the substrate property the worktree-per-agent pattern runs against. The eight-parallel-worktree-agent pattern that Cursor 3 and Devin Local both standardized on is throughput-bound at the per-agent generation rate, not at the per-token cost. A 4x throughput envelope at the same accuracy band collapses the per-worktree latency tail the eight-parallel-agent pattern runs into; the team that grades the per-agent dispatch decision against the per-agent latency surface re-shapes the routing matrix against the throughput envelope, not against the per-token cost.
The independent-eval-on-day-five signal closed the diligence window faster than any prior frontier release. The first independent re-run of the Google-announced benchmarks landed inside five days of the keynote and matched to the decimal on every re-tested number — Terminal-Bench 2.1, MCP Atlas, CharXiv. The diligence window that historically took the procurement function six weeks to close (independent-eval, hallucination-rate audit, intelligence-index re-rank) compressed to under a week. The FY27 standing-contract clock starts faster than the customary cadence; the procurement function that defers the routing-matrix update by a quarter against a six-week diligence-window assumption is two quarters late on the substrate-shift the team can already feel inside the coding-agent loop.

The structural read isn't Google shipped a faster cheaper model. It is that the agentic-coding-frontier substrate's Pro-tier-to-Flash-tier compression collapsed two cycles into one, the 4x throughput envelope re-shapes the worktree-per-agent pattern's per-agent dispatch decision, and the per-prompt model-routing policy the FY27 plan was written against six months ago needs the per-workload-class re-routing the new Flash-tier substrate makes operationally cheap to land inside the next sprint, not the next quarter.

What the Flash-tier coding frontier restructures about the FY27 model-routing matrix

Four concrete shifts that follow when the Flash tier becomes the agentic-coding-frontier substrate and the procurement function has the per-prompt routing matrix already drafted against the prior tier map.

The per-prompt model-routing policy bifurcates the coding-agent surface into a Flash-throughput slot and a Pro-accuracy slot, with the default flipped against the prior cycle. Twelve months ago, the per-prompt routing policy for the coding-agent loop had the Pro tier as the default and the Flash tier as the cost-tier escape hatch. The Flash-tier-as-coding-frontier compression flips the default: the Flash tier is the default agentic-coding execution surface, and the Pro tier is the per-task escalation path for the workload class whose verifier coverage gap measured against the Flash tier the team cannot underwrite. The routing-policy artifact in the team's repo is the artifact that needs the per-cycle update; the team that ships the substrate without updating the routing policy ships the Pro-tier-cost-overhead the FY27 budget did not provision.

The worktree-per-agent dispatch decision becomes throughput-bound, not cost-bound. The eight-parallel-worktree-agent pattern Cursor 3 and Devin Local both standardized on is the substrate the team's coding-throughput grading runs against. The 4x throughput envelope at the Flash tier collapses the per-worktree latency tail the eight-parallel-agent pattern previously paid. The per-agent dispatch decision re-shapes from route to the per-token-cheapest model the verifier coverage gap survives to route to the per-second-fastest model the verifier coverage gap survives — and the per-second-fastest model on the agentic-coding benchmarks for the next two quarters is the Flash tier, not the Pro tier. The team's coding-agent throughput surface grades against the new tier-default, not the prior one.

The per-vendor portability envelope on the model-routing matrix is the FY27 standing-contract anchor, not the per-vendor lock-in. The four-vendor agentic-coding-frontier map now reads — Anthropic's Claude Opus 4.8 and Fable 5 / Mythos 5, OpenAI's GPT-5.6 Sol partner-preview, Google's Gemini 3.5 Flash and Gemini 3 Deep Think, and the open-weights frontier track DeepSeek V4 anchors at 1/20 the per-token cost. The substrate is portable across the four standing surfaces if the team's per-prompt routing policy is written against the workload class, not against the vendor name. The team that anchors the FY27 standing contract single-vendor against any one of the four pays the per-vendor cadence-slip risk the prior two cycles surfaced; the team that anchors dual-vendor with the per-workload routing policy underwrites the optionality the FY27 plan needs against the per-vendor cadence the substrate is shipping inside.

The independent-eval-on-day-five diligence-window compression becomes the FY27 procurement-function cadence-default. The procurement function that grades the per-cycle model release on a six-week diligence-window assumption is the function that ships the FY27 routing matrix two quarters late against the substrate-shift the engineering team can already feel. The new diligence-window cadence the procurement function has to run against is the five-day independent-eval window — the cadence the engineering team's per-prompt routing-policy code review runs against. The FY27 procurement playbook has to encode the compressed cadence explicitly; the teams that leave the cadence implicit pay the per-cycle substrate-shift overhead the playbook did not provision.

Where the Flash-tier substrate is signal and where it is noise

Four honest reads on what Gemini 3.5 Flash actually tells the buyer at the FY27 model-routing diligence review.

Signal: the per-workload throughput envelope is the substrate's load-bearing operational property, not the per-token cost. The 4x throughput envelope at the Flash tier is the property the worktree-per-agent pattern's per-second latency surface grades against; the per-token cost is the procurement-spreadsheet signal, not the operational one. The team that wires the substrate against the throughput envelope ships the coding-agent loop's per-second latency-tail compression; the team that wires it against the per-token cost only ships the cost-tier savings without the latency-tail compression and reads the per-cycle substrate-shift miss inside one quarter.

Signal: the independent-eval-day-five matching the announced numbers to the decimal is the trust-signal the procurement function can underwrite against. The historical procurement-function gap on per-cycle model releases was the gap between the announced benchmark and the independent-eval re-run — the gap the diligence window closed against. The five-day-eval-match-to-the-decimal is the trust signal the procurement function can compress the diligence window against; the team that runs against the new cadence-default ships the per-cycle routing-matrix update inside the substrate cycle, not after it.

Noise: the 61% hallucination-rate the independent-eval community measured is not a per-workload disqualifier. Independent eval also measured a 61% hallucination rate on free-form generation — a high number against the per-workload free-form-generation surface. The honest read is the substrate is not the right routing target for the per-workload free-form-generation slot, not the substrate is not the right routing target for any slot. The agentic-coding surface the substrate excels against runs against deterministic verifiers (test pass-rate, schema compliance, command-output match) — the surface the verifier coverage closes the hallucination tail against. The team that reads the hallucination number as a substrate-wide disqualifier misreads the per-workload routing question; the team that reads it as a per-workload routing input ships the per-prompt routing policy against the substrate's coding-strength workload classes.

Noise: the Intelligence-Index #8 rank Google did not disclose is not the procurement-decision input. The independent-eval community ranked Gemini 3.5 Flash at #8 on Intelligence Index, behind GPT-5.5 and Claude Opus 4.7. The Intelligence Index is a cross-workload aggregate; the procurement decision the FY27 plan runs against is the per-workload-class routing decision, not the cross-workload aggregate rank. The substrate is #1 on Terminal-Bench 2.1 at the Flash tier; the per-workload routing decision against the coding-agent surface grades against Terminal-Bench, not Intelligence Index. The team that procures against the aggregate index ships the wrong routing decision against the per-workload surface; the team that procures against the per-workload benchmark ships the routing decision against the substrate's coding-strength workload class.

What the engineering team should do inside the next sprint

Four concrete actions that close the gap between the Flash-tier coding-frontier substrate and the FY27 model-routing matrix the substrate requires.

Update the per-prompt routing policy in the coding-agent loop against the Flash-tier-as-default flip inside the next sprint. The routing-policy artifact in the team's repo is the load-bearing artifact the per-cycle substrate-shift lands inside. Update the default-route against the agentic-coding workload class from the Pro tier to the Flash tier, write the per-workload-class escalation path against the Pro tier for the verifier-coverage-gap workloads, and ship the policy with the per-cycle review cadence the next substrate-cycle re-validates against. The team that ships the policy update inside the next sprint takes the per-cycle cost-and-throughput translation against the substrate; the team that defers ships the per-cycle gap to the FY28 procurement review.

Re-grade the eight-parallel-worktree-agent pattern's per-agent dispatch decision against the new throughput envelope. The per-agent dispatch decision the worktree-per-agent pattern runs against is the artifact the per-cycle throughput envelope re-shapes. Re-grade the per-agent latency surface against the 4x Flash-tier throughput envelope, re-shape the per-worktree concurrency-cap against the per-agent generation rate the new substrate provides, and ship the per-dispatch decision artifact the team's coding-throughput grading runs against. The team that re-grades the pattern this sprint translates the throughput envelope into the per-agent-pipeline-cycle latency tail compression; the team that defers ships the per-cycle latency overhead the prior tier-map did not absorb.

Run a per-workload-class shootout on the Flash tier against the Pro tier across the team's top-three coding workload classes inside two weeks. The procurement-decision-grade artifact the FY27 plan needs is the per-workload-class shootout — for each of the team's top-three coding workload classes (multi-file refactor with explicit test contracts, dependency upgrade against the explicit-version-pin map, structured-extraction against the deterministic verifier), the per-workload-class pass-rate, per-workload-class time-to-completion, per-workload-class per-token-cost, and per-workload-class verifier-coverage-gap against the Flash tier compared to the Pro tier. The shootout is the artifact the per-prompt routing policy update grades against; the team that runs the shootout this sprint ships the routing-matrix update against the per-workload-class signal, not against the cross-workload aggregate rank.

Write the per-vendor portability envelope on the routing policy against the four-vendor frontier map inside the FY27 standing-contract negotiation. The standing-contract anchor the per-cycle substrate-shift surface grades against is the per-vendor portability envelope — the per-vendor cadence-slip-budget the team will tolerate to keep the per-workload routing policy portable across the four vendor surfaces. Write the envelope against the four-vendor frontier map (Anthropic, OpenAI, Google, the open-weights track), the per-vendor cadence-cycle-cap against the per-cycle substrate-shift the team has already measured against two prior cycles, and the per-quarter substrate-switch budget the standing contract underwrites against the per-vendor cadence-slip risk. The team that writes the envelope explicitly into the FY27 negotiation buys the per-vendor portability the per-cycle substrate-shift makes operationally cheap; the team that does not write the envelope explicitly pays the per-cycle substrate-switch cost the first time one of the four vendors slips a cadence quarter.

The senior-judgment work the per-cycle substrate-shift makes operationally cheap but does not replace

The Flash-tier coding-frontier substrate compresses the cost of running the per-prompt routing policy against last cycle's tier map, paying the per-second latency tail the prior throughput envelope did not absorb, and grading the per-cycle substrate-shift on the six-week diligence-window cadence the procurement playbook was written against. It does not compress the senior judgment of deciding which workload classes are Flash-tier-shape and which are Pro-tier-shape, writing the per-workload-class verifier the per-prompt routing policy grades against, owning the per-vendor portability envelope the FY27 standing contract underwrites, and running the per-cycle substrate-shift code review against the team's per-prompt routing policy. The teams that confuse the cheapened per-token cost for the cheapened judgment are the teams that route the per-workload free-form-generation surface against the substrate whose hallucination rate the verifier coverage gap does not close, and read the per-cycle production-reliability post-mortem on the routing-policy gap the per-workload-class shootout would have surfaced. The teams that keep the senior judgment at the center of the per-prompt routing-policy decision are the teams that translate the per-cycle substrate-shift into the per-week throughput surface the prior tier map could not produce. The substrate is the leverage; the senior judgment is the load-bearing wall.

The model-routing question is no longer which model is the cycle's flagship; it is which workload classes the Flash-tier substrate is the default-route for, which workload classes the Pro-tier substrate is the escalation-path for, which per-vendor portability envelope the FY27 standing contract underwrites against the four-vendor frontier map, and which per-cycle substrate-shift cadence the per-prompt routing policy grades against. The teams that ask the right question this sprint translate the per-cycle substrate-shift into the per-week throughput surface; the teams that ask the wrong one ship the per-cycle gap to the FY28 procurement review.

At SONNET CODE we run the AI Development engagement against the per-prompt model-routing policy artifact — per-workload-class shootouts against the four-vendor frontier map, per-vendor portability envelopes on the FY27 standing contract, and per-cycle substrate-shift code reviews against the team's coding-agent loop. If your team's per-prompt routing policy is still written against last cycle's tier map, schedule a call — we'll walk you through the per-workload-class routing-matrix update we ship inside one sprint.