Sonnet Code
← Volver a todos los artículos
AI Development23 de junio de 2026·11 min read

Google Made Gemini 2.5 Pro Deep Think Generally Available on the API, AI Studio, and Vertex AI on June 22 — the 2-Million-Token Context Window and the Extended-Reasoning Deep Think Mode That Doubles Compute on Hard Problems Are Now Broadly Accessible at Roughly $2.50/$15 per Million Tokens for the Standard Mode and Roughly 4x That for Deep Think, with 89.8% on MMLU-Pro, 82.4% on GPQA Diamond, 76.4% on SWE-bench Verified, and 94.1% on HumanEval+ — the Frontier-Reasoning Slot of the Routing Matrix Every Buyer Maintains for the FY27 Plan Just Gained a Third Credible US-Lab Entry Beside Anthropic Opus 4.7 and OpenAI GPT-5.5, and a Fourth That Is No Longer Reachable With Claude Fable 5 and Claude Mythos 5 Export-Suspended as of June 12.

What Google shipped on June 22 and the routing-matrix slot it fills

On June 22, 2026, Google made Gemini 2.5 Pro Deep Think generally available on the Gemini API, AI Studio, and Vertex AI — the 2-million-token context window and the extended-reasoning Deep Think mode that doubles compute on hard problems are now broadly accessible to every buyer with a Google Cloud contract, every developer with an AI Studio key, and every integrator that already built the routing-layer plumbing into the Vertex AI control plane. Pricing lands at roughly $2.50 input / $15 output per million tokens for the standard mode and roughly 4x that for Deep Think — the multimodal text/code/image/audio/video/structured-data surface is the same one Gemini 2.5 Pro already shipped on, and Deep Think is the toggle on top, not a separate SKU.

The benchmark set Google published is the one the buyer's routing matrix has to grade:

  • MMLU-Pro: 89.8% — the highest of any publicly available model at GA on this date.
  • GPQA Diamond (graduate-level science): 82.4% — the regulated-industry signal worth grading honestly when the per-vertical failure tail is dense in technical reasoning the commodity-tier model can't follow.
  • SWE-bench Verified: 76.4% — the agentic-coding signal the procurement spreadsheet now reads alongside Anthropic's Opus 4.7 and OpenAI's GPT-5.5 entries.
  • HumanEval+: 94.1% — the highest HumanEval+ figure ever recorded at the time of the GA.

The frontier-reasoning slot of the procurement-grade routing matrix every buyer maintains for the FY27 plan just gained a third credible US-lab entry beside Anthropic Opus 4.7 and OpenAI GPT-5.5 — and a fourth that is no longer reachable, with Claude Fable 5 and Claude Mythos 5 export-suspended as of June 12, 2026 by the US government export-control directive that landed three days after Fable 5's June 9 launch. The routing-matrix question for the next release cycle is not which lab leads the leaderboard headline; it is which lab fills the buyer's per-vertical failure-tail with a model the buyer can still get tomorrow — and Google just signaled the answer for the reasoning slot.

Why Deep Think changes the buyer's routing-matrix math

Deep Think is not a separate model. It is an extended-inference toggle on Gemini 2.5 Pro that allows the model to spend significantly more compute on complex problems before generating a response — the same shape Anthropic shipped as adaptive thinking on Fable 5 and OpenAI shipped as the "reasoning effort" knob on GPT-5.5. The pricing premium — roughly 4x standard rate — encodes the compute multiplier directly into the buyer's per-token line item, and the toggle becomes a routing-matrix decision the buyer's senior-judgment overlay grades per-vertical:

  • For the easy-to-grade vertical tail (high-volume customer-support classification, structured-extraction pipelines, deterministic-tool-use orchestration), the standard-mode 2.5 Pro at $2.50/$15 is what the routing layer should serve. Deep Think on the easy tail is the 4x premium on top of the work the model already finishes at the standard rate — the buyer that routes every request through Deep Think because "the benchmark headline is higher with the toggle on" is the buyer that lights the cloud bill on fire for no per-vertical accuracy improvement the senior-judgment overlay can detect.
  • For the hard-to-grade vertical tail (graduate-level technical reasoning, multi-step regulated-domain decisions, agentic-coding tasks where the failure mode is silent-wrong-completion rather than visibly-stuck), the Deep Think toggle is the routing-matrix entry the per-vertical senior-judgment overlay must grade against Opus 4.7 with extended thinking and GPT-5.5 with reasoning-effort dialed up. The 76.4% SWE-bench Verified figure with Deep Think is the headline that gets the slot into the diligence sprint; the per-vertical pass rate on the buyer's own gold set is what wins it.
  • For the 2-million-token slot, the routing-matrix call is whether the long-context tail of the buyer's vertical is one the cheaper-per-token Gemini 2.5 Pro should serve in standard mode (large-codebase comprehension, multi-document regulated-vertical retrieval, long-trace agentic workflow audit) versus paying the Opus 4.7 / GPT-5.5 premium for the same context-window envelope. The 2M window is not a frontier-only feature anymore; it is a routing-matrix lever for the buyer that has the senior-judgment overlay to grade the long-context tail honestly.

The vendors that the buyer's routing matrix can still call tomorrow at the frontier-reasoning + long-context + agentic-coding trifecta as of this GA: Anthropic Opus 4.7 with extended thinking (the model-quality default the routing matrix grades everything else against), OpenAI GPT-5.5 with reasoning-effort (the Terminal-Bench 2.0 leader at 82.7% and the lowest-hallucination-rate frontier model in the public benchmark set), and Google Gemini 2.5 Pro with Deep Think (the newest entry, the cheapest per token at the standard rate, the only one with the 2M context-window envelope at this price point). The buyer that runs the FY27 routing-matrix update with all three grading honestly against the per-vertical gold set wins the per-vertical cost curve; the buyer that anchors on a single lab because the procurement contract is already signed loses the cost curve and pays the Deep-Think premium on the wrong vertical tail because the routing-layer fallback was never tuned.

What the AI Studio + Vertex AI distribution choice signals

Google did not gate Deep Think behind a separate enterprise SKU or a closed Vertex AI preview. The GA shipped across all three of the Gemini API, AI Studio, and Vertex AI surfaces simultaneously — the developer-grade key path, the consumer-prototyping path, and the enterprise-control-plane path. That is a distribution signal the procurement spreadsheet has to read carefully:

  • Vertex AI is the enterprise-grade surface where the routing matrix gets enforced inside the buyer's existing Google Cloud envelope: VPC service controls, customer-managed encryption keys, audit logging, IAM-bound model access, residency-region pinning. The Deep Think toggle is now a Vertex AI routing-layer feature flag the buyer's central platform team owns, not a developer-team escape hatch.
  • AI Studio is the rapid-prototyping surface where a vertical product team can prove the per-vertical pass-rate hypothesis against the standard mode and Deep Think mode in the same afternoon, without filing a Vertex AI tenant request. The downstream operational handoff to the Vertex AI envelope is the buyer's responsibility — and the buyer that does not have the Vertex-AI-to-AI-Studio prompt-and-routing-config-portability path well-rehearsed by the end of Q3 is the buyer that ships the per-vertical prototype on AI Studio and then cannot promote it to the Vertex AI envelope without re-grading the per-vertical senior-judgment overlay from scratch because the routing-matrix config is encoded against the wrong surface.
  • The Gemini API is the model-vendor-direct path the integrator owns when the buyer's preferred control-plane envelope is not Vertex AI — the multi-cloud routing layer the buyer maintains because the procurement function does not single-sign on a single hyperscaler. The Gemini API surface gets the same pricing and the same GA at the same time; the difference is that the audit-logging, VPC-service-controls, IAM-binding, and residency-region-pinning are the buyer's responsibility to wire into the integrator's substrate, not Google's.

The three-surface simultaneous GA tells the buyer the model is not a Vertex-only product anymore; it is a routing-matrix entry the buyer can plumb through whichever control-plane envelope the FY27 procurement contract has already encoded.

The per-vertical diligence sprint to run this quarter

The standard buyer-side failure mode after a frontier-reasoning GA looks like this: the platform team reads the benchmark headline, files a routing-matrix ticket to try the new model on the high-cost workload, the routing-layer feature flag flips for one vertical, the per-vertical pass rate on the gold set does not visibly improve, the platform team concludes the new model is no better and reverts. The vertical that did benefit from Deep Think — the hard-to-grade tail where the senior-judgment overlay was the only thing catching the silent-wrong-completions — never gets the routing-matrix update, and the buyer down the road who did run the per-vertical diligence sprint honestly ships the per-vertical cost curve 6 months before the buyer who flipped the feature flag once.

The diligence sprint the buyer who reads this announcement carefully runs this quarter:

  1. Re-grade the per-vertical gold set against Gemini 2.5 Pro standard mode, Gemini 2.5 Pro Deep Think mode, Opus 4.7 with extended thinking, and GPT-5.5 with reasoning-effort dialed to the matching premium. Pass-rate parity at the same effective compute-per-request is the headline number; the per-vertical failure-tail composition (silent-wrong-completion rate, instruction-drift rate, hallucination rate, tool-call-correctness rate) is the slot-level grade the routing matrix actually encodes.
  2. Compute the per-vertical effective cost per accepted output — not the per-token rack rate, the per-vertical cost net of the senior-judgment-overlay rework rate. The vertical where Gemini 2.5 Pro standard mode beats Opus 4.7 on cost-per-accepted-output is the routing-matrix entry the buyer's FY27 plan should encode; the vertical where Deep Think pays for itself against the senior-judgment-overlay rework rate is the one the routing matrix should toggle on; the vertical where Opus 4.7 still pays for itself is the one the buyer should not rebalance just because the new entry shipped.
  3. Plumb the 2M-context envelope into the long-context tail of the vertical workload — large-codebase comprehension, multi-document regulated-vertical retrieval, long-trace agentic workflow audit. The per-vertical pass-rate at the 2M boundary is the slot-level grade the buyer's routing matrix has been missing because the previous routing matrix did not have a frontier-quality 2M model at this price point to grade against.
  4. Refresh the senior-judgment-overlay calibration for each vertical where the routing-matrix entry changed. The overlay's prior calibration was against the previous routing-matrix incumbent; the new entry needs new gold sets, new failure-mode rubrics, and a refreshed senior-reviewer queue cadence. The buyer that flips the feature flag without refreshing the overlay is the buyer that loses the per-vertical accuracy gain because the overlay is now catching the wrong failure modes.

What the export-suspended Fable 5 slot means for the routing matrix

The routing-matrix slot that the GA of Gemini 2.5 Pro Deep Think also changes is the one that Fable 5 was supposed to fill for the buyer that pre-graded the FY27 plan against the model that shipped on June 9 and got export-suspended on June 12. The 95% SWE-bench score Fable 5 hit was the headline; the export-suspension is the procurement-reality the buyer has to encode in the FY27 plan: the model exists, the buyer cannot acquire it on a forward-flowing contract today, and the routing-matrix slot the buyer pre-graded against it has to be backfilled by a model the buyer can still get tomorrow.

Gemini 2.5 Pro Deep Think is one of the three candidates the buyer can still get for the frontier-reasoning slot — alongside Opus 4.7 and GPT-5.5, both of which also remain reachable on forward-flowing contracts. The buyer that ran the FY27 plan against Fable 5 in the what we wish we had column and against Opus 4.7 in the what we can still get column now has a third entry in the second column the diligence sprint can re-grade against. The buyer that did not pre-grade the what we can still get column and is now staring at a routing-matrix gap is the one that has to run the full per-vertical diligence sprint in Q3 instead of Q4 — and pays the procurement-timing penalty of running the sprint against the FY27-plan deadline instead of ahead of it.

What it means for the integrator and the AI-training overlay

The model-vendor announcement reshuffles the routing-matrix headline; the per-vertical diligence sprint is the work that translates the headline into a procurement decision the buyer can defend in the FY27 committee. That sprint has three components the buyer almost always under-resources:

  • Per-vertical gold-set curation and rubric refresh — the assets the senior-judgment overlay actually grades against. The headline benchmark is generic; the per-vertical gold set is the buyer's own failure-tail composition, encoded by domain experts who can tell the silent-wrong-completion from the visibly-correct-completion at the same surface representation. The integrator that does not maintain the per-vertical gold set as a first-class asset is the integrator that re-grades the routing matrix against a stale rubric and loses the diligence-sprint conclusion to the buyer down the road that did.
  • Senior-judgment overlay calibration — the human-in-the-loop pool the buyer routes the per-vertical hard tail to, and whose calibrated decisions feed the next gold-set refresh. The overlay is not a backstop; it is the calibration layer for the routing matrix itself, and the buyer that runs the overlay against a generic crowd-worker bench loses the per-vertical accuracy gain because the overlay's senior judgment does not match the per-vertical failure mode the new entry would have caught.
  • Routing-matrix plumbing across the AI Studio / Vertex AI / Gemini API / Bedrock / Foundry / direct-vendor-API surfaces — the multi-surface, multi-vendor control-plane envelope the buyer's FY27 plan actually executes against. The integrator that built the routing matrix as a Vertex-only feature flag does not have the plumbing to grade the Opus 4.7 entry against the Gemini 2.5 Pro Deep Think entry against the GPT-5.5 entry on the same per-vertical workload; the integrator that built the routing matrix as a vendor-agnostic substrate is the one whose buyer can run the diligence sprint in days, not quarters.

The buyer that walks into Q3 with the per-vertical gold set, the senior-judgment overlay calibrated for the new entry, and the routing-matrix plumbing portable across all three surfaces is the buyer that turns the June 22 announcement into a compounding per-vertical cost-curve advantage. The buyer that flips the feature flag on the high-cost workload, observes no headline improvement, and reverts is the buyer that loses the per-vertical accuracy gain the slot was designed for — and discovers it six months later when the buyer down the road ships the per-vertical pass-rate report that closes the procurement decision the FY27 plan was supposed to encode.

The frontier-reasoning slot just gained a credible third entry that the buyer can still get tomorrow. The per-vertical diligence sprint is the work that decides which entry the FY27 plan should route the per-vertical workload through. The buyer that runs the sprint honestly wins the cost curve; the buyer that anchors on the headline benchmark loses it.