Sonnet Code
← Volver a todos los artículos
AI & Machine Learning25 de mayo de 2026·9 min read

Gemini 3.5 Flash Made Speed the Frontier — Why 289 Tokens a Second Changes Which Agents You Can Actually Ship

The release, in one paragraph

At Google I/O 2026 on May 19, Google launched Gemini 3.5 Flash and pointed almost the entire keynote at one property: speed. The model runs at roughly 289 output tokens per second — about four times faster than Opus 4.7 or GPT-5.5 — while, by Google's account, outperforming last quarter's Gemini 3.1 Pro across most benchmarks. It's the engine underneath Antigravity 2.0 and the new Managed Agents in the Gemini API, and Google was explicit about why: this is the model built for action, not chat. TechCrunch's read of the launch was that Google is betting its next AI wave on agents rather than chatbots — and Flash is the bet's foundation, because a fast-enough cheap model is what makes a per-call managed agent economically and latency-wise viable.

The surprising line isn't "Google made a faster model." Everyone ships a faster small model eventually. The surprising line is that speed, not raw capability, was the headline — and that a Flash-tier model is now claimed to beat the previous quarter's Pro-tier model while running four times faster. That combination quietly retires a planning assumption a lot of teams have been carrying since 2024: that you trade quality for speed, that the fast model is the dumb model. When the fast model is also the smart-enough model, the entire calculus of which model goes in which loop changes — and the workloads that were latency-bound demos last quarter become things you can actually put in front of users.

Why a tokens-per-second number is an architecture decision in disguise

A single chatbot reply is forgiving about latency. A user asks one question, waits a couple of seconds, reads the answer. The model's speed barely matters because there's exactly one inference in the loop. Agents break that forgiveness, because an agent is a latency multiplier.

A real agentic task is a loop: the model reasons, decides to call a tool, waits for the tool, observes the result, reasons again, maybe calls another tool, and only then produces an answer. A task that takes eight model turns runs eight inferences end to end, and the user waits for the sum. At a Pro-tier speed, an eight-turn loop that feels instant in a demo with one turn becomes a thirty-second stare at a spinner in production. Below a certain tokens-per-second threshold, multi-step agents aren't slow — they're unshippable, because no user tolerates a thirty-second wait and no finance team tolerates paying premium per-token rates for eight turns of a long-running loop.

This is why a 4x speedup is not a 4x convenience — it's a threshold crossing for a whole class of workloads. The agent that needed twelve seconds per response at the old speed needs three at the new one, and three seconds is the difference between "we shipped it" and "we shelved it." Speed doesn't make existing agents marginally nicer; it moves specific agents from the demo pile to the production pile. And because Flash is also cheap, it moves them there at a unit cost that survives contact with a real usage bill — which is the other half of why those loops were shelved.

The "fast and good enough" trap

Here's the part the benchmark table won't tell you. "Beats 3.1 Pro on most benchmarks while running 4x faster" is a genuinely strong claim, and it's also exactly the kind of claim that gets teams in trouble if they read it as "use Flash for everything now."

Benchmarks measure single-turn correctness; agents fail on the long tail. A model that wins the average benchmark can still degrade on the specific multi-step, tool-heavy, edge-case-laden tasks your product actually runs — and an agent compounds small per-turn error rates across a loop. A 2% chance of a wrong turn is invisible in a one-shot benchmark and a real failure rate over an eight-turn task. The honest test of "good enough" is your own trajectory eval on your own workload, not the launch slide.

Speed changes the failure mode, not just the latency. A faster model means the agent makes more attempts in the same wall-clock window. If your verification gate is weak, that's more wrong attempts landing faster — speed amplifies whatever your correctness discipline already is, in both directions. Teams that adopt a fast model without a strong eval layer don't get faster good outcomes; they get faster bad ones.

The cheap-and-fast model invites scope creep. When inference is fast and cheap, the temptation is to let the agent loop more, retry more, fan out more — and the per-call savings evaporate into a higher call count. Fast and cheap is a budget you can blow precisely because it feels free. The cost discipline that mattered when tokens were expensive matters more, not less, when they're cheap, because the volume goes up.

The lock-in question the launch raises

There's a strategic footnote worth flagging. In the same window, Google said consumer access to Gemini CLI and the Gemini Code Assist IDE extensions ends on June 18, 2026 for AI Pro, AI Ultra, and free-tier users — a consolidation toward the Antigravity/Managed-Agents surface. That's a useful reminder for anyone about to wire a production agent to a single vendor's fast model: the speed is real, and so is the platform risk. Betting your agent's economics on one vendor's Flash-tier model is reasonable; betting your architecture on the assumption that the surrounding surfaces won't be deprecated under you is the part to design against. The defensible posture is a model-agnostic abstraction where the fast model is a swappable component, not a load-bearing wall.

What to actually do about it

  • Re-run your shelved agent backlog against the new latency budget. Workloads you killed for being too slow or too expensive last quarter deserve a second look — some of them just crossed the threshold. Re-cost them at Flash-tier speed and price before assuming they're still infeasible.
  • Validate "good enough" on your trajectory, not the benchmark. Build or reuse a multi-turn eval on your real tasks and edge cases. Adopt the fast model where your eval says it holds, not where the launch slide does.
  • Tighten the verification gate before you raise the loop count. A faster model with a weak eval is a faster way to ship wrong answers. Make the gate strong first; let the agent loop more second.
  • Keep the model swappable. Abstract the model behind your own interface so Flash is a component you can replace — both because a better/cheaper option will land next quarter and because vendor surfaces get deprecated, as the CLI shutdown just demonstrated.

Sonnet Code's take

Gemini 3.5 Flash is the moment speed stopped being the thing you trade away for quality and became the thing that decides which agents are shippable at all. That's a real unlock: a class of multi-step, tool-using workloads that were too slow and too expensive to run last quarter just became viable, and teams should go re-examine the backlog they shelved for exactly those reasons. The trap is reading a strong benchmark-and-speed claim as permission to skip the part that actually decides whether the agent is safe to ship — the eval on your own tasks and the gate on your own pipeline.

That's the work we do. AI development at Sonnet Code is the engineering that turns a fast model into a production agent — the latency budgeting, the cost-attribution layer, the model-agnostic abstraction that keeps you from getting locked to one vendor's surface, and the loop design that doesn't blow the cheap-token budget on runaway retries. AI training is the senior-practitioner side: the engineers and domain experts who build the trajectory evals that tell you whether "beats Pro on most benchmarks" is true for your workload, and who author the correctness criteria a fast agent runs against. If I/O made your team wonder whether the agents you shelved for being too slow are suddenly back on the table, the next conversation is about which ones genuinely crossed the threshold — and what verification has to exist before a fast, cheap agent runs against your real systems.