Sonnet Code
The Sonnet Code Blog · Page 14

Engineering notes from the field.

Essays and field notes on AI, software engineering, design, and the craft of building product teams that ship. Written by the engineers doing the work.

AI Development9 min read

Google Antigravity 2.0 and Managed Agents Land at I/O 2026 — Agent Execution Is Now an API Primitive, and the Build-vs-Buy Line Just Moved

At I/O 2026 on May 19, Google shipped Antigravity 2.0 — a standalone agent-first desktop app, plus a CLI, an SDK for hosting custom agents on your own infrastructure, and Managed Agents in the Gemini API that spin up a reasoning, tool-using, code-executing agent in an isolated Linux sandbox with a single API call, all powered by the Gemini 3.5 Flash agent harness. The headline framing is "Google goes agentic." The substance is one tier deeper: agent execution — the sandbox, the harness, the tool loop, the isolation — just became a managed primitive you rent by the API call instead of infrastructure you build and operate. That collapses weeks of platform work to a single call, and it quietly moves the build-vs-buy line under every internal agent platform a team started in 2025. The interesting question is no longer "can we run an agent." It's "what do we still own when the runtime is rented."

Sonnet Code Editorial Team · May 24, 2026
AI & Machine Learning8 min read

The Frontier Took a Breath in May 2026 — When Models Converge, Your Eval Suite and Routing Layer Are the Moat, Not the Model

May 2026's most important AI story is the one that didn't happen: no frontier-scale capability jump from Anthropic, Google, Meta, Mistral, or the Chinese labs. The April Intelligence Index ceiling held, and the action moved to architecture, efficiency, and product defaults — Gemini 3.5 Flash running 4x faster while beating last quarter's Pro, Qwen 3.7 Max undercutting the leaders on price-per-quality, the top coding models clustered within a couple of benchmark points. The headline framing is "the frontier is plateauing." The substance is one tier deeper: when models converge, picking the "best" one stops being a strategy, because they're all good enough. The durable advantage moves to the two things that don't commoditize — the eval suite that proves a model works on your workload, and the routing layer that sends each request to the cheapest model that passes. The teams still shopping for the smartest model are optimizing the variable that just stopped mattering.

Sonnet Code Editorial Team · May 24, 2026
AI Development9 min read

Codex Goal Mode Hits GA and Locked Computer Use Lands — Autonomous, Objective-Driven Agents Are Now an Enterprise SKU

On May 22 OpenAI graduated Goal Mode to general availability across the Codex app, IDE extension, and CLI; shipped Locked Computer Use that lets Codex keep operating after the user's Mac locks; opened a Secure MCP Tunnel for on-prem MCP servers; and rolled out 90+ new plugins (Atlassian Rovo, GitLab Issues, CircleCI, Microsoft Suite, Neon). The headline framing is "Codex gets more autonomous." The substance is one tier deeper: the agent that's expected to run for hours or days against a stated objective — over locked machines, through enterprise-tunneled MCP servers, with audit-grade safeguards — just stopped being a research demo and started being a procurement-defensible product. For teams whose AI roadmap has been quietly assuming "someone will let an agent run overnight against a goal," the conversation about who owns the objective, who reviews the trajectory, and who signs off on the rollback path is the one to schedule this week.

Sonnet Code Editorial Team · May 23, 2026
AI Training8 min read

ICLR 2026's "Reasoning Trap" Paper: Training Models to Reason Harder Made Tool Hallucination Worse — The Eval Rubric Is the Fix, Not the Next Model

A paper presented at ICLR 2026 in Rio, "The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination," lands at the exact moment 96% of enterprises are running AI agents in production. The finding is counter-intuitive: post-training a model to reason longer and more thoroughly increases the rate at which it invents tool calls, fabricates parameters, and references functions that don't exist. Frontier hallucination on citation-accuracy tasks still averages 12.4% with extended thinking enabled. The headline framing is "reasoning has a tradeoff." The substance is one tier deeper: the eval rubric that grades tool-call correctness — not the choice of model — is the lever that brings hallucination under control. Teams treating reasoning as a free upgrade are about to learn it isn't.

Sonnet Code Editorial Team · May 22, 2026
AI Development8 min read

MCP's 2026-07-28 Release Candidate Drops the Sticky-Session Requirement — The Protocol Just Became Cloud-Native, and Your Architecture Has a Decision to Make

On May 21 the Model Context Protocol team published the release candidate for the 2026-07-28 spec — the largest revision since launch. A stateless core that scales on ordinary HTTP load balancers, an MCP Apps extension for server-rendered UIs, a Tasks extension for long-running work, and OAuth/OpenID Connect-aligned authorization. The final spec ships July 28 with a ten-week window for Tier 1 SDK maintainers to ship support. The headline framing is "MCP grows up." The substance is one tier deeper: every remote MCP server architected around sticky sessions, shared session stores, and gateway-level deep packet inspection just got an upgrade path that lets it run behind a plain round-robin load balancer. That's a real cost reduction. It's also a real migration project — and the teams that wait until July 28 to start planning it will spend the back half of the year carrying two stacks.

Sonnet Code Editorial Team · May 22, 2026
AI & Machine Learning9 min read

Eval-as-a-Service Just Crossed $1B — Domain-Expert RLHF Is the New Procurement Line, Not the Research Budget

A joint IDC/Forrester projection released this week puts enterprise eval-as-a-service spending past $1B for 2026 — Scale AI, Surge, Mercor, Snorkel, and Labelbox plus a wave of vertical specialists (health.eval and legal.eval both launched this week with credentialed-expert RLHF marketplaces). The pattern across all of them: enterprise AI procurement now carries a dedicated line item for "domain-expert review," separate from the model license, separate from the engineering build, owned by someone whose job description includes "defends the rubric." The teams shipping production AI reliably in 2026 aren't the ones with the best models. They're the ones whose evals were authored by senior practitioners with credentials the executive committee recognizes.

Sonnet Code Editorial Team · May 21, 2026