Sonnet Code
The Sonnet Code Blog · Page 10

Engineering notes from the field.

Essays and field notes on AI, software engineering, design, and the craft of building product teams that ship. Written by the engineers doing the work.

AI Development9 min read

Google Shipped Gemini Spark at I/O 2026 — a 24/7 Personal AI Agent That Lives on a Cloud VM, Has Its Own Email Address, and Executes Tasks Inside Third-Party Apps via MCP. The Always-On Agent Just Became a Product Category — and Every SaaS Surface You Own Is Now an Agent Surface.

At Google I/O 2026 on May 19, Google announced Gemini Spark — a personal AI agent that runs continuously on dedicated Google Cloud virtual machines, gets its own dedicated email address inside Workspace, drafts and replies on behalf of the user, monitors inboxes against playbook rules, and executes multi-step tasks inside third-party apps via the Model Context Protocol. Unlike the request-response Gemini chatbot, Spark persists when the user closes the tab and runs on schedules, conditions, and standing instructions — a 7 AM briefing assembled from Gmail and Calendar context, a continuous monitor on customer-inquiry inboxes, an unattended follow-up sequence for sales touches. Beta access opened the week after I/O on the newly reduced $100/mo AI Ultra tier, with deeper MCP integrations rolling through the coming months. The structural read isn't 'Google shipped an agent.' It's that the always-on personal agent — an agent with its own network identity, that acts asynchronously, and that consumes integration surface from every SaaS product it touches — just became a real consumer product category. Here's what changes about the MCP surface every B2B SaaS team needs to expose, and why the 'who is the user?' question on your audit logs is about to mean something different than it has.

Sonnet Code Editorial Team · June 1, 2026
Developer Tools9 min read

Cursor 3 Shipped Composer 2.5 — an In-House Long-Horizon Coding Model That Matches Opus 4.7 and GPT-5.5 at 1/10 the Cost. The IDE-vs-Model-Lab Boundary Just Collapsed, and the Vendor Stack Got One Layer Shorter.

On May 18, 2026 Cursor shipped Composer 2.5 inside Cursor 3 — its own production-grade long-horizon coding model, built on Moonshot's open-source Kimi K2.5 checkpoint with 25× more synthetic training tasks than Composer 2, retrained for the behavioral qualities standard benchmarks miss (effort calibration, sustained long-horizon work, communication style). The headline numbers: 79.8% on SWE-Bench Multilingual, 63.2% on CursorBench v3.1 — matching Claude Opus 4.7 and GPT-5.5 on coding evals at roughly one-tenth the cost per token ($0.50/M input, $2.50/M output on the standard tier; $3.00 / $15.00 on Fast). The structural read isn't 'a cheaper model.' It's that an IDE vendor just produced a frontier-class coding model in-house, the editor-vs-lab boundary collapsed in a single product release, and the dependency relationship between editor companies and frontier labs just inverted for an entire product surface. Here's what changes when your code editor stops being a thin client to someone else's model API and starts owning the model itself — and why every multi-vendor portability conversation you had three months ago needs an additional axis on it now.

Sonnet Code Editorial Team · June 1, 2026
AI Development9 min read

GPT-5.5 Hit Codex With a 400K Context Window and an In-App Browser. The Foundation Layer Is Now Three Frontier Models Deep — and Stack Portability Just Became Strategy.

Between April 23 and May 7, 2026 OpenAI rolled GPT-5.5 across the ChatGPT and Codex line, brought Codex up to a 400K context window with an in-app browser the agent can drive, dropped GPT-5.5 Instant to free-tier users as the new default, and shipped voice-reasoning API models with real-time translation and transcription. In the same six weeks Anthropic pre-announced Claude Mythos and ran the Opus 4.8 → Mythos sequence in front of enterprise buyers, and Google shipped Gemini 3.5 Flash with a 4× speed advantage on agentic benchmarks. Three frontier-model vendors, three flagship releases, three roughly-parity stories on the workloads most enterprises deploy. The 'which one wins?' question is now functionally obsolete — and the strategic question moved to 'how much of your stack would break if you swapped?' For most teams the honest answer is more than they'd like, because the integration was written against vendor-specific tool calling, the prompts were tuned to one model's behavior, and the evals were calibrated on one vendor's output distribution. Here's what stack-portability looks like as engineering in mid-2026, and why the teams that build the portability layer this quarter will ride every capability bump three weeks faster than the teams that don't.

Sonnet Code Editorial Team · May 31, 2026
Developer Tools8 min read

Windsurf 2.0 Put Cloud Agents and Local Agents in the Same Kanban View. The 'Where Do I Run This?' Question Just Became a UI Toggle — and the Orchestration Layer Moved Into the Editor.

Windsurf 2.0 (April 15, 2026) shipped two surfaces that change the daily shape of agentic engineering work. The Agent Command Center is a Kanban view of every agent you have running — local Cascade sessions and remote Devin VMs side by side — grouped into per-project Spaces that bundle agent sessions, open PRs, files, and context. Devin, the Cognition autonomous cloud engineer, is now bundled into every paid Windsurf plan (Pro $20/mo, Max $200/mo, Teams). The pricing move tells the story: Windsurf is no longer competing on price-per-seat against Copilot's $10 plan — the $200 Max tier is what an organization pays to give a developer both in-IDE agents and a cloud autonomous engineer in one workspace, on one bill, with one audit trail. The structural consequence isn't the editor. It's that the local-vs-cloud agent decision — which used to be a developer skill buried in a CLI — just became a UI toggle, and the use of cloud autonomous agents is about to climb sharply because the activation energy collapsed. Here's what 'agent fleet management' looks like as a default discipline now that it ships as a default surface.

Sonnet Code Editorial Team · May 31, 2026
Developer Tools9 min read

Google I/O 2026 Shipped Antigravity 2.0, Gemini 3.5 Flash, and a Managed Agents Tier in the Same Week. The Agent Runtime Just Stopped Being a Build-Your-Own Decision.

At Google I/O 2026 on May 19, Google released Gemini 3.5 Flash (76.2% on Terminal-Bench 2.1, 83.6% on MCP Atlas, 4× faster output than other frontier models) and shipped Antigravity 2.0 in the same keynote — what was a single AI-powered IDE a year ago is now a full platform with a desktop app, a Go-based CLI replacing Gemini CLI on June 18, an SDK that exposes the same agent harness powering Google's own products, and a Managed Agents tier in the Gemini API. The Managed Agents tier lets a developer hand the API a multi-step task and have it executed in a Google-hosted sandboxed Linux container with shell, browser, and computer-use tools attached. The structural read isn't that Google shipped a faster model. It's that the agent runtime — the sandboxed VM, the tool harness, the parallel-subagent scheduler — just stopped being something serious teams build themselves. It became a vendor primitive you rent, and the rent option just became vastly more credible. Here's what the platform shift means for AI development and AI training roadmaps, and where the lock-in surface actually moved.

Sonnet Code Editorial Team · May 31, 2026
AI Training9 min read

Claude Mythos Is Too Capable to Ship — So Anthropic Stood Up Project Glasswing. Vetted Human Experts Just Became the Gating Layer for Frontier AI Deployment.

Claude Mythos Preview can identify and exploit zero-day vulnerabilities in every major operating system and web browser when directed to — many of them ten or twenty years old, the oldest a now-patched 27-year-old OpenBSD bug. It can chain minor vulnerabilities into full system control on its own. Central banks have held emergency briefings over the implications for legacy financial-system code. So Anthropic isn't shipping it through a standard API rollout. Instead, the company stood up Project Glasswing — an industry consortium of 40+ vetted organizations that maintain critical software (CrowdStrike is a founding member), granted monitored access to run Mythos against their own systems before any wider release. The structural story isn't that one model is dangerous. It's that the deployment model for frontier AI just changed shape. The gating layer is no longer 'ship to API, see what happens' — it's vetted human experts using the model under monitoring, with access tiered by the reviewer's domain credentials. This is what the operational reality of 'responsible AI' looks like in 2026 — and what every enterprise pretending to have that function on the org chart is now going to have to actually staff.

Sonnet Code Editorial Team · May 30, 2026