Essays and field notes on AI, software engineering, design, and the craft of building product teams that ship. Written by the engineers doing the work.

A new VentureBeat-cited survey puts the number on something every engineering leader has been quietly tracking: 43% of AI-generated code changes still require manual debugging in production after passing QA and staging. Amazon's March outages — 1.6M errors on March 2, then a 99% drop in US order volume on March 5, both traced to AI-assisted code merged without proper approval — are the executive-visible version of the same gap. Amazon's response was a 90-day code safety reset across 335 critical systems. The pattern isn't "AI coding tools are unsafe." It's that the eval, review, and gating layer that catches AI-generated regressions before production is the binding constraint on whether the productivity gains turn into shippable software or into incident reports.

On May 18 Cursor shipped Composer 2.5, a specialized coding agent built on Moonshot's open-source Kimi K2.5 checkpoint with 25× more synthetic training data than its predecessor. It scores 79.8% on SWE-Bench Multilingual and 63.2% on CursorBench v3.1 — within a point of Claude Opus 4.7 and ahead of GPT-5.5 on some axes — at $0.50 per million input and $2.50 per million output tokens. That's roughly one-tenth the per-token cost of the frontier tier. For teams that have spent twelve months sending every coding workload to the most expensive Opus or GPT slot they could justify, the question isn't whether the specialized tier is good enough. It's which workloads should have moved there last quarter, which ones still need the frontier, and who owns the routing call.

Surge AI quietly passed $1B in annual revenue while remaining bootstrapped and profitable, raising external capital for the first time at a $25B+ valuation. Mercor employs 30,000+ expert contractors at an average rate of $95/hour and disburses more than $1.5M per day to evaluators. Scale AI's partial Meta acquisition triggered a client exodus that reshaped the entire labeling market in twelve months. The headline framing is "data-labeling firms are growing fast." The substance is one tier deeper: the senior-domain-expert layer of the AI stack — the people who write rubrics, grade trajectories, author golden examples, and red-team the agents — has become the binding constraint on model and product quality. The vendor that owns that bench owns the bottleneck.

On May 14 Anthropic announced that, effective June 15, the Claude Agent SDK and every third-party harness running on top of it — Claude Code GitHub Actions, OpenClaw, Conductor, Zed, Jean, the lot — will leave the standard Claude subscription pool and move to a separate "Agent SDK Credit pool" billed independently. Pro plans get $20 of agent credits a month; Max, $100; Team, $150 per seat; Enterprise, $200 per seat. The headline framing is a billing change. The substance is one tier deeper: the agent runtime layer just got carved out as its own procurement surface, with its own meter, its own SKU, and its own competitive battlefield. Every team building Claude-powered agents in production has roughly four weeks to decide whether their agent cost model still works.

On April 16 Anthropic released Claude Opus 4.7 at Opus pricing, with materially better software engineering, sharper vision, and a built-in safeguard layer that blocks high-risk cybersecurity prompts by default. The line in the press cycle most production teams skimmed past is the one that matters: Anthropic conceded on the record that Opus 4.7 is not as broadly capable as its unreleased Mythos preview. For teams that have spent twelve months arguing "use the most capable model" versus "use the safest model that does the job," the gap between what shipped and what's being held back just stopped being theoretical. It's a routing decision you can either make on purpose or have made for you by your procurement chain.

On April 30 Microsoft launched the Legal Agent inside Word — purpose-built for contract review and negotiation, playbook-aware, citation-backed, with Claude as the underlying model running through Microsoft 365's subprocessor stack. It's not Copilot reskinned for lawyers. It's a vertical agent with workflow-specific UX, structured tracked-changes output, and a playbook concept that turns every law firm's internal standards into an enforceable review pattern. Three weeks later, the takeaway every product team should be holding on to: the template for "vertical agent inside a horizontal tool, powered by a frontier model, with a domain-expert-authored rubric underneath" just got shipped at the scale of Microsoft Word, and the playbook is now legible.