Ensayos y notas de campo sobre IA, ingeniería de software, diseño y el oficio de construir equipos de producto que entregan. Escrito por los ingenieros que hacen el trabajo. Publicaciones en inglés.

At I/O on May 19, 2026 Google shipped Antigravity 2.0 — a standalone desktop app and CLI built for orchestrating multiple agents in parallel — alongside Managed Agents in the Gemini API, which gives every agent an isolated Linux environment to do its work in, with Gemini 3.5 Flash running at 4× the speed of comparable frontier models. The product page reads like a better IDE. It isn't. What Google actually shipped is a fleet console — one developer supervising a roster of agents on a queue of scheduled work, in parallel, in isolated sandboxes. That's a different job than "writing code with autocomplete," and the skills it demands are operations skills: specifying work crisply, observing parallel execution, designing review that scales, and scoping permissions so eight agents in your repo aren't eight breaches waiting to happen. The leverage is real. The failure modes are operational, plural, and confident. Here's what changes — and what you have to build above the tool to get value out of it.

Anthropic shipped Claude Opus 4.7 on April 16, 2026, with SWE-bench Verified jumping from 80.8% to 87.6% and SWE-bench Pro climbing 10.9 points — the largest single-model coding gap of the year. The score is genuine, and the model is the strongest agentic coder in production right now. But the inference everyone is rushing to — that shipping AI-built software got 7 points better — is quietly wrong. The benchmark assumes the issue is specified, the repo is intelligible, the tests are the right tests, the review is competent, and the deploy is safe. Every one of those is work *your team* still has to do. The model jumped. The scaffolding around it didn't, unless someone built it. This is what the gap between leaderboard gains and production velocity actually looks like — and why the bottleneck just moved one step up the stack.

At Google I/O on May 19, 2026, Google introduced Gemini Omni: a natively multimodal model that takes images, audio, video, and text into a single core engine and generates high-quality video grounded in real-world knowledge — with conversational editing, improved physical world understanding, and a non-optional SynthID watermark baked into every clip. The leap isn't that AI can make video; it's that multimodal generation has become a primitive you can build a product on instead of a research demo. But the moment generation gets easy, the bottleneck moves: judging whether the output is correct, on-brand, and safe is a human-judgment problem, and provenance — proving what was machine-made — becomes a product requirement, not a nice-to-have. For teams building multimodal into a real product, the generation is the easy part. The evaluation and the provenance layer are where the work now lives.

On May 21, 2026 xAI shipped Grok Build, a terminal-native agentic coding CLI that spawns up to 8 concurrent agents on Grok 4.3 with a 2-million-token context window and a plan-review gate before edits apply. It joins Anthropic's Claude Code and OpenAI's Codex CLI — and the moment there are three near-identical tools doing the same thing, the tool itself stops being the decision. The interesting question is no longer 'which CLI' but 'which model for which task, and how do you wire any of them into a workflow your team trusts.' The agentic terminal just became a commodity layer. The durable engineering work moved up a level: routing, evaluation, and the review discipline that keeps autonomous edits from becoming autonomous mistakes.

On May 14, 2026 PwC and Anthropic expanded their alliance: 30,000 PwC professionals certified on Claude Code and Cowork, a joint Center of Excellence, and a new Claude-native finance group — with rollout aimed at a global workforce of hundreds of thousands. The results they're quoting are real: insurance underwriting cycles compressed from ten weeks to ten days, delivery improvements of up to 70%. But the part worth studying isn't the headcount or the metrics. It's what the size of the training program tells you: the frontier model was the easy part. The hard, expensive, year-long part is teaching tens of thousands of people to actually work with agents, and rebuilding the workflows around them. For any company trying to put AI to work at scale, the constraint stopped being which model and became how fast your people and your processes can absorb it.

Microsoft shipped Copilot Studio computer-use agents to general availability on May 13, 2026 — agents that operate real software by clicking and typing, now deployable against a production SLA with audit logging and credential isolation. The capability is genuinely here. So is the ceiling: the best models top out around 72–78% on OSWorld-Verified, which means roughly one in four steps of a multi-step task is a coin flip without scaffolding. The gap between a compelling demo and a process you'd trust unattended isn't closed by a better model. It's closed by evals, governance, and a human-review surface for the cases the agent shouldn't decide alone — which is exactly the engineering most teams skip.