GPT-5.5 Hit Codex With a 400K Context Window and an In-App Browser. The Foundation Layer Is Now Three Frontier Models Deep — and Stack Portability Just Became Strategy.

The release, compressed

In the six weeks between April 23 and May 7, 2026, OpenAI shipped three things that, taken together, finish a chapter most enterprise AI roadmaps had been treating as an open question.

GPT-5.5 went generally available in the API and in ChatGPT for Plus, Pro, Business, Enterprise, Edu, and Go plans on April 23, with GPT-5.5 Pro added to Pro, Business, and Enterprise. GPT-5.5 Instant rolled to free-tier ChatGPT users on May 5, replacing GPT-5.3 Instant as the default model for the entire installed base. Codex received GPT-5.5 with a 400K context window and shipped an in-app browser the agent can drive to reproduce visual bugs and verify local fixes — click through a rendered UI, watch the agent reason about what it sees, and have the local fix written and tested before the developer touches the keyboard. Voice-reasoning API models dropped on May 7 with real-time translation and transcription.

In the same six-week window, Anthropic pre-announced Claude Mythos as a full capability tier above Opus 4.7 and ran the Opus 4.8 → Mythos sequence in front of the enterprise market. Google shipped Gemini 3.5 Flash with a 4× speed advantage on agentic benchmarks and 76.2% on Terminal-Bench 2.1. Three frontier-model vendors, three flagship releases, three roughly-parity capability stories on the workloads most enterprises actually deploy. The 'which one wins?' question is now functionally obsolete.

What the 400K + browser combo unlocks in Codex

The Codex headline that will get the most demo time is the in-app browser. The headline that matters more for production teams is the 400K context window.

A 400K window in a coding agent changes what's worth attempting in a single session. Before the bump, Codex sessions had to be heavily context-trimmed — load the relevant files, summarize the rest, pray the agent doesn't need the bit you cut. Past 400K, the entire service — types, tests, migrations, recent PR history, the auth module, the relevant SDK docs — is in the agent's working memory at once. The kind of refactor that previously required a careful manual prompt-engineering pass on what to include now just runs, and the model can correctly reason about a change in service A that depends on a behavior of service B because it has both in front of it.

The browser tool is the second half of that change. Codex sessions can now navigate a local dev server, click through the UI, observe the rendered DOM, and reproduce a visual bug by seeing it instead of being told about it. The 'browser as an agent surface' theme that began with Chrome auto-browse and crossed into Anthropic's computer-use API has now reached every major coding agent. For frontend bugs, regression hunts, and end-to-end verification, the agent loop closes inside one tool — the developer files a ticket, the agent loads the page, sees the bug, writes the fix, verifies the fix by re-rendering, opens a PR.

Both unlocks are real. Both also push the review tax upward, because what gets generated grows more than what gets reviewed, and the work that previously sat on a senior developer's screen for half a day to verify is now a 90-second auto-loaded PR with a video of the agent reproducing the bug attached.

Three vendors at parity is the strategic story

For most of 2024 and 2025, enterprise AI strategy had a single dominant question buried under all the others: which frontier model do we build the stack around? The question was real because the capability gap between vendors was real — there were workloads where Claude was decisively the only reasonable answer, workloads where GPT was the only one, workloads where Gemini's pricing was the only thing that made the unit economics work. Picking was a meaningful bet.

The capability gap has narrowed. As of May 2026:

Claude Opus 4.8 / Mythos leads on the agentic-coding axis — SWE-bench Verified at 93.9%, Terminal-Bench 2.0 at 82.0%, Dynamic Workflows shipping 1,000-subagent runs as a primitive.
GPT-5.5 / Codex leads on multi-modal end-to-end task completion — voice models, in-app browser, 400K context, broadest enterprise distribution through ChatGPT.
Gemini 3.5 Flash leads on speed-and-cost-per-token at frontier capability — 4× output speed, Managed Agents in the API, deep integration with Google Workspace and Cloud.

Each has a workload where it's the marginal best choice. None has a workload where the others are unusable. And the rankings reshuffle every six to eight weeks. The model your team picked as 'decisively best' in March is the second-best model by May. The strategic question stopped being which? and became how much of your stack would break if you swapped?

For most teams, the honest answer is more than we'd like — because the integration was written against vendor-specific tool calling, the prompts were tuned to a particular model's behavior, the evals were calibrated on one vendor's output distribution. Swapping isn't a config change; it's an engineering project. And the team that did the work to make it a config change is the team that gets to ride the next capability bump three weeks faster than everyone else.

What stack-portability actually looks like as engineering

This is not abstract. Portability between Claude, GPT-5.5, and Gemini 3.5 in mid-2026 is a specific set of design choices.

Tool calls go through MCP, not vendor-specific formats. Every major vendor now speaks MCP. Code that wraps vendor-specific tool-call JSON is code that has to be re-written when the model swaps. Code that calls through an MCP layer doesn't.

Prompts are templated, not hand-tuned. A prompt that has been hand-tuned to GPT-5.5's preferences is a prompt that quietly under-performs on Claude. Treat prompts as template + per-vendor adapter: the template is the intent and the structure; the adapter is the per-vendor phrasing the eval harness shows actually works.

Evals run on every model in your matrix, every release week. A regression on Codex when GPT-5.6 ships should be caught the same week, not the same quarter. That requires the eval harness to be runnable across vendors, with gold sets that are not specific to any one model's output style, and a continuous-eval cadence that is on a calendar instead of being a manual exercise after a release.

Cost-tracking is per-vendor at the workload level, not per-call. A workload that costs 30% less on Gemini 3.5 Flash for the same eval score is a workload that should run on Gemini 3.5 Flash. That decision is visible only if your cost-tracking groups calls by workload and reports cost-per-successful-task by vendor — not the per-call dashboard most teams currently rely on.

None of this is novel. All of it is the difference between a team that gets the next capability bump as a free win and a team that spends a quarter integrating it.

Where Sonnet Code fits

Three frontier-model vendors at roughly parity is the easy half of the story. The hard half is the engineering above the model that turns the multi-vendor environment from a risk into a leverage point — the MCP-native integration layer, the templated prompts with per-vendor adapters, the eval harness that runs across the matrix every release week, the cost-tracking that surfaces which workload belongs on which vendor. AI development at Sonnet Code is exactly that engineering: building the portability layer that lets your team adopt GPT-5.5's 400K + browser combo this month, Mythos when it ships, and the next Gemini bump after that, without rewriting the integration each time. AI training is the human-judgment half: senior engineers and domain experts who design the cross-vendor eval rubrics that make the comparison meaningful in your domain, calibrate gold sets that don't accidentally favor one vendor's output style, and run the adversarial review on the workloads where the right vendor is genuinely a question.

The frontier-model layer is now firmly multi-vendor. The portability layer above it is where the next year of leverage lives. That's the layer worth building deliberately.