Sonnet Code
← Volver a todos los artículos
AI Development23 de mayo de 2026·9 min read

Codex Goal Mode Hits GA and Locked Computer Use Lands — Autonomous, Objective-Driven Agents Are Now an Enterprise SKU

The release, in one paragraph

On May 22, 2026, OpenAI shipped the largest single Codex release since the mobile launch a week earlier. Goal Mode — previously an experimental flag — is now generally available across the Codex app, IDE extension, and CLI, with the published positioning that an engineer "can have Codex drive toward a specific objective for hours or even days." Locked Computer Use ships in the same release: Codex can now keep operating desktop apps after the user's Mac locks, including remotely via Codex Mobile, scoped to active trusted turns with short-lived authorization, covered displays, automatic relock on local input, and a manual-unlock fallback. The enterprise tier picks up Secure MCP Tunnel — a customer-hosted tunnel-client that lets ChatGPT web, Codex, the Responses API, and AgentKit reach private or on-prem MCP servers without exposing them to the public internet — plus 90+ new plugins combining skills, app integrations, and MCP servers (Atlassian Rovo, CircleCI, CodeRabbit, GitLab Issues, Microsoft Suite, Neon by Databricks, Remotion, Render, Superpowers). HIPAA support for Codex inside ChatGPT Enterprise lands in the same window.

The surprising line in the release isn't "Codex got more autonomous." The autonomous-coding-loop pattern has been credible since GitHub's Copilot Coding Agent went GA two days earlier and Claude Code's autonomous mode shipped last quarter. The surprising line is the scaffolding around the autonomy — locked computer use with relock-on-local-input, tunneled MCP for on-prem services, HIPAA-bound deployment, plugin marketplace at enterprise scale, and a goal-mode posture that's specifically marketed as "drive for hours or days." Each one of those is a procurement requirement that the public preview couldn't satisfy. Together, they turn an objective-driven agent from a research-team curiosity into a SKU a Fortune 500 procurement office can sign off on. The conversation isn't whether the autonomous loop works. It's whether your organization has the supervision, eval, and rollback plumbing to let one run unattended against a stated objective on production-adjacent systems.

Why objective-driven autonomy moves the eval problem, not the writing problem

For twelve months, the AI-coding conversation has been dominated by assistive autonomy — a developer in the loop, reviewing each suggestion, accepting or rejecting on a per-change basis. Goal Mode GA is a different posture: the developer states an objective, the agent runs (sometimes for hours, sometimes across days), and the developer reviews what the agent did rather than what the agent proposed. That posture change moves the binding constraint from "how good is the suggestion" to "how good is the eval that catches a wrong trajectory before it lands."

"Run for hours against a goal" is a different supervision contract. A 30-second autocomplete suggestion can be reviewed by reading the diff. A 6-hour Goal Mode run that touched 47 files, opened 12 PRs, ran the test suite 38 times, and reverted itself twice cannot. It needs a trajectory summary, a confidence-tagged change log, a list of side effects the agent took outside its sandbox, and an eval rubric that grades the outcome against the stated objective — not the quality of each step in isolation. The teams that ship Goal Mode successfully will be the ones that built the supervision UI before they enabled the flag. The teams that enable the flag first will discover the supervision gap during the first overnight run that touched something they didn't expect.

Locked computer use moves the trust boundary, again. When the Mac is unlocked and the developer is present, the safeguard is "the user can stop it." When the Mac is locked, the safeguard has to be inside the agent's contract: short-lived authorization tokens, scoped tool permissions, an explicit list of apps the agent is allowed to drive, an audit trail of every UI action, and a relock semantics that revokes ongoing capability the moment local input resumes. OpenAI shipped those safeguards. But the policy — which agents are allowed to run locked, against which apps, under which corporate-account scope — is a decision the customer's security team owns. That decision has to be made before the first overnight Goal Mode run, not after.

Secure MCP Tunnel changes the MCP procurement conversation, full stop. Until this week, exposing an internal MCP server to a hosted AI product meant either (a) putting the MCP server on the public internet behind an auth gate, or (b) running a self-hosted client inside the customer's VPC. Both options had real security review costs. Secure MCP Tunnel is a third path: the customer hosts a tunnel-client that brokers calls into the private MCP server, the AI product talks to the tunnel rather than the server, and the MCP server never touches a public address. That collapses the security review for the largest single class of enterprise MCP deployment — internal tooling reachable through OpenAI's hosted surface — from "net-new architecture proposal" to "plumbing a vendor tunnel." The MCP integration roadmap for the back half of 2026 just got materially shorter inside OpenAI shops.

What objective-driven Codex actually changes for production teams

The objective itself becomes the eval surface. Goal Mode asks the developer to state what "done" looks like before the run starts. That statement is the contract the agent grades itself against, the contract the reviewer evaluates the result against, and (when wired correctly) the contract the eval suite scores future runs against. Teams that have spent the last year writing eval rubrics for individual model outputs now need a parallel discipline for writing objective specifications — what counts as the goal being met, what counts as the goal being met correctly, what counts as the goal being met safely. That's a new artifact category that didn't exist in the assistive-autonomy era.

Trajectory review is the new code review. A traditional code review reads a diff and judges intent from the change. A trajectory review reads a sequence of tool calls, intermediate decisions, partial failures, retries, and side effects, then judges whether the agent's process was reasonable. The skill is different, the tooling is different, and the seniority required to do it well is different — a junior reviewer can spot a typo in a diff, but only a senior reviewer can read a 6-hour trajectory and notice that the agent was solving the wrong problem from step 12 onward. Allocate senior-engineer time accordingly.

The MCP plugin surface stops being optional. Ninety new plugins in one release, on top of the marketplace surface that's been growing for two quarters, means the generic answer to "can Codex do X in our environment" is now almost always yes, through a plugin or MCP server. The interesting questions are which plugins your organization has approved, which MCP servers it operates, what the trust contract is for each, and who owns the catalog. Without that ownership, the plugin tier turns into shadow integration — every developer wires up whatever Atlassian or GitLab plugin they want, the security team finds out at the next pen test.

HIPAA-bound Codex inside ChatGPT Enterprise is a real unlock for regulated buyers. The HIPAA boundary inside ChatGPT Enterprise extends the BAA to cover Codex tasks, which means healthcare-adjacent engineering teams (provider IT, payer engineering, pharma R&D platforms) can route engineering work through Codex without a separate BAA negotiation. That moves Codex from "interesting but unevaluated" to "a tool we can pilot inside a covered entity" for an entire class of organization that's been on the sidelines.

What it doesn't change

Goal Mode doesn't write the objective for you. A vague goal — "clean up the auth module" — produces a vague trajectory and a vague result. A precise goal — "reduce auth latency p95 below 80ms without changing the public API surface, with regression tests covering every public method" — produces a trajectory the reviewer can actually grade. The skill of writing a goal specification that's both bounded enough to be checkable and broad enough to be useful is non-trivial, and very few engineering teams have institutionalized it. The first quarter after Goal Mode GA will be dominated by the discovery that goal-writing is its own craft.

Locked computer use is still computer use. The safeguards in the release are good, the trust contract is real, and the relock semantics are well-designed. None of that changes the fact that a desktop-driving agent has access to anything the desktop has access to — production consoles, payment dashboards, the corporate calendar. Scope it deliberately, audit it routinely, and don't let the safeguard list lull the security team into approving uses the policy never anticipated.

The plugin marketplace is a supply-chain surface. Every plugin is a third-party integration with its own auth scope, its own update cadence, its own risk profile. Ninety new plugins in one drop is ninety new supply-chain dependencies for any team that adopts the full catalog. The teams that ship the plugin surface responsibly will run it through the same review process as any other vendor dependency. The teams that don't will discover the gap when one of the plugins is compromised.

Secure MCP Tunnel only collapses the procurement step for OpenAI-hosted surfaces. It doesn't address the parallel question of how the other AI surfaces in your stack (Claude, Cursor, Gemini) reach the same internal MCP servers. If your roadmap depends on multiple frontier providers — which most enterprise roadmaps do — you still need a story for cross-vendor MCP reach. The tunnel is a leverage point inside OpenAI shops; it's not the universal answer.

Where we'd push back on the framing

"Drive for hours or days" is a marketing line, not an operational guarantee. In a controlled environment, with a well-specified goal, a clean codebase, and a forgiving CI pipeline, the multi-hour autonomous run is real. In a real enterprise environment — with a 12-year-old monorepo, brittle CI, flaky integration tests, and three teams who didn't know the agent was running — the multi-hour run will hit something the demo never showed. Budget for that, plan for that, and don't let the marketing line set the expectation the security review will have to walk back.

The supervision UI shipped is necessary but not sufficient. OpenAI's trajectory review, command approval, and mobile steering surfaces are real improvements over the v1 supervision posture. None of them replace the senior engineer who reads the diff at the end. The right read is "the tooling makes the supervisor faster," not "the tooling makes the supervisor optional." Teams that take the second reading will learn the difference during the first incident.

Per-task billing makes cost forecastable, not cheap. A multi-hour Goal Mode run that solves a real problem is enormously valuable; a multi-hour Goal Mode run that thrashes against a malformed objective and accomplishes nothing is enormously expensive. The cost model rewards goal precision; it punishes goal sloppiness. Engineering teams will need to learn to budget agent capacity the same way they budget compute — with a default cap, a per-team allocation, and a review process when a team exceeds its envelope.

What we'd build differently this week

  • Stand up a goal-specification template. Pre-define the fields any Goal Mode invocation has to fill in: the bounded objective, the acceptance criteria, the regression suite that has to keep passing, the explicit out-of-scope list, the rollback condition. Treat it as the agentic equivalent of a PR template — required, reviewable, versioned in the repo.
  • Define a trajectory-review checklist. Pull request templates already exist; trajectory review templates don't. Build one that covers: was the agent solving the stated objective, did it touch anything out-of-scope, did it skip a test it shouldn't have skipped, did it leave the environment in a state the rollback can reverse. Senior engineer time, deliberately allocated.
  • Decide locked-use policy before the first locked run. Which projects allow it, which environments forbid it, which apps the agent can drive while the screen is locked, what the audit cadence looks like. Get the security team's signoff in writing. Backfilling this after the first incident is more expensive than designing it up front.
  • Catalog the MCP plugins the organization will support. Approved list, denied list, review process for new entries, named owner per plugin. The plugin surface is going to grow faster than the security team's bandwidth to evaluate it; the catalog is the choke point that keeps shadow plugins out of production.
  • Wire the eval suite to the objective, not just the output. A traditional eval grades a model's response to a prompt. A Goal Mode eval grades the trajectory's adherence to the objective. Build the rubric, score the trajectories, retain the failure cases as regression scenarios for the next agent revision.

Sonnet Code's take

The Goal Mode + Locked Computer Use + Secure MCP Tunnel + 90-plugin release is the moment OpenAI's coding agent stopped looking like a developer-productivity tool and started looking like an autonomous operator that needs supervision plumbing the same way a junior engineer needs supervision plumbing — escalation rules, scope limits, a manager who reads the trajectory at the end of the run. The teams that win the next quarter aren't the ones who turn the flag on fastest. They're the ones who built the goal-specification template, the trajectory-review checklist, the locked-use policy, the plugin catalog, and the objective-grading eval before enabling Goal Mode in production.

That's the work we do. AI development at Sonnet Code is the engineering plumbing — the goal specs, trajectory review surfaces, MCP plugin catalogs, eval harnesses, audit trails, rollback paths — that turns an autonomous coding agent from a demo into a production tool with a supervisor on the other end. AI training is the senior-practitioner side: the engineers, security architects, and (through partner networks in regulated domains) credentialed clinicians and attorneys who author the objective-grading rubrics, the failure-mode catalogs, and the trajectory-review criteria the supervision layer scores against. If your organization is reading the Goal Mode release this week and wondering whether your team is ready to let an agent run for six hours against a stated objective on production-adjacent infrastructure, the next conversation isn't about whether to flip the flag. It's about who owns the objective, who reviews the trajectory, and the supervision contract that lets the agent operate safely while the screen is locked.