Sonnet Code
← Volver a todos los artículos
AI & Machine Learning15 de mayo de 2026·8 min read

Codex on ChatGPT Mobile: OpenAI Just Made Coding-Agent Supervision a Phone Workflow

The release, in one paragraph

On May 14, 2026, OpenAI shipped Codex inside the ChatGPT mobile app for iOS and Android — in preview, available across all ChatGPT plans including the free tier. The phone connects to a Codex desktop instance (macOS at launch; Windows pending) and pulls in the live state of that machine: active task threads, terminal output, file diffs, and pending approvals. From the phone, a developer can switch between threads, approve commands, redirect a running task, switch models, or kick off new ones. Alongside the mobile launch, OpenAI made Hooks generally available, rolled out programmatic access tokens for Business and Enterprise plans for CI pipeline use, and added HIPAA-compliant Codex for eligible ChatGPT Enterprise workspaces running in local environments.

The headline framing is "Codex on your phone." The substance is one tier deeper: OpenAI just productionized the asynchronous-coding-agent workflow — long-running agent on the workstation, supervising human on the move — and shipped the CI integration plumbing in the same release. That's a deliberate, coordinated push at the workflow Anthropic's Claude Code currently owns, and the response cycle is going to define how senior engineers actually work with coding agents for the next year.

Why mobile-as-the-approval-surface is the right shape (and why it isn't what most teams thought they wanted)

The early Cursor / Claude Code / Codex experience was synchronous: engineer at the desk, IDE open, agent running in a tab, human in the loop at every turn. That worked when the agent was junior — small edits, narrow scope, fast feedback. It does not work as the agent's task length scales.

The shape of senior engineering work, when the agent is doing the writing, is asynchronous review. The engineer kicks off a long-running task (refactor this directory, write tests for that module, investigate this bug), the agent goes off and does it, and the engineer comes back when there's something to look at. That mental model is the same one a tech lead applies to a junior engineer: assign the work, check in periodically, review the PR when it's ready.

The mobile interface formalizes the "check in periodically" piece. A senior engineer with five concurrent Codex tasks running on their workstation can, from anywhere, see which ones are blocked waiting for approval, glance at the diffs, approve or redirect, and walk away. That is genuinely the shape of the work — not because mobile is the best surface for reviewing diffs (it isn't), but because the approval of an agent's checkpoint decision is a different cognitive task than the careful review of the underlying change, and the approval can happen anywhere.

Three things follow from that, and most of them haven't landed for the teams using Codex synchronously today:

The agent's checkpointing strategy matters more than its raw capability. If the agent stops to ask for approval every five minutes, mobile is annoying. If the agent stops to ask for approval at meaningful decision points — "I'm about to delete a directory," "I'm about to change the public API," "I'm about to write a test that mocks behavior I don't understand" — mobile is liberating. The quality of the checkpoint prompts is now a product surface, not an implementation detail.

The desktop-to-mobile latency is a UX surface, not an engineering one. When the engineer is on the train and approves a step, the agent should resume in seconds, not minutes. That requires the desktop instance to be running, the connection to be live, and the state-sync to be reliable. On a normal corporate laptop with sleep policies, this is non-trivial. Teams that deploy this seriously will end up with dedicated workstations (Mac minis at home, sometimes cloud-hosted dev environments) instead of using the engineer's primary laptop.

The review on the phone is not the deep review. A senior engineer who approves an agent's checkpoint on the train still needs to do a careful diff review on the desktop before merging. The mobile approval is workflow advancement, not quality gate. Confusing the two — assuming that "I approved it on my phone" means "I reviewed it on my phone" — is the failure mode every team is going to hit in their first quarter of mobile-Codex usage.

What the Hooks GA and the programmatic tokens actually unlock

Buried in the same announcement are two pieces that are more consequential for production teams than the mobile surface itself.

Hooks went GA. Hooks let a team plug custom code into the Codex agent's loop — pre-prompt hooks, post-tool hooks, on-completion hooks. The patterns are familiar to anyone who has used Claude Code's hook ecosystem; the shift here is that OpenAI's hooks are now production-grade with stability guarantees, not a beta surface. Teams can now wire Codex into their internal tools — issue trackers, deploy systems, code-review automation — with confidence the integration won't break on the next Codex update.

Programmatic access tokens for CI use. Business and Enterprise plans can now provision tokens that Codex uses in CI pipelines and release workflows — no interactive ChatGPT session needed. That means a CI workflow can spawn a Codex agent to investigate a flaky test, propose a fix, run the test suite against the fix, and either auto-merge or open a PR. It's the same primitive Anthropic shipped for Claude Code earlier in the year, now reaching feature parity from OpenAI.

Together, Hooks GA + programmatic tokens are the announcement for production use. Mobile Codex is the announcement for the engineer's day-to-day workflow. The first is the bigger investment to roll out; the second is what makes engineers actually use it.

HIPAA-compliant Codex and what it does (and doesn't) signal

The mention of HIPAA-compliant Codex for eligible ChatGPT Enterprise workspaces, in local environments only is the part healthcare buyers should be reading carefully — and that everyone else should understand for the precedent it sets.

"In local environments only" is the load-bearing phrase. Codex's cloud-sandbox mode — where the agent runs in OpenAI-managed sandboxes against a copy of the customer's repo — is not the HIPAA path. The HIPAA path runs Codex against a local workstation, with the repo on the local disk, and the OpenAI service still processes the prompts and the diffs but under a BAA. That's a meaningful capability for healthcare ISVs and provider IT teams that have been blocked on Codex for compliance reasons. It is not a substitute for the Coder-Agents-style in-VPC deployment for the regulated buyers who can't have any code leave their perimeter at all.

The signal beyond healthcare is that OpenAI is now willing to spin up vertical-compliance variants of Codex (HIPAA today, more BAAs and DPAs likely over the next quarter), which closes one of the structural gaps between Codex and the rest of the AI dev tooling stack.

Where we'd push back on the launch narrative

macOS-only at launch is a real limitation for the Windows-dominant enterprise. Most large enterprises run mostly Windows workstations. Codex Mobile's "connect to your laptop" workflow is unavailable to those teams until OpenAI ships Windows support. The right posture for those teams is: watch the launch, don't rebuild the workflow around it yet.

Approving on mobile is not reviewing on mobile. The risk is that engineers conflate "I tapped approve" with "I evaluated whether this change is correct." A team rolling out mobile Codex should be explicit, in its workflow doc, that mobile approvals are workflow advancement and that every diff still gets a desktop review before merge. Without that discipline, the convenience of mobile becomes the source of the next quality regression.

The desktop-availability problem is real and expensive. A Mac mini at home that stays awake to run the desktop Codex instance is fine for a senior engineer who can expense it. A 5,000-engineer enterprise that wants to provide this workflow as a platform service has to think about cloud-hosted developer environments, persistent connections, and the operational story for "my Codex is down because the workstation rebooted." That's a platform-engineering project, not a click-to-install.

What we'd build differently this week

  • Pilot mobile Codex with three senior engineers on long-running tasks. Pick people who are already comfortable with desktop Codex and have at least one task per day that runs >10 minutes. Measure how often mobile approvals happen, what the latency feels like, whether the engineers report it as useful or as a notification stream they have to manage.
  • Document the "approval vs review" boundary in your engineering workflow. Write it down. What counts as a mobile-acceptable approval (e.g. "agent wants to run a test"), what requires a desktop review (e.g. "agent wants to merge a PR"). Without the boundary, the boundary will be set by accident.
  • Pilot Hooks for one CI integration. A Codex hook that posts to your internal incident-response channel when an agent's task fails. Or a hook that writes audit-trail records to your central logging when an agent takes a destructive action. Small, useful, and the integration pattern teaches the team how to wire bigger ones.
  • Stand up programmatic Codex in one CI workflow. Flaky-test investigation is the canonical example; refactor-after-dependency-update is another. Measure the success rate, the cost per task, and how often a human had to intervene to redirect.
  • Decide which compliance posture you need before piloting HIPAA-Codex. A BAA with OpenAI is not the same as "Codex inside our VPC." If the workflow can tolerate OpenAI as the data processor under a BAA, HIPAA-Codex is fine. If it can't, the right pilot is a Coder-Agents-style in-VPC deployment instead.

Sonnet Code's take

Codex on ChatGPT Mobile is the moment senior engineers got the right surface to actually supervise coding agents — and the same release that shipped that surface also shipped the CI plumbing and the vertical-compliance variant that turn Codex from "a tool I use at my desk" into "a system my team runs in production." The teams that win this cycle aren't the ones who turn on mobile-approve and call it done; they're the ones who design the agent's checkpointing for asynchronous supervision, who treat mobile approval and desktop review as distinct workflow steps, who wire Hooks into their existing incident-response and audit pipelines, and who staff the senior engineers whose review is what the workflow is actually advancing toward. We staff that work directly: AI development at Sonnet Code is the engineering that designs the agent checkpointing for asynchronous use, builds the Hooks integrations into the org's existing CI and observability stack, stands up the programmatic-Codex CI workflows, and lays the foundation for the BAA-or-VPC choice the compliance team is about to ask. We pair it with AI training engagements where senior engineers author the rubrics that grade agent diffs, build the golden examples the agent is calibrated against, and grade the trajectory traces that come out of mobile-approved tasks. If your team is going to flip mobile Codex on for engineers next week, the next conversation isn't about the app — it's about the checkpointing policy you haven't written and the senior reviewer whose desktop review is still the actual quality gate.