Sonnet Code
← Back to all articles
AI & Machine LearningMay 21, 2026·9 min read

GitHub's Copilot Coding Agent Just Hit GA — Autonomous PRs Move the Bottleneck From Writing Code to Reviewing It

The release, in one paragraph

On May 20, 2026, GitHub announced the general availability of the Copilot Coding Agent — the autonomous loop that closes the gap between an assigned issue and a merge-ready PR without a human in the keyboard. GA brings four things the public preview didn't have: multi-model routing across Claude Opus 4.7, GPT-5.5, and Cursor's Composer 2.5 (with the choice exposed per-task and overridable per-repo); a dedicated agent-PR review surface that highlights the agent's reasoning trace, tool calls, and confidence-tagged change ranges; per-task billing (a single "task" meter rather than per-token line items) that lets engineering leaders forecast cost the way they forecast headcount; and enterprise governance hooks including a deny-list, an audit log, and a policy-driven "do not assign" flag at the repo level. The agent is available to Copilot Business and Copilot Enterprise subscribers immediately, with a free-tier preview for individual contributors capped at five agent tasks per month.

The surprising line in the GA announcement isn't "autonomous coding agents work." The autonomous-PR pattern has been credible for nine months — Cursor's Background Agents, Claude Code's autonomous mode, Devin's relaunch, the various OSS frameworks. The surprising line is that GitHub — the platform that owns the issue, the code, the PR, the review, and the merge — is now shipping the agent as a native, default-available product across every paid Copilot tier. That changes the deployment friction from "adopt a new vendor" to "flip a flag." And it puts the entire review-and-gating surface for AI-authored code into the same UI engineers already use for human-authored code. The bottleneck in software delivery isn't writing code anymore. It's reviewing it, gating it, and deciding when to roll back.

Why review-as-the-bottleneck is the architecture decision, not the productivity claim

For two years, the AI-coding conversation has been framed as a productivity story: AI assistants reduce the time to write a change, so engineers ship more. The Coding Agent GA reframes that conversation, because once the writing step is largely automated, the time-to-merge is dominated by review, CI, and the decision to gate or deploy. That shifts where the engineering investment has to land.

Review surface area scales with agent throughput. A senior engineer who previously reviewed five human-authored PRs per day will, in a team that adopts agent-authored PRs aggressively, be asked to review twenty or thirty. Without a deliberate redesign of the review process, the quality bar drops, the false-merge rate climbs, and the team's escape rate to production rises sharply. The teams that ship Coding Agent successfully aren't the ones who turn on the most agents; they're the ones who simultaneously redesign the review pipeline — diff-summarization tooling, confidence-tagged change ranges, agent-PR-specific CI checks, gated merges for high-risk paths — to keep the senior reviewer's signal-to-noise ratio above the threshold where review is a real gate and not a rubber stamp.

The reviewer's rubric becomes the new release gate. When the agent is writing the PR, the meta-question of "what does an acceptable change look like" has to be encoded somewhere the agent can read and the reviewer can hold the agent to. That's the rubric — the same artifact that grades a fine-tuning run, the same artifact a senior practitioner authors during an AI-training engagement, the same artifact that catches regressions in an eval suite. Teams shipping Coding Agent in production without a documented review rubric are flying blind; teams shipping with a rubric that a senior engineer maintains can scale agent throughput while holding the merge-quality bar constant.

The CI pipeline is now load-bearing in a way it wasn't. Every agent-authored PR runs through CI before it reaches the reviewer's eye; if CI is slow, flaky, or thin on coverage, the agent's throughput overwhelms the gate. The teams that adopt Coding Agent successfully invest in CI as a precondition — fast-running unit tests, deterministic builds, type-checking on every diff, security scanning surfaced inline in the PR — because the CI pipeline is the cheapest place in the loop to catch a regression. A weak CI pipeline + an autonomous agent = a flood of "looks fine, didn't compile, merged anyway" PRs that surface only at deploy time. Strengthen CI before raising the agent's throughput.

What the multi-model routing actually changes

The most under-reported line in the GA release is the routing layer. The Coding Agent now exposes the model choice per-task — Claude Opus 4.7 for the highest-stakes work, GPT-5.5 for broad tool ecosystem support, Cursor Composer 2.5 for cost-efficient bulk workloads — with policy hooks at the repo level to constrain the choice.

The routing decision is the customer's, not GitHub's. A year ago, choosing a coding-agent vendor was a one-shot decision; the vendor picked the model under the hood, and the customer got what they got. Today, the routing decision lives in the customer's hands — and that's a feature only if the customer instruments the workloads well enough to make the decision. The teams that benefit most from multi-model routing are the ones that grade per-workload outcomes against a rubric; the teams that simply pick a default and never revisit it are paying for optionality they don't use.

Per-task billing makes cost forecastable for the first time. Per-token billing was a budget-review nightmare — "how much will this cost" required predicting the agent's token consumption per task, the variance per task class, and the model price changes per quarter. Per-task billing collapses all of that into a single line item: "how many tasks did the team complete this month, at the contracted per-task rate." That makes Coding Agent a procurement line item engineering leaders can defend to the CFO with the same confidence they defend a headcount line. The trade-off is real — high-variance tasks subsidize low-variance ones — but the predictability is worth the variance for most enterprise buyers.

Routing policy is now an engineering artifact, not a procurement opinion. The right routing policy lives in code, gets reviewed in PRs, ages with the codebase, and is graded by the eval suite. "Opus 4.7 for refactors touching auth/ or payments/; Composer 2.5 for everything in tests/ and tooling/; GPT-5.5 for tasks tagged external-api-integration" is a real routing policy a senior engineer can write, maintain, and defend at a review meeting. The team that doesn't write the policy down is letting the default routing make the decision; the team that writes it down is shipping the decision deliberately.

What it doesn't change

The senior reviewer is still the bottleneck. Coding Agent shifts the volume of code-writing work to the machine, but it does not — yet — replace the senior reviewer's judgment on cross-system impact, security-critical paths, architectural tradeoffs, or the ambiguous-requirements cases that the agent has no priors for. The right framing is that the senior reviewer's time gets leveraged — they arrive at the PR with the obvious items already flagged by CI, the confidence-tagged change ranges already highlighted by the agent, and the routine review steps already complete. It doesn't reduce the senior reviewer headcount the org needs. If anything, the bar on each senior reviewer's judgment goes up.

Agentic regressions still happen in production. The March Amazon incident — 6.3 million dropped orders attributed to an AI-assisted deploy that passed staging — was the executive-visible version of a quieter pattern: agent-authored changes that look correct, pass CI, and break in production at scale. The Coding Agent GA does not eliminate that pattern; it makes it cheaper to ship the regression. The teams that invest in deploy gates, canary releases, and feature flags ship Coding Agent safely; the teams that adopt the agent without re-investing in production safety ship the next Amazon-class incident.

The agent's quality still depends on the model. Coding Agent on Claude Opus 4.7 produces different code than Coding Agent on Composer 2.5. The benchmark gap is small per-PR but accumulates across thousands of PRs; the routing decision matters, the eval matters, and the per-workload data matters. "Pick a default and move on" is a procurement shortcut that costs you the optionality you paid for.

Where we'd push back on the framing

"Autonomous" is the right word for the marketing and the wrong word for the production conversation. The Coding Agent is autonomous in the sense that it closes the loop from issue to PR without a human in the keyboard. It is not autonomous in the sense of "can be trusted to merge without review" — and GitHub's own messaging on the GA release explicitly emphasizes human review at merge. The right framing for the production conversation is "agent writes, human reviews, system gates" — not "agent ships." Be precise about this; the misunderstanding is the source of every "we adopted Coding Agent and it caused an incident" story.

"Per-task billing" is convenient and partial. A predictable per-task cost lets the procurement team forecast spend, which is genuinely valuable. It also abstracts away the underlying token economics, which means the team can't directly compare cost-per-task across models or vendors without re-pricing the work. The right defensive posture is to keep your own token-level instrumentation — even when GitHub is billing per task — so that the day you want to switch vendors or move workloads to a self-hosted model, you have the data to make the decision.

"GA across every Copilot tier" is convenient and concentrating. GitHub now owns the issue, the code, the PR, the review, the merge, the agent, and increasingly the billing tier behind the agent. That is convenient for buyers who already standardized on GitHub and a structural concern for buyers thinking about multi-vendor optionality. The right defense is the same as at the model layer: keep your CI, your eval, your audit trail, and your routing-policy artifacts in a form that's portable to a different platform if the procurement conversation shifts.

What we'd build differently this week

  • Audit your review pipeline for capacity. For each repo, answer: if agent throughput on this repo doubles next quarter, who reviews the PRs, in what UI, with what tooling, and to what rubric? The repos where the answer is "no one yet" are the ones where the agent should be gated until the review pipeline catches up. Don't raise agent throughput without raising review capacity.
  • Write your routing policy down. Per-repo defaults, per-task-class overrides, escalation rules for the hardest workloads. "Opus for security-critical paths; Composer for bounded utility work; GPT-5.5 for external-API integrations." The policy doesn't have to be perfect on day one; it has to exist, be reviewed quarterly, and be owned by a named engineer.
  • Stand up an agent-PR-specific eval harness. Same prompts, same harness, same rubric, run against a representative sample of issues, scored on time-to-merge, review iterations, escape rate to production. Two weeks of structured data beats a quarter of "the agent feels OK" and gives you the ammunition to defend or refine the routing policy at the next review meeting.
  • Re-invest in CI as a precondition for raising agent throughput. Fast tests, deterministic builds, type-checking, security scanning, license compliance, secrets handling — all surfaced inline in the PR before the senior reviewer sees the diff. The CI pipeline is the cheapest place to catch an agent regression; the production deploy is the most expensive.
  • Update your PR templates and review tooling. Mark agent-authored PRs distinctly. Surface the agent's reasoning trace, the tool calls it made, and the confidence-tagged change ranges in a place the reviewer actually looks. The reviewer's signal-to-noise ratio is the binding constraint on review quality; the UI is what determines that ratio.

Sonnet Code's take

The Copilot Coding Agent GA is the moment autonomous coding stopped being a category-defining feature and became default infrastructure. The right read isn't "engineering productivity just doubled." It's that the bottleneck in software delivery just moved from writing to reviewing, and the engineering teams whose review process, CI pipeline, and rubric maintenance were already brittle are about to learn it under load. The teams that ship Coding Agent successfully through this cycle are the ones who invest equally in the review side of the loop — the rubric, the senior-engineer-authored eval, the CI gating, the agent-PR review tooling, the deploy safety net — as in the agent's adoption itself.

We staff that work directly. AI development at Sonnet Code is the engineering that designs the routing layer, the CI gating, the agent-PR review surface, the deploy-safety plumbing, and the observability stack that turns autonomous PRs into shippable software. We pair it with AI training engagements where senior practitioners — staff engineers, security architects, principal reviewers — author the rubrics and golden examples that grade what the agent actually does on your codebase, against your style guide, with your failure modes encoded. If your team is reading the GA announcement this week and wondering whether your review process survives a 5× increase in PR volume, the next conversation isn't about whether to turn on the agent. It's about who owns the rubric the agent is graded against, where the deploy gate lives, and the senior practitioner whose review keeps the merge-quality bar honest as the agent's throughput keeps climbing.