SONNET CODE
← Back to all articles
AI ToolsJune 30, 2026·8 min read

Cognition Ships FrontierCode: Mergeability, Not Correctness

What shipped on June 8 and why this benchmark is different from every coding bench before it

On June 8, 2026 Cognition shipped FrontierCode, a coding-agent benchmark designed against a single question: would an open-source maintainer actually merge this pull request? The benchmark is built from real PRs against real OSS repositories, scored against maintainer-authored rubrics that grade five dimensions in parallel — correctness, tests, scope, style, and maintainability — and structured into three nested splits: Diamond (50 hardest tasks), Main (100 tasks), Extended (150 tasks). More than 20 open-source maintainers helped design the tasks; each task took over 40 hours to build, review, attack, and calibrate against adversarial submissions.

The headline scores on the Diamond split tell the operationally important story:

  • Claude Opus 4.8 — 13.4% (the best score on the benchmark)
  • GPT-5.5 — 6.3%
  • Gemini 3.1 Pro — 4.7%
  • Kimi K2.6 — 3.8% (the best open-weight model)

The gap from the top frontier model's headline SWE-bench Verified score (88.6%) to its FrontierCode Diamond score (13.4%) is not noise; it is the measurement gap between what the bench grades (does the patch compile and pass the test suite) and what the maintainer grades (does the patch fit the codebase well enough to merge). FrontierCode is the first coding benchmark designed against the second question rather than the first.

The structural read is not the models are bad. It is that the per-PR mergeability surface is the production constraint the bench had been silently ignoring for two years, and the Q3 procurement question for any team running coding agents against the production code surface now has to grade against the right metric.

Four shifts in how the AI-integrated product team grades coding agents in the next quarter

Four concrete things change about the team's diligence cycle the week FrontierCode lands inside the standing routing matrix.

The eval surface changes from per-task pass rate to per-PR mergeability rate. The team that has been grading coding agents on per-task SWE-bench-style benchmarks has been grading against the load-bearing wrong metric for the production-reliability surface. The per-PR mergeability rate is the metric the team's senior code reviewer already grades against in the standing review queue; FrontierCode is the bench that aligns the external vendor-comparison surface with the internal review-queue surface. The diligence artifact for the FY27 routing matrix becomes the per-vendor per-PR mergeability rate against the team's own code review rubric, not the vendor-published benchmark headline.

The human-in-the-loop code review workstream becomes a load-bearing line item, not an overhead line. A top-of-bench 13.4% Diamond score is operationally honest: the autonomous coding agent ships a mergeable PR roughly one time in eight against the hardest tasks. The other seven need senior-engineer code review, scope adjustment, test refactoring, or style-and-maintainability rewrites before the PR is mergeable. The team that under-budgets the per-PR review workstream is the team that posts the per-feature velocity regression two quarters into the agent rollout. The team that budgets the review workstream as a first-class engineering line item is the team that converts the agent rollout into the throughput translation the FY27 plan promised.

The five-dimension rubric becomes the team's standing code-review artifact. FrontierCode's correctness/tests/scope/style/maintainability rubric is the artifact the team's senior code reviewer applies in the standing review queue, written down explicitly for the first time. The team that adopts the same five-dimension rubric against its own PR-review surface buys itself the per-PR-grade instrumentation that converts the agent rollout from a per-task throughput conversation to a per-PR-grade throughput conversation — the conversation the team actually needs against the production-reliability surface. The rubric is a senior-engineering-function artifact, not a tooling artifact; the team that ships the rubric inside the code review workflow is the team that operates against the production constraint.

The per-PR mergeability rate becomes the per-vendor diligence delta inside the routing matrix. The four-vendor coding-tool routing matrix the FY27 procurement plan has to grade against (Cursor, Claude Code, Codex, Grok Build) sits against an additive per-vendor mergeability delta the team measures on its own codebase. The routing-matrix decision is no longer which vendor's bench score is higher — it is which vendor's per-PR mergeability rate is higher against the team's own code review rubric on a representative slice of the team's own backlog. The team that runs the per-vendor mergeability eval against ten representative PRs from its own backlog buys itself the diligence-grade portability artifact the FY27 standing contract underwrites against.

Where this lands in the AI-integrated product team's next sprint

The product team that already ships coding agents inside the production engineering loop has three concrete pieces of work that drop into the sprint backlog this week.

Adopt the five-dimension mergeability rubric inside the standing code review workflow. Translate FrontierCode's correctness/tests/scope/style/maintainability rubric into the team's own per-PR code review template; require the senior code reviewer to grade every agent-authored PR against all five dimensions on the standing pull-request comment surface. The artifact lives inside the team's PR template, not inside the vendor's evaluation dashboard, and the per-PR-grade history becomes the diligence asset the FY27 routing matrix grades against in the standing engineering review.

Instrument the per-PR mergeability rate per vendor against the team's own backlog. Pick ten representative PRs from the team's last sprint that the senior reviewer judges as the right size and shape for an agent attempt; run every coding-agent vendor in the routing matrix against the same ten PRs; grade the output PRs against the five-dimension rubric; record the per-vendor mergeability rate as the standing diligence artifact for the FY27 procurement conversation. The artifact updates quarterly against a refreshed PR sample so the routing matrix tracks the per-vendor capability slope the standing contract has to underwrite against.

Re-grade the per-feature velocity budget against the per-PR mergeability rate. The team's per-feature velocity budget was built against a one-PR-per-task assumption the per-task bench score implicitly underwrote. The honest per-PR mergeability rate is somewhere between 13.4% (top frontier model on Diamond) and the 50-70% band the team measures on its own representative-size PRs; the budget that did not account for the per-PR review-and-rewrite loop is the budget the team has to refresh before the next sprint planning conversation. The per-PR review workstream lands inside the velocity budget as a first-class line item the senior-engineering function owns, not as overhead the rollout absorbed.

The senior judgment FrontierCode makes visible

FrontierCode makes one thing operationally explicit: the coding-agent surface is not the per-task throughput surface the bench score advertised. The surface is the per-PR mergeability surface the team's senior code reviewer has always graded against, and the senior-engineering function is the function that owns the per-PR-grade rubric, the per-vendor mergeability instrumentation, the per-feature velocity budget against the honest review-and-rewrite loop, and the per-quarter portability commitment against the standing contract.

The procurement question is no longer which coding-agent vendor has the highest bench score; it is which vendor's per-PR mergeability rate is highest against the team's own five-dimension code review rubric on a representative slice of the team's own backlog, how much per-PR senior-review headcount the agent rollout costs against the standing engineering capacity, how the per-feature velocity budget refreshes against the honest per-PR review-and-rewrite loop, and where the FY27 routing-matrix decision lands inside the standing contract built against the wrong-metric bench scores six months ago. The teams that ask the right question this quarter buy themselves the production-grade throughput translation the FY27 plan promised; the teams that ask the wrong one buy themselves the post-mortem on the per-feature velocity regression the wrong-metric bench score quietly underwrote.


At SONNET CODE we run the per-vendor mergeability eval for every AI-integration engagement we ship — the five-dimension rubric inside the standing PR template, the per-vendor diligence artifact, the per-feature velocity budget refreshed against the honest per-PR review-and-rewrite loop. If your team is re-grading the FY27 coding-agent routing matrix against the right metric, schedule a call — we'll walk you through the per-PR mergeability instrumentation we run against the production code surface.