Sonnet Code
← Volver a todos los artículos
AI & Machine Learning4 de mayo de 2026·7 min read

Live-SWE-agent at 79.2%: The Scaffold Just Caught Up to the Lab

The number, then the implication

OpenAutoCoder's Live-SWE-agent, an open-source scaffold paired with Claude Opus 4.5, scored 79.2% on SWE-bench Verified in the leaderboard refresh that landed three weeks ago — within 1.7 percentage points of Anthropic's manually engineered internal harness. On the harder SWE-Bench Pro benchmark — designed explicitly to resist contamination and to look more like enterprise codebases than the curated open-source repos in Verified — Live-SWE-agent posted 45.8%, the leading open-source result and within striking distance of the top proprietary entries.

Headline: an open-source scaffold caught up to a lab harness. Substance: scaffold quality has officially overtaken raw model selection as the largest lever on agentic-coding outcomes.

Why "the scaffold caught up" is the bigger story than "the model improved"

For two years the conventional ordering on a coding-agent project has been:

  1. Pick the best model.
  2. Wrap it in a thin loop and ship.
  3. Treat the scaffolding work as undifferentiated plumbing.

That ordering was always slightly wrong, and it is now visibly wrong. The same Opus 4.5 weights, in two different scaffolds, can swing more than 13 points on SWE-bench Verified depending on the harness. Read the leaderboard methodology pages and the same model moves 5–15 points by changing nothing but the agent loop around it. The scaffold owns:

  • Tool definitions — which shell commands, file ops, and search primitives the agent has, and how aggressively they are scoped.
  • Context construction — what gets pulled into the prompt, in what order, and at what compression ratio.
  • Self-evaluation hooks — when the agent runs the tests, when it reads the tests' output, and how it interprets failure.
  • Loop control — when to keep going, when to ask for help, when to back out and try a different decomposition.

These are engineering decisions, not model-selection decisions. They are also where most teams under-invest because they are unglamorous and don't show up on a model card.

What "self-evolving" actually means in the Live-SWE-agent claim

The OpenAutoCoder paper describes Live-SWE-agent as "live, runtime self-evolving." Translated out of researchspeak: the scaffold can rewrite parts of itself between problems, observing what worked on the previous instance and adjusting its tool surface and prompts before the next attempt. This is more interesting in theory than it is yet in practice — the observed gains over a static-scaffold baseline are real but modest, and most of the 79.2% number comes from a scaffold that was well-engineered by humans who understood the failure modes, not from one that evolved its way to that number live.

The honest read: live self-modification is a research direction with a year or two left before it becomes the default. The boring read: the scaffold won by being well-designed. Both reads point in the same direction for a buyer — the scaffold is where the engineering work that compounds actually lives.

What this changes for product teams shipping coding agents

1. Stop benchmarking models in isolation. A score on an LLM leaderboard is a property of model + scaffold + prompt + tools, not of the model alone. If the comparison you ran last quarter to pick a backend was "Opus 4.6 vs. GPT-5.4 in the same agent loop you wrote in a weekend," you may have picked the model your weekend loop happened to suit, not the model that's actually best for your workload.

2. The harness has to be a first-class artifact, with tests. It is normal in 2026 to find an agent in production with no unit tests on the scaffold, no regression suite on the prompts, no eval harness on the tool definitions. That is also where most agentic regressions come from — a scaffold change ships in a Friday PR, the model behaves differently against the real workflow, and nobody notices for two weeks because there is no harness on the harness.

3. Open-source scaffolds are now a credible starting point. A year ago the operational choice was "build your own loop or buy Cursor / Claude Code wholesale." With Live-SWE-agent, mini-SWE-agent, and a healthy crop of smaller harnesses on GitHub, there is now a third option: fork an open-source scaffold, adapt it to your tools, and own the result. The scaffold layer is where most of the value lives, but you no longer have to build it from zero to capture that value.

Where we'd push back on the open-source-wins narrative

Two honest gaps.

Open scaffolds optimize for SWE-bench, not for your codebase. Verified and Pro are both still public, contained, and benchmark-shaped. Your codebase has 800K lines of internal Java, a build system that takes nine minutes to warm, conventions that don't appear in any open repo, and an ops surface that the agent has to respect. The scaffold that wins on Verified is a starting point for your scaffold, not your scaffold.

A scaffold without an eval suite is just opinion. Claiming your scaffold is good without a workload-specific eval to back it up is the same posture as claiming your prompt is good without an eval. The teams that get the most out of any scaffold — open or proprietary — are the ones that pair it with a regression suite tied to the actual tasks they need agents to do.

What we'd build differently this week

If our team were standing up a coding-agent project today, the architectural defaults we'd reach for after this leaderboard cycle:

  • Scaffold as a versioned artifact. Treat the agent loop, tool definitions, and prompts the same way you treat a service: versioned, code-reviewed, with a CHANGELOG. When something regresses, you want to know which commit on the scaffold caused it.
  • Workload-specific eval suite owned by humans who know the workload. SWE-bench is a useful sanity check; it is not a substitute for a hand-written suite of failures from your own bug tracker, with golden-patch comparisons.
  • Two scaffolds, not one. A small fast scaffold for the 90% of tasks that don't need long-horizon planning; a heavier scaffold (Live-SWE-agent-style or your own) for the long tail. Routing between them is a much smaller engineering project than people assume, and it usually pays for itself in a quarter.
  • Model-agnostic from day one. The interface between scaffold and model should take a model identifier. Today's Opus 4.5 is tomorrow's Opus 4.7 is next quarter's something else.

Sonnet Code's take

The piece of the agentic-coding stack that compounds is the scaffold and the eval suite around it — not the model. We staff that work for clients on two sides of the same problem: AI development, where we build the harness, the tool layer, and the routing logic that turns a frontier model into a usable in-product agent; and AI training, where senior engineers who actually know your domain author the regression suites, red-team prompts, and preference data that tell you whether a scaffold change made things better or only different. If your team has been comparing models in isolation and wondering why the production agent feels worse than the demo, the next investment is rarely "the next model." It is the harness around the model you have, evaluated against the work your team actually does.