Sonnet Code
← Back to all articles
AI & Machine LearningMay 1, 2026·6 min read

Cursor Composer 2 Just Reset the Floor on Agentic Coding Costs

The numbers, in one paragraph

Cursor's Composer 2 — built on Kimi K2.5 with Cursor's own continued pretraining and reinforcement learning on top — scores 73.7 on SWE-bench Multilingual, 61.7 on Terminal-Bench 2.0, and 61.3 on CursorBench. That's frontier territory. The pricing is what makes the announcement structurally interesting: $0.50 per million input tokens and $2.50 per million output tokens for the standard tier, $1.50/$7.50 for a faster variant. That's an 86% reduction from Composer 1.5, which launched at $3.50/$17.50 a month earlier.

The headline number is the price. The underlying number is the curve.

A model class is becoming a commodity in real time

Competent agentic coding cost roughly $20 per million output tokens at the start of the year. It now costs $2.50 at the cheap end of the frontier. That isn't a normal price-performance improvement; it's a market structure event. Three forces are pushing on it at once:

  • Open-weight foundations. Composer 2 is built on an open Moonshot model. The base model didn't cost Cursor a billion dollars to train.
  • Specialization beating scale. A coding-specific RL pipeline on a smaller base now competes with general-purpose frontier models that are 5-10x larger. Coding is a domain where verifiable rewards work — you can run the test, check the diff, ground the loop in something real — so RL-on-a-smaller-base is unusually effective here.
  • Distribution as the moat. Cursor doesn't need Composer 2 to print API margin. They need it to keep developers inside the Cursor IDE. That economic structure lets them price aggressively in a way that pure-play model labs can't.

The consequence: any product roadmap that treats "the cost of LLM-driven code generation" as a fixed input is wrong. It's a quantity that's halving every couple of months and is unlikely to stop until something breaks the curve.

What this does to AI-integrated product economics

Three shifts worth designing around:

1. Features you shelved because the inference cost didn't pencil are probably back on the table. A year ago, "the AI rewrites the entire ticket queue every night to flag the ones likely to be misrouted" cost real money at scale. Today the same workload is a rounding error on the cloud bill. If your team has a list of "someday, when LLMs get cheap enough" backlog items, walk through it this quarter — most of them have crossed the line.

2. The product question is no longer "can we afford to call the model?" but "can we afford the latency budget?" Frontier coding models still take 10-60 seconds for non-trivial tasks. For a developer-facing product that's fine. For a user-facing product where the user is waiting, it's not. The next eighteen months of product engineering is largely going to be about caching, streaming, and pre-computation patterns that hide that latency — not about whether to use the model.

3. The team that knows when not to use a frontier model is more valuable than the team that knows how to use one. When the price floor on frontier coding drops 86% in a month, the temptation is to route everything to it. Don't. A regex still beats a $0.50/M model on the workloads that need to run a billion times a day. The senior engineering judgment that knows which problems deserve which tier — deterministic code, embeddings, small model, large model, agent loop — is becoming the binding constraint on AI-native product teams.

What we'd build differently this week

If we were starting an AI-integrated product today, the architectural defaults we'd reach for are different from the defaults we reached for in January:

  • Model-agnostic core. Wrap the LLM call behind an interface that takes a model identifier. Don't bake Anthropic or OpenAI into the type signature. The cost-quality frontier is moving too fast to commit.
  • Eval suite first, prompt second. When models are this cheap to swap, the eval suite is the asset. The prompts will keep changing; the eval is what tells you whether the swap was a good idea.
  • Two-tier routing as the default. A small-fast model for the 90% of requests that don't need frontier reasoning, a frontier model for the long tail. Composer 2's pricing is what makes this routing pattern obvious; a year ago it felt like premature optimization.
  • Owned tool definitions and owned context-window management. The thing that's not commoditizing is the integration glue: which tools the agent can call, what it knows about your domain, how you compress your codebase into a working context. That's where the engineering investment compounds.

Sonnet Code's take

The model layer is becoming a substitutable runtime. The work that compounds is the layer above and the layer below it: evals, tool definitions, context management, routing logic, observability, and the senior engineering judgment that decides which problems deserve which model in the first place. That layer is what we build for clients, and it's what most product teams are still under-staffed on. If your roadmap has "add AI features" on it this year and you're trying to figure out whether to wait six months for prices to halve again, the answer is: don't wait, but build for substitution. The teams that ship the most resilient AI-native products in 2026 will be the ones who treated the model layer as a moving part from day one.