Sonnet Code
← Volver a todos los artículos
AI & Machine Learning21 de mayo de 2026·9 min read

Meta Just Open-Sourced a Coding Model That Matches Opus 4.7 — Self-Hosting Is a Procurement Line Now, Not a Hobby

The release, in one paragraph

On May 20, 2026, Meta released Llama 4.5 — a frontier-tier general-purpose model with native multimodal capability — and a specialized fine-tune called Llama Code 4.5 aimed squarely at production coding workloads. The headline benchmark numbers landed exactly where the market was watching: 79.2% on SWE-Bench Multilingual (versus Claude Opus 4.7 at 80.5%, Cursor's Composer 2.5 at 79.8%, GPT-5.5 at 77.8%), 69.0% on Terminal-Bench 2.0, and 64.0% on CursorBench v3.1 — within a single percentage point of every frontier-tier rival on every benchmark Meta chose to publish. The weights and tokenizer are on Hugging Face under an Apache 2.0-style commercial license (no MAU clause, no compete-with-Meta restriction, fine-tuning rights explicit). Hosted inference is already live on Together AI, Groq, and Fireworks at roughly $0.30 per million input tokens and $1.20 per million output — under half the price of Composer 2.5 and roughly a fifteenth of the Opus tier. Meta also shipped a reference agent harness — Code Composer — with MCP client support baked in by default.

The surprising line in the announcement isn't the benchmark numbers. Open-weight coding models have been creeping toward the frontier for eighteen months — Kimi K2.5, DeepSeek-V4, Qwen Coder 3 — and Llama Code 4.5 finally crosses the line where the benchmarks are at parity within margin of error. The surprising line is the license. Apache 2.0-style commercial use, no revenue triggers, no MAU ceiling, explicit fine-tuning rights, and a model card that documents the training data sources at a level of detail no frontier lab has matched yet. For the first time, self-hosting a frontier-grade coding agent inside a regulated customer's VPC isn't a multi-quarter risk-tolerance conversation. It's a procurement decision the architecture review can sign off on this week.

Why open weights at the frontier is the architecture decision, not the price line

For three years, the production-coding-AI conversation has assumed a one-way curve: closed frontier models at the top, open-weight models a tier behind, with the gap closing slowly and the open tier serving cost-sensitive workloads where the quality compromise was tolerable. The Llama Code 4.5 release breaks that framing — and the consequences run far past the headline pricing line.

Self-hosting is now economic at frontier quality. A team running a high-volume coding workload — every PR, every CI run, every internal code-review pass — pays a recurring per-token bill that scales linearly with engineering velocity. At the open-weight tier, that same workload runs on the customer's own GPUs (or a partner's, in a single-tenant VPC), and the marginal cost is electricity and amortized hardware rather than per-token pricing. For workloads above a critical-mass threshold — typically a few hundred million tokens per month, the kind of bill an engineering org with a hundred engineers and an aggressive AI-assist policy already carries — the self-hosted economics are now genuinely better than the hosted frontier API. That was not true a year ago.

Data residency stops being a blocker, and starts being a feature. Every regulated industry — health, finance, defense, public sector, certain regions of EU enterprise — has spent two years writing AI-procurement clauses that allow the productivity narrative while protecting the data boundary. The closed-frontier vendor's answer was a private deployment tier, available at six- and seven-figure annual commitments, with a deployment story that included "trust us about the boundary." The open-weights answer is "the inference runs in your VPC, on your hardware, with your egress rules, and the weights came down from a URL that anyone with the same architecture review can verify against the published hash." That is a categorically different procurement story.

Model-card transparency is now table stakes. Llama Code 4.5's model card documents training data sources, eval methodology, known failure modes, and license restrictions at a level of detail Meta's frontier-lab competitors will be pressured to match over the next two release cycles. The procurement-side line item that used to read "vendor warrants the model is fit for purpose" now reads "we can read what the model was trained on and what it scores on the evals we chose." That's not a minor disclosure — it's the start of a different procurement contract.

The open base model wave is no longer a coincidence; it's the new market structure. Kimi K2.5 powered Cursor's Composer 2.5. DeepSeek-V4 is a credible frontier-tier base model on agentic coding tasks. Llama 4.5 / Llama Code 4.5 lands this week. Qwen Coder 3 is rumored for July. The shape of production AI in 2027 isn't "frontier closed lab vs. open community alternative" — it's a half-dozen production-grade base models, several of them open-weight, each one a fine-tune away from a workload-specific specialization. Plan the architecture for that landscape, not for a single-vendor stack.

What the Apache 2.0-style license actually changes

The most under-reported line in the release is the license. Meta has been gradually liberalizing Llama's terms across releases — Llama 2 had the 700M-MAU clause, Llama 3 dropped the compete restrictions, and Llama 4.5 lands as a near-pure Apache 2.0 commercial license. The operational consequences are larger than the marketing copy admits.

Commercial use without a revenue trigger. Llama 2's license bound any deployer with more than 700M monthly active users — a clause designed to keep the biggest hyperscalers from shipping a Meta-trained model under their own brand. The clause was irrelevant for 99% of buyers and a procurement headache for the rest. Llama 4.5's terms drop it entirely. A startup, a Fortune 500 product team, a hyperscaler — any one of them can ship Llama Code 4.5 in production without a license negotiation. That is a meaningful expansion of the addressable deployment surface.

Fine-tuning rights are explicit, and the derivative-model terms are clean. A team that fine-tunes Llama Code 4.5 on its own proprietary codebase to ship a workload-specific specialization owns the resulting weights for commercial purposes without a revenue share to Meta. That is the line item that turns the open-weight tier from "a base model you call through an API" into "a base model you specialize and ship as a competitive moat." The vertical specialization opportunities open up dramatically.

The training data provenance is documented at a level no frontier lab has matched. Meta's model card lists the corpus sources, the deduplication strategy, the licensing of major code dataset contributions, and the eval methodology. That's not a complete answer to the "was this trained on my code" question — no model card is — but it's a meaningfully better answer than the closed-lab market has shipped to date. For procurement teams whose security review demands a documented training pipeline, this is the first frontier-tier coding model that has one.

What it doesn't change

Hosting infrastructure is still your problem. A model running in your VPC requires GPUs, an inference server (vLLM, TGI, TensorRT-LLM), capacity planning, monitoring, autoscaling, and an SRE rotation. None of that is free. For workloads above the critical-mass threshold the economics still beat the hosted API; for low-volume workloads — a single team's senior engineers using a coding agent a few hours a day — the hosted API is dramatically cheaper than the staffing tax on running your own inference. Route by workload class, not by ideology.

The eval suite is still the bottleneck. Llama Code 4.5 produces the same benchmark scores as Opus 4.7 on the published evals; that is not the same as producing the same scores on your codebase, with your style guide, against your senior-engineer-authored rubric. The team that switches the model without running a representative-workload eval first is going to learn the hard way which benchmarks were close to its production distribution and which weren't. Stand up the eval before the switch.

The frontier base model still wins on the hardest cases. Open weights at parity is a real claim on the median PR, the bounded refactor, the well-scoped issue. On the long-horizon agentic task — a debugging session that spans services, an architectural change that touches dozens of files, a security review with ambiguous requirements — the frontier closed labs still hold a meaningful quality edge. The right default in 2026 is open-weight tier for the bulk of coding work, frontier tier for the hard cases, with the routing decision made by an eval, not a vibe.

Open weights are not open training data. You can run Llama Code 4.5; you cannot retrain it from scratch on the same corpus, because Meta has not released the training data. That distinction matters for two procurement conversations: deep auditability ("what exactly was the model trained on") and reproducible retraining ("can we rebuild the model if Meta disappears tomorrow"). The answer to both is "better than closed frontier labs offer, not as good as a true open-source release would offer."

Where we'd push back on the framing

"Match Opus 4.7 on benchmarks" is true and partial. A reasonable read of Meta's published benchmarks is that Llama Code 4.5 is competitive on a focused set of coding tasks and likely behind on the long-horizon agentic work where the frontier labs still own the leaderboard. Procurement decisions should weight the benchmarks Meta didn't publish at least as carefully as the ones they did. The right read of the launch isn't "open weights caught up." It's "open weights caught up on the workloads that matter for the median PR, and probably haven't caught up on the workloads that matter for the hardest PR."

"$0.30/M is half the price" is per-token, not per-task. A model that requires two passes to reach the right answer at $0.30/M can be more expensive in total than a frontier model that gets it on the first pass at $5/M. The numbers to watch in your own deployment are tokens per resolved task, escape rate to review, and rework rate after merge — not the published per-million-token price. Run an A/B before the budget meeting; not after.

"Self-hosted is cheaper" is true above a workload threshold and false below it. A team running a hundred-engineer org with an aggressive AI-assist policy probably crosses the threshold this quarter. A team running a tighter operation with five senior engineers using a coding agent a few hours a day each is going to find that the self-hosted infrastructure cost — GPUs, SRE time, monitoring — exceeds the API bill by a wide margin. The cost-routing argument should be a workload-by-workload calculation, not a portfolio-wide policy.

"Apache 2.0-style" is not "Apache 2.0." Meta's license, while dramatically more permissive than Llama 2's, still includes restrictions on certain use cases (high-risk applications, military, certain disinformation use cases) and an acceptable use policy that the procurement team should read before signing. The Open Source Initiative has not certified the license as a true open-source license. For most commercial buyers this distinction is irrelevant; for buyers whose policy explicitly requires OSI-certified open source, it matters.

What we'd build differently this week

  • Map your highest-volume coding workloads. PR-review agents, CI-time code-quality bots, batch refactor jobs, in-IDE assist. Sum the tokens per month per workload class. The workloads above a few hundred million tokens per month are the ones where the self-hosted Llama Code 4.5 economics are likely favorable; the workloads below are best left on the hosted frontier API. Get the data before the procurement conversation, not during it.
  • Run a head-to-head on a representative sample of your real PRs. Not the headline PRs — the boring, bounded, single-file changes that make up the bulk of your codebase. Grade against your existing rubric. Two weeks of structured data beats a quarter of speculation about which model would have been better.
  • Stand up an eval harness that compares hosted vs self-hosted on the same model. Same prompts, same harness, same rubric, run nightly against both. The day a self-hosted deployment starts drifting from the hosted reference — a stale weights file, an inference server regression, a tokenizer mismatch — the eval will catch it before the production incident does.
  • Open the data-residency conversation in parallel. If your customers' contracts require data to stay in a specific region or VPC, the self-hosted Llama Code 4.5 deployment is the option that unlocks workloads you previously had to refuse. Map which AI-assist workloads are gated by residency rules today; those are the highest-ROI early candidates for self-hosted deployment.
  • Build the routing layer model-agnostic from day one. Llama Code 4.5 today; the next open-weight release in six months; the frontier closed-lab tier for the hardest cases. The team that ties its prompts, harnesses, and evals to one specific model is the team that pays a migration tax every cycle. Build the routing and eval harness so the specific model is a config flag, not a code dependency.

Sonnet Code's take

The Llama Code 4.5 release is the moment self-hosting a frontier-grade coding model became a defensible procurement decision for regulated and data-residency-bound enterprises, and a credible cost optimization for any team above a critical-mass threshold of AI-assist volume. The right read isn't "open weights won." It's that production AI engineering in 2026 now has a real architectural fork to navigate — hosted frontier for the hardest workloads, open-weight self-hosted for the high-volume ones, with the routing decision graded by an eval suite a senior practitioner signed off on. Teams that take the fork seriously will end up with stacks that scale into 2027 without quarterly cost fire drills; teams that treat the open-weight tier as a procurement shortcut will discover the staffing tax of self-hosted inference and the quality regressions of an un-evaluated model swap at the same review meeting.

We staff that work directly. AI development at Sonnet Code is the engineering that builds the multi-tier routing layer, the self-hosted inference deployment, the per-workload eval suites, the model-card review checklist that turns a vendor's claims into a procurement signoff, and the observability plumbing that catches a self-hosted regression before it reaches production. We pair it with AI training engagements where senior practitioners — staff engineers, security architects, principal reviewers — author the rubrics and golden examples that grade what the new model actually does on your codebase, separate from what the benchmarks say it does on Meta's. If your team is reading the Llama Code 4.5 announcement this week and wondering whether self-hosting changes the procurement calculus for your AI-assist roadmap, the next conversation isn't about which open-weight model to bet on. It's about which workloads cross the self-hosting threshold, who owns the eval that grades them, and the senior practitioner whose rubric defines whether the open-weight tier is shippable on your code.