Snyk Embedded Claude and Shipped Evo: AI-Generated Code Just Got Its Own Security Stack

The release, in one paragraph

On May 7, 2026, Snyk announced it had embedded Anthropic's Claude across the Snyk AI Security Platform, and at the same time brought Evo by Snyk — the first agentic security orchestration system — to general availability. Evo runs a fleet of specialist security agents: Evo Agent Red Teaming continuously probes running AI applications and agents with prompt injection, data exfiltration, and multi-step adversarial conversations; Evo Agent Supply Chain scans the components an agent loads (MCP servers, third-party tools, model artifacts, datasets) for malicious or hidden capabilities; and Evo Runtime Policy enforces guardrails on tool calls before the call lands. The pitch is that Claude's reasoning is now powering both ends of the loop — sharper vulnerability discovery on the inbound side, faster developer-ready fixes on the outbound side. The integration is available to joint customers today, with expanded access rolling out through 2026.

The headline framing is "AI-native AppSec." The substance is one tier deeper, and it's the part procurement teams should be reading carefully: the security category for AI-generated code, AI-augmented applications, and autonomous agents has officially split off from classic AppSec, and the tooling stack that covers it is no longer optional infrastructure.

Why this lands differently than a normal scanner upgrade

Application security in 2024 was a problem with a known shape: human engineers wrote code, SAST tools scanned it, dependency scanners watched the supply chain, and a SOC team triaged what came out the other end. The shape was static enough that the category had stabilized.

That shape broke in three places this year, and the Snyk announcement is downstream of all three:

Most production code is no longer hand-written. Snyk's own framing — 65–70% of production code is AI-generated — is consistent with what every engineering survey has shown for two quarters. The code is being written by models. The reviewers, when they exist at all, are reviewing tens of thousands of lines a week. Subtle vulnerability patterns that a careful human reviewer would catch (a JWT verification that accidentally accepts none, an SQL builder that quietly concatenates a user-controlled field, a CORS policy that widened during a refactor) slip through at much higher rates than they did when the same engineers were writing the code themselves.

Agents now ship code without a pull request in between. A workspace agent that closes a Jira ticket by editing six files and merging the change is making a security-relevant decision the same way a human engineer would. Except there's no PR review, no engineer staring at the diff, no opportunity for the kind of "wait, why are we passing the raw request body to that template" gut-check that catches half the real-world vulnerabilities. The agent is operating, and the AppSec layer either keeps up with the agent or accepts the regressions.

Agents themselves are an attack surface now. An agent that takes natural-language input, calls tools, reads files, and writes back outputs is a system that can be prompt-injected through any of those channels. A README in a third-party SDK can contain malicious instructions that change the agent's behavior. A web page the agent fetches as part of a research step can subvert the agent's reasoning. A document a user uploads can hide a directive that gets the agent to call a tool the user shouldn't have access to. None of these attacks existed in 2023, all of them are documented in production environments now, and none of them are caught by a SAST scanner pointed at a code repository.

What Evo's red-teaming agents actually do (and what they don't)

Evo Agent Red Teaming is the part of the launch worth focusing on, because it's the layer most teams haven't built and won't build themselves at any reasonable budget.

The shape of the work is straightforward to describe and brutal to execute: stand up an adversarial agent whose job is to attack the agent you're shipping. Feed it the same tool surface (or a sandboxed mirror), the same input channels, the same documents and URLs. Let it try every prompt-injection variant in the literature, plus the variants it can generate from the model behind it. Watch which attacks succeed and how. Score the agent's behavior on every attack — did it execute, did it refuse, did it escalate to a human, did it leak data, did it call a tool it shouldn't have.

Three things follow that haven't sunk in yet for most buyers:

Red-teaming is a continuous process, not a quarterly engagement. Every model swap changes the agent's susceptibility profile. Every new tool added to the agent's surface adds a new attack path. Every MCP server the agent loads is a new component in the supply chain. The right cadence for adversarial probing isn't "before launch" — it's "every commit, every model change, every dependency update," and that volume only works if the red-teaming itself is automated and agentic.

The rubric matters more than the tool. Evo can probe an agent for prompt injection; it cannot tell you whether the agent's behavior under prompt injection is acceptable for your business. A clinical-decision-support agent that refuses a borderline injection but produces a verbose error message may be leaking diagnostic information through the verbosity. A customer-support agent that politely says "I cannot share that" may have already revealed account state through tool calls earlier in the trace. Whose job is it to grade those traces? The answer is a senior domain reviewer whose grading you'd defend to an auditor, and Snyk's tool doesn't ship one — it ships the infrastructure to run the probes and capture the traces.

Supply-chain scanning of MCP servers and skills is a new category. Every agent now loads a stack of third-party components — MCP servers it calls, skills it imports, tools it invokes — and each of those components is a place where a hostile maintainer can hide a malicious capability. "I'll add a tool to your agent's surface that exfiltrates secrets the first time the agent runs in CI" is exactly the kind of attack the JavaScript ecosystem has been dealing with for a decade in package.json. The MCP and skills ecosystem just inherited the problem, and Evo Agent Supply Chain is one of the first scanners pointed at it.

What it changes for development teams shipping AI features

Two structural shifts most engineering leaders haven't quite priced in yet.

AppSec posture is now part of the agent design, not bolted on after. A team that designs an agent's tool surface without a security review is no longer trading "we'll get to it after launch" for some short-term velocity — it's shipping a system that an adversarial agent will probe within a quarter and find issues in. The right shape of the conversation is: when the agent's tool surface is being designed, the AppSec engineer is in the room, and the tools are scoped, sandboxed, and audit-logged as part of the design, not after. That's a workflow change, not a tooling change.

Findings have to be triaged by humans who understand the workload, not by whoever's on rotation. A Claude-powered scanner that ships ten "potentially exploitable" findings per PR is producing a triage queue that drowns the team that has to read it. The findings that matter — the ones that would actually become incidents — are the ones graded against the workflow's threat model by someone who understands both. Without that human in the loop, the scanner becomes another dashboard nobody reads.

The eval suite and the security suite converge. A trajectory eval that grades whether the agent reasoned correctly under normal load is the same shape as a security eval that grades whether the agent reasoned correctly under adversarial load. Most teams are building these two suites separately. The teams that win this category build them once, with shared infrastructure and shared graders, and route traces into both pipelines.

Where we'd push back on the launch narrative

"AI-native AppSec" is not a feature category, it's an entire team's worth of work. Snyk shipping the infrastructure does not eliminate the need for senior security engineers, threat modelers, and domain reviewers. It moves the bottleneck from "we don't have the tooling" to "we don't have the senior practitioners to set the policy the tooling enforces." The procurement question changed; it didn't disappear.

Red-teaming results need a written disposition process, not just a dashboard. Evo can produce a 300-page report on an agent's susceptibility to a hundred different prompt-injection variants. Without a documented process for triaging findings, assigning owners, scoring severity against the agent's actual threat model, and tracking remediation, that report is shelfware. Decide who in the org owns the disposition workflow before you flip Evo on against production agents, not after.

One vendor doesn't cover the posture. Evo + Claude integrated into Snyk is one layer; capability scoping at the runtime (Cursor's Security Review, Anthropic's per-agent permissioning), sandboxed tool execution at the platform (Coder Agents' VPC isolation, Bedrock's IAM-scoped tool surfaces), and traditional SAST/DAST on the human-written code paths are all separate layers. The teams that adopt Evo and decommission the rest of their AppSec stack on the strength of the launch are making a single-vendor bet that hasn't been stress-tested in production yet.

What we'd build differently this week

Inventory the agents and AI-augmented workflows already in production. Not "which models do we use" — which agents have tool access, what tools, against what data, and what would happen if any of those tools were called with adversarial input. Most orgs don't have this inventory, and you can't secure what you can't see.
Pilot Evo Agent Red Teaming on one high-traffic agent for a week. Capture the findings, triage them against the agent's actual threat model, and measure the time it takes a human to disposition each finding. The data informs whether the rollout scales or needs better filters first.
Stand up the MCP/skills supply-chain inventory. Every MCP server, every skill, every third-party tool loaded into any production agent: catalogued, owner-assigned, pinned to a specific version, and scanned. If you can't list the MCP servers your agents are loading today, that's the first artifact to build.
Wire trajectory traces into the security pipeline, not just the eval pipeline. Same infrastructure: capture every agent run, route it to both the quality-grading and the security-grading workflows. The teams running both pipelines off shared infrastructure get twice the leverage on the same investment.
Hire (or contract) the senior AppSec reviewer who can grade red-team traces against your domain. The scanner is buy; the disposition is build, and the build needs a human whose judgment your auditors will accept.

Sonnet Code's take

The Snyk + Claude + Evo release is the moment "AI security" stopped being a vendor pitch deck and started being a procurement line item with a budget owner. The teams that win this cycle are the ones who treat agent red-teaming as continuous (not quarterly), who scope agent tool surfaces at design time (not after a finding), and who staff the senior domain reviewers who grade adversarial traces against the workflow's actual threat model. We staff that work directly: AI development at Sonnet Code is the engineering that designs the agent tool surfaces with capability scoping and sandboxed execution from the start, wires Evo (or a comparable runtime) into the deployment pipeline, and builds the trajectory-trace plumbing that feeds both eval and security suites off the same infrastructure. We pair it with AI training engagements where senior security engineers, threat modelers, and domain reviewers author the red-team prompt libraries, grade adversarial traces, and build the rubric your AppSec posture will be defended against. If your team read this week's news and is now wondering whether to flip Evo on, the next conversation isn't about the scanner. It's about the agent inventory you don't have yet and the senior reviewer who'd grade what the scanner produces.