New benchmark from OpenAI and Paradigm demonstrates rapid advances in AI’s ability to identify and exploit DeFi flaws, outpacing detection and patching.
What to know:
GPT-5.3-Codex exploited 72.2% of 120 high-severity vulnerabilities in a new EVMbench test, more than doubling the 31.9% rate of its predecessor GPT-5 in just six months.
While exploit capabilities surge, AI agents lag in detection (partial recall) and patching (below full coverage), highlighting an imbalance favoring offensive uses.
The findings amplify warnings from firms like Cecuro, where specialized defensive AI detected 92% of exploited DeFi contracts, amid fears that AI is supercharging crypto hacks at low cost.
OpenAI, in partnership with crypto VC firm Paradigm, has unveiled EVMbench, a groundbreaking benchmark evaluating AI agents’ prowess in handling smart contract vulnerabilities—spanning detection, patching, and exploitation. Released on Wednesday, the tool assesses agents across 120 curated high-severity flaws drawn from 40 real-world audits, primarily from open code competitions and Tempo blockchain security reviews. These vulnerabilities collectively safeguard over $100 billion in on-chain assets, underscoring the high stakes for DeFi ecosystems.
In exploit mode, agents simulate end-to-end attacks in a sandboxed environment, draining funds via transaction replays and on-chain verification. OpenAI’s latest model, GPT-5.3-Codex, scored 72.2% success, a stark leap from GPT-5’s 31.9% just half a year prior, indicating exploit capabilities are doubling approximately every 1.3 months. This acceleration aligns with broader trends, where AI lowers the barrier for large-scale scanning, with average exploit attempts costing as little as $1.22 per contract.
However, performance in detection and patching modes reveals gaps: Agents often identify only partial vulnerabilities and struggle to fix them without disrupting functionality. Detection relies on recall of ground-truth issues, while patching requires preserving contract behavior amid automated tests. Experts note that clear objectives like “drain funds” favor exploits, whereas nuanced tasks like auditing demand domain-specific heuristics—echoing findings from Cecuro’s recent study, where a specialized AI outperformed general models by detecting 92% of 90 exploited DeFi contracts versus 34% for a GPT-5.1 baseline.
The benchmark arrives as AI’s dual-use nature intensifies crypto security debates. Chainalysis’ 2026 Crypto Crime Report highlights AI-enabled scams yielding 4.5 times more revenue than traditional ones, with $17 billion stolen in 2025 alone. North Korean hackers and others are leveraging AI for automated exploits, while DeFi losses from hacks exceeded $2.3 billion in a single year. Forbes Council experts urge DeFi teams to shift from passive audits to active defenses, embedding AI in CI/CD pipelines for continuous monitoring, anomaly detection, and circuit breakers to halt exploits in real-time.
EVMbench has been open-sourced on GitHub to track evolving risks and promote AI-assisted auditing. Yet, as offensive capabilities outpace defensive adoption, industry voices like those on X warn of an “arms race” in Web3 security, with agents potentially manipulating markets through MEV extraction or front-running. OpenAI emphasizes responsible use, advocating for benchmarks like this to bolster defenses before vulnerabilities turn into billion-dollar breaches.




