AI Cyber Model Arena: Testing AI Agents in Cybersecurity - wiz.io
We are thrilled to announce the AI Cyber Model Arena, featuring a benchmark suite that includes 257 real-world challenges across five critical offensive domains: zero-day discovery, CVE detection, API security, web security, and cloud security. This initiative enhances everyday security workflows through the incorporation of AI agents, boosted by the advancements in large language model (LLM) cybersecurity capabilities.
At Wiz Research, our continuous evaluation of AI models focuses on their utility for vulnerability research and threat hunting. By creating a comprehensive evaluation benchmark, we aim to reflect the real-world cybersecurity challenges that practitioners face daily.
Our goal encompasses extensive coverage of the offensive lifecycle, from identifying cold-start memory bugs to executing dynamic exploitation in web and API settings, and addressing multi-step cloud misconfigurations across popular platforms like AWS, Azure, GCP, and Kubernetes. This benchmarking is grounded in real exposure to contemporary vulnerabilities.
The evaluation process distinctly separates the effects of agents from models, employing a multi-agent × multi-model approach across the five domains. Scoring is based on specific metrics tailored for each category, ensuring realistic assessment through repeated trials to capture the best outcomes.
To maintain fairness, challenges operate within network-isolated Docker containers, using native tools and execution models without external modifications. This setup allows for equitable access to system tools while preventing cheating through rigorous validation mechanisms.
A key insight from our findings is that offensive capabilities are highly contextual; the same model can show variable performance based on its agent configuration and domain specificity. We remain committed to evolving the AI Cyber Model Arena with new models, challenges, and tools that push the boundaries of AI in cybersecurity.


