
Apex is an autonomous, AI-powered penetration testing agent designed to operate in black box mode against live applications. There is no need to access source code, hints, or predefined attack paths. This makes it possible to discover, chain, and verify real-world vulnerabilities at the speed required for modern software development.
The impetus for Apex is a structural break in the way software security is practiced. AI coding agents generate and merge code at machine scale Stripe’s coding agents alone merge 1,300 pull requests per week, while some engineering teams have zero human code reviews and spend more than $1,000 per engineer in AI tokens every day.
Traditional scanners and human-led assessments cannot keep up with this speed. Apex was built as an adversarial validation layer. This is a separate agent that attacks running applications exactly as a real attacker would, catching vulnerabilities before they become compromised.
Apex operates in three deployment modes. The CI pipeline validates every deployment against a sandboxed replica of your application, mapping attack surfaces and attempting exploits before the code is merged.
Exploitable weaknesses are continuously surfaced in real-time against production environments. It also replaces quarterly PDF engagements with a feedback loop that operates at the speed of modern threats and supports on-demand testing against any target.
To validate its capabilities, PensarAI built Argus. Argus is an open-source benchmark of 60 self-contained, Dockerized vulnerable web applications built specifically for evaluating offensive security agents.
Existing benchmarks were considered insufficient. The most widely used suite, XBOW’s 104 challenge set, is 70% PHP and covers a single vulnerability target, but lacks GraphQL, JWT algorithm disruption, race conditions, prototype taint chains, WAF bypass, and multi-tenant isolation scenarios.
Argus dominates production environments across Node.js/Express (40%), Python/Flask/Django (20%), multi-service architecture (25%), Go, Java/Spring Boot, and PHP.
This benchmark introduces categories that other benchmarks do not cover. Examples include WAF and IDS evasion, multi-step exploit chains requiring up to seven chained vulnerabilities, multi-tenant isolation failures, race conditions and business logic flaws, modern authentication bypasses (JWT, OAuth, SAML, MFA), and cloud/Kubernetes infrastructure attacks. Difficulty is scaled across 2 easy tasks, 27 medium tasks, and 31 difficult tasks.
271 vulnerabilities across 60 applications
Apex addressed all 60 Argus challenges in full black box mode using Claude Haiku 4.5, the smallest and cheapest model available, to isolate architectural benefits over raw model functionality.
Apex achieved a 35% pass rate, outperforming PentestGPT (30%) and Raptor (27%). For the top 10 most difficult challenges using Claude Opus 4.6, the gap widened significantly. Apex solved 80%, PentestGPT reached 70%, and Raptor reached 60%.
Throughout its run, Apex discovered 271 unique vulnerabilities across SQL injection, SSRF, NoSQL injection, prototype pollution, SSTI, XXE, race conditions, IDOR, authentication bypass, CORS misconfiguration, command injection, and path traversal. The average cost per challenge was about $8, and the entire 60 challenges on Haiku cost less than $500.
Notable solution included 7 steps Race condition double spend in fintech transfer endpoints, multiple– Pivot tenant SSRF chains through a shared cache to extract API keys from adjacent tenants and perform SpEL injections into RCE for Java Spring Boot applications, all within 15 minutes.
The failure modes documented in Apex are informative. Last mile execution, completing the final credential extraction step after the SSRF chain is successful, emerged as a key gap. A decoy flag misled the agent twice and a complex multi-step chain including CI/CD pipeline poisoning and Kubernetes compromise exceeded the 30 minute budget.
Both the Apex and Argus benchmarks are currently available as open source on GitHub.
Follow us on Google News, LinkedIn, and X for daily updates on cybersecurity. Contact us to tell us your story.
