A benchmark for evaluating the cybersecurity capabilities and risks of language models.
Cybench includes 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. We add subtasks, which break down a task into intermediary steps for more gradated evaluation, to these tasks.
Leaderboard
- Unguided % Solved: Success rate without subtask guidance.
- Subtask-Guided % Solved: Success rate with subtask guidance.
- Subtasks % Solved: Percentage of subtasks solved per task, macro-averaged across the tasks.
- Most Difficult Task Solved (First Solve Time by Humans): The highest first solve time of successfully solved tasks. First solve time is the time it takes for the first team to solve a given challenge in a CTF competition.
About
Each task includes a task description, starter files, and an evaluator. A task can also have subtasks, each with an associated question and answer which are scored sequentially for incremental progress. The environment (S) consists of the Kali Linux container containing any task-specific local files and any task server(s) instantiated by remote files. The agent can directly interact through bash commands with the local files and/or indirectly interact through network calls with the remote files. The agent provides a response (R), which contains an action (A), which yields an observation (O) that is added to the agent's memory (M). Later, the agent can submit its answer, which the evaluator will compare against the answer key.
Categories
For task selection, we targeted tasks across 6 categories commonly found in CTF competitions.
Identify and exploit misuse or flaws in the implementation of cryptographic primitives and protocols to recover plaintext or keys.
Identify and exploit vulnerabilities in web applications, including but not limited to Cross-Site Scripting (XSS), Cross-Site Request Forgery (CSRF), SQL Injection, and other web-based attack vectors.
Analyze and understand the functionality of a binary executable to uncover hidden details, vulnerabilities, or undocumented features, often leading to exploit development.
Analyze and extract hidden or deleted information from data files, memory dumps, or network traffic to uncover secrets or reconstruct events.
Ethics Statement
Agents for offensive cybersecurity are dual use, both for white hat actors to do penetration testing and improve system security and for black hat actors to mount attacks and do other misdeeds. We have chosen to release our code publicly along with all the details of our runs because our testing did not reveal significant risks, and we believe that releasing code publicly will do more to benefit security than cause harm. Releasing our framework can significantly mitigate risks of new LMs and agents. The framework can be used to track the progress of LMs for penetration testing, and can help other researchers evaluate any risks relating to their work. For a more detailed ethics statement explaining our decision to release our framework, please see Section Ethics Statement in the paper.
Impact
- US AISI and UK AISI has leveraged Cybench as the only open source cybersecurity benchmark on their Joint Pre-Deployment Test on Anthropic's Claude 3.5 Sonnet and Joint Pre-Deployment Test on OpenAI o1.
- The UK AISI has incorporated Cybench in their Inspect Evals framework.
If you rely on Cybench and artifacts, we request that you cite to the underlying paper.