The US Defense Advanced Research Projects Agency, DARPA, recently kicked off a two-year AI Cyber Challenge (AIxCC), inviting top AI and cybersecurity experts to design new AI systems to help secure major open source projects which our critical infrastructure relies upon. As AI continues to grow, it’s crucial to invest in AI tools for Defenders, and this competition will help advance technology to do so.
Google’s OSS-Fuzz and Security Engineering teams have been excited to assist AIxCC organizers in designing their challenges and competition framework. We also playtested the competition by building a Cyber Reasoning System (CRS) tackling DARPA’s exemplar challenge.
This blog post will share our approach to the exemplar challenge using open source technology found in Google’s OSS-Fuzz, highlighting opportunities where AI can supercharge the platform’s ability to find and patch vulnerabilities, which we hope will inspire innovative solutions from competitors.
Leveraging OSS-Fuzz
AIxCC challenges focus on finding and fixing vulnerabilities in open source projects. OSS-Fuzz, our fuzz testing platform, has been finding vulnerabilities in open source projects as a public service for years, resulting in over 11,000 vulnerabilities found and fixed across 1200+ projects. OSS-Fuzz is free, open source, and its projects and infrastructure are shaped very similarly to AIxCC challenges. Competitors can easily reuse its existing toolchains, fuzzing engines, and sanitizers on AIxCC projects. Our baseline Cyber Reasoning System (CRS) mainly leverages non-AI techniques and has some limitations. We highlight these as opportunities for competitors to explore how AI can advance the state of the art in fuzz testing.
Fuzzing the AIxCC challenges
For userspace Java and C/C++ challenges, fuzzing with engines such as libFuzzer, AFL(++), and Jazzer is straightforward because they use the same interface as OSS-Fuzz.
Fuzzing the kernel is trickier, so we considered two options:
Syzkaller, an unsupervised coverage guided kernel fuzzer
A general purpose coverage guided fuzzer, such as AFL
Syzkaller has been effective at finding Linux kernel vulnerabilities, but is not suitable for AIxCC because Syzkaller generates sequences of syscalls to fuzz the whole Linux kernel, while AIxCC kernel challenges (exemplar) come with a userspace harness to exercise specific parts of the kernel.
Instead, we chose to use AFL, which is typically used to fuzz userspace programs. To enable kernel fuzzing, we followed a similar approach to an older blog post from Cloudflare. We compiled the kernel with KCOV and KSAN instrumentation and ran it virtualized under QEMU. Then, a userspace harness acts as a fake AFL forkserver, which executes the inputs by executing the sequence of syscalls to be fuzzed.
After every input execution, the harness read the KCOV coverage and stored it in AFL’s coverage counters via shared memory to enable coverage-guided fuzzing. The harness also checked the kernel dmesg log after every run to discover whether or not the input caused a KASAN sanitizer to trigger.
Some changes to Cloudflare’s harness were required in order for this to be pluggable with the provided kernel challenges. We needed to turn the harness into a library/wrapper that could be linked against arbitrary AIxCC kernel harnesses.
AIxCC challenges come with their own main() which takes in a file path. The main() function opens and reads this file, and passes it to the harness() function, which takes in a buffer and size representing the input. We made our wrapper work by wrapping the main() during compilation via $CC -Wl,--wrap=main harness.c harness_wrapper.a
The wrapper starts by setting up KCOV, the AFL forkserver, and shared memory. The wrapper also reads the input from stdin (which is what AFL expects by default) and passes it to the harness() function in the challenge harness.
Because AIxCC's harnesses aren't within our control and may misbehave, we had to be careful with memory or FD leaks within the challenge harness. Indeed, the provided harness has various FD leaks, which means that fuzzing it will very quickly become useless as the FD limit is reached.
To address this, we could either:
Forcibly close FDs created during the running of harness by checking for newly created FDs via /proc/self/fd before and after the execution of the harness, or
Just fork the userspace harness by actually forking in the forkserver.
The first approach worked for us. The latter is likely most reliable, but may worsen performance.
All of these efforts enabled afl-fuzz to fuzz the Linux exemplar, but the vulnerability cannot be easily found even after hours of fuzzing, unless provided with seed inputs close to the solution.
Improving fuzzing with AI
This limitation of fuzzing highlights a potential area for competitors to explore AI’s capabilities. The input format being complicated, combined with slow execution speeds make the exact reproducer hard to discover. Using AI could unlock the ability for fuzzing to find this vulnerability quickly—for example, by asking an LLM to generate seed inputs (or a script to generate them) close to expected input format based on the harness source code. Competitors might find inspiration in some interesting experiments done by Brendan Dolan-Gavitt from NYU, which show promise for this idea.
Another approach: static analysis
One alternative to fuzzing to find vulnerabilities is to use static analysis. Static analysis traditionally has challenges with generating high amounts of false positives, as well as difficulties in proving exploitability and reachability of issues it points out. LLMs could help dramatically improve bug finding capabilities by augmenting traditional static analysis techniques with increased accuracy and analysis capabilities.
Proof of understanding (PoU)
Once fuzzing finds a reproducer, we can produce key evidence required for the PoU:
The culprit commit, which can be found from git history bisection.
The expected sanitizer, which can be found by running the reproducer to get the crash and parsing the resulting stacktrace.
Next step: “patching” via delta debugging
Once the culprit commit has been identified, one obvious way to “patch” the vulnerability is to just revert this commit. However, the commit may include legitimate changes that are necessary for functionality tests to pass. To ensure functionality doesn’t break, we could apply delta debugging: we progressively try to include/exclude different parts of the culprit commit until both the vulnerability no longer triggers, yet all functionality tests still pass.
This is a rather brute force approach to “patching.” There is no comprehension of the code being patched and it will likely not work for more complicated patches that include subtle changes required to fix the vulnerability without breaking functionality.
Improving patching with AI
These limitations highlight a second area for competitors to apply AI’s capabilities. One approach might be to use an LLM to suggest patches. A 2024 whitepaper from Google walks through one way to build an LLM-based automated patching pipeline.
Competitors will need to address the following challenges:
Validating the patches by running crashes and tests to ensure the crash was prevented and the functionality was not impacted
Narrowing prompts to include only the functions present in the crashing stack trace, to fit prompt limitations
Building a validation step to filter out invalid patches
Using an LLM agent is likely another promising approach, where competitors could combine an LLM’s generation capabilities with the ability to compile and receive debug test failures or stacktraces iteratively.
No comments:
Post a Comment
You are welcome to contribute comments, but they should be relevant to the conversation. We reserve the right to remove off-topic remarks in the interest of keeping the conversation focused and engaging. Shameless self-promotion is well, shameless, and will get canned.
Note: Only a member of this blog may post a comment.