American Fuzzy Lop, or AFL for short, is a powerful coverage-guided fuzzer developed by Michal Zalewski (lcamtuf) at Google. Since its release in 2013, it has racked up an impressive set of trophies in the form of security vulnerabilities in high-profile software. Given its phenomenal success on real world programs, I was curious to explore in detail how it worked on an automatically generated bug.
I started off with the toy program we looked at in the previous post, with a single bug added. The bug added by LAVA will trigger whenever the first four bytes of a float-type file_entry are set to 0x6c6175de or 0xde75616c, and will cause printf to be called with an invalid format string, crashing the program.
After verifying that the bug could be triggered reliably, I compiled it with afl-gcc and started a fuzzing run. To get things started, I used a well-formed input file for the program that contained both int and float file_entry types:
Because I’m lucky enough to have a 24 core server sitting around, I gave it 24 cores (one using -M and the rest using -S) and let it run for about 4 and a half days, fully expecting that it would find the input in that time.
This did not turn out so well.
|Around 20 billion executions later, AFL had found zilch.|
At this point, I turned to Twitter, where John Regehr suggested that I look into what coverage AFL was achieving. I realized that I actually had no idea how AFL’s instrumentation worked, and that this would be a great opportunity to find out.
@moyix you should look at the actual coverage achieved by afl’s test cases and see what is going on
— John Regehr (@johnregehr) July 16, 2016
Diving Into AFL’s Instrumentation
Looking into the source code of afl-as, the program that instruments the assembly code, I noticed a curious bit of code:
|AFL skips labels following p2align directives in the assembly code.|
According to the comment, this should only affect programs compiled under OpenBSD. However, the branch I wanted instrumented was being affected by this even though I was running under Linux, not OpenBSD, and there were no jump tables present in the program.
|The .L18 block should be instrumented by AFL, but won’t be because it’s right after an alignment statement.|
Since I’m not on OpenBSD, I just commented out this if statement. As an alternate workaround, you can also add “-fno-align-labels -fno-align-loops -fno-align-jumps” to the compile command (at the cost of potentially slower binaries). After making the change I restarted, once again confident AFL would soon find my bug.
Alas, it was not to be. Another 17 hours of fuzzing on 24 cores yielded nothing, and so I went back to the drawing board. I am still fairly sure I found a real bug in AFL, but fixing it didn’t help find the bug I was interested in. (Note: it’s possible that if I had waited four days again it would have found my bug. On the other hand, AFL’s cycle counter had turned green, indicating that it thought there was little benefit in continuing to fuzz.)
|5.2 billion executions, no crashes 🙁|
At this point I remembered a post by lcamtuf that described how AFL managed to figure out that an XML file could contain CDATA tags even though its original test cases didn’t contain any examples that used CDATA. He also calls out our bug as exactly the kind of thing AFL is not designed to find:
What seemed perfectly clear, though, is that the algorithm wouldn’t be able to get past “atomic”, large-search-space checks such as:
if (strcmp(header.magic_password, "h4ck3d by p1gZ")) goto terminate_now;
if (header.magic_value == 0x12345678) goto terminate_now;
So how was AFL able to generate a CDATA tag out of thin air? It turns out that libxml2 has a set of macros that expand out some string comparisons into character-by-character comparisons that use simple if statements. This allows AFL to discover valid strings character by character, since each correct character will add new coverage, and cause further fuzzing to be done with that input.
We can also apply this to our test program. Rather than checking for the fixed constant 0x6c6175de, we can compare each byte individually. This should allow AFL to identify the trigger value one byte at a time. The new code looks like this:
|The monolithic if statement has been replaced by 4 individual branches.|
Once we make this change and compile with afl-gcc, AFL finds a crash in just 3 minutes on a single CPU!
|AFL has found the bug!|
This also makes me wonder if it might be worthwhile to implement a compiler pass that breaks down large integer comparisons into byte-sized chunks that AFL can deal with more easily. For string comparisons, one can already substitute in an inline implementation of strcmp/memcmp; an example is available in the AFL source.
A Hidden Coverage Pitfall
|Using the LLVM instrumentation mode, AFL is no longer able to find our bug.|
So this is something to keep in mind when using afl-clang-fast: the instrumentation does not work in quite the same way as the traditional afl-gcc mode, and in some special cases you may need to use AFL_DONT_OPTIMIZE in order to get the coverage instrumentation that you want.
Making AFL Smarter with a Dictionary
Although it’s great that we were able to get AFL to generate the triggering input that reveals the bug by tweaking the program, it would be nice if we could somehow get it to find the bugs in our original programs.
AFL is having trouble with our bugs because they require it to guess a 32-bit input all at once. The search space for this is pretty large: even supposing that it starts systematically flipping bits in the right part of the file, it’s going to take an average of 2 billion executions to find the right value. And of course, unless it has some reason to believe that working on that part of the file will get improved coverage, it won’t be focusing on the right file position, making it even less likely it will find the right input.
However, we can give AFL a leg up by allowing it to pick inputs that aren’t completely random. One of AFL’s features is that it supports using a dictionary of values when fuzzing. This is basically just a set of tokens that it can use when mutating a file instead of picking values at random. So one classic trick is to take all of the constants and strings found in the program binary and add them to the dictionary. Here’s a quick and dirty script that extracts the constants and strings from a binary for use with AFL:
Once we give AFL a dictionary, it finds 94% of our bugs (149/159) within 15 minutes!
Now, does this mean that LAVA’s bugs are too easy to find? At the moment, probably yes. In the real world, the triggering conditions will not always be something you can just extract with objdump and strings. The key improvement needed in LAVA is a wider variety of triggering mechanisms, which is something we’re working on.
- Its ability to find bugs is strongly related to the quality of its coverage instrumentation, and that instrumentation can vary due both to bugs in AFL and inherent differences in the various compile-time passes AFL supports.
- The structure of the code also heavily influences AFL’s behavior: seemingly small differences (making 4 one-byte comparisons vs one 4-byte comparison) can have a huge effect.
- Seeding AFL with even a naïve dictionary can be devastatingly effective.