In this series of posts, I’m going to describe how to automatically put bugs in programs, a topic on which we just published a paper at Oakland, one of the top academic security conferences. The system we developed, LAVA, can put millions of bugs into real-world programs. Why would anyone want to do this? Are my coauthors and I sociopaths who just want to watch the world burn? No, but to see why we need such a system requires a little bit of background, which is what I hope to provide in this first post.
I am sure this will come as a shock to most, but programs written by humans have bugs. Finding and fixing them is immensely time consuming; just how much of a developer’s time is spent debugging is hard to pin down, but estimates range between 40% and 75%. And of course these errors can be not only costly for developers but catastrophic for users: attackers can exploit software bugs to run their own code, install malware, set your computer on fire, etc.
|Weekly World News has known about this problem for years.|
It should come as little surprise, then, that immense effort has been expended in finding ways to locate and fix bugs automatically. On the academic side, techniques such as fuzzing, symbolic execution, model checking, abstract interpretation, and creative combinations of those techniques, have been proposed and refined for the past 25 years. Nor has industry been idle: companies like Coverity, Fortify, Veracode, Klocwork, GrammaTech, and many more will happily sell (or rent) you a product that automatically finds bugs in your program.
Great, so by now we must surely have solved the problem, right? Well, not so fast. We should probably check to see how well these tools and techniques work. Since they’re detectors, the usual way would be to measure the false positive and false negative rates. To measure false positives, we can just run one of these tools on our program, go through the output, and decide whether we think each bug it found is real.
The same strategy does not work for measuring false negatives. If a bug finder reports finding 42 bugs in a program, we have no way of knowing whether that’s 99% or 1% of the total. And this seems like the piece of information we’d most like to have!
|Heartbleed: detectable with static analysis tools, but only after the fact.|
To measure false negatives we need a source of bugs so that we can tell how many of them our bug-finder detects. One strategy might be to look at historical bug databases and see how many of those bugs are detected. Unfortunately, these sorts of corpora are fixed in size – there are only so many bugs out there, and analysis tools will, over time, be capable of detecting most of them. We can see how this dynamic played out with Heartbleed: shortly after the bug was found, Coverity and GrammaTech quickly found ways to improve their software so that it could find Heartbleed.
Let me be clear – it’s a good thing that vendors can use test cases like these to improve their products! But it’s bad when these test cases are in short supply, leaving users with no good way of evaluating false negatives and bug finders with no clear path to improving their techniques.
This is where LAVA enters the picture. If we can find a way to automatically add realistic bugs to pre-existing programs, we can both measure how well current bug finding tools are doing, and provide an endless stream of examples that bug-finding tools can use to get better.
|LAVA: Large-scale Automated Vulnerability Analysis|
Goals for Automated Bug Corpora
So what do we want out of our bug injection? In our paper, we defined five goals for automated bug injection, requiring that injected bugs
Be cheap and plentiful
Span the execution lifetime of a program
Be embedded in representative control and data flow
Come with a triggering input that proves the bug exists
Manifest for a very small fraction of possible inputs
The first goal we’ve already discussed – if we want to evaluate tools and enable “hill climbing” by bug finders we will want a lot of bugs. If it’s too expensive to add a bug, or if we can only add a handful per program, then we don’t gain much by doing it automatically – expensive humans can already add small numbers of bugs to programs by hand.
The next two relate to whether our (necessarily artificial) bugs are reasonable proxies for real bugs. This is a tricky and contentious point, which we’ll return to in part three. For now, I’ll note that the two things called out here – occurring throughout the program and being embedded in “normal” control and data flow – are intended to capture the idea that program analyses will need to do essentially the same reasoning about program behavior to find them as they would for any other bugs. In other words, they’re intended to help ensure that getting better at finding LAVA bugs will make tools better at understanding programs generally.
The fourth is important because it allows us to demonstrate, conclusively, that the bugs we inject are real problems. Concretely, with LAVA we can demonstrate an input for each bug we inject that causes the program to crash with a segfault or bus error.
The final property is critical but not immediately obvious. We don’t want the bugs we inject to be too easy to find. In particular, if a bug manifests on most inputs, then it’s trivial to find it – just run the program and wait for the crash. We might even want this to be a tunable parameter, so that we could specify what fraction of the input space of a program causes a crash and dial the difficulty of finding the right input up or down.
Ethics of Bug Injection
- A way to get the program to do something bad on some secret input.
- Not to get caught (i.e., to be stealthy, and for the bugs to be deniable).
Looking at (1), it’s clear that one bug suffices to achieve the goal; there’s no need to add millions of bugs to a program. Indeed, adding millions of bugs harms goal (2) – it would require lots of changes to the program source, which would be very difficult to hide.
|An attempted Linux kernel backdoor attempt from 2003. Can you spot the bugdoor?|