Beyond the Borrow Checker: Differential Fuzzing

01-04-2022 | Using a modern fuzzing technique to validate the high-level logic of a safe Rust library.

Coverage growth chart from openssl_x509, from Google's Fuzz Bench

Image source: Google's FuzzBench project. Note coverage for the OpenSSL target tapers off rather quickly.

The oft-cited "infinite monkey theorem" states that a monkey hitting utterly random keys on a typewriter will, given infinite time, manage to write the complete works of Shakespeare. That's an offensively inefficient way to reproduce classic literature. But what if your goal is to justify confidence in the correctness of a program which parses Shakespearean English? In a cruel twist of fate fitting of the source material: a probabilistic technique is actually one of your best options.

Repeatedly subjecting programs to randomized and malformed inputs, aka "fuzzing" (Manes et al.), is an influential idea in software testing because, plainly, it works. Coverage-guided fuzzing, a modern variation that makes our dutiful monkeys "smarter", is a largely automated analysis capable of discovering critical vulnerabilities. Especially in programs written in memory-unsafe languages. Yet, despite the availability of free production-grade fuzzers and their respective trophy cases of real-world bugs, the technique remains criminally under-utilized in the mainstream.

We'll start with big-picture concepts, transition to sobering truths about peer-reviewed fuzzing research, then walk through a real-world libFuzzer "harness" for a public Rust library (think unit test snippet, but for configuring a fuzzing campaign). Since we're targeting safe Rust code, we'll use differential fuzzing to demonstrate that a 3rd-party container implementation is comparable to the standard library in reliability and correctness. As opposed to fuzzing for memory corruption bugs the borrow checker already protects us from.

My hope is that, armed with background knowledge and a couple of code snippets, you can make informed decisions about fuzzing in the context of software assurance - or maybe even add a harness to your next project. Note libFuzzer readily supports C and C++, where memory safety errors are always in scope, and our high-level methodology is transferable.

Give it to me straight

Before we delve into the nitty-gritty tactics, let's talk strategy - what you need to know as a decision maker. Fuzzing is a dynamic program analysis with the following characteristics:

No false positives: All results are real bugs.
- Even if they're not always exploitable vulnerabilities.
High false negatives: Many, if not most, bugs are missed.
- For reasons more complex than lack of code coverage (e.g. execution state-space not correlated with branches alone).
Unbounded yield saturation: Groce and Reghr explain why fuzzing's Return On Investment (ROI) subsides:
- Expected returns diminish according to a power law distribution given constant compute resources.
- Discovering more of a given bug class may be linearly correlated (Bohme et al.) with compute resource investment.
- Discovering bugs of linearly more diverse bug classes may require exponentially more (Bohme et al.) investment.

All recent advances in fuzzing attempt to address the latter two shortcomings, false negatives and saturation. How they do so varies wildly: mutation scheduling (Woo et al.), input grammars (Blazytko et al.), coverage-guidance (Nagy et al.), concolic execution (Yun et al.), dynamic taint analysis (Gan et al.), performance optimization (Schumilo et al.), etc.

In the broader picture of a wholistic software security initiative, it's worth noting:

Dynamic testing is only half of the story: Static security analyses, namely static taint analysis and certain variants of abstract interpretation, offer a complimentary set of strengths (e.g. ability to reason about all possible executions) at the cost of unique weaknesses (e.g. over-approximation and false positives). You shouldn't place all your eggs in the runtime basket.
- Rust democratizes advanced static analysis: Temporal memory safety (e.g. no heap or concurrency bugs) is guaranteed at compile-time. Spatial memory safety (e.g. no buffer overflow) is run-time enforced, only where necessary. This combo enables provable safety without expensive products (like licensed code analysis tools) or hard-to-scale expert processes (like code review by skilled security engineers). So long as the developer can satisfy ownership constraints.
Fuzzing favors compiled binaries: Fuzzers have traditionally excelled at finding low-level failures in native code (e.g. executed directly by the CPU, like C/C++/Rust). That's the focus of this post. The bulk of relevant advances serve this niche.
- The future of fuzzing is in flux: An average organization likely writes more web applications than web servers, so fuzzing of both managed code (e.g. interpreted or run in a VM, like Python/Java/TypeScript) and web APIs (see Mayhem4API and RESTler) is increasingly relevant in industry.

Now that we have some bearing in the space, let's look at the frontier. Where the optimistic promises of the theoretical meet the crushing pressures of the practical: applied research.

Are state-of-the-art fuzzers just monkeying around?

Yes and no. An honest answer requires nuance. Advances in "gray-box" techniques, coverage-guidance in particular, catalyzed a leap forward in the efficacy of practical fuzzers. Yet, despite dozens of top-tier publications and awarded PhDs, our subsequent progress hasn't proven empirically reliable. A surprising fraction of peer-reviewed algorithms aren't convincingly robust in practice.

Before we bear the bad news, we should cover the good: evolutionary algorithms. Gray-box fuzzers leverage lightweight instrumentation to collect limited information about program internals during a run. After employing various techniques to measure program coverage (e.g. basic block control flow edges traversed), some gray-box fuzzers prioritize mutation of specific inputs in an effort to maximize said coverage. This strategy, dubbed "evolutionary fuzzing", tunes test populations via some notion of "fitness" - borrowing ideas from genetic algorithms to generate more potent test cases. The efficacy can be shocking.

American Fuzzy Lop (AFL) was not the first evolutionary fuzzer. But it's popularity makes it very much an integral part of the security community's zeitgeist, circa 2014 to present. AFL's coverage-hungry algorithms famously demonstrated iteratively generating a valid JPEG from scratch. Inputs generated by a traditional "black-box" (e.g. no knowledge of program internals) fuzzer would have continually failed a check early in header parsing, never exercising the majority of a JPEG library's logic. AFL could infer the relationship between a library's input and its control flow - well enough to pass checks of validity.

To put an abstract spin on it: evolutionary fuzzing can force a small subset of state transitions in the target program's automata. See more states, find more security-relevant edge cases.

Now for the hard truth. AFL still falls victim to the aforementioned false positive and saturation problems. In the grand scheme of things, the JPEG example is a parlor trick for which evolutionary algorithms are particularly well-suited. Fuzzing coverage tapers logarithmically in many real-world programs without meaningful results. The evolution stops, the fuzzer gets stuck. Moreover, there exists peer-reviewed research arguing that we haven't gotten markedly better at fuzzing since AFL's 2014 debut. Two [cherry-picked] findings from that line of inquiry:

Much of fuzzing research is biased and difficult to replicate. In a stern critique of contemporary work, Klees et al. (2018) evaluated 32 fuzzing research papers - some of which used AFL as a baseline and claimed to develop improved fuzzers. The evaluation found serious problems in the experimental design of every surveyed paper. This implies each paper's conclusions are unreliable at best and actively misleading at worst.
No fuzzer, regardless of sophistication, is convincingly superior. Li et al. (2021) developed a framework for metrics-based evaluation of modern fuzzers. Their open-source, head-to-head comparison of 35 state-of-the-art fuzzers found no clear winner. Judging by the Mann-Whitney test for unique bug counts, no fuzzer beat AFL across all real-world programs. Not even contenders leveraging concolic execution, a technique often touted as the holy grail of software testing.

In essence, we're still beholden to the fickle whims of chance, still just bashing keys on that metaphorical typewriter. Fuzzing is an expensive waste of resources past some specific point in time. And nobody yet knows how to reliably determine what that point is, that's an area of study (Tramontana et al.) in itself. Much like any research thrust, fuzzing aims to repeat and quantify phenomena we don't fully understand. It's a pursuit closer to alchemy than chemistry.

Yikes. How can I action this information?

One interpretation of these findings is that we have numerous monkeys each targeting different low-hanging fruit, modulo some overlap. While that's somewhat disheartening, we can double down on what works - evolutionary algorithms - and accept the limitations. At least until the periodic wave of innovation graces us with another crest. You can still get considerable value via one or more of the below strategies:

Low-effort, majority return: Use a mature coverage-guided fuzzer, if not doing so already.
- Power law expected yield.
High-effort, supplemental return: Use multiple fuzzers to increase bug yield, if compute budget allows.
- Still power law yield, but higher bug totals on the Y-axis.
Continuous effort, continuous return: Integrate a fuzzer into CI and continually harness newly added interfaces.
- Likely linear yield? Assuming your team's secure coding skills have plateaued (both bug-rate and harnessing prowess).

So several typewriters are better than one, but be sure you've tried bashing at least one set of keys before you ship a product. New entry point for untrusted input appeared in code review? Ideal time to add a new harness. If a motivated attacker is the first person to fuzz your APIs...well that might reduce the universe of infinite possibilities down to a select few your Public Relations (PR) team won't be happy about.

That's a cop out answer. We can do better!

When a solution to a hard problem can't be reliably automated, we can fall back to augmenting human ability. As a Rust programmer, you've already been half of a human-in-the-loop property verification system. Those lifetime annotations rustc complains about? They're aiding an interprocedural static analysis. By occasionally querying your genius meat-brain for them, the compiler solves an otherwise intractable problem: eliminating memory errors.

If carefully constructed, a harness is to a fuzzer what source annotations are to a static analyzer: a much-needed means by which to bridge the semantic gap. One particularly powerful harnessing technique is "differential fuzzing" - providing the same input to multiple programs and observing differences in their behavior. This doesn't change the fundamental calculus of the technique (no false positives, high false negatives, saturation). But it does provide richer results, enabling detection of higher-level logical errors specific to the goals (Woodruff et al.) of the target program.

The beauty of differential fuzzing is that you don't need to create an abstract specification for correct behavior, at least not entirely from scratch. That creation process can be difficult and error-prone (e.g. model checking, deductive verification). Instead, our specification is provided by the programs under test. Maybe that's because one is a reference implementation written by an authority (e.g. manufacturer of device, inventor of a cryptographic algorithm). Or maybe any divergence between untrusted implementations is an issue worth triaging. Either way, differential fuzzing can test domain-specific properties far exceeding the complexity of what the borrow checker is capable of reasoning about. So long as there's a counterpart to diff against.

What's the goal of this particular differential harness?

Let's start by introducing the Program Under Test (PUT), a data-structure library I've been working on. Though scapegoat is intended as a somewhat "drop-in" (largely API-compatible) replacement for the standard library's BTreeSet/BTreeMap, it's not as fast or mature - Rust's standard ordered set/map implementations are masterpieces of systems programming.

scapegoat does, however, offer the following unique features:

#![no_std] portability: it'll run on embedded devices without a dynamic memory allocator and/or an OS (e.g. "bare-metal"). Collection storage is stack-allocated using compile-time, per-constructor configuration (via const generics).
Strictly #![forbid(unsafe_code)]: the library, and all three of its dependencies, use only the safe subset of Rust. That means all emitted code maximizes memory safety guarantees for all possible executions.
Fallibility: Out-Of-Memory (OOM) panic! is avoidable: try_* API variants return Result<_, SgError>.

How can we be sure scapegoat correctly replicates the functionality of a standard library container under this set of self-imposed constraints? Relying on unit tests alone is a bit egocentric. If a developer had the omniscience to write unit tests for all possible edge cases, they'd likely also be capable of writing correct code on the first go. This is where differential fuzzing comes in.

The remainder of this post is a hands-on tutorial in writing a differential harness for comparing std::collections::BTreeMap against scapegoat::SgMap. Repetitive per-API code is excluded for brevity (for tis "the soul of wit"!), but you can view the real harness here. Our setup will be:

Coverage-guided: We'll leverage granular instrumentation and genetic algorithms to automatically produce test cases that cover more of the code.
Structure-aware: The fuzzer's input bit-stream will be transformed into a sequence of valid library API calls and parameters. That's a best-case scenario for maximizing library coverage.
Differential: We'll treat the std::collections APIs as "known good" models, and validate that scapegoat's APIs are logically equivalent (return the same results) and equally reliable (don't panic!).

Without further ado, let's bash some bits!

Rigging Up the Harness

0) Tooling Setup

We'll use libFuzzer: a modern, mature, open-source fuzzer. It's actively maintained as part of the LLVM compiler infrastructure project - on which Rust and many others rely. The cargo-fuzz wrapper makes using libFuzzer easy and fun. Assuming you already have the Rust toolchain installed and are running x86-64 Linux or MacOS, getting set up takes just two commands:

rustup default nightly
cargo install cargo-fuzz

While we need the nightly toolchain to perform fuzzing, it doesn't introduce any dependencies to stable programs being fuzzed. In other words, you can add harnesses to production libraries with low MSRVs.

Next, clone the below repo and initialize a harness. The repo contains scapegoat v2.1 code, but with a subtle bug intentionally inserted. We'll find this bug via fuzzing.

git clone git@github.com:tnballo/diff-fuzz-blog-post.git
cd diff-fuzz-blog-post
cargo fuzz init

If you're on Windows, you can use the Docker container here.

1) Method Dispatch and Argument Generation

After running cargo fuzz init, you should have the following starter harness in fuzz\fuzz_targets\fuzz_target_1.rs:

#![no_main]
use libfuzzer_sys::fuzz_target;

fuzz_target!(|data: &[u8]| {
    // fuzzed code goes here
});

The body of fuzz_target is libFuzzer's entry point; it'll get called repeatedly with random data until an error condition (crash, panic!, assert! failure, etc) is found.

To make our harness structure-aware, we need a way to convert the random, unstructured data stream into a sequence of calls to library APIs. Per the Rust Fuzz Book, the Arbitrary trait lets us perform that conversion:

We'll create an enum in which each variant represents one library function and its parameters.
Because our enum will derive Arbitrary, the fuzzer can choose function call order and generate parameters for each call.
Every generated parameter is of the correct type but has an unpredictable value.

For this post, we'll libfuzzer-sys = "0.4" arbitrary

Now use use use use use enum } });

Note } } });

M }

Here's }

The

get,insert, remove, and extend. That last interface is defined by a trait (see std::iter:Extend), but that doesn't make it special as far as writing the harness is concerned. Fuzzing trait implementations is like fuzzing any other function.

To derive Arbitrary, we need to update fuzz/Cargo.toml to include the arbitrary crate as a dependency:

[dependencies] = { version = "1", features = ["derive"] } we can add our API-dispatch enum to fuzz_target_1.rs, along with imports for Arbitrary, Debug, and the data structures under test:#![no_main] style=color:#ccc>core::fmt::Debug; style=color:#ccc>libfuzzer_sys::fuzz_target; style=color:#ccc>arbitrary::Arbitrary; style=color:#ccc>buggy_scapegoat::SgMap; style=color:#ccc>std::collections::BTreeMap; #[derive(Arbitrary, Debug)] MapMethod<K: Ord + Debug, V: Debug> { Get { key: K }, Insert { key: K, val: V }, Remove { key: K }, Extend { other: Vec<(K, V)> }, fuzz_target!(|methods: Vec<MapMethod<usize, usize>>| { // fuzzed code goes here how we changed fuzz_target's input type from data: &[u8] to methods: Vec<MapMethod<usize, usize>> (both types implement Arbitrary). Instead of the fuzzer providing a buffer of data, it'll provide a list of function/parameter pairs. Our choice of usize keys and values is better for throughput than using, say, Strings or complex structs. But it doesn't hamper our ability to test ordered map logic.2) Logical Equivalence Assertions
The final piece is the body of fuzz_target's closure. We need to create an instance of each data structure, iterate over methods, provide each input to both libraries, and check that the output and/or effect is equivalent. Here is that differential logic:
fuzz_target!(|methods: Vec<MapMethod<usize, usize>>| { const CAPACITY: usize = 2048; let mut sg_map = SgMap::<_, _, CAPACITY>::new(); // Data structure under test let mut bt_map = BTreeMap::new(); // Reference data structure for m in methods { match m { MapMethod::Get { key } => { assert_eq!( sg_map.get(&key), bt_map.get(&key) ); } MapMethod::Insert { key, val } => { if sg_map.len() < sg_map.capacity() { assert_eq!( sg_map.insert(key, val), bt_map.insert(key, val) ); assert_eq!(sg_map.len(), bt_map.len()); } } MapMethod::Remove { key } => { assert_eq!( sg_map.remove(&key), bt_map.remove(&key) ); assert_eq!(sg_map.len(), bt_map.len()); } MapMethod::Extend { other } => { if (sg_map.len() + other.len()) <= sg_map.capacity() { sg_map.extend(other.clone().into_iter()); bt_map.extend(other.into_iter()); assert!(sg_map.iter().eq(bt_map.iter())); } } apMethod::Get and MapMethod::Remove are straightforward: both map implementations should return the same value, if any, because they maintain the same key-value pairs. To verify that both collections report the same length after item removal, we add the post-condition assert_eq!(sg_map.len(), bt_map.len());.MapMethod::Insert and MapMethod::Extend are a tad more nuanced. Both check that sg_map's fixed stack capacity won't be exceeded prior to adding items. We could have used SgMap's fallible try_insert and try_extend to handle potential errors instead of avoiding them, but the above approach offers a more direct comparison against BTreeMap's APIs. And it demonstrates a technique for encoding capacity, a factor otherwise unknown to the fuzzer, within the harness.
Map::Extend's final line, assert!(sg_map.iter().eq(bt_map.iter()));, solves a subtle problem. After extending both collections, we'd like to ensure they contain the same items. But SgMap and BTreeMap are different types, so we can't rely on the Eq trait. Instead, we compare their ordered iterators - which produce items of comparable type.
These asserts of equivalence give libFuzzer a semantically rich bug oracle. Instead of relying on crashes, the fuzzer uses a hybrid of "state-machine" and "specification" (a formidable chimera!) to validate a complex notion of "correctness".
Finding the Bug
Recall that the repo for this post purposely introduces an insidious bug. It's a logic bug within a function for removing nodes from a binary tree. The full context of how this function works isn't important (see source if interested).
The important part is that the bug can cause our library to silently "lose" a subtree. From an end-user's perspective, this means items they insert can magically disappear. But only some of the items, and only some of the time. That's bananas!
The original, bug-free code snippet looked like this:
loop { let min_node = &self.arena[min_idx]; match min_node.left_idx() { // Continue search for min node Some(lt_idx) => { min_parent_idx = min_idx; min_idx = lt_idx; 61;background-color:#171717>// More code here... the buggy code:loop { let min_node = &self.arena[min_idx]; match min_node.left_idx() { // Continue search for min node Some(lt_idx) => { min_idx = lt_idx; min_parent_idx = min_idx; 61;background-color:#171717>// More code here... bug swaps the order of the two statements nested within the Some match arm. That would be tricky to spot in code review, even for the eagle-eyed. I would know: this is a real bug I inadvertently introduced early in scapegoat's development.Worse yet, the buggy library passes all rustdoc tests. These tests are stolen ported directly from BTreeMap's API documentation, meaning all map example code works perfectly fine. The bug isn't triggered unless the internal tree is in a particular state when a particular node is removed. Because the tree is self-balancing, that state isn't trivial to reproduce. Well, unless you're using a differential fuzzer. Which can find the bad state in mere seconds.
We could start the fuzzing campaign with the simple command cargo fuzz run fuzz_target_1, but for real-world use it's often worth augmenting the command with some of the below flags:
cargo fuzz run fuzz_target_1 --release --debug-assertions -s address --jobs $(nproc) -- -max_len=65536 >--release ensures the binary is compiled with optimizations, aiding throughput.
--debug-assertions enables debug_assert! statements and integer overflow checks (despite release mode). This flag can help catch interesting invariant violations a fuzzer would otherwise miss.
-s address enables AddressSanitizer, a granular instrumentation tool that can catch memory corruption errors as tiny as a single byte in size - even if the program doesn't crash or change it's output. While this needlessly lowers throughput when fuzzing a fully safe Rust program, it's a good habit to get into if you rely on unsafe dependencies.
--jobs $(nproc) is a Linux-ism for starting one fuzzer instance/job per available host core. On a multi-core machine, this makes fuzzing far more efficient. libFuzzer's internals allow jobs to coordinate seeds/mutations among themselves.
-max_len=65536 increases the upper limit length of the byte-stream produced by the fuzzer (what we're converting to API calls). The default would've still caught our crash in this example, but testing longer call sequences is generally a good idea.

Note --release, --debug-assertions, and -s address are default settings: they apply even if these flags are omitted. But being explicit never hurts, and the flags' functionality is worth understanding.

After running the above command to kick-off a fuzzing campaign, you'll see terminal scroll with mutation and/or coverage information. But soon the fuzzer will find an assert failure! It may take a couple seconds, depending on the speed of your machine. Once the failure is found, you should see a message akin to the following buried before a stack trace:

thread '<unnamed>' panicked at 'assertion failed: sg_map.iter().eq(bt_map.iter())', fuzz_targets/fuzz_target_1.rs:49:21
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
==33023== ERROR: libFuzzer: deadly signal

Toward the bottom of the output, you'll see a command provided underneath the text Minimize test case with:. LibFuzzer has a built-in minimizer that will take a crashing input and attempt to shrink it down to the smallest possible input that still causes the same crash. Copy/paste and run this command, which should look like the below:

cargo fuzz tmin -O -a fuzz_target_1 fuzz/artifacts/fuzz_target_1/crash-<SOME_HASH>

For the crash discovered on my machine, the minimization result was:

Minimized artifact:

	fuzz/artifacts/fuzz_target_1/minimized-from-f4b1b840a1627a2a3b427fd923adec6aaf5c958b

Output of `std::fmt::Debug`:

	[
	    Extend {
	        other: [
	            (
	                12225489209634957737,
	                18446744073695242665,
	            ),
	            (
	                18446744073709551615,
	                18446744073709551615,
	            ),
	            (
	                18446744073709551103,
	                8029759142086311935,
	            ),
	            (
	                2965947086368182160,
	                8214752266484842281,
	            ),
	        ],
	    },
	    Remove {
	        key: 12225489209634957737,
	    },
	    Extend {
	        other: [],
	    },
	]

Now if we didn't already know where the bug is, we could manually translate this minimized call sequence into a short unit test ending with the fuzzer output's failing assert (assert!(sg_map.iter().eq(bt_map.iter()));). From there it's a matter of using gdb (or Mozilla rr) to root cause the test failure. And we get to keep that unit test in our suite afterwards.

Notice how the assert fires after the fuzzer calls extend with an empty list (last call in the sequence) and compares iterators, even though it was the previous remove call that caused our library to enter a "bad state". It's shockingly common for the root cause of an error to be far, in terms of SLoC or execution time, from where the error is detected. Thankfully, the minimized test case provides a full sequence of reproducible events.

`Takeaway`

A pessimist would grumble at the notion of relying on random chance to secure code. An optimist will highlight that stochastic processes are widely used in mathematical modeling to draw empirical conclusions. A realist knows the weakest link is the first to be compromised, and even the dumbest of fuzzers can find shallow bugs before an adversary does.

Since the heyday of AFL, coverage-guided fuzzing has been the darling of dynamic security analysis. Many an academic hath dubiously claimed to build atop the shoulders of the giant. But, to again quote Shakespare completely out of context, "no legacy is so rich as honesty". And the current data indicates the coverage-guided king has yet to be convincingly dethroned.

Rust code may seldom fall victim to the bug classes for which AddressSanitizer and MemorySanitizer are so keenly attuned. Yet fuzzing's reign is far from over. A well-designed harness can hunt down trigger-able memory corruption vulnerabilities in unsafe dependencies, or, in the differential case, even semantically-meaningful logic bugs.

Run that differential fuzzer long enough and the line between fuzzing and model checking starts to blur. You might be tempted to claim you've demonstrated the equivalence of logical state-spaces between two non-trivial programs. Informally, that could mean:

An interface-level, stochastic guarantee that the internal actions of two transition systems are semantically equivalent. Where:
The interface is defined by the APIs under test, consuming potentially attacker-controlled inputs.
Stochastic implies not absolute (constrained by limitations of dynamic testing) but powerful nonetheless.
Internal actions are "hidden" logic (e.g. inner-workings of disparate data structures backing the same API).
Transition systems are state machines represented by the real PUTs, not specifications or models.

So why aren't we all fuzzing security-sensitive surfaces, at least for a modest number of cycles? Well that comes down to the difference between apes and programmers: apes know when they should be using tools.

Fortunately for us, libFuzzer (C/C++ backend/frontend) and cargo-fuzz (Rust frontend) make it delightfully easy to get those typewriters tapping. Maybe to justify confidence in our own security posture. Or, perhaps, to manifest a CVE masterpiece from the coverage-spelunking depths of someone else's code.

I'd like to thank GitHub user TheDoctor314 for their contributions to the scapegoat crate, including stable-toolchain implementations of counterparts to nightly-only BTreeMap APIs. Much appreciated!

Read a free technical book! I'm fulfilling a lifelong dream and writing a book. It's about developing secure and robust systems software. Although a work-in-progress, the book is freely available online (no paywalls or obligations): https://highassurance.rs/

Tiemoko Ballo