Background

Friday night I sat down with a glass of Macallan 15 and decided to write a static checker that would find the Heartbleed bug. I decided that I would write it as an out-of-tree clang analyzer plugin and evaluate it on a few very small functions that had the spirit of the Heartbleed bug in them, and then finally on the vulnerable OpenSSL code-base itself.

The Clang project ships an analysis infrastructure with their compiler, it’s invoked via scan-build. It hooks whatever existing make system you have to interpose the clang analyzer into the build process and the analyzer is invoked with the same arguments as the compiler. This way, the analyzer can ‘visit’ every compilation unit in the program that compiles under clang. There are some limitations to clang analyzer that I’ll touch on in the discussion section.

This exercise added to my list of things that I can only do while drinking: I have the best success with first-order logic while drinking beer, and I have the best success with clang analyzer while drinking scotch.

Strategy

One approach to identify Heartbleed statically was proposed by Coverity recently, which is to taint the return values of calls to ntohl and ntohs as input data. One problem with doing static analysis on a big state machine like OpenSSL is that your analysis either has to know the state machine to be able to track what values are attacker influenced across the whole program, or, they have to have some kind of annotation in the program that tells the analysis where there is a use of input data.

I like this observation because it is pretty actionable. You mark ntohl calls as producing tainted data, which is a heuristic, but a pretty good one because programmers probably won’t htonl their own data.

What our clang analyzer plugin should do is identify locations in the program where variables are written using ntohl, taint them, and then alert when those tainted values are used as the size parameter to memcpy. Except, that isn’t quite right, it could be the use is safe. We’ll also check the constraints of the tainted values at the location of the call: if the tainted value hasn’t been constrained in some way by the program logic, and it’s used as an argument to memcpy, alert on a bug. This could also miss some bugs, but I’m writing this over a 24h period with some Scotch, so increasing precision can come later.

Clang analyzer details

The clang analyzer implements a type of symbolic execution to analyze C/C++ programs. Plugging in to this framework as an analyzer requires bending your mind around the clang analyzer view of program state. This is where I consumed the most scotch.

The analyzer, under the hood, performs a symbolic/abstract exploration of program state. This exploration is flow and path sensitive, so it is different from traditional compiler data flow analysis. The analysis maintains a “state” object for each path through the program, and in this state object are constraints and facts about the program’s execution on that path. This state object can be queried by your analyzer, and, your analyzer can change the state to include information produced by your analysis.

This was one of my biggest hurdles when writing the analyzer – once I have a “symbolic variable” in a particular state, how do I query the range of that symbolic variable? Say there is a program fragment that looks like this:

int data = ntohl(pkt_data);
if(data >= 0 && data < sizeof(global_arr)) {
 // CASE A
...
} else {
 // CASE B
 ...
}

When looking at this program from the analyzers point of view, the state “splits” at the if into two different states A and B. In state A, there is a constraint that data is between certain bounds, and in case B there is a constraint that data is NOT within certain bounds. How do you access this information from your checker?

If your checker calls the “dump” method on its given “state” object, data like the following will be printed out:

Ranges of symbol values:
 conj_$2{int} : { [-2147483648, -2], [0, 2147483647] }
 conj_$9{uint32_t} : { [0, 6] }

In this example, conj_$9{uint32_t} is our ‘data’ value above and the state is in the A state. We have a range on ‘data’ that places it between 0 and 6. How can we, as the checker, observe that there’s a difference between this range and an unconstrained range of say [-2147483648, 2147483648]?

The answer is, we create a formula that tests the symbolic value of ‘data’ against some conditions that we enforce, and then we ask the state what program states exist when this formula is true and when it is false. If a new formula contradicts an existing formula, the state is infeasible and no state is generated. So we create a formula that says, roughly, “data > 500″ to ask if data could ever be greater than 500. When we ask the state for new states where this is true and where it is false, it will only give us a state where it is false.

This is the kind of idiom used inside of clang analyzer to answer questions about constraints on state. The arrays bounds checkers use this trick to identify states where the sizes of an array are not used as constraints on indexes into the array.

Implementation

Your analyzer is implemented as a C++ class. You define different “check” functions that you want to be notified of when the analyzer is exploring program state. For example, if your analyzer wants to consider the arguments to a function call before the function is called, you create a member method with a signature that looks like this:

void checkPreCall(const CallEvent &Call, CheckerContext &C) const;

Your analyzer can then match on the function about to be (symbolically) invoked. So our implementation works in three stages:

Identify calls to ntohl/ntoh
Taint the return value of those calls
Identify unconstrained uses of tainted data

We accomplish the first and second with a checkPostCall visitor that roughly does this:

void NetworkTaintChecker::checkPostCall(const CallEvent &Call,
CheckerContext &C) const {
  const IdentifierInfo *ID = Call.getCalleeIdentifier();

  if(ID == NULL) {
    return;
  }

  if(ID->getName() == "ntohl" || ID->getName() == "ntohs") {
    ProgramStateRef State = C.getState();
    SymbolRef 	    Sym = Call.getReturnValue().getAsSymbol();

    if(Sym) {
      ProgramStateRef newState = State->addTaint(Sym);
      C.addTransition(newState);
    }
  }

Pretty straightforward, we just get the return value, if present, taint it, and add the state with the tainted return value as an output of our visit via ‘addTransition’.

For the third goal, we have a checkPreCall visitor that considers a function call parameters like so:

void NetworkTaintChecker::checkPreCall(const CallEvent &Call,
CheckerContext &C) const {
  ProgramStateRef State = C.getState();
  const IdentifierInfo *ID = Call.getCalleeIdentifier();

  if(ID == NULL) {
    return;
  }
  if(ID->getName() == "memcpy") {
    SVal            SizeArg = Call.getArgSVal(2);
    ProgramStateRef state =C.getState();

    if(state->isTainted(SizeArg)) {
      SValBuilder       &svalBuilder = C.getSValBuilder();
      Optional<NonLoc>  SizeArgNL = SizeArg.getAs<NonLoc>();

      if(this->isArgUnConstrained(SizeArgNL, svalBuilder, state) == true) {
        ExplodedNode  *loc = C.generateSink();
        if(loc) {
          BugReport *bug = new BugReport(*this->BT, "Tainted,
unconstrained value used in memcpy size", loc);
          C.emitReport(bug);
        }
      }
    }
  }

Also relatively straightforward, our logic to check if a value is unconstrained is hidden in ‘isArgUnConstrained’, so if a tainted, symbolic value has insufficient constraints on it in our current path, we report a bug.

Some implementation pitfalls

It turns out that OpenSSL doesn’t use ntohs/ntohl, they have n2s / n2l macros that re-implement the byte-swapping logic. If this was in LLVM IR, it would be tractable to write a “byte-swapping recognizer” that uses an amount of logic to prove when a piece of code approximates the semantics of a byte-swap.

There is also some behavior that I have not figured out in clang’s creation of the AST for openssl where calls to ntohs are replaced with __builtin_pre(__x), which has no IdentifierInfo and thus no name. To work around this, I replaced the n2s macro with a function call to xyzzy, resulting in linking failures, and adapted my function check from above to check for a function named xyzzy. This worked well enough to identify the Heartbleed bug.

Solution output with demo programs and OpenSSL

First let’s look at some little toy programs. Here is one toy example with output:

$ cat demo2.c

...

int data_array[] = { 0, 18, 21, 95, 43, 32, 51};

int main(int argc, char *argv[]) {
  int   fd;
  char  buf[512] = {0};

  fd = open("dtin", O_RDONLY);

  if(fd != -1) {
    int size;
    int res;

    res = read(fd, &size, sizeof(int));

    if(res == sizeof(int)) {
      size = ntohl(size);

      if(size < sizeof(data_array)) {
        memcpy(buf, data_array, size);
      }

      memcpy(buf, data_array, size);
    }

    close(fd);
  }

  return 0;
}

$ ../docheck.sh
scan-build: Using '/usr/bin/clang' for static analysis
/usr/bin/ccc-analyzer -o demo2 demo2.c
demo2.c:30:7: warning: Tainted, unconstrained value used in memcpy size
      memcpy(buf, data_array, size);
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.
scan-build: 1 bugs found.
scan-build: Run 'scan-view /tmp/scan-build-2014-04-26-223755-8651-1' to
examine bug reports.

And finally, to see it catching Heartbleed in both locations it was present in OpenSSL, see the following:

Discussion

The approach needs some improvement, we reason about if a tainted value is “appropriately” constrained or not in a very coarse-grained way. Sometimes that’s the best you can do though – if your analysis doesn’t know how large a particular buffer is, perhaps it’s enough to show to an analyst “hey, this value could be larger than 5000 and it is used as a parameter to memcpy, is that okay?”

I really don’t like the limitation in clang analyzer of operating on ASTs. I spent a lot of time fighting with the clang AST representation of ntohs and I still don’t understand what the source of the problem was. I kind of just want to consider a programs semantics in a virtual machine with very simple semantics, so LLVM IR seems ideal to me. This might just be my PL roots showing though.

I really do like the clang analyzers interface to path constraints. I think that interface is pretty powerful and once you get your head around how to apply your problem to asking states if new states satisfying your constraints are feasible, it’s pretty straightforward to write new analyses.

Edit: Code Post

I’ve posted the code for the checker to Github, here.

We are proud to have one of the only seven accepted funded-track proposals to DARPA’s Cyber Grand Challenge.

Computer security experts from academia, industry and the larger security community have organized themselves into more than 30 teams to compete in DARPA’s Cyber Grand Challenge —- a first-of-its-kind tournament designed to speed the development of automated security systems able to defend against cyberattacks as fast as they are launched. DARPA also announced today that it has reached an agreement to hold the 2016 Cyber Grand Challenge final competition in conjunction with DEF CON, one of the largest computer security conferences in the world.

More info from DARPA:

Press coverage:

NY Times: Automating Cybersecurity
CBS News: $2 million prize for making cyber security smarter
Wall St Journal: Military Launches Computer v. Computer Hacking Contest
Ars Technica: DARPA prepares $2 million cyber warfare challenge for DEF CON 2016

Our participation in this program aligns with our development of Javelin, an automated system for simulating attacks against enterprise networks. We have assembled a world-class team of experts in software security, capture the flag, and program analysis to compete in this challenge. As much as we wish the other teams luck in this competition, Trail of Bits is playing to win. Game on!

On June 28th Artem Dinaburg and Andrew Ruef will be speaking at REcon 2014 about a project named McSema. McSema is a framework for translating x86 binaries into LLVM bitcode. This translation is the opposite of what happens inside a compiler. A compiler translates LLVM bitcode to x86 machine code. McSema translates x86 machine code into LLVM bitcode.

Why would we do such a crazy thing?

Because we wanted to analyze existing binary applications, and reasoning about LLVM bitcode is much easier than reasoning about x86 instructions. Not only is it easier to reason about LLVM bitcode, but it is easier to manipulate and re-target bitcode to a different architecture. There are many program analysis tools (e.g. KLEE, PAGAI, LLBMC) written to work on LLVM bitcode that can now be used on existing applications. Additionally it becomes much simpler to transform applications in complex ways while maintaining original application functionality.

McSema brings the world of LLVM program analysis and manipulation tools to binary executables. There are other x86 to LLVM bitcode translators, but McSema has several advantages:

McSema separates control flow recovery from translation, permitting the use of custom control flow recovery front-ends.
McSema supports FPU instructions.
McSema is open source and licensed under a permissive license.
McSema is documented, works, and will be available soon after our REcon talk.

This blog post will be a preview of McSema and will examine the challenges of translating a simple function that uses floating point arithmetic from x86 instructions to LLVM bitcode. The function we will translate is called timespi. It it takes one argument, k and returns the value of k * PI. Source code for timespi is below.

long double timespi(long double k) {
    long double pi = 3.14159265358979323846;
    return k*pi;
}

When compiled with Microsoft Visual Studio 2010, the assembly looks like the IDA Pro screenshot below.

This is what the original timespi function looks like in IDA.

After translating to LLVM bitcode with McSema and then re-emitting the bitcode as an x86 binary, the assembly looks much different.

How timespi looks after translation to LLVM and re-emission back as an x86 binary. The new code is considerably larger. Below, we explain why.

You may be saying to yourself: “Wow, that much code bloat for such a small function? What are these guys doing?”

We specifically wanted to use this example because it shows floating point support — functionality that is unique to McSema, and because it showcases difficulties inherent in x86 to LLVM bitcode translation.

Translation Background

McSema models x86 instructions as operations on a register context. That is, there is a register context structure that contains all registers and flags and an instruction semantics are expressed as modifications of structure members. This concept is easiest to understand with a simplified pseudocode example. An operation such as ADD EAX, EBX would be translated to context[EAX] += context[EBX].

Translation Difficulties

Now let’s examine why a small function like timespi presents serious translation challenges.

The value of PI is read from the data section.

Control flow recovery must detect that the first FLD instruction references data and correctly identify the data size. McSema separates control flow recovery from translation, and hence can leverage IDA’s excellent CFG recovery via an IDAPython script.

The translation needs to support x86 FPU registers, FPU flags, and control bits.

The FPU registers aren’t like integer registers. Integer registers (EAX, ECX, EBX, etc.) are named and independent. Instructions referencing EAX will always refer to the same place in a register context.

FPU registers are a stack of 8 data registers (ST(0) through ST(7)), indexed by the TOP flag. Instructions referencing ST(i) actually refer to st_registers[(TOP + i) % 8] in a register context.

This is Figure 8-2 from the Intel IA-32 Software Development Manual. It very nicely depicts the FPU data registers and how they are implicitly referenced via the TOP flag.

Integer registers are defined solely by register contents. FPU registers are partially defined by register contents and partially by the FPU tag word. The FPU tag word is a bitmap that defines whether the contents of a floating point register are:

Valid (that is, a normal floating point value)
The value zero
A special value such as NaN or Infinity
Empty (the register is unused)

To determine the value of an FPU register, one must consult both the FPU tag word and the register contents.

The translation needs to support at least the `FLD`, `FSTP`, and `FMUL` instructions.

The actual instruction operation such as loads, stores, and multiplication is fairly straightforward to support. The difficult part is implementing FPU execution semantics.

For instance, the FPU stores state about FPU instructions, like:

Last Instruction Pointer: the location of the last executed FPU instruction
Last Data Pointer: the address of the latest memory operand to an FPU instruction
Opcode: The opcode of the last executed FPU instruction

Some of these concepts are easier to translate to LLVM bitcode than others. Storing the address of the last memory operand translates very well: if the translated instruction references memory, store the memory address in the last data pointer field of the register context. Other concepts simply don’t translate. As an example, what does the “last instruction pointer” mean when a single FPU instruction is translated into multiple LLVM operations?

Self-referencing state isn’t the end of translation difficulties. FPU flags like the precision control and rounding control flags affect instruction operation. The precision control flag affects arithmetic operation, not the precision of stored registers. So one can load a double extended precision values in ST(0) and ST(1) via FLD, but FMUL may store a single precision result in ST(0).

Translation Steps

Now that we’ve explored the difficulties of translation, let’s look at the steps needed to translate just the core of timespi, the FMUL instruction. The IA-32 Software Development Manual manual defines this instance of FMUL as “Multiply ST(0) by m64fp and store result in ST(0).” Below are just some of the steps required to translate FMUL to LLVM bitcode.

Check the FPU tag word for ST(0), make sure its not empty.
Read the TOP flag.
Read the value from st_registers[TOP]. Unless the FPU tag word said the value is zero, in which case just read a zero.
Load the value pointed to by m64fp.
Do the multiplication.
Check the precision control flag. Adjust the result precision of the result as needed.
Write the adjusted result into st_registers[TOP].
Update the FPU tag word for ST(0) to match the result. Maybe we multiplied by zero?
Update FPU status flags in the register context. For FMUL, this is just the C1 flag.
Update the last FPU opcode field
Did our instruction reference data? Sure did! Update the last FPU data field to m64fp.
Skip updating the last FPU instruction field since it doesn’t really map to LLVM bitcode… for now

Thats a lot of work for a single instruction, and the list isn’t even complete. In addition to the work of translating raw instructions, there are additional steps that must be taken on function entry and exit points, for external calls and for functions that have their address taken. Those additional details will be covered during the REcon talk.

Conclusion

Translating floating point operations is a tricky, difficult business. Seemingly simple floating point instructions hide numerous operations and translate to a large amount of LLVM bitcode. The translated code is large because McSema exposes the hidden complexity of floating point operations. Considering that there have been no attempts to optimize instruction translation, we think the current output is pretty good.

For a more detailed look at McSema, attend Artem and Andrew’s talk at REcon and keep following the Trail of Bits blog for more announcements.

EDIT: McSema is now open-source. See our announcement for more information.

We are proud to announce that McSema is now open source! McSema is a framework for analyzing and transforming machine-code programs to LLVM bitcode. It supports translation of x86 machine code, including integer, floating point, and SSE instructions. We previously covered some features of McSema in an earlier blog post and in our talk at ReCON 2014.

Our talk at ReCON where we first described McSema

Build instructions and demos are available in the repository and we encourage you to try them on your own. We have created a mailing list, mcsema-dev@googlegroups.com, dedicated to McSema development and usage. Questions about licensing or integrating McSema into your commercial project may be directed to opensource@trailofbits.com.

McSema is permissively licensed under a three-clause BSD license. Some code and utilities we incorporate (e.g. Intel PIN for semantics testing) have their own licenses and need to be downloaded separately.

Finally, we would like to thank DARPA for their sponsorship of McSema development and their continued support. This project would not have been possible without them.

In this post, we discuss the creation of a novel software obfuscation toolkit, MAST, implemented in the LLVM compiler and suitable for denying program understanding to even the most well-resourced adversary. Our implementation is inspired by effective obfuscation techniques used by nation-state malware and techniques discussed in academic literature. MAST enables software developers to protect applications with technology developed for offense.

MAST is a product of Cyber Fast Track, and we would like to thank Mudge and DARPA for funding our work. This project would not have been possible without their support. MAST is now a commercial product offering of Trail of Bits and companies interested in licensing it for their own use should contact info@trailofbits.com.

Background

There are a lot of risks in releasing software these days. Once upon a time, reverse engineering software presented a challenge best solved by experienced and skilled reverse engineers at great expense. It was worthwhile for reasonably well-funded groups to reverse engineer and recreate proprietary technology or for clever but bored people to generate party tricks. Despite the latter type of people causing all kinds of mild internet havoc, reverse engineering wasn’t widely considered a serious threat until relatively recently.

Over time, however, the stakes have risen; criminal entities, corporations, even nation-states have become extremely interested in software vulnerabilities. These entities seek to either defend their own network, applications, users, or to attack someone else’s. Historically, software obfuscation was a concern of the “good guys”, who were interested in protecting their intellectual property. It wasn’t long before malicious entities began obfuscating their own tools to protect captured tools from analysis.

A recent example of successful obfuscation is that used by the authors of the Gauss malware; several days after discovering the malware, Kaspersky Lab, a respected malware analysis lab and antivirus company, posted a public plea for assistance in decrypting a portion of the code. That even a company of professionals had trouble enough to ask for outside help is telling: obfuscation can be very effective. Professional researchers have been unable to deobfuscate Gauss to this day.

Motivation

With all of this in mind, we were inspired by Gauss to create a software protection system that leapfrogs available analysis technology. Could we repurpose techniques from software exploitation and malware obfuscation into a state-of-the-art software protection system? Our team is quite familiar with publicly available tools for assisting in reverse engineering tasks and considered how to significantly reduce their efficacy, if not deny it altogether.

Software developers seek to protect varying classes of information within a program. Our system must account for each with equal levels of protection to satisfy these potential use cases:

Algorithms: adversary knowledge of proprietary technology
Data: knowledge of proprietary data (the company’s or the user’s)
Vulnerabilities: knowledge of vulnerabilities within the program

In order for the software protection system to be useful to developers, it must be:

Easy to use: the obfuscation should be transparent to our development process, not alter or interfere with it. No annotations should be necessary, though we may want them in certain cases.
Cross-platform: the obfuscation should apply uniformly to all applications and frameworks that we use, including mobile or embedded devices that may run on different processor architectures.
Protect against state-of-the-art analysis: our obfuscation should leapfrog available static analysis tools and techniques and require novel research advances to see through.

Finally, we assume an attacker will have access to the static program image; many software applications are going to be directly accessible to a dedicated attacker. For example, an attacker interested in a mobile application, anti-virus signatures, or software patches will have the static program image to study.

Our Approach

We decided to focus primarily on preventing static analysis; in this day and age there are a lot of tools that can be run statically over application binaries to gain information with less work and time required by attackers, and many attackers are proficient in generating their own situation-specific tools. Static tools can often very quickly be run over large amounts of code, without necessitating the attacker having an environment in which to execute the target binary.

We decided on a group of techniques that compose together, comprising opaque predicate insertion, code diffusion, and – because our original scope was iOS applications – mangling of Objective-C symbols. These make the protected application impossible to understand without environmental data, impossible to analyze with current static analysis tools due to alias analysis limitations, and deny the effectiveness of breakpoints, method name retrieval scripts, and other common reversing techniques. In combination, these techniques attack a reverse engineer’s workflow and tools from all sides.

Further, we did all of our obfuscation work inside of a compiler (LLVM) because we wanted our technology to be thoroughly baked into the entire program. LLVM can use knowledge of the program to generate realistic opaque predicates or hide diffused code inside of false paths not taken, forcing a reverse engineer to consult the program’s environment (which might not be available) to resolve which instruction sequences are the correct ones. Obfuscating at the compiler level is more reliable than operating on an existing binary: there is no confusion about code vs. data or missing critical application behavior. Additionally, compiler-level obfuscation is transparent to current and future development tools based on LLVM. For instance, MAST could obfuscate Swift on the day of release — directly from the Xcode IDE.

Symbol Mangling

The first and simplest technique was to hinder quick Objective-C method name retrieval scripts; this is certainly the least interesting of the transforms, but would remove a large amount of human-readable information from an iOS application. Without method or other symbol names present for the proprietary code, it’s more difficult to make sense of the program at a glance.

Opaque Predicate Insertion

The second technique we applied, opaque predicate insertion, is not a new technique. It’s been done before in numerous ways, and capable analysts have developed ways around many of the common implementations. We created a stronger version of predicate insertion by inserting predicates with opaque conditions and alternate branches that look realistic to a script or person skimming the code. Realistic predicates significantly slow down a human analyst, and will also slow down tools that operate on program control flow graphs (CFGs) by ballooning the graph to be much larger than the original. Increased CFG size impacts the size of the program and the execution speed but our testing indicates the impact is smaller or consistent with similar tools.

Code Diffusion

The third technique, code diffusion, is by far the most interesting. We took the ideas of Return-Oriented Programming (ROP) and applied them in a defensive manner.

In a straightforward situation, an attacker exploits a vulnerability in an application and supplies their own code for the target to execute (shellcode). However, since the introduction of non-executable data mitigations like DEP and NX, attackers have had to find ways to execute malicious code without the introduction of anything new. ROP is a technique that makes use of code that is already present in the application. Usually, an attacker would compile a set of short “gadgets” in the existing program text that each perform a simple task, and then link those together, jumping from one to the other, to build up the functionality they require for their exploit — effectively creating a new program by jumping around in the existing program.

We transform application code such that it jumps around in a ROP-like way, scrambling the program’s control flow graph into disparate units. However, unlike ROP, where attackers are limited by the gadgets they can find and their ability to predict their location at runtime, we precisely control the placement of gadgets during compilation. For example, we can store gadgets in the bogus programs inserted during the opaque predicate obfuscation. After applying this technique, reverse engineers will immediately notice that the handy graph is gone from tools like IDA. Further, this transformation will make it impossible to use state-of-the-art static analysis tools, like BAP, and impedes dynamic analysis techniques that rely on concrete execution with a debugger. Code diffusion destroys the semantic value of breakpoints, because a single code snippet may be re-used by many different functions and not used by other instances of the same function.

Native code before obfuscation with MAST

Native code after obfuscation with MAST

The figures above demonstrate a very simple function before and after the code diffusion transform, using screenshots from IDA. In the first figure, there is a complete control flow graph; in the second, however, the first basic block no longer jumps directly to either of the following blocks; instead, it must refer at runtime to a data section elsewhere in the application before it knows where to jump in either case. Running this code diffusion transform over an entire application reduces the entire program from a set of connected-graph functions to a much larger set of single-basic-block “functions.”

Code diffusion has a noticeable performance impact on whole-program obfuscation. In our testing, we compared the speed of bzip2 before and after our return-oriented transformation and slowdown was approximately 55% (on x86).

Environmental Keying

MAST does one more thing to make reverse engineering even more difficult — it ties the execution of the code to a specific device, such as a user’s mobile phone. While using device-specific characteristics to bind a binary to a device is not new (it is extensively used in DRM and some malware, such as Gauss), MAST is able to integrate device-checking into each obfuscation layer as it is woven through the application. The intertwining of environmental keying and obfuscation renders the program far more resistant to reverse-engineering than some of the more common approaches to device-binding.

Rather than acquiring any copy of the application, an attacker must also acquire and analyze the execution environment of the target computer as well. The whole environment is typically far more challenging to get ahold of, and has a much larger quantity of code to analyze. Even if the environment is captured and time is taken to reverse engineer application details, the results will not be useful against the same application as running on other hosts because every host runs its own keyed version of the binary.

Conclusions

In summary, MAST is a suite of compile-time transformations that provide easy-to-use, cross-platform, state-of-the-art software obfuscation. It can be used for a number of purposes, such as preventing attackers from reverse engineering security-related software patches; protecting your proprietary technology; protecting data within an application; and protecting your application from vulnerability hunters. While originally scoped for iOS applications, the technologies are applicable to any software that can be compiled with LLVM.

At THREADS 2014, I demonstrated a new capability of mcsema that enables the use of KLEE, a symbolic execution framework, on software available only in binary form. In the talk, I described how to use mcsema and KLEE to learn an unknown protocol defined in a binary that has never been seen before. In the example, we learned the series of steps required to navigate through a maze. Our competition in the DARPA Cyber Grand Challenge requires this capability — our “reasoning system” will have no prior knowledge and no human guidance, yet must learn to speak with dozens, hundreds, or thousands of binaries, each with unique inputs.

Symbolic Execution

In the first part of this two part blog post, I’ll explain what symbolic execution is and how symbolic execution allows our “reasoning system” to learn inputs for arbitrary binaries. In the second part of the blog post, I will guide you through the maze solving example presented at THREADS. To describe the power of symbolic execution, we are going to look at three increasingly difficult iterations of a classic computer science problem: maze solving. Once I discuss the power of symbolic execution, I’ll talk about KLEE, an LLVM-based symbolic execution framework, and how mcsema enables KLEE to run on binary-only applications.

Maze Solving

One of the classic problems in first year computer science classes is maze solving. Plainly, the problem is this: you are given a map of a maze. Your task is to find a path from the start to the finish. The more formal definition is: a maze is defined by a matrix where each cell can be a step or a wall. One can move into a step cell, but not into a wall cell. The only valid move directions are up, down, left, or right. A sequence of moves from cell to cell is called a path. Some cell is marked as START and another cell is marked as END. Given this maze, find a path from START to END, or show that no such path exists.

An example maze. The step spaces are blank, the walls are +-|, the END marker is the # sign, and the current path is the X’s.

The typical solution to the maze problem is to enumerate all possible paths from START, and search for a path that terminates at END. The algorithm is neatly summarized in this stack overflow post. The algorithm works because it has a complete map of the maze. The map is used to create a finite set of valid paths. This set can be quickly searched to find a valid path.

Maze Solving sans Map

In an artificial intelligence class, one may encounter a more difficult problem: solving a maze without the map. In this problem, the solver has to discover the map prior to finding a path from the start to the end. More formally, the problem is: you are given an oracle that answers questions about maze paths. When given a path, the oracle will tell you if the path solves the maze, hits a wall, or moves to a step position. Given this oracle, find a path from the start to the end, or show there is no path.

The solution to this problem is backtracking. The solver will build the path one move at a time, asking the oracle about the path at every move. If an attempted move hits a wall, the solver will try another direction. If no direction works, the solver returns to the previous position and tries a new direction. Eventually, the solver will either find the end or visit every possible position. Backtracking works because with every answer from the oracle, the solver learns more of the map. Eventually, the solver will learn enough of the map to find the end.

Maze Solving with Fake Walls

Lets posit an even more difficult problem: a maze with fake walls. That is, there are some walls that are really steps. Since some walls are fake, the solver learns nothing from the oracle until it asks about a complete solution. If this isn’t very clear, imagine a map that is made from completely fake walls: for any path, except one that solves the maze, the oracle will always answer “wall.” More formally, the problem now is: given an oracle that will verify only a complete path from the start to the end, solve the maze.

This is vastly more difficult than before: the solver can’t learn the map. The only generic solution is to ask the oracle about every possible path. The solver will eventually guess a valid path, since it must be in the set of all paths (assuming the maze is finite). This “brute force” solver is even more powerful than the previous: it will solve all mazes, map or no map.

Despite its power, the brute force solver has a huge problem: it’s slow and impractical.

Cheat To Win

The last problem is equivalent to the following more general problem: given an oracle that verifies solutions, find a valid solution. Ideally, we want something that finds a valid solution faster than brute force guessing. Especially when it comes to generic problems, since we don’t even know what the inputs look like!

So lets make a “generic problem solver”. Brute force is slow and impractical because it tries every single concrete input, in sequence. What if a solver could try all inputs at once? Humans do this all the time without even thinking. For instance, when we solve equations, we don’t try every number until we find the solution. We use a variable that can stand in for any number, and algorithmically identify the answer.

So how will our solver try every input at once? It will cheat to win! Our solver has an ace up its sleeve: the oracle is a real program. The solver can look at the oracle, analyze it, and find a solution without guessing. Sadly, this is impossible to do for every oracle (because you run into the halting problem). But for many real oracles, this approach works.

For instance, consider the following oracle that declares a winner or a loser:

x = input();
if(x > 5 && x < 9 && x % 4 == 0) {
  winner();
else {
  loser();
}

The solver could determine that the winner input must be a number greater than 5, less than 9, and evenly divisible by 4. These constraints can be turned into a set of linear equations and solved, showing the only winner value is 8.

A hypothetical problem solver could work like this: it will treat input into the oracle as a symbol. That is, instead of picking a specific value as the input, the value will be treated as a variable. The solver will then apply constraints to the symbol that correspond to different branches in the oracle program. When the solver finds a “valid solution” state in the oracle, the constraints on the input are solved. If the constraints can be solved, the result will be a concrete input that reaches the valid solution state. The problem solver tries every possible input at once by converting the oracle into a system of linear equations.

This hypothetical problem solver is real: the part that discovers the constraints is called a symbolic execution framework, and the part that solves equations is called an SMT solver.

The Future Is Now

There are several software packages that combine symbolic execution with SMT solvers to analyze programs. We will be looking at KLEE because it works with LLVM bitcode. We can use KLEE as a generic problem solver to find all valid inputs given an oracle that verifies those inputs. KLEE can solve a maze with hidden walls: Felipe Manzano has an excellent blog post showing how to use KLEE to solve exactly such a maze.

So what does mcsema have to do with this? Well, KLEE works on programs written in LLVM bitcode. Before mcsema, KLEE could only analyze programs that come with source code. Using mcsema, KLEE can be a problem solver for arbitrary binary applications! For instance, given a compiled binary that checks solutions to mazes with hidden walls, KLEE could find all the valid paths through the maze. Or it could do something more useful, like automatically generate application tests with high code coverage, or maybe even find security bugs in binary programs.

But back to maze solving. In Part 2 of this blog post, we’ll take a binary that solves mazes, use mcsema to translate it to LLVM, and then use KLEE to find all valid paths through the maze. More specifically, we will take Felipe’s maze oracle and compile it to a Linux binary. Then, we will use mcsema and KLEE to find all possible maze solutions. Everything will be done without modifying the original binary. The only thing KLEE will know is how to provide input and how to check solutions. In essence, we are going to show how to use mcsema and KLEE to identify all valid inputs to a binary application.

This is part two of a two-part blog post that shows how to use KLEE with mcsema to symbolically execute Linux binaries (see the first post!). This part will cover how to build KLEE, mcsema, and provide a detailed example of using them to symbolically execute an existing binary. The binary we’ll be symbolically executing is an oracle for a maze with hidden walls, as promised in Part 1.

As a visual example, we’ll show how to get from an empty maze to a solved maze:

Building KLEE with LLVM 3.2 on Ubuntu 14.04

One of the hardest parts about using KLEE is building it. The official build instructions cover KLEE on LLVM 2.9 and LLVM 3.4 on amd64. To analyze mcsema generated bitcode, we will need to build KLEE for LLVM 3.2 on i386. This is an unsupported configuration for KLEE, but it still works very well.

We will be using the i386 version of Ubuntu 14.04. The 32-bit version of Ubuntu is required to build a 32-bit KLEE. Do not try adding -m32 to CFLAGS on a 64-bit version. It will take away hours of your time that you will never get back. Get the 32-bit Ubuntu. The exact instructions are described in great detail below. Be warned: building everything will take some time.

# These are instructions for how to build KLEE and mcsema. 
# These are a part of a blog post explaining how to use KLEE
# to symbolically execute closed source binaries.
 
# install the prerequisites
sudo apt-get install vim build-essential g++ curl python-minimal \
  git bison flex bc libcap-dev cmake libboost-dev \
  libboost-program-options-dev libboost-system-dev ncurses-dev nasm
 
# we assume everything KLEE related will live in ~/klee.
cd ~
mkdir klee
cd klee
 
# Get the LLVM and Clang source, extract both
wget http://llvm.org/releases/3.2/llvm-3.2.src.tar.gz
wget http://llvm.org/releases/3.2/clang-3.2.src.tar.gz
tar xzf llvm-3.2.src.tar.gz
tar xzf clang-3.2.src.tar.gz
 
# Move clang into the LLVM source tree:
mv clang-3.2.src llvm-3.2.src/tools/clang
 
# normally you would use cmake here, but today you HAVE to use autotools.
cd llvm-3.2.src
 
# For this example, we are only going to enable only the x86 target.
# Building will take a while. Go make some coffee, take a nap, etc.
./configure --enable-optimized --enable-assertions --enable-targets=x86
make
 
# add the resulting binaries to your $PATH (needed for later building steps)
export PATH=`pwd`/Release+Asserts/bin:$PATH
 
# Make sure you are using the correct clang when you execute clang — you may 
# have accidentally installed another clang that has priority in $PATH. Lets 
# verify the version, for sanity. Your output should match whats below.
# 
#$ clang --version
#clang version 3.2 (tags/RELEASE_32/final)
#Target: i386-pc-linux-gnu
#Thread model: posix
 
# Once clang is built, its time to built STP and uClibc for KLEE.
cd ~/klee
git clone https://github.com/stp/stp.git
 
# Use CMake to build STP. Compared to LLVM and clang,
# the build time of STP will feel like an instant.
cd stp
mkdir build && cd build
cmake -G 'Unix Makefiles' -DCMAKE_BUILD_TYPE=Release ..
make
 
# After STP builds, lets set ulimit for STP and KLEE:
ulimit -s unlimited
 
# Build uclibc for KLEE
cd ../..
git clone --depth 1 --branch klee_0_9_29 https://github.com/klee/klee-uclibc.git
cd klee-uclibc
./configure -l --enable-release
make
cd ..
 
# It’s time for KLEE itself. KLEE is updated fairly often and we are 
# building on an unsupported configuration. These instructions may not 
# work for future versions of KLEE. These examples were tested with 
# commit 10b800db2c0639399ca2bdc041959519c54f89e5.
git clone https://github.com/klee/klee.git
 
# Proper configuration of KLEE with LLVM 3.2 requires this long voodoo command
cd klee
./configure --with-stp=`pwd`/../stp/build \
  --with-uclibc=`pwd`/../klee-uclibc \
  --with-llvm=`pwd`/../llvm-3.2.src \
  --with-llvmcc=`pwd`/../llvm-3.2.src/Release+Asserts/bin/clang \
  --with-llvmcxx=`pwd`/../llvm-3.2.src/Release+Asserts/bin/clang++ \
  --enable-posix-runtime
make
 
# KLEE comes with a set of tests to ensure the build works. 
# Before running the tests, libstp must be in the library path.
# Change $LD_LIBRARY_PATH to ensure linking against libstp works. 
# A lot of text will scroll by with a test summary at the end.
# Note that your results may be slightly different since the KLEE 
# project may have added or modified tests. The vast majority of 
# tests should pass. A few tests fail, but we’re building KLEE on 
# an unsupported configuration so some failure is expected.
export LD_LIBRARY_PATH=`pwd`/../stp/build/lib
make check
 
#These are the expected results:
#Expected Passes : 141
#Expected Failures : 1
#Unsupported Tests : 1
#Unexpected Failures: 11
 
# KLEE also has a set of unit tests so run those too, just to be sure. 
# All of the unit tests should pass!
make unittests
 
# Now we are ready for the second part: 
# using mcsema with KLEE to symbolically execute existing binaries.
 
# First, we need to clone and build the latest version of mcsema, which
# includes support for linked ELF binaries and comes the necessary
# samples to get started.
cd ~/klee
git clone https://github.com/trailofbits/mcsema.git
cd mcsema
git checkout v0.1.0
mkdir build && cd build
cmake -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Release ..
make
 
# Finally, make sure our environment is correct for future steps
export PATH=$PATH:~/klee/llvm-3.2.src/Release+Asserts/bin/
export PATH=$PATH:~/klee/klee/Release+Asserts/bin/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/klee/stp/build/lib/

Translating the Maze Binary

The latest version of mcsema includes the maze program from Felipe’s blog in the examples as demo_maze. In the instructions below, we’ll compile the maze oracle to a 32-bit ELF binary and then convert the binary to LLVM bitcode via mcsema.

# Note: tests/demo_maze.sh completes these steps automatically
cd ~/klee/mcsema/mc-sema/tests
# Load our environment variables
source env.sh
# Compile the demo to a 32-bit ELF executable
${CC} -ggdb -m32 -o demo_maze demo_maze.c
# Recover the CFG using mcsema's bin_descend
${BIN_DESCEND_PATH}/bin_descend -d -func-map=maze_map.txt -i=demo_maze -entry-symbol=main
# Convert the CFG into LLVM bitcode via mcsema's cfg_to_bc
${CFG_TO_BC_PATH}/cfg_to_bc -i demo_maze.cfg -driver=mcsema_main,main,raw,return,C -o demo_maze.bc
# Optimize the bitcode
${LLVM_PATH}/opt -O3 -o demo_maze_opt.bc demo_maze.bc

We will use the optimized bitcode (demo_maze_opt.bc) generated by this step as input to KLEE. Now that everything is set up, let’s get to the fun part — finding all maze solutions with KLEE.

# create a working directory next to the other KLEE examples.
cd ~/klee/klee/examples
mkdir maze
cd maze
# copy the bitcode generated by mcsema into the working directory
cp ~/klee/mcsema/mc-sema/tests/demo_maze_opt.bc ./
# copy the register context (needed to build a drive to run the bitcode)
cp ~/klee/mcsema/mc-sema/common/RegisterState.h ./

Now that we have the maze oracle binary in LLVM bitcode, we need to tell KLEE which inputs are symbolic and when a maze is solved. To do this we will create a small driver that will intercept the read() and exit() system calls, mark input to read() as symbolic, and assert on exit(1), a successful maze solution.

To make the driver, create a file named maze_driver.c with contents from the this gist and use clang to compile the maze driver into bitcode. Every function in the driver is commented to help explain how it works.

clang -I../../include/ -emit-llvm -c -o maze_driver.bc maze_driver.c

We now have two bitcode files: the translation of the maze program and a driver to start the program and mark inputs as symbolic. The two need to be combined into one bitcode file for use with KLEE. The two files can be combined using llvm-link. There will be a compatibility warning, which is safe to ignore in this case.

llvm-link demo_maze_opt.bc maze_driver.bc > maze_klee.bc

Running KLEE

Once we have the combined bitcode, let’s do some symbolic execution. Lots of output will scroll by, but we can see KLEE solving the maze and trying every state of the program. If you recall from the driver, we can recognize successful states because they will trigger an assert in KLEE. There are four solutions to the original maze, so let’s see how many we have. There should be 4 results — a good sign (note: your test numbers may be different):

klee --emit-all-errors -libc=uclibc maze_klee.bc
# Lots of things will scroll by
ls klee-last/*assert*
# For me, the output is:
# klee-last/test000178.assert.err  klee-last/test000315.assert.err
# klee-last/test000270.assert.err  klee-last/test000376.assert.err

Now let’s use a quick bash script to look at the outputs and see if they match the original results. The solutions identified by KLEE from the mcsema bitcode are:

sddwddddsddw
ssssddddwwaawwddddsddw
sddwddddssssddwwww
ssssddddwwaawwddddssssddwwww

… and they match the results from Felipe’s original blog post!

Conclusion

Symbolic execution is a powerful tool that can execute programs on all inputs at once. Using mcsema and KLEE, we can symbolically execute existing closed source binary programs. In this example, we found all solutions to a maze with hidden walls — starting from an opaque binary. KLEE and mcsema could do this while knowing nothing about mazes and without being tuned for string inputs.

This example is simple, but it shows what is possible: using mcsema we can apply the power of KLEE to closed source binaries. We could generate high code coverage tests for closed source binaries, or find security vulnerabilities in arbitrary binary applications.

Note: We’re looking for talented systems engineers to work on mcsema and related projects (contract and full-time). If you’re interested in being paid to work on or with mcsema, send us an email!

The Cyber Grand Challenge qualifying event was held on June 3rd, at exactly noon Eastern time. At that instant, our Cyber Reasoning System (CRS) was given 131 purposely built insecure programs. During the following 24 hour period, our CRS was able to identify vulnerabilities in 65 of those programs and rewrite 94 of them to eliminate bugs built in their code. This proves, without a doubt, that it is not only possible but achievable to automate the actions of a talented software auditor.

Despite the success of our CRS at finding and patching vulnerabilities, we did not qualify for the final event, to be held next year. There was a fatal flaw that lowered our overall score to 9th, below the 7th place threshold for qualification. In this blog post we’ll discuss how our CRS works, how it performed against competitor systems, what doomed its score, and what we are going to do next.

Cyber Grand Challenge Background

The goal of the Cyber Grand Challenge (CGC) is to combine the speed and scale of automation with the reasoning capabilities of human experts. Multiple teams create Cyber Reasoning Systems (CRSs) that autonomously reason about arbitrary networked programs, prove the existence of flaws in those programs, and automatically formulate effective defenses against those flaws. How well these systems work is evaluated through head-to-head tournament-style competition.

The competition has two main events: the qualifying event and the final event. The qualifying event was held on June 3, 2015. The final event is set to take place during August 2016. Only the top 7 competitors from the qualifying event proceed to the final event.

During the qualifying event, each competitor was given the same 131 challenges, or purposely built vulnerable programs, each of which contained at least one intentional vulnerability. For 24 hours, the competing CRSes faced off against each other and were scored according to four criteria. The full details are in the CGC Rules, but here’s a quick summary:

The CRS had to work without human intervention. Any teams found to use human assistance were disqualified.
The CRS had to patch bugs in challenges. Points were gained for every bug successfully patched. Challenges with no patched bugs received zero points.
The CRS could prove bugs exist in challenges. The points from patched challenges were doubled if the CRS could generate an input that crashed the challenge.
The patched challenges had to function and perform almost as well as the originals. Points were lost based on performance and functionality loss in the patched challenges.

A spreadsheet with all the qualifying event scores and other data used to make the graphs in this post is available from DARPA (Trail of Bits is the ninth place team). With the scoring in mind, let’s review the Trail of Bits CRS architecture and the design decisions we made.

Preparation

We’re a small company with a distributed workforce, so we couldn’t physically host a lot of servers. Naturally, we went with cloud computing to do processing; specifically, Amazon EC2. Those who saw our tweets know we used a lot of EC2 time. Most of that usage was purely out of caution.

We didn’t know how many challenges would be in the qualifying event — just that it would be “more than 100.” We prepared for a thousand, with each accompanied by multi-gigabyte network traffic captures. We were also terrified of an EC2 region-wide failure, so we provisioned three different CRS instances, one in each US-based EC2 region, affectionately named Biggie (us-east-1), Tupac (us-west-2), and Dre (us-west-1).

It turns out that there were only 131 challenges and no gigantic network captures in the qualifying event. During the qualifying event, all EC2 regions worked normally. We could have comfortably done the qualifying event with 17 c4.8xlarge EC2 instances, but instead we used 297. Out of our abundance of caution, we over-provisioned by a factor of ~17x.

Bug Finding

The Trail of Bits CRS was ranked second by the number of verified bugs found (Figure 1). This result is impressive considering that we started with nothing while several other teams already had existing bug finding systems prior to CGC.

Figure 1: Teams in the qualifying event ranked by number of bugs found. Orange bars signify finalists.

Our CRS used a multi-pronged strategy to find bugs (Figure 2). First, there was fuzzing. Our fuzzer is implemented with a custom dynamic binary translator (DBT) capable of running several 32-bit challenges in a single 64-bit address space. This is ideal for challenges that feature multiple binaries communicating with one another. The fuzzer’s instrumentation and mutation are separated, allowing for pluggable mutation strategies. The DBT framework can also snapshot binaries at any point during execution. This greatly improves fuzzing speed, since it’s possible to avoid replaying previous inputs when exploring new input space.

Figure 2: Our bug finding architecture. It is a feedback-based architecture that explores the state space of a program using fuzzing and symbolic execution.

In addition to fuzzing, we had not one but two symbolic execution engines. The first operated on the original unmodified binaries, and the second operated on the translated LLVM from mcsema. Each symbolic execution engine had its own strengths, and both contributed to bug finding.

The fuzzer and symbolic execution engines operate in a feedback loop mediated by a system we call MinSet. The MinSet uses branch coverage to maintain a minimum set of maximal coverage inputs. The inputs come from any source capable of generating them: PCAPs, fuzzing, symbolic execution, etc. Every tool gets original inputs from MinSet, and feeds any newly generated inputs into MinSet. This feedback loop lets us explore the possible input state with both fuzzers and symbolic execution in parallel. In practice this is very effective. We log the provenance of our crashes, and most of them look something like:

Network Capture ⇒ Fuzzer ⇒ SymEx1 ⇒ Fuzzer ⇒ Crash

Some bugs can only be triggered when the input replays a previous nonce, which would be different on every execution of the challenge. Our bug finding system can produce inputs that contain variables based on program outputs, enabling our CRS to handle such cases.

Additionally, our symbolic executors are able to identify which inputs affect program state at the point of a crash. This is a key requirement for the success of any team competing in the final as it enables the CRS to create a more controlled crash.

Patching

Our CRS’s patching effectiveness, as measured by the security score, ranks as fourth (Figure 3).

Figure 3: Teams in the qualifying event ranked by patch effectiveness (security score). Orange bars signify finalists.

Our CRS patches bugs by translating challenges into LLVM bitcode with mcsema. Patches are applied to the LLVM bitcode, optimized, and then converted back into executable code. The actual patching works by gracefully terminating the challenge when invalid memory accesses are detected. Patching the LLVM bitcode representation of challenges provides us with enormous power and flexibility:

We can easily validate any memory access and keep track of all memory allocations.
Complex algorithms, such as dataflow tracking, dominator trees, dead store elimination, loop detection, etc., are very simple to implement using the LLVM compiler infrastructure.
Our patching method can be used on real-world software, not just CGC challenges.

We created two main patching strategies: generic patching and bug-based patching. Generic patching is an exclusion-based strategy: it first assumes that every memory access must be verified, and then excludes accesses that are provably safe. The benefit of generic patching is that it patches all possible invalid memory accesses in a challenge. Bug-based patching is an inclusion-based strategy: it first assumes only one memory access (where the CRS found a bug) must be verified, and then includes nearby accesses that may be unsafe. Each patching strategy has multiple heuristics to determine which accesses should be included or excluded from verification.

The inclusion and exclusion heuristics generate patched challenges with different security/performance tradeoffs. The patched challenges generated by these heuristics were tested for performance and security to determine which heuristic performed best while still fixing the bug. For the qualifying event, we evaluated both generic and bug-based patching, but ultimately chose a generic-only patching strategy. Bug-based patching was slightly more performant, but generic patching was more comprehensive and it patched bugs that our CRS couldn’t find.

Functionality and Performance

Functionality and performance scores combine to create an availability score. The availability score is used as a scaling factor for points gained by patching and bug finding. This scaling factor only matters for successfully patched challenges, since those are the only challenges that can score points. The following graphs only consider functionality and performance of successfully patched challenges.

Functionality

Out of the 94 challenges that our CRS successfully patched, 56 retained full functionality, 30 retained partial functionality, and 8 were nonfunctional. Of the top 10 teams in the qualifying event, our CRS ranks 5th in terms of fully functional patched challenges (Figure 4). We suspect our patched challenges lost functionality due to problems in mcsema, our x86 to LLVM translator. We hope to verify and address these issues once DARPA open-sources the qualifying event challenges.

Figure 4: The count of perfectly functional, partially functional, and nonfunctional challenges submitted by each of the top 10 teams in the qualifying event. Orange bars signify finalists.

Performance

The performance of patched challenges is how our CRS snatched defeat from the jaws of victory. Of the top ten teams in the qualifying event, our CRS placed last in terms of patched challenge performance (Figure 5).

Figure 5: Average and median performance scores of the top ten qualifying event participants. Orange bars signify finalists.

Our CRS produces slow binaries for two reasons: technical and operational. The technical reason is that performance of our patched challenges is an artifact of our patching process, which translates challenges into LLVM bitcode and then re-emits them as executable binaries. The operational reason is that our patching was developed late and optimized for the wrong performance measurements.

So, why did we optimize for the wrong performance measurements? The official CGC performance measurement tools were kept secret, because the organizers wanted to ensure that no one could cheat by gaming the performance measurements. Therefore, we had to measure performance ourselves, and our metrics showed that CPU overhead of our patched challenges was usually negligible. The main flaw that we observed was that our patched challenges used too much memory. Because of this, we spent time and effort optimizing our patching to use less memory in favor of using more CPU time.

It turns out we optimized for the wrong thing, because our self-measurement did not agree with the official measurement tools (Table 1). When self-measuring, our worst-performing patching method had a median CPU overhead of 33% and a median memory overhead of 69%. The official qualifying event measured us at 76% CPU overhead and 28% memory overhead. Clearly, our self-measurements were considerably different from official measurements.

Measurement	Median CPU Overhead	Median Memory Overhead
Worst Self-Measured Patching Method	33%	69%
Official Qualifying Event	76%	28%

Table 1: Self measured CPU and memory overhead and the official qualifying event CPU and memory overhead.

Our CRS measured its overall score with our own performance metrics. The self-measured score of our CRS was 106, which would have put us in second place. The real overall score was 21.36, putting us in ninth.

An important aspect of software development is choosing where to focus your efforts, and we chose poorly. CGC participants had access to the official measuring system during two scored events held during the year, one in December 2014 and one in April 2015. We should have evaluated our patching system thoroughly during both scored events. Unfortunately, our patching wasn’t fully operational until after the second scored event, so we had no way to verify the accuracy of our self-measurement. The performance penalty of our patching isn’t a fundamental issue. Had we known how bad it was, we would have fixed it. However, according to our own measurements the patching was acceptable so we focused efforts elsewhere.

What’s Next?

According to the CGC FAQ (Question 46), teams are allowed to combine after the qualifying event. We hope to join forces with another team that qualified for the CGC final event, and use the best of both our technologies to win. The technology behind our CRS will provide a significant advantage to any team that partners with us. If you would like to discuss a potential partnership for the CGC final, please contact us at cgc@trailofbits.com.

If we cannot find a partner for the CGC final, we will focus our efforts on adapting our CRS to automatically find and patch vulnerabilities in real software. Our system is up to the task: it has already proven that it can find bugs, and all of its core components were derived from software that works on real Linux binaries. Several components even have Windows and 64-bit support, and adding support for other platforms is a possibility. If you are interested in commercial applications of our technology, please get in touch with us at cgc@trailofbits.com.

Finally, we plan to contribute back fixes and updates to the open source projects utilized in our CRS. We used numerous open source projects during development, and have made several custom fixes and modifications. We look forward to contributing these back to the community so that everyone benefits from our improvements.

At the end of last year, we had some free time to explore new and interesting uses of the automated bug-finding technology we developed for the DARPA Cyber Grand Challenge. While the rest of the competitors are quietly preparing for the CGC Final Event, we can entertain you with tales of running our bug-finding tools against real Linux applications.

Like many good stories, this one starts with a bet:

On November 4, 2014, Thomas Ptacek (of Starfighter) bet Matthew Green (of Johns Hopkins) that libotr, a popular library used in secure messaging software, would have a high severity (e.g. remote code execution, information disclosure) bug in the next 12 months. Here at Trail of Bits, we like a good wager, especially when the proceeds go to charity. And we just happened to have an automated bug-finding system laying around, itching for something to do. The temptation was too much to resist: we decided to use our automated bug-finding system from the Cyber Grand Challenge to look for bugs in libotr.

Before we go on, we should state that this was not a security audit. We simply wanted to test how well our automated bug-finding system works on real Linux software and maybe win some money for charity.

We successfully enhanced our bug-finding system to support the libotr library and tested it extensively. Our system confirmed that there were no critical bugs in code paths that we tested; since no one else reported any bugs, the bet ended with Matthew Green donating $1000 to Partners in Health.

Read on to discover the challenges encrypted communications systems present for automated testing, how we solved them, and our testing methodology. Of course, just because our system didn’t find bugs in libotr does not mean that libotr is bug-free.

Background

The automated bug-finding system, known as a Cyber Reasoning System (CRS), that we built for the Cyber Grand Challenge operates on binary code for the DECREE operating system. While DECREE is based on Linux, it differs considerably from plain Linux. DECREE has no signals, no shared memory, no threads, no sockets, no files, and only seven system calls. This means that DECREE is not binary or source compatible with Linux libraries like libotr.

After weighing our options, we decided the easiest and fastest way to test libotr was to port it to DECREE, instead of adding full Linux support to our CRS. We attempted the port in a generic manner, to ensure we could use the lessons learned to test future Linux software.

To port libotr, we had to solve two major issues: shared library dependencies (libotr depends on libgpgerror and libgcrypt) and libc support. We used LLVM to solve both problems at once. First, we used whole-program-llvm to compile libotr and all dependencies to LLVM bitcode. We then merged all the shared libraries at the bitcode level, and aggressively optimized the resulting bitcode. In one move, we eliminated the need for shared libraries, and drastically reduced the amount of libc we’d have to implement, because unused libc calls were optimized out of the resulting bitcode. To build a libc that works on DECREE, we combined libc implementations from the challenge binaries, stubbed functions that don’t make sense in DECREE, and created new implementations based on DECREE calls where appropriate.

Automated Testing

Encrypted communications applications are, by design, difficult to automatically audit. This makes perfect sense: if an automated system can reason how ciphertext relates to plaintext, the encrypted communication system is already broken. These systems are also difficult to audit by random testing (e.g. fuzzing), because recipients will verify the integrity of every message. Typically when testing encrypted systems, the encryption is turned off (or data is manipulated prior to encryption or after decryption). We wanted to simulate testing a black-box binary, so we did not modify libotr in any way. Instead, we thought the best path was to make our CRS simulate a man-in-the-middle (MITM) attack. Because we tested an unmodified libotr, our CRS could not effectively attack code past message integrity checks. However, there was still much in the way of attack surface: message control data, headers, and possibility of flaws in decryption/authentication code. The problem was that our CRS was not designed to MITM. We instead architected the test application (not libotr) to be easier to attack, which results in the convoluted architecture below.

The CRS acts as a man-in-the-middle between two applications communicating using libotr.

Creating the test application was more difficult than porting libotr to DECREE. The porting process was fairly straightforward and took about two weeks. The sample application took a bit longer, and was a much more frustrating experience: the official libotr distribution has no sample code, and the documentation leaves a lot to be desired.

Our testing was limited by the features of libotr exercised by our sample application (for instance, it doesn’t use SMP), and by the unusual test application we created. Additionally, some vulnerabilities may only occur after decryption, and modification of encrypted and authenticated data will never trigger these bugs.

Results

The results of testing libotr are very encouraging. We ran 48 Xeon CPUs for 24 hours against our libotr sample application, and did not identify any memory safety violations.

The CRS acts as a man-in-the-middle between two applications communicating using libotr.

This negative result does not mean that libotr is bug free. We only tested a subset of libotr, and there are considerable parts that our CRS never audited. The lack of obvious bugs is however a very good sign.

Conclusion

The timeframe of the libotr bet has expired without any reported high severity vulnerabilities. We audited parts of libotr with our automated bug-finding tools, and also didn’t find memory corruption vulnerabilities. In the process of setting up this test, we learned how to port Linux applications to DECREE and verified that our CRS can identify real bugs in Linux programs. Better documentation, tests, and sample applications that exercise every libotr feature would simplify both automated and manual auditing. For this experiment we constrained ourselves to an unmodified libotr. We are planning a future test where we modify libotr to enable easier automated testing.

We’re excited to announce that Sophia D’Antoine will be the next featured speaker at Etsy’s Code as Craft series on Wednesday, February 10th from 6:30-8pm in NYC.

What is Code as Craft?

Etsy Code as Craft events are a semi-monthly series of guest speakers who explore a technical topic or computing trend, sharing both conceptual ideas and practical advice. All talks will take place at the Etsy Labs on the 7th floor at 55 Washington Street in beautiful Brooklyn (Suite 712). Come see an awesome speaker and take a whirl in our custom photo booth. We hope to see you at an upcoming event!

In her talk, Sophia will discuss the latest in iOS security and the cross-section between this topic and compiler theory. She will discuss one of our ongoing projects, MAST, a mobile application security toolkit for iOS, which we discussed on this blog last year. Since then, we’ve continued to work on it, added new features, and transitioned it from a proof-of-concept DARPA project to a full-fledged mobile app protection suite.

What’s the talk about?

iOS applications have become an increasingly popular targets for hackers, reverse engineers, and software pirates. In this presentation, we discuss the current state of iOS attacks, review available security APIs, and reveal why they are not enough to defend against known threats. For high-risk applications, novel protections that go beyond those offered by Apple are required. As a solution, we discuss the design of the Mobile Application Security Toolkit (MAST) which ties together jailbreak detection, anti-debugging, and anti-reversing in LLVM to address these risks.

We hope to see you there. If you’re interested in attending, follow this link to register. MAST is still a beta product, so if you’re interested in using it on your own iOS applications after seeing this talk, contact us directly.

Developers have access to tools like AddressSanitizer and Valgrind that will tell them when the code that they’re running accesses uninitialized memory, leaks memory, or uses memory after it’s been freed. Despite the availability of these excellent tools, memory bugs still persist, still get shipped to users, and still get exploited in the wild.

Most of today’s bug-finding finding tools are dynamic: they identify bugs in programs while those programs are running. This is great because all programs have massive test suites that exercise every line of code… right? Wrong. Large test suites are the exception, not the rule. Test suites definitely help find and reduce bugs, but bugs still get through.

Perhaps the solution is to pay to have your code audited by professionals. More eyes on your code is a good thing™, but the underlying issue remains. Analyses run inside the heads of experts are still “dynamic”: thinking through every code path is just not tractable.

So dynamic analyses can miss bugs because they can’t check every possible program path. What can check every possible program path?

Finding use-after-frees in millions of lines of code

My stock advice for 2016: SELL integer overflow, BUY use-after-free, and HOLD type confusion pic.twitter.com/UVaxHXA99U

— John Lambert (@JohnLaTwC) January 6, 2016

We use static analysis to analyze millions of lines of code, without ever running the code. The analysis technique, called data-flow tracking, enables us to analyze and summarize properties about every possible program path. This solves the aforementioned problem of missing bugs that occur when certain program paths are not exercised.

How does an analysis that sees everything actually work? Below we describe the 1-2-3 of an actual whole-program static analysis tool that we developed and regularly use. The tool, PointsTo, finds and reports on potential use-after-free bugs in large codebases.

Step 1: Convert to LLVM bitcode

PointsTo operates on the LLVM bitcode representation of a program. We chose LLVM bitcode because it is a convenient intermediate representation for performing program analyses. Unsurprisingly, the first stage of our analysis pipeline converts a program’s source code into an LLVM bitcode database. We use an internal tool named CompInfo to produce these databases. An alternative, open-source tool for doing something similar is whole-program-llvm.

Step 2: Create the data-flow graph

The key idea behind PointsTo is to analyze how pointers to allocated objects flow through the program. What we care about are assignments to and copies of pointers, pointer dereferences, and frees of pointers. These operations on pointers are represented using a data-flow graph.

The most interesting step in the process is the why and how of transforming allocations and frees into special assignments. The “why” is that this transformation lets us repurpose an existing program analysis to find paths from FREE definitions to pointer dereferences. The how is more subtle: how does PointsTo know that it should change “new A” into an ALLOC and “delete a” into a FREE?

Imagine a hypothetical embedded system where programs are starved for memory and so the natural choice is to use a custom memory allocator called ration_memory. We created a Python modelling language to feed PointsTo information about higher-level function behaviors. Our modelling scripts tell PointsTo that “new A” returns a new object, and so we can use it to say the same thing about ration_memory.

Segue: Hidden data flows

The transformation from source code into a data flow graph looked pretty simple, but that was because the source code we started with was simple. It had no function calls, and more importantly, it had no function pointers or method calls! What happens if callback below is a function pointer? What happens if callback frees x?

int *x = malloc(4);
callback(x);
*x += 1;

This is the secret sauce and namesake of PointsTo: we perform a context- and path-sensitive pointer analysis that tells us which function pointers point to which functions and when. Altogether, we can produce an error report that follows x through callback and back again.

Step 3: Dénouement

It’s time to report potential errors for expert analysis. PointsTo searches through the data-flow graph, looking for flows from assignments to FREE down to dereferences. These flows are converted into a program slice of the source code lines, showing the path that execution needs to follow in order to produce a use-after-free. Here’s an example program slice of a real bug:

When describing this system to compiler folks, the usual first question is: but what about false-positives? What if we get a report about a use-after-free and it isn’t one? Here is where the priorities of program analysis for compilers and for vulnerabilities diverge.

False-positives in a compiler analysis can introduce bugs, and so compilers are usually conservative. That is, they trade false-positives for false-negatives. They might miss some optimization opportunities because they can’t prove something, but at least the program will be compiled correctly *cough*.

For vulnerability analysis, this is a bad trade. False-positives in a vulnerability analysis are inconvenient, but they’re a drop in the ocean when millions of lines of code need to be looked at. False-negatives, however, are unacceptable. A false-negative is a bug that is missed and might make it to production. A tool that always finds the bug and sometimes warns you about sketchy but correct code is an investment that saves time and money during code audits.

In summary, we conclude

Analyzing programs for bugs is hard. Industry best-practices like having extensive test suites should be followed. Developers should regularly run their programs through dynamic analysis tools to pick the low-hanging fruit. But more importantly, developers should understand that test suites and dynamic analyses are not a panacea. Bugs have a nasty habit of hiding behind rarely executed code paths. That’s why all paths need to be looked at. That’s why we made PointsTo.

PointsTo was a topic of discussion at a recent Empire Hacking, a bi-monthly meetup in NYC. The talk I gave there includes more information about the design and implementation of PointsTo and, for curious readers, the slides and video are reproduced below. We hope to release more videos from Empire Hacking in the future.

PointsTo was originally produced for Cyber Fast Track and we would like to thank DARPA for funding our work. Consultants at Trail of Bits use PointsTo and other internal tools for application security reviews. Contact us if you’re interested in a detailed audit of your code supported by tools like PointsTo and our CRS.

Using Vector35’s Binary Ninja, a promising new interactive static analysis and reverse engineering platform, I wrote a script that generated “exploits” for 2,000 unique binaries in this year’s DEFCON CTF qualifying round.

If you’re wondering how to remain competitive in a post-DARPA DEFCON CTF, I highly recommend you take a look at Binary Ninja.

Before I share how I slashed through the three challenges — 334 cuts, 666 cuts, and 1,000 cuts — I have to acknowledge the tool that made my work possible.

Compared to my experience with IDA, which is held together with duct tape and prayers, Binary Ninja’s workflow is a pleasure. It does analysis on its own intermediate language (IL), which is exposed through Python and C++ APIs. It’s comparatively simple to query blocks of code, functions, trace execution flow, query register states, and many other tasks that seem herculean within IDA.

This brought a welcome distraction from the slew of stack-based buffer overflows and unhardened heap exploitation that have come to characterize DEFCON’s CTF.

Since the original point of CTF competitions was to help people improve, I limited my options to what most participants could use. Without Binary Ninja, I would have had to:

Use IDA and IDAPython; a more expensive and unpleasant proposition.
Develop a Cyber Reasoning System; an unrealistic option for most participants.
Reverse the binaries by hand; effectively impossible given the number of binaries.

None of these are nearly as attractive as Binary Ninja.

How Binary Ninja accelerates CTF work

This year’s qualifying challenges were heavily focused on preparing competitors for the Cyber Grand Challenge (CGC). A full third of the challenges were DECREE-based. Several required CGC-style “Proof of Vulnerability” exploits. This year the finals will be based on DECREE so the winning CGC robot can ‘play’ against the human competitors. For the first time in its history, DEFCON CTF is abandoning the attack-defense model.

Challenge #1 : 334 cuts

334 cuts
http://download.quals.shallweplayaga.me/22ffeb97cf4f6ddb1802bf64c03e2aab/334_cuts.tar.bz2
334_cuts_22ffeb97cf4f6ddb1802bf64c03e2aab.quals.shallweplayaga.me:10334

The first challenge, 334 cuts, didn’t offer much in terms of direction. I started by connecting to the challenge service:

$ nc 334_cuts_22ffeb97cf4f6ddb1802bf64c03e2aab.quals.shallweplayaga.me 10334
send your crash string as base64, followed by a newline
easy-prasky-with-buffalo-on-bing

Okay, so it wants us to crash the service, no problem; I already had a crashing input string for that service already from a previous challenge.

$ nc 334_cuts_22ffeb97cf4f6ddb1802bf64c03e2aab.quals.shallweplayaga.me 10334
send your crash string as base64, followed by a newline
easy-prasky-with-buffalo-on-bing
YWFhYWFhYWFhYWFhYWFhYWFhYWFsZGR3YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYQo=
easy-biroldo-with-mayonnaise-on-multigrain

I wasn’t expecting a second challenge name after the first. I’m guessing I’m going to need to crash a few services now. Next I extracted the tarball.

$ tar jxf 334_cuts.tar.bz2
$ ls 334_cuts
334_cuts/easy-alheira-with-bagoong-on-ngome*
334_cuts/easy-cumberland-with-gribiche-on-focaccia*
334_cuts/easy-kielbasa-with-skhug-on-scone*
334_cuts/easy-mustamakkara-with-pickapeppa-on-soda*
334_cuts/easy-alheira-with-garum-on-pikelet*
334_cuts/easy-cumberland-with-khrenovina-on-white*
334_cuts/easy-krakowska-with-franks-on-pikelet*
334_cuts/easy-mustamakkara-with-shottsuru-on-naan*
...
$ ls 334_cuts | wc -l
334

Hmm, there are 334 DECREE challenge binaries, all with food-related names. Well, time to throw them into Binja. Starting with easy-biroldo-with-mayonnaise-on-multigrain. DECREE challenge binaries are secretly ELF binaries (as used on Linux and FreeBSD), so they load just fine with Binja’s ELF loader.

Binary Ninja has a simple and smooth interface

This challenge binary is fairly simple and nearly identical to easy-prasky-with-buffalo-on-bing. Each challenge binary is stripped of symbols, has a static stack buffer, a canary, and a stack-based buffer overflow. The canary is copied to the stack and checked against a hard coded value. If the canary is overwritten, the challenge terminates and does not crash. Any overflow will have to make sure the canary value is overwritten with the expected value. It turns out all 334 challenges only differ in four ways:

The size of the buffer you overflow
The canary string and its length
The size of the stack buffer in the recvmsg function
The amount of data the writemsg function proceses for each iteration of its write loop

Our crashing string has to exactly overflow both the stack buffer and pass the canary check in each of the 334 binaries. It’s best to automate collecting this information. Thankfully Binja can be used as a headless analysis engine from Python!

We start by importing binja into our python script and creating a binary view. The binary view is our main interface to Binja’s analysis.

I was initially trying to create a generic solution without looking at the majority of the challenge binaries, so I found the main function programmatically. I did that by starting at the entry point and knowing that it made two calls.

From the entry point, I knew there were two calls with the second being the one I wanted. Similarly, I knew the next function had one call and the call was the one I wanted to follow to main. All my analysis used Binja’s LowLevelIL.

Once we have our reference to main, the real fun begins.

Binary Ninja in LowLevelIL mode

The first thing we needed to figure out was the canary string. The approach I took was to collect references to all the call instructions:

Then I knew that the first call was to a memcpy, the second was to recvmsg, and the third was to the canary memcmp. Small hiccup here, sometimes the compiler would inline the memcpy. This happened when the canary string string was less than 16 bytes long.

This Challenge Binary has an inline memcpy.😦

This was a simple fix, as I now counted the number of calls in the function and adjusted my offsets accordingly:

To extract the canary and size of the canary buffer, I used the newly introduced get_parameter_at() function. This function is fantastic: at any caller site, it allows you to query the function parameters with respect to calling convention and system architecture. I used it to query all the parameters for the call to memcmp.

Next I need to know how big the buffer to overflow is. To do this, I once again used get_parameter_at() to query the first argument for the read_buf call. This points to the stack buffer we’ll overflow. We can calculate its size by subtracting the offset of the canary’s stack buffer.

It turns out the other two variables were inconsequential. These two bits of information were all we needed to craft our crashing string.

I glued all this logic together and threw it at the 334 challenge. It prompted me for 10 crashing strings before giving me the flag: baby's first crs cirvyudta.

Challenge #2: 666 cuts

666 cuts
http://download.quals.shallweplayaga.me/e38431570c1b4b397fa1026bb71a0576/666_cuts.tar.bz2
666_cuts_e38431570c1b4b397fa1026bb71a0576.quals.shallweplayaga.me:10666

To start, I once again connected with netcat:

$ nc 666_cuts_e38431570c1b4b397fa1026bb71a0576.quals.shallweplayaga.me 10666
send your crash string as base64, followed by a newline
medium-chorizo-with-chutney-on-bammy

I’m expecting 666 challenge binaries.

$ tar jxf 666_cuts.tar.bz2
$ ls 666_cuts
666_cuts/medium-alheira-with-khrenovina-on-marraqueta*
666_cuts/medium-cumberland-with-hollandaise-on-bannock*
666_cuts/medium-krakowska-with-doubanjiang-on-pita*
666_cuts/medium-newmarket-with-pickapeppa-on-cholermus*
…
$ ls 666_cuts | wc -l
666

Same game as before, I throw a random binary into binja and it’s nearly identical to the set from 334. At this point I wonder if the same script will work for this challenge. I modify it to connect to the new service and run it. The new service provides 10 challenge binary names to crash and my script provides 10 crashing strings, before printing the flag: you think this is the real quaid DeifCokIj.

Challenge #3: 1000 cuts

1000 cuts
http://download.quals.shallweplayaga.me/1bf4f5b0948106ad8102b7cb141182a2/1000_cuts.tar.bz2
1000_cuts_1bf4f5b0948106ad8102b7cb141182a2.quals.shallweplayaga.me:11000

You get the idea, 1000 challenges, same script, flag is: do you want a thousand bandages gruanfir3.

Room For Improvement

Binary Ninja shows a lot of promise, but it still has a ways to go. In future versions I hope to see the addition of SSA and a flexible type system. Once SSA is added to Binary Ninja, it will be easier to identify data flows through the application, tell when types change, and determine when stack slots are reused. It’s also a foundational feature that helps build a decompiler.

Conclusion

From its silky smooth graph view to its intermediate language to its smart integration with Python, Binary Ninja provides a fantastic interface for static binary analysis. With minimal effort, it allowed me to extract data from 2000 binaries quickly and easily.

That’s the bigger story here: It’s possible to enhance our capabilities and combine mechanical efficiency with human intuition. In fact, I’d say it’s preferable. We’re not going to become more secure if we rely on machines entirely. Instead, we should focus on building tools that make us more effective; tools like Binary Ninja.

If you agree, give Binary Ninja a chance. In less than a year of development, it’s already punching above its weight class. Expect more fanboyism from myself and the rest of Trail of Bits — especially as Binary Ninja continues to improve.

My (slightly updated) script is available here. For the sake of history, the original is available here.

Binary Ninja is currently in a private beta and has a public Slack.

No doubt, DARPA’s Cyber Grand Challenge (CGC) will go down in history for advancing the state of the art in a variety of fields: symbolic execution, binary translation, and dynamic instrumentation, to name a few. But there is one contribution that we believe has been overlooked so far, and that may prove to be the most useful of them all: the dataset of challenge binaries.

Until now, if you wanted to ‘play along at home,’ you would have had to install DECREE, a custom Linux-derived operating system that has no signals, no shared memory, no threads, and only seven system calls. Sound like a hassle? We thought so.

One metric for all tools

Competitors in the Cyber Grand Challenge identify vulnerabilities in challenge binaries (CBs) written for DECREE on the 32-bit Intel x86 architecture. Since 2014, DARPA has released the source code for over 100 of these vulnerable programs. These programs were specifically designed with vulnerabilities that represent a wide variety of software flaws. They are more than simple test cases, they approximate real software with enough complexity to stress both manual and automated vulnerability discovery.

If the CBs become widely adopted as benchmarks, they could change the way we solve security problems. This mirrors the rapid evolution of the SAT and ML communities once standardized benchmarks and regular competitions were established. The challenge binaries, valid test inputs, and sample vulnerabilities create an industry standard benchmark suite for evaluating:

Bug-finding tools
Program-analysis tools (e.g. automated test coverage generation, value range analysis)
Patching strategies
Exploit mitigations

The CBs are a more robust set of tests than previous approaches to measuring the quality of software analysis tools (e.g. SAMATE tests, NSA Juliet tests, or the STONESOUP test cases). First, the CBs are complex programs like games, content management systems, image processors, and so on, instead of just snippets of vulnerable code. After all, to be effective, analysis tools must process real software with a fairly low bug density, not direct snippets of vulnerable code. Second, unlike open source projects with added bugs, we have very high confidence all the bugs in the CBs have been found, so analysis tools can be compared to an objective standard. Finally, the CBs also come with extensive functionality tests, triggers for introduced bugs, patches, and performance monitoring tools, enabling benchmarking of patching tools and bug mitigation strategies.

Creating an industry standard benchmarking set will solve several problems that hamper development of future program analysis tools:

First, the absence of standardized benchmarks prevents an objective determination of which tools are “best.” Real applications don’t come with triggers for complex bugs, nor an exhaustive list of those bugs. The CBs provide metrics for comparison, such as:

Number of bugs found
Number of bugs found per unit of time or memory
Categories of bugs found and missed
Variances in performance from configuration options

Next, which mitigations are most effective? CBs come with inputs that stress original program functionality, inputs that check for the presence of known bugs, and performance measuring tools. These allow us to explore questions like:

What is the potential effectiveness and performance impact of various bug mitigation strategies (e.g. Control Flow Integrity, Code Pointer Integrity, Stack Cookies, etc)?
How much slower does the resulting program run?
How good is a mitigation compared to a real patch?

Play Along At Home

The teams competing in the CGC have had years to hone and adapt their bug-finding tools to the peculiarities of DECREE. But the real world doesn’t run on DECREE; it runs on Windows, Mac OS X, and Linux. We believe that research should be guided by real-world challenges and parameters. So, we decided to port* the challenge binaries to run in those environments.

It took us several attempts to find the best porting approach to minimize the amount of code changes, while preserving as much original code as possible between platforms. The eventual solution was fairly straightforward: build each compilation unit without standard include files (as all CBs are statically linked), implement CGC system calls using their native equivalents, and perform various minor fixes to make the code compatible with more compilers and standard libraries.

We’re excited about the potential of multi-platform CBs on several fronts:

Since there’s no need to set up a virtual machine just for DECREE, you can run the CBs on the machine you already have.
With that hurdle out of the way, we all now have an industry benchmark to evaluate program analysis tools. We can make comparisons such as:
- How good are the CGC tools vs. existing program analysis and bug finding tools
- When a new tool is released, how does it stack up against the current best?
- Do static analysis tools that work with source code find more bugs than dynamic analysis tools that work with binaries?
- Are tools written for Mac OS X better than tools written for Linux, and are they better than tools written for Windows?
When researchers open source their code, we can evaluate how well their findings work for a particular OS or compiler.

Before you watch the competitors’ CRSs duke it out, explore the challenges that the robots will attempt to solve in an environment you’re familiar with.

Get the CGC’s Challenge Binaries in the most common operating systems.

* Big thanks to our interns, Kareem El-Faramawi and Loren Maggiore, for doing the porting, and to Artem, Peter, and Ryan for their support.

Finding bugs in programs is hard. Automating the process is even harder. We tackled the harder problem and produced two production-quality bug-finding systems: GRR, a high-throughput fuzzer, and PySymEmu (PSE), a binary symbolic executor with support for concrete inputs.

From afar, fuzzing is a dumb, brute-force method that works surprisingly well, and symbolic execution is a sophisticated approach, involving theorem provers that decide whether or not a program is “correct.” Through this lens, GRR is the brawn while PSE is the brains. There isn’t a dichotomy though — these tools are complementary, and we use PSE to seed GRR and vice versa.

Let’s dive in and see the challenges we faced when designing and building GRR and PSE.

GRR, the fastest fuzzer around

GRR is a high speed, full-system emulator that we use to fuzz program binaries. A fuzzing “campaign” involves executing a program thousands or millions of times, each time with a different input. The hope is that spamming a program with an overwhelming number of inputs will result in triggering a bug that crashes the program.

Note: GRR is pronounced with two fists held in the air

During DARPA’s Cyber Grand Challenge, we went web-scale and performed tens of billions of input mutations and program executions — in only 24 hours! Below are the challenges we faced when making this fuzzer, and how we solved those problems.

Throughput. Typically, program fuzzing is split into discrete steps. A sample input is given to an input “mutator” which produces input variants. In turn, each variant is separately tested against the program in the hopes that the program will crash or execute new code. GRR internalizes these steps, and while doing so, completely eliminates disk I/O and program analysis ramp-up times, which represent a significant portion of where time is spent during a fuzzing campaign with other common tools.
Transparency. Transparency requires that the program being fuzzed cannot observe or interfere with GRR. GRR achieves transparency via perfect isolation. GRR can “host” multiple 32-bit x86 processes in memory within its 64-bit address space. The instructions of each hosted process are dynamically rewritten as they execute, guaranteeing safety while maintaining operational and behavioral transparency.
Reproducibility. GRR emulates both the CPU architecture and the operating system, thereby eliminating sources of non-determinism. GRR records program executions, enabling any execution to be faithfully replayed. GRR’s strong determinism and isolation guarantees let us combine the strengths of GRR with the sophistication of PSE. GRR can snapshot a running program, enabling PSE to jump-start symbolic execution from deep within a given program execution.

PySymEmu, the PhD of binary symbolic execution

Symbolic execution as a subject is hard to penetrate. Symbolic executors “reason about” every path through a program, there’s a theorem prover in there somewhere, and something something… bugs fall out the other end.

At a high level, PySymEmu (PSE) is a special kind of CPU emulator: it has a software implementation for almost every hardware instruction. When PSE symbolically executes a binary, what it really does is perform all the ins-and-outs that the hardware would do if the CPU itself was executing the code.

PSE explores the relationship between the life and death of programs in an unorthodox scientific experiment

CPU instructions operate on registers and memory. Registers are names for super-fast but small data storage units. Typically, registers hold four to eight bytes of data. Memory on the other hand can be huge; for a 32-bit program, up to 4 GiB of memory can be addressed. PSE’s instruction simulators operate on registers and memory too, but they can do more than just store “raw” bytes — they can store expressions.

A program that consumes some input will generally do the same thing every time it executes. This happens because that “concrete” input will trigger the same conditions in the code, and cause the same loops to merry-go-round. PSE operates on symbolic input bytes: free variables that can initially take on any value. A fully symbolic input can be any input and therefore represents all inputs. As PSE emulates the CPU, if-then-else conditions impose constraints on the originally unconstrained input symbols. An if-then-else condition that asks “is input byte B less than 10” will constrain the symbol for B to be in the range [0, 10) along the true path, and to be in the range [10, 256) along the false path.

If-then-elses are like forks in the road when executing a program. At each such fork, PSE will ask its theorem prover: “if I follow the path down one of the prongs of the fork, then are there still inputs that satisfy the additional constraints imposed by that path?” PSE will follow each yay path separately, and ignore the nays.

So, what challenges did we face when creating and extending PSE?

Comprehensiveness. Arbitrary program binaries can exercise any one of thousands of the instructions available to x86 CPUs. PSE implements simulation functions for hundreds of x86 instructions. PSE falls back onto a custom, single-instruction “micro-executor” in those cases where an instruction emulation is not or cannot be provided. In practice, this setup enables PSE to comprehensively emulate the entire CPU.
Scale. Symbolic executors try to follow all feasible paths through a program by forking at every if-then-else condition, and constraining the symbols one way or another along each path. In practice, there are an exponential number of possible paths through a program. PSE handles the scalability problem by selecting the best path to execute for the given execution goal, and by distributing the program state space exploration process across multiple machines.
Memory. Symbolic execution produces expressions representing simple operations like adding two symbolic numbers together, or constraining the possible values of a symbol down one path of an if-then-else code block. PSE gracefully handles the case where addresses pointing into memory are symbolic. Memory accessed via a symbolic address can potentially point anywhere — even point to “good” and “bad” (i.e. unmapped) memory.
Extensibility. PSE is written using the Python programming language, which makes it easy to hack on. However, modifying a symbolic executor can be challenging — it can be hard to know where to make a change, and how to get the right visibility into the data that will make the change a success. PSE includes smart extension points that we’ve successfully used for supporting concolic execution and exploit generation.

Measuring excellence

So how do GRR and PSE compare to the best publicly available tools?

GRR

GRR is both a dynamic binary translator and fuzzer, and so it’s apt to compare it to AFLPIN, a hybrid of the AFL fuzzer and Intel’s PIN dynamic binary translator. During the Cyber Grand Challenge, DARPA helpfully provided a tutorial on how to use PIN with DECREE binaries. At the time, we benchmarked PIN and found that, before we even started optimizing GRR, it was already twice as fast as PIN!

The more important comparison metric is in terms of bug-finding. AFL’s mutation engine is smart and effective, especially in terms of how it chooses the next input to mutate. GRR internalizes Radamsa, another too-smart mutation engine, as one of its many input mutators. Eventually we may also integrate AFL’s mutators. During the qualifying event, GRR went face-to-face with AFL, which was integrated into the Driller bug-finding system. Our combination of GRR+PSE found more bugs. Beyond this one data point, a head-to-head comparison would be challenging and time-consuming.

PySymEmu

PSE can be most readily compared with KLEE, a symbolic executor of LLVM bitcode, or the angr binary analysis platform. LLVM bitcode is a far cry from x86 instructions, so it’s an apples-to-oranges comparison. Luckily we have McSema, our open-source and actively maintained x86-to-LLVM bitcode translator. Our experiences with KLEE have been mostly negative; it’s hard to use, hard to hack on, and it only works well on bitcode produced by the Clang compiler.

Angr uses a customized version of the Valgrind VEX intermediate representation. Using VEX enables angr to work on many different platforms and architectures. Many of the angr examples involve reverse engineering CTF challenges instead of exploitation challenges. These RE problems often require manual intervention or state knowledge to proceed. PSE is designed to try to crash the program at every possible emulated instruction. For example PSE will use its knowledge of symbolic memory to access any possible invalid array-like memory accesses instead of just trying to solve for reaching unconstrained paths. During the qualifying event, angr went face-to-face with GRR+PSE and we found more bugs. Since then, we have improved PSE to support user interaction, concrete and concolic execution, and taint tracking.

I’ll be back!

Automating the discovery of bugs in real programs is hard. We tackled this challenge by developing two production-quality bug-finding tools: GRR and PySymEmu.

GRR and PySymEmu have been a topic of discussion in recent presentations about our CRS, and we suspect that these tools may be seen again in the near future.

We’ve mentioned GRR before – it’s our high-speed, full-system emulator used to fuzz program binaries. We developed GRR for DARPA’s Cyber Grand Challenge (CGC), and now we’re releasing it as an open-source project! Go check it out.

Fear GRR

Bugs aren’t afraid of slow fuzzers, and that’s why GRR was designed with unique and innovative features that make it tread scarily fast.

GRR emulates x86 binaries within a 64-bit address space using dynamic binary translation (DBT). As a 64-bit program, GRR can use more hardware registers and memory than the original program. This enabled easy implementation of perfect isolation without complex register-rescheduling or memory remapping logic. The translated program never sees GRR coming.

GRR is fast. Existing DBTs re-translate the same program on every execution. They specialize in translating long-running programs, where the translation cost is amortized over time, and “hot” code is reorganized to improve performance. Fuzzing campaigns execute the same program over and over again, so all code is hot, and re-translating the same code is wasteful. GRR’s avoids re-translating code by caching it to disk, and it optimizes its the cached code over the lifetime of the fuzzing campaign.

GRR eats JIT compilers and self-modifying code for breakfast. GRR translates one basic block at a time, and indexes the translated blocks in its cache using “version numbers.” A block’s version number is a Merkle hash of the contents of executable memory. Modifying the contents of an executable page in memory invalidates its hash, thereby triggering re-translation of its code when next executed.

GRR is efficient. GRR uses program snapshotting to skip over irrelevant setup code that executes before the first input byte is ever read. This saves a lot cycles in a fuzzing campaign with millions or billions of program executions. GRR also avoids kernel roundtrips by emulating system calls and performing all I/O within memory.

GRR is extensible. GRR supports pluggable input mutators, including Radamsa, and code coverage metrics, which allows you to tune GRR’s behavior to the program being fuzzed. In the CGC, we didn’t know ahead of time what binaries we would get. There is no one-size fits all way of measuring code coverage. GRR’s flexibility let us change how code coverage was measured over time. This made our fuzzer more resilient to different types of programs.

Two fists in the air

Take a look at GRR demolishing this CGC challenge that has six binaries communicating over IPC. GRR detects the crash in the 3rd binary.

This demo shows off two nifty features of GRR:

GRR can print out the trace of system calls performed by the translated binary.
GRR can print out the register state on entry to every basic block executed. Instruction-granularity register traces are available when the maximum basic block size is set to a one instruction.

Dig deeper

I like to think of GRR as an excavator for bugs. It’s a lean, mean, bug-finding machine. It’s also now open-source, and permissively licensed. You should check it out and we welcome contributions.

Hi, I’m Josh. I recently joined the team at Trail of Bits, and I’ve been an evangelist and plugin writer for the Binary Ninja reversing platform for a while now. I’ve developed plugins that make reversing easier and extended Binary Ninja’s architecture support to assist in playing the microcorruption CTF. One of my favorite features of Binary Ninja is the Low Level IL (LLIL), which enables development of powerful program analysis tools. At Trail of Bits, we have used the LLIL to automate processing of a large number of CTF binaries, as well as automate identifying memory corruptions.

I often get asked how the LLIL works. In this blog post, I answer common questions about the basics of LLIL and demonstrate how to use the Python API to write a simple function that operates on the LLIL. In a future post, I will demonstrate how to use the API to write plugins that use both the LLIL and Binary Ninja’s own dataflow analysis.

What is the Low Level IL?

Compilers use an intermediate representation (IR) to analyze and optimize the code being compiled. This IR is generated by translating the source language to a single standard language understood by the components of the toolchain. The toolchain components can then perform generic tasks on a variety of architectures without having to implement those tasks individually.

Similarly, Binary Ninja not only disassembles binary code, but also leverages the power of its own IR, called Low Level IL, in order to perform dataflow analysis. The dataflow analysis makes it possible for users to query register values and stack contents at arbitrary instructions. This analysis is architecture-agnostic because it is performed on the LLIL, not the assembly. In fact, I automatically got this dataflow analysis for free when I wrote the lifter for the MSP430 architecture.

Let’s jump right in and see how the Low Level IL works.

Viewing the Low Level IL

Within the UI, the Low Level IL is viewable only in Graph View. It can be accessed either through the “Options” menu in the bottom right corner, or via the i hotkey. The difference between IL View and Graph View is noticeable; the IL View looks much closer to a high level language with its use of infix notation. This, combined with the fact that the IL is a standardized set of instructions that all architectures are translated to, makes working with an unfamiliar language easy.

Graph View versus IL View; on the left, Graph View of ARM (top) and x86-64 (bottom) assembly of the same function. On the right, the IL View of their respective Graph Views.

If you aren’t familiar with this particular architecture, then you might not easily understand the semantics of the assembly code. However, the meaning of the LLIL is clear. You might also notice that there are often more LLIL instructions than there are assembly instructions. The translation of assembly to LLIL is actually a one-to-many rather than one-to-one translation because the LLIL is a simplified representation of an instruction set. For example, the x86 repne cmpsb instruction will even generate branches and loops in the LLIL:

Low Level IL representation of the x86 instruction repne cmpsb

How is analysis performed on the LLIL? To figure that out, we’ll first dive into how the LLIL is structured.

Low Level IL Structure

According to the API documentation, LLIL instructions have a tree-based structure. The root of an LLIL instruction tree is an expression consisting of an operation and zero to four operands as child nodes. The child nodes may be integers, strings, arrays of integers, or another expression. As each child expression can have its own child expressions, an instruction tree of arbitrary order and complexity can be built. Below are some example expressions and their operands:

Operation	Operand 1	Operand 2	Operand 3
`LLIL_NOP`
`LLIL_SET_REG`	`dest`: string or integer	`src`: expression
`LLIL_LOAD`	`src`: expression
`LLIL_CONST`	`value`: integer
`LLIL_IF`	`condition`: expression	`true`: integer	`false`: integer
`LLIL_JUMP_TO`	`dest`: expression	`targets`: array of integers

Let’s look at a couple examples of lifted x86, to get a better understanding of how these trees are generated when lifting an instruction: first, a simple mov instruction, and then a more complex lea instruction.

Example: `mov eax, 2`

LLIL tree for mov eax, 2

This instruction has a single operation, mov, which is translated to the LLIL expression LLIL_SET_REG. The LLIL_SET_REG instruction has two child nodes: dest and src. dest is a reg node, which is just a string representing the register that will be set. src is another expression representing how the dest register will be set.

In our x86 instruction, the destination register is eax, so the dest child is just eax; easy enough. What is the source expression? Well, 2 is a constant value, so it will be translated into an LLIL_CONST expression. An LLIL_CONST expression has a single child node, value, which is an integer. No other nodes in the tree have children, so the instruction is complete. Putting it all together, we get the tree above.

Example: `lea eax, [edx+ecx*4]`

LLIL tree for lea eax, [edx+ecx*4]

The end result of this instruction is also to set the value of a register. The root of this tree will also be an LLIL_SET_REG, and its dest will be eax. The src expression is a mathematical expression consisting of an addition and multiplication…or is it?

If we add parenthesis to explicitly define the order of operations, we get (edx + (ecx * 4)); thus, the root of the src sub-tree will be an LLIL_ADD expression, which has two child nodes: left and right, both of which are expressions. The left side of the addition is a register, so the left expression in our tree will be an LLIL_REG expression. This expression only has a single child. The right side of the addition is our multiplication, but the multiplier in an lea instruction has to be a power of 2, which can be translated to a left-shift operation, and that’s exactly what the lifter does: ecx * 4 becomes ecx << 2. So, the right expression in the tree is actually an LLIL_LSL expression (Logical Shift Left).

The LLIL_LSL expression also has left and right child expression nodes. For our left-shift operation, the left side is the ecx register, and the right side is the constant 2. We already know that both LLIL_REG and LLIL_CONST terminate with a string and integer, respectively. With the tree complete, we arrive at the tree presented above.

Now that we have an understanding of the structure of the LLIL, we are ready to dive into using the Python API. After reviewing features of the API, I will demonstrate a simple Python function to traverse an LLIL instruction and examine its tree structure.

Using the Python API

There are a few important classes related to the LLIL in the Python API: LowLevelILFunction, LowLevelILBasicBlock, and LowLevelILInstruction. There are a few others, like LowLevelILExpr and LowLevelILLabel, but those are more for writing a lifter rather than consuming IL.

Accessing Instructions

To begin playing with the IL, the first step is to get a reference to a function’s LLIL. This is accomplished through the low_level_il property of a Function object. If you’re in the GUI, you can get the LowLevelILFunction object for the currently displayed function using current_function.low_level_il.

The LowLevelILFunction class has a lot of methods, but they’re basically all for implementing a lifter, not performing analysis. In fact, this class is really only useful for retrieving or enumerating basic blocks and instructions. The __iter__ method is implemented and iterates over the basic blocks of the LLIL function, and the __getitem__ method is implemented and retrieves an LLIL instruction based on its index. The LowLevelILBasicBlock class also implements __iter__, which iterates over the individual LowLevelILInstruction objects belonging to that basic block. Therefore, it is possible to iterate over the instructions of a LowLevelILFunction two different ways, depending on your needs:

il = current_function.low_level_il

# iterate over instructions using basic blocks
for bb in il.basic_blocks:
  for instruction in bb:
    print instruction

# iterate over instructions directly
for index in range(len(il)):
  instruction = il[index]
  print instruction

Directly accessing an instruction is currently cumbersome. In Python, this is accomplished with function.low_level_il[function.get_low_level_il_at(function.arch, address)]. That’s a pretty verbose line of code. This is because the Function.get_low_level_il_at() method doesn’t actually return a LowLevelILInstruction object; it returns the integer index of the LLIL instruction. Hopefully it will be more concise in an upcoming refactor of the API.

Parsing Instructions

The real meat of the LLIL is exposed in LowLevelILInstruction objects. The common members shared by all instructions allow you to determine:

The containing function of the LLIL instruction
The address of the assembly instruction lifted to LLIL
The operation of the LLIL instruction
The operation_name, which is just a string representation of the operation
The size of the operation (i.e. is this instruction manipulating a byte/short/long/long long)

As we saw in the table above, the operands vary by instruction. These can be accessed sequentially, via the operands member, or directly accessed by operand name (e.g. dest, left, etc). When accessing operands of an instruction that has a destination operand, the dest operand will always be the first element of the list.

Example: A Simple Recursive Traversal Function

A very simple example of consuming information from the LLIL is a recursive traversal of a LowLevelILInstruction. In the example below, the operation of the expression of an LLIL instruction is output to the console, as well as its operands. If an operand is also an expression, then the function traverses that expression as well, outputting its operation and operands in turn.

def traverse_IL(il, indent):
  if isinstance(il, LowLevelILInstruction):
    print '\t'*indent + il.operation_name

  for o in il.operands:
    traverse_IL(o, indent+1)

  else:
    print '\t'*indent + str(il)

After copy-pasting this into the Binary Ninja console, select any instruction you wish to output the tree for. You can then use bv, current_function, and here to access the current BinaryView, the currently displayed function’s Function object, and the currently selected address, respectively. In the following example, I selected the ARM instruction ldr r3, [r11, #-0x8]:

Lifted IL vs Low Level IL

While reviewing the API, you might notice that there are function calls such as Function.get_lifted_il_at versus Function.get_low_level_il_at. This might make you unsure of which you should be processing for your analysis. The answer is fairly straight-forward: with almost no exceptions, you will always want to work with Low Level IL.

Lifted IL is what the lifter first generates when parsing the executable code; an optimized version is what is exposed as the Low Level IL to the user in the UI. To demonstrate this, try creating a new binary file, and fill it with a bunch of nop instructions, followed by a ret. After disassembling the function, and switching to IL view (by pressing i in Graph View), you will see that there is only a single IL instruction present: <return> jump(pop). This is due to the nop instructions being optimized away.

It is possible to view the Lifted IL in the UI: check the box in Preferences for “Enable plugin development debugging mode.” Once checked, the “Options” tab at the bottom of the window will now present two options for viewing the IL. With the previous example, switching to Lifted IL view will now display a long list of nop instructions, in addition to the <return> jump(pop).

In general, Lifted IL is not something you will need unless you’re developing an Architecture plugin.

Start Using the LLIL

In this blog post, I described the fundamentals of Binary Ninja’s Low Level IL, and how the Python API can be used to interact with it. Around the office, Ryan has used the LLIL and its data flow analysis to solve 2000 CTF challenge binaries by identifying a buffer to overflow and a canary value that had to remain intact in each. Sophia will present “Next-level Static Analysis for Vulnerability Research” using the Binary Ninja LLIL at INFILTRATE 2017, which everyone should definitely attend. I hope this guide makes it easier to write your own plugins with Binary Ninja!

In Part 2 of this blog post, I will demonstrate the power of the Low Level IL and its dataflow analysis with another simple example. We will develop a simple, platform-agnostic plugin to navigate to virtual functions by parsing the LLIL for an object’s virtual method table and calculating the offset of the called function pointer. This makes reversing the behavior of C++ binaries easier because instructions such as call [eax+0x10] can be resolved to a known function like object->isValid(). In the meantime, get yourself a copy of Binary Ninja and start using the LLIL.

In my first blog post, I introduced the general structure of Binary Ninja’s Low Level IL (LLIL), as well as how to traverse and manipulate it with the Python API. Now, we’ll do something a little more interesting.

Reverse engineering binaries compiled from object-oriented languages can be challenging, particularly when it comes to virtual functions. In C++, invoking a virtual function involves looking up a function pointer in a virtual table (vtable) and then making an indirect call. In the disassembly, all you see is something like mov rax, [rcx+0x18]; call rax. If you want to know what function it will call for a given class object, you have to find the virtual table and then figure out which function pointer is at that offset.

Or you could use this plugin!

An Example Plugin: Navigating to a Virtual Function

vtable-navigator.py is an example plugin that will navigate to a given class’s virtual function from a call instruction. First, the plugin uses the LLIL to identify a specified class’s vtable when it is referenced in a constructor. Next, it will preprocess the call instruction’s basic block to track register assignments and their corresponding LLIL expressions. Finally, the plugin will process the LowLevelILInstruction object of the call and calculate the offset of the function to be called by recursively visiting register assignment expressions.

Discovering the vtable Pointer

Figure 1: Two classes inherit from a base virtual class. Each class’s virtual table points to its respective implementation of the virtual functions.

In the simplest form, a class constructor stores a pointer to its vtable in the object’s structure in memory. The two most common ways that a vtable pointer can be stored are either directly referencing a hardcoded value for the vtable pointer or storing the vtable pointer in a register, then copying that register’s value into memory. Thus, if we look for a write to a memory address from a register with no offset, then it’s probably the vtable.

An example constructor. The highlighted instruction stores a vtable in the object’s structure.

We can detect the first kind of vtable pointer assignment by looking for an LLIL instruction in the constructor’s LowLevelILFunction object (as described in Part 1) that stores a constant value to a memory address contained in a register.

According to the API, an LLIL_STORE instruction has two operands: dest and src. Both are LLIL expressions. For this case, we are looking for a destination value provided by a register, so dest should be an LLIL_REG expression. The value to be stored is a constant, so src should be an LLIL_CONST expression. If we match this pattern, then we assume that the constant is a vtable pointer, read the value pointed to by the constant (i.e. il.src.value), and double check that there’s a function pointer there, just to make sure it’s actually a vtable.

# If it's not a memory store, then it's not a vtable.
if il.operation != LowLevelILOperation.LLIL_STORE:
    continue

# vtable is referenced directly
if (il.dest.operation == LowLevelILOperation.LLIL_REG and
        il.src.operation == LowLevelILOperation.LLIL_CONST):
    fp = read_value(bv, il.src.value, bv.address_size)

    if not bv.is_offset_executable(fp):
        continue

    return il.src.value

Pretty straight forward, but let’s look at the second case, where the value is first stored in a register.

For this case, we search for instructions where the dest and src operands of an LLIL_STORE are both LLIL_REG expressions. Now we need to determine the location of the virtual table based only on the register.

This is where things get cool. This situation not only demonstrates usage of the LLIL, but how powerful the dataflow analysis performed on the LLIL can be. Without dataflow analysis, we would have to parse this LLIL_STORE instruction, figure out which register is being referenced, and then step backwards to find the last value assigned to that register. With the dataflow analysis, the register’s current value is readily available with a single call to get_reg_value_at_low_level_il_instruction.

# vtable is first loaded into a register, then stored
if (il.dest.operation == LowLevelILOperation.LLIL_REG and
        il.src.operation == LowLevelILOperation.LLIL_REG):
    reg_value = src_func.get_reg_value_at_low_level_il_instruction(
        il.instr_index, il.src.src
    )

    if reg_value.type == RegisterValueType.ConstantValue:
        fp = read_value(bv, reg_value.value, bv.address_size)

        if not bv.is_offset_executable(fp):
            continue

    return reg_value.value

Propagation of Register Assignments

Now that we know the location of the vtable, let’s figure out which offset is called. To determine this value, we need to trace back through the state of the program from the call instruction to the moment the vtable pointer is retrieved from memory, calculate the offset into the virtual table, and discover which function is being called. We accomplish this tracing by implementing a rudimentary dataflow analysis that preprocesses the basic block containing the call instruction. This preprocessing step will let us query the state of a register at any point in the basic block.

def preprocess_basic_block(bb):
    defs = {}
    current_defs = {}

    for instr in bb:
        defs[instr.instr_index] = copy(current_defs)

        if instr.operation == LowLevelILOperation.LLIL_SET_REG:
            current_defs[instr.dest] = instr

        elif instr.operation == LowLevelILOperation.LLIL_CALL:
            # wipe out previous definitions since we can't
            # guarantee the call didn't modify registers.
            current_defs.clear()

    return defs

At each instruction of the basic block, we keep a table of register states. As we iterate over each LowLevelILInstruction, this table is updated when an LLIL_SET_REG operation is encountered. For each register tracked, we store the LowLevelILInstruction responsible for changing its value. Later, we can query this register’s state and retrieve the LowLevelILInstruction and recursively query the value of the src operand, which is the expression the register currently represents.

Additionally, if an LLIL_CALL operation is encountered, then we clear the register state from that point on. The called function might modify the registers, and so it is safest to assume that all registers after the call have unknown values.

Now we have all the data that we need to model the vtable pointer dereference and calculate the virtual function offset.

Calculating the Virtual Function Offset

Before diving into the task of calculating the offset, let’s consider how we can model the behavior. Looking back at Figure 1, dispatching a virtual function can be generalized into four steps:

Read a pointer to the vtable from the object’s structure in memory (LLIL_LOAD).
Add an offset to the pointer value, if the function to be dispatched is not the first function (LLIL_ADD).
Read the function pointer at the calculated offset (LLIL_LOAD).
Call the function (LLIL_CALL).

Dispatching a virtual function can therefore be modeled by evaluating the src operand expression of the LLIL_CALL instruction, recursively visiting each expression. The base case of the recursion is reached when the LLIL_LOAD instruction of step 1 is encountered. The value of that LLIL_LOAD is the specified vtable pointer. The vtable pointer value is returned and propagates back through the previous iterations of the recursion to be used in those iterations’ evaluations.

Let’s step through the evaluation of an example, to see how the model works and how it is implemented in Python. Take the following virtual function dispatch in x86.

mov eax, [ecx] ; retrieve vtable pointer
call [eax+4]   ; call the function pointer at offset 4 of the vtable

This assembly would be translated into the following LLIL.

0: eax = [ecx].d
1: call ([eax + 4].d)

Building out the trees for these two LLIL instructions yields the following structures.

Figure 2: LLIL tree structures for the example vtable dispatch assembly.

The src operand of the LLIL_CALL is an LLIL_LOAD expression. We evaluate the src operand of the LLIL_CALL with a handler based on its operation.

# This lets us handle expressions in a more generic way.
# operation handlers take the following parameters:
#   vtable (int): the address of the class's vtable in memory
#   bv (BinaryView): the BinaryView passed into the plugin callback
#   expr (LowLevelILInstruction): the expression to handle
#   current_defs (dict): The current state of register definitions
#   defs (dict): The register state table for all instructions
#   load_count (int): The number of LLIL_LOAD operations encountered
operation_handler = defaultdict(lambda: (lambda *args: None))
operation_handler[LowLevelILOperation.LLIL_ADD] = handle_add
operation_handler[LowLevelILOperation.LLIL_REG] = handle_reg
operation_handler[LowLevelILOperation.LLIL_LOAD] = handle_load
operation_handler[LowLevelILOperation.LLIL_CONST] = handle_const

Therefore, the first iteration of our recursive evaluation of this virtual function dispatch is a call to handle_load.

def handle_load(vtable, bv, expr, current_defs, defs, load_count):
    load_count += 1

    if load_count == 2:
        return vtable

    addr = operation_handler[expr.src.operation](
        vtable, bv, expr.src, current_defs, defs, load_count
    )
    if addr is None:
        return

    # Read the value at the specified address.
    return read_value(bv, addr, expr.size)

handle_load first increments a count of LLIL_LOAD expressions encountered. Recall that our model for dereferencing a vtable expects two LLIL_LOAD instructions: the vtable pointer, then the function pointer we want. Tracing backwards through the program state means that we will encounter the load of the function pointer first and the load for the vtable pointer second. The count is 1 at the moment, so the recursion should not yet be terminated. Instead, the src operand of the LLIL_LOAD is recursively evaluated by a handler function for the src expression. When this call to a handler completes, addr should contain the address that points to the function pointer to be dispatched. In this case, src is an LLIL_ADD, so handle_add is called.

def handle_add(vtable, bv, expr, current_defs, defs, load_count):
    left = expr.left
    right = expr.right

    left_value = operation_handler[left.operation](
        vtable, bv, left, current_defs, defs, load_count
    )

    right_value = operation_handler[right.operation](
        vtable, bv, right, current_defs, defs, load_count
    )

    if None in (left_value, right_value):
        return None

    return left_value + right_value

handle_add recursively evaluates both the left and right sides of the LLIL_ADD expression and returns the sum of these values back to its caller. In our example, the left operand is an LLIL_REG expression, so handle_reg is called.

def handle_reg(vtable, bv, expr, current_defs, defs, load_count):
    # Retrieve the LLIL expression that this register currently
    # represents.
    set_reg = current_defs.get(expr.src, None)
    if set_reg is None:
        return None

    new_defs = defs.get(set_reg.instr_index, {})

    return operation_handler[set_reg.src.operation](
        vtable, bv, set_reg.src, new_defs, defs, load_count
    )

This is where our dataflow analysis comes into play. Using the current register state, as described by current_defs, we identify the LLIL expression that represents the current value of this LLIL_REG expression. Based on the example above, current_defs[‘eax’] would be the expression [ecx].d. This is another LLIL_LOAD expression, so handle_load is called again. This time, load_count is incremented to 2, which meets the base case. If we assume that in our example the user chose a class whose constructor is located at 0x1000, then handle_load will return the value 0x1000.

With the left operand evaluated, it is now time for handle_add to evaluate the right operand. This expression is an LLIL_CONST, which is very easy to evaluate; we just return the value operand of the expression. With both left and right operands evaluated, handle_add returns the sum of the expression, which is 0x1004. handle_load receives the return value from handle_add, then reads the function pointer located at that address from the BinaryView. We can then change the currently displayed function by calling bv.file.navigate(bv.file.view, function_pointer) in the BinaryView object.

Returning to the LLIL tree structures earlier, we can annotate the structures to visualize how the recursion and concrete data propagation happens.

Figure 3: LLIL tree structures for the example vtable dispatch assembly, annotated with handler calls and propagation of concrete values back up the call chain.

Example: Rectangles and Triangles

For a real world example, I used a slightly modified version of this C++ tutorial, which you can find here. Here’s a demonstration of the plugin in action:

If you compile virtual-test.cpp for both x86-64 and ARM and open the binaries in Binary Ninja with the plugin installed, you will find that it will work on both architectures without needing any architecture-specific code. That’s the beauty of an intermediate representation!

Go Forth and Analyze

Binary Ninja’s LLIL is a powerful feature that enables cross-platform program analysis to be easily developed. As we’ve seen, its structure is simple, yet allows for representation of even the most complex instructions. The Python API is a high quality interface that we can use effectively to traverse instructions and process operations and operands with ease. More importantly, we’ve seen simple examples of how dataflow analysis, enabled by the LLIL, can allow us to develop cross-platform plugins to perform program analysis without having to implement complicated heuristics to calculate program values. What are you waiting for? Pick up a copy of Binary Ninja and start writing your own binary analysis with the LLIL, and don’t forget to attend Sophia’s presentation of “Next-level Static Analysis for Vulnerability Research” using the Binary Ninja LLIL at INFILTRATE 2017.

Earlier this week, we open-sourced a tool we rely on for dynamic binary analysis: Manticore! Manticore helps us quickly take advantage of symbolic execution, taint analysis, and instrumentation to analyze binaries. Parts of Manticore underpinned our symbolic execution capabilities in the Cyber Grand Challenge. As an open-source tool, we hope that others can take advantage of these capabilities in their own projects.

We prioritized simplicity and usability while building Manticore. We used minimal external dependencies and our API should look familiar to anyone with an exploitation or reversing background. If you have never used such a tool before, give Manticore a try.

Two interfaces. Multiple use cases.

Manticore comes with an easy-to-use command line tool that quickly generates new program “test cases” (or sample inputs) with symbolic execution. Each test case results in a unique outcome when running the program, like a normal process exit or crash (e.g., invalid program counter, invalid memory read/write).

The command line tool satisfies some use cases, but practical use requires more flexibility. That’s why we created a Python API for custom analyses and application-specific optimizations. Manticore’s expressive and scriptable Python API can help you answer questions like:

At point X in execution, is it possible for variable Y to be a specified value?
Can the program reach this code at runtime?
What is a program input that will cause execution of this code?
Is user input ever used as a parameter to this libc function?
How many times does the program execute this function?
How many instructions does the program execute if given this input?

In our first release, the API provides functionality to extend the core analysis engine. In addition to test case generation, the Manticore API can:

Abandon irrelevant states
Run custom analysis functions at arbitrary execution points
Concretize symbolic memory
Introspect and modify emulated machine state

Early applications

Manticore is one of the primary tools we use for binary analysis research. We used an earlier version as the foundation of our symbolic execution vulnerability hunting in the Cyber Grand Challenge. We’re using it to build a custom program analyzer for DARPA LADS.

In the month leading up to our release, we solicited ideas from the community on simple use cases to demonstrate Manticore’s features. Here are a few of our favorites:

Eric Hennenfent solved a simple reversing challenge. He presented two solutions: one using binary instrumentation and one using symbolic execution.
Yan and Mark replaced a variable with a tainted symbolic value to determine which specific comparisons user input could influence.
Josselin Feist generated an exploit using only the Manticore API. He instrumented a binary to find a crash and then determined constraints to call an arbitrary function with symbolic execution.
Cory Duplantis solved a reversing challenge from Google CTF 2016. His script is a great example of how straightforward it is to solve many CTF challenges with Manticore.

Finally, a shoutout to Murmus who made a video review of Manticore only 4 hours after we open sourced it!

It’s easy to get started

With other tools, you’d have to spend time researching their internals. With Manticore, you have a well-written interface and an approachable codebase. So, jump right in and get something useful done sooner.

Grab an Ubuntu 16.04 VM and:

# Install the system dependencies
sudo apt-get update && sudo apt-get install z3 python-pip -y
python -m pip install -U pip

# Install manticore and its dependencies
git clone https://github.com/trailofbits/manticore.git && cd manticore
sudo pip install --no-binary capstone .

You have installed the Manticore CLI and API. We included a few examples in our source repository. Let’s try the CLI first:

# Build the examples
cd examples/linux
make

# Use the Manticore CLI to discover unique test cases
manticore basic
cat mcore_*/*1.stdin | ./basic
cat mcore_*/*2.stdin | ./basic

“Basic” is a toy example that reads user input and prints one of two statements. Manticore used symbolic execution to explore `basic` and discovered the two unique inputs. It puts sample inputs it discovers into “stdin” files that you can pipe to the binary. Next, we’ll use the API:

# Use the Manticore API to count executed instructions
cd ../script
python count_instructions.py ../linux/helloworld

The count_instructions.py script uses the Manticore API to instrument the `helloworld` binary and count the number of instructions it executes.

Let us know what you think!

If you’re interested in reverse engineering, binary exploitation, or just want to want to learn about CPU emulators and symbolic execution, we encourage you to play around with it and join #manticore on our Slack for discussion and feedback. See you there!

Have you ever wanted to make a query into a native mode program asking about program locations that write a specific value to a register? Have you ever wanted to automatically deobfuscate obfuscated strings?

Reverse engineering a native program involves understanding its semantics at a low level until a high level picture of functionality emerges. One challenge facing a principled understanding of a native mode program is that this understanding must extend to every instruction used by the program. Your analysis must know which instructions have what effects on memory calls and registers.

We’d like to introduce CodeReason, a machine code analysis framework we produced for DARPA Cyber Fast Track. CodeReason provides a framework for analyzing the semantics of native x86 and ARM code. We like CodeReason because it provides us a platform to make queries about the effects that native code has on overall program state. CodeReason does this by having a deep semantic understanding of native instructions.

Building this semantic understanding is time-consuming and expensive. There are existing systems, but they have high barriers to entry or don’t do precisely what we want, or they don’t apply simplifications and optimizations to their semantics. We want to do that because these simplifications can reduce otherwise hairy optimizations to simple expressions that are easy to understand. To motivate this, we’ll give an example of a time we used CodeReason.

Simplifying Flame

Around when the Flame malware was revealed, some of its binaries were posted onto malware.lu. Their overall scheme is to store the obfuscated string in a structure in global data. The structure looks something like this:

struct ObfuscatedString {
  char padding[7];
  char hasDeobfuscated;
  short stringLen;
  char string[];
};

Each structure has variable-length data at the end, with 7 bytes of data that were apparently unused.

There are two fun things here. First I used Code Reason to write a string deobfuscator in C. The original program logic performs string deobfuscation in three steps.

The first function checks the hasDeobfuscated field and if it is zero, will return a pointer to the first element of the string. If the field is not zero, it will call the second function, and then set hasDeobfuscated to zero.

The second function will iterate over every character in the ‘string’ array. At each character, it will call a third function and then subtract the value returned by the third function from the character in the string array, writing the result back into the array. So it looks something like:

void inplace_buffer_decrypt(unsigned char *buf, int len) {
  int counted = 0;
  while( counted &lt; len ) {
    unsigned char *cur = buf + counted;
    unsigned char newChar = get_decrypt_modifier_f(counted);
    *cur -= newChar;
    ++counted;
  }
  return;
}

What about the third function, ‘get_decrypt_modifier’? This function is one basic block long and looks like this:

lea ecx, [eax+11h]
add eax, 0Bh
imul ecx, eax
mov edx, ecx
shr edx, 8
mov eax, edx
xor eax, ecx
shr eax, 10h
xor eax, edx
xor eax, ecx
retn

An advantage of having a native code semantics understanding system is that I could capture this block and feed it to CodeReason and have it tell me what the equation of ‘eax’ looks like. This would tell me what this block ‘returns’ to its caller, and would let me capture the semantics of what get_decrypt_modifier does in my deobfuscator.

It would also be possible to decompile this snippet to C, however what I’m really concerned with is the effect of the code on ‘eax’ and not something as high-level as what the code “looks like” in a C decompilers view of the world. C decompilers also use a semantics translator, but then proxy the results of that translation through an attempt at translating to C. CodeReason lets us skip the last step and consider just the semantics, which sometimes can be more powerful.

Using CodeReason

Getting this from CodeReason looks like this:

$ ./bin/VEEShell -a X86 -f ../tests/testSkyWipe.bin
blockLen: 28
r
...
EAX = Xor32[ Xor32[ Shr32[ Xor32[ Shr32[ Mul32[ Add32[ REGREAD(EAX), I:U32(0xb) ], Add32[ REGREAD(EAX), I:U32(0x11) ] ], I:U8(0x8) ], Mul32[ Add32[ REGREAD(EAX), I:U32(0xb) ], Add32[ REGREAD(EAX), I:U32(0x11) ] ] ], I:U8(0x10) ], Shr32[ Mul32[ Add32[ REGREAD(EAX), I:U32(0xb) ], Add32[ REGREAD(EAX), I:U32(0x11) ] ], I:U8(0x8) ] ], Mul32[ Add32[ REGREAD(EAX), I:U32(0xb) ], Add32[ REGREAD(EAX), I:U32(0x11) ] ] ]
...
EIP = REGREAD(ESP)

This is cool, because if I implement functions for Xor32, Mul32, Add32, and Shr32, I have this function in C, like so:

unsigned char get_decrypt_modifier_f(unsigned int a) {
return Xor32(
Xor32(
Shr32(
Xor32(
Shr32(
Mul32(
Add32( a, 0xb),
Add32( a, 0x11) ),
0x8 ),
Mul32(
Add32( a, 0xb ),
Add32( a, 0x11 ) ) ),
0x10 ),
Shr32(
Mul32(
Add32( a, 0xb ),
Add32( a, 0x11 ) ),
0x8 ) ),
Mul32(
Add32( a, 0xb ),
Add32( a, 0x11 ) ) );
}

And this also is cool because it works.

C:\code\tmp>skywiper_string_decrypt.exe
CreateToolhelp32Snapshot

We’re extending CodeReason into an IDA plugin that allows us to make these queries directly from IDA, which should be really cool!

The second fun thing here is that this string deobfuscator has a race condition. If two threads try and deobfuscate the same thread at the same time, they will corrupt the string forever. This could be bad if you were trying to do something important with an obfuscated string, as it would result in passing bad data to a system service or something, which could have very bad effects.

I’ve used CodeReason to attack string obfuscations that were implemented like this:

xor eax eax
push eax
sub eax, 0x21ece84
push eax

Where the sequence of native instructions would turn non-string immediate values into string values (through a clever use of the semantics of twos compliment arithmetic) and then push them in the correct order onto the stack, thereby building a string dynamically each time the deobfuscation code ran. CodeReason was able to look at this and, using a very simple pinhole optimizer, convert the code into a sequence of memory writes of string immediate values, like:

MEMWRITE[esp] = '.dll'
MEMWRITE[esp-4] = 'nlan'

Conclusions

Having machine code in a form where it can be optimized and understood can be kind of powerful! Especially when that is available from a programmatic library. Using CodeReason, we were able to extract the semantics of string obfuscation functions and automatically implement a string de-obfuscator. Further, we were able to simplify obfuscating code into a form that expressed the de-obfuscated string values on their own. We plan to cover additional uses and capabilities of CodeReason in future blog posts.

Two years ago, when we began taking on blockchain security engagements, there were no tools engineered for the work. No static analyzers, fuzzers, or reverse engineering tools for Ethereum.

So, we invested significant time and expertise to create what we needed, adapt what we already had, and refine the work continuously over dozens of audits. We’ve filled every gap in the process of creating secure blockchain software.

Today, we’re happy to share most of these tools in the spirit of helping to secure Ethereum’s foundation.

Think of what follows as the roadmap. If you are new to blockchain security, just start at the top. You have all you need. And, if you’re diligent, less reason to worry about succumbing to an attack.

Development Tools

To build a secure Ethereum codebase: get familiar with known mistakes to avoid, run a static analysis on every new checkin of code, fuzz new features, and verify your final product with symbolic execution.

1. Not So Smart Contracts

This repository contains examples of common Ethereum smart contract vulnerabilities, including real code. Review this list to ensure you’re well acquainted with possible issues.

The repository contains a subdirectory for each class of vulnerability, such as integer overflow, reentrancy, and unprotected functions. Each subdirectory contains its own readme and real-world examples of vulnerable contracts. Where appropriate, contracts that exploit the vulnerabilities are also provided.

We use these examples as test cases for our Ethereum bug-finding tools, listed below. The issues in this repository can be used to measure the effectiveness of other tools you develop or use. If you are a smart contract developer, carefully examine the vulnerable code in this repository to fully understand each issue before writing your own contracts.

2. Slither

Slither combines a set of proprietary static analyses on Solidity that detect common mistakes such as bugs in reentrancy, constructors, method access, and more. Run Slither as you develop, on every new checkin of code. We continuously incorporate new, unique bugs types that we discover in our audits.

Slither is privately available to all firms that work with us, and may become available for licensing or accessible via an API if there’s enough interest. Sign up to get notified if Slither becomes available.

Running Slither is simple: $ slither.py contract.sol

Slither will then output the vulnerabilities it finds in the contract.

3. Echidna

Echidna applies next-generation smart fuzzing to EVM bytecode. Write Echidna tests for your code after you complete new features. It provides simple, high coverage unit tests that discover security bugs. Until your app has 80+% coverage with Echidna, don’t consider it complete.

Using Echidna is simple:

Add some Echidna tests to your existing code (like in this example),
Run ./echidna-test contract.sol, and
See if your invariants hold.

If you want to write a fancier analysis (say, abstract state machine testing), we have support for that too.

4. Manticore

Manticore uses symbolic execution to simulate complex multi-contract and multi-transaction attacks against EVM bytecode. Once your app is functional, write Manticore tests to discover hidden, unexpected, or dangerous states that it can enter. Manticore enumerates the execution states of your contract and verifies critical functionality.

If your contract doesn’t require initialization parameters, then you can use the command line to easily explore all the possible executions of your smart contract as an attacker or the contract owner:

manticore contract.sol --contract ContractName --txaccount [attacker|owner]

Manticore will generate a list of all the reachable states (including assertion failures and reverts) and the inputs that cause them. It will also automatically flag certain types of issues, like integer overflows and use of uninitialized memory.

Using the Manticore API to review more advanced contracts is simple:

Initialize your contract with the proper values
Define symbolic transactions to explore potential states
Review the list of resulting transactions for undesirable states

Reversing Tools

Once you’ve developed your smart contract, or you want to look at someone else’s code, you’ll want to use our reversing tools. Load the binary contract into Ethersplay or IDA-EVM. For an instruction set reference, use our EVM Opcodes Database. If you’d like to do more complex analysis, use Rattle.

1. EVM Opcode Database

Whether you’re stepping through code in the Remix debugger or reverse engineering a binary contract, you may want to look up details of EVM instructions. This reference contains a complete and concise list of EVM opcodes and their implementation details. We think this is a big time saver when compared to scrolling through the Yellow Paper, reading Go/Rust source, or checking comments in StackOverflow articles.

2. Ethersplay

Ethersplay is a graphical EVM disassembler capable of method recovery, dynamic jump computation, source code matching, and binary diffing. Use Ethersplay to investigate and debug compiled contracts or contracts already deployed to the blockchain.

Ethersplay takes EVM bytecode as input in either ascii hex encoded or raw binary format. Examples of each are test.evm and test.bytecode, respectively. Open the test.evm file in Binary Ninja, and it will automatically analyze it, identify functions, and generate a control flow graph.

Ethersplay includes two Binary Ninja plugins to help. “EVM Source Code” will correlate contract source to the EVM bytecode. “EVM Manticore Highlight” integrates Manticore with Ethersplay, graphically highlighting code coverage information from Manticore output.

3. IDA-EVM

IDA-EVM is a graphical EVM disassembler for IDA Pro capable of function recovery, dynamic jump computation, applying library signatures, and binary diffing using BinDiff.

IDA-EVM allows you to analyze and reverse engineer smart contracts without source. To use it, follow the installation instructions in the readme, then open a .evm or .bytecode file in IDA.

4. Rattle

Rattle is an EVM static analyzer that analyzes the EVM bytecode directly for vulnerabilities. It does this by disassembling and recovering the EVM control flow graph and lifting the operations to a Single Static Assignment (SSA) form called EVM::SSA. EVM::SSA optimizes out all pushes, pops, dups, and swaps, often reducing the instruction count by 75%. Rattle will eventually support storage, memory, and argument recovery as well as static security checks similar to those implemented in Slither.

Rattle is privately available to all firms that work with us, and may become available for licensing or accessible via an API if there’s enough interest. Sign up to be notified if Rattle becomes available.

To use Rattle, supply it runtime bytecode from solc or extracted directly from the blockchain:

$ ./rattle -i path/to/input.evm

Work with us!

Please, use the tools, file issues in their respective repos, and participate in their feature and bug bounties. Let us know how they could be better on the Empire Hacking Slack in #ethereum.

Now that we’ve introduced each tool, we plan to write follow-up posts that dig into their details.

Background

Strategy

Clang analyzer details

Implementation

Some implementation pitfalls

Solution output with demo programs and OpenSSL

Discussion

Edit: Code Post

Translation Background

Translation Difficulties

The value of PI is read from the data section.

The translation needs to support x86 FPU registers, FPU flags, and control bits.

The translation needs to support at least the FLD, FSTP, and FMUL instructions.

Translation Steps

Conclusion

Background

Motivation

Our Approach

Symbol Mangling

Opaque Predicate Insertion

Code Diffusion

Environmental Keying

Conclusions

Symbolic Execution

Maze Solving

Maze Solving sans Map

Maze Solving with Fake Walls

Cheat To Win

The Future Is Now

Building KLEE with LLVM 3.2 on Ubuntu 14.04

Translating the Maze Binary

Running KLEE

Conclusion

Cyber Grand Challenge Background

Preparation

Bug Finding

Patching

Functionality and Performance

Functionality

Performance

What’s Next?

Background

Automated Testing

Results

Conclusion

What is Code as Craft?

What’s the talk about?

Finding use-after-frees in millions of lines of code

Step 1: Convert to LLVM bitcode

Step 2: Create the data-flow graph

Segue: Hidden data flows

Step 3: Dénouement

In summary, we conclude

How Binary Ninja accelerates CTF work

Challenge #1 : 334 cuts

Challenge #2: 666 cuts

Challenge #3: 1000 cuts

Room For Improvement

Conclusion

One metric for all tools

Play Along At Home

GRR, the fastest fuzzer around

PySymEmu, the PhD of binary symbolic execution

Measuring excellence

GRR

PySymEmu

I’ll be back!

Fear GRR

Two fists in the air

Dig deeper

What is the Low Level IL?

Viewing the Low Level IL

Low Level IL Structure

Example: mov eax, 2

Example: lea eax, [edx+ecx*4]

Using the Python API

Accessing Instructions

Parsing Instructions

Example: A Simple Recursive Traversal Function

Lifted IL vs Low Level IL

The translation needs to support at least the `FLD`, `FSTP`, and `FMUL` instructions.

Example: `mov eax, 2`

Example: `lea eax, [edx+ecx*4]`