CNN for Reverse Engineering: an Approach for Function Identification

Nytro · April 15, 2020

CNN for Reverse Engineering: an Approach for Function Identification

Why is CNN useful for function identification and how to implement one

Alon Stern

Apr 14 · 7 min read

Before deploying binaries to third party environments, it is very common to strip the binaries from any information that is not required for them to function properly. This is done to make reverse engineering of the binary more difficult. Some of the information that is erased from the binary is the boundaries of each function. For someone who wants to reverse engineer a binary this information can be extremely useful.

Function identification is a task in the reverse engineering field where given a compiled binary, one should determine the addresses of the boundaries of each function. Boundries of a function are the start address and the end address of the function.

Why Neural Networks?

There are no simple rules for recognizing the boundaries, especially when it comes to binaries which have been optimized during compilation.
A huge amount of data — it is very easy to find on the internet code to compile or binaries already compiled with debug information to create our dataset.
Almost no domain knowledge is required! One of the big advantages of neural networks (especially deep ones) is that they are capable of processing raw data well and no feature extraction is required.

The idea of using neural networks for function identification is not new. It was first introduced in a paper called Recognizing Functions in Binaries with Neural Networks written by Eui et al. The authors used a bidirectional RNN for learning the function boundaries. According to their paper, they not only achieved similar or better results relative to the former state of the art but also reduced the computation time from 587 hours to 80 hours. I think this research really demonstrates the power of neural networks.

So Why CNN? (CNN vs RNN)

CNN (Convolutional Neural Network) is highly popular in tasks regarding computer vision. One of the reasons is that CNN can only capture local features.

Local features describe the input patches (key points in the input). For example, for an image, it can be any feature regarding a specific area in the image such as a point or an edge.

Global features are features that describe the input as a whole.

RNN, on the other hand, is a “stronger” model in the sense that it can learn both local and global features.

But stronger is not always better. Using a model capable of learning both local and global features for a task that requires learning only local features, might lead to overfitting and increase the training time.

For function identification it is enough, for each byte in the binary, to look at the 10 bytes before it and 10 bytes after it to determine if it is a start or a stop of a function. This property makes it seems a CNN is supposed to achieve good results for this task.
With that being said, there are global features that can help to determine the boundaries of the function. For example, call opcodes can help us determine the start of a function. However, even RNN will have a hard time learning those features as RNN does not perform well on long sequences (that is why in the Eui et al. paper they train their network with a random 1000 bytes sequence from the binary and not the whole binary).

In addition, unlike RNN which is a sequential model, CNN can be run in parallel which means both training and testing the network are supposed to be faster.

We are done with the introduction. Let’s code!

Code

The code is implemented in Python3.6 using PyTorch library.

For simplicity, we are going to implement a model that identifies the beginning of each function, but the same code can be applied to identify the ending as well.

The full code is available here:

alonstern/function-identification

Contribute to alonstern/function-identification development by creating an account on GitHub.

github.com

The Data

We are going to use the same dataset Eui et al. used in their paper.
The dataset was originally created for a paper called ByteWeight: Learning to Recognize Functions in Binary Code. Eui et al. used the same dataset to compare their results to the results reported in the original paper.

The dataset is available at http://security.ece.cmu.edu/byteweight
The dataset consists of a set of binaries compiled with debug information.

We are going to use the elf_32 dataset but the same code can be applied also for the elf_64 dataset (and the PE_dataset, but that will require different debug info parsing procedure).

The dataset can be downloaded by running:

wget — recursive — no-parent — reject html,signature http://security.ece.cmu.edu/byteweight/elf_32

Preprocessing the Data

First, we need to extract from each binary its code section and its function addresses.

Elf files are composed of sections. Each section contains different information. The sections we are interested in are the .text section and the .symtab section.

.text contains the opcodes that are executed when running the binary.
.symtab contains information about the functions in the binaries (and more).

Note that the information in the .symtab section can be stripped from the binary. This project is useful for those cases.

For parsing the sections in the binaries we are going to use Pyelftools library.

First, let’s extract the .text section data

For each byte in the code of the binary, we need to extract whether it is a start of a function.

Now let’s iterate over the binaries in our dataset.
We will use tqdm library to make a nice progress bar for our preprocessing with zero effort!

Great! We have our data and its tags.

To feed the data into the model we should not just feed it file by file. Instead, we should determine the size of the data we would like to train the model with each time and split our data into blocks with that size.

Also, if we want a CNN to output a vector with the size of tags, we need to pad the input according to the CNN kernel size.

Let’s wrap it up under a torch.utils.data.Dataset class:

Building the Model

The input of the model is going to be a vector where each value is between 0 to 257 (0–255 for the byte value, 256 is a symbol for a start of a file and 257 is a symbol for an end of a file).
The output of the model is going to a matrix where each row contains two values — the probability of a byte to be a start of a function and the probability to not be a start of a function (the values are summed to 1).

Since every byte value represents a different symbol we would like to convert every value to a vector. The way to do so is to use an embedding layer.

A guide for embedding:

Neural Network Embeddings Explained

How deep learning can represent War and Peace as a vector

towardsdatascience.com

After the embedding layer, we are ready to add the convolution layer with Relu activation function.
Notice we want the convolution to work on whole bytes so the kernel size should be: the number of bytes we want to look at X the size of each byte (the output dimension of the embedding layer).

Now we add a fully connected output layer with softmax activation function.

The whole architecture:

And that’s the whole model!

Training and Testing the Model

First, we need to split the data to train and test. We are going to use 90% of the data for training and 10% for testing.

Now, to create the model we can simply instantiate CNNModel

For the training, we are going to use the Negative Log-Likelihood loss function with Adam optimizer. Again, we add the tqdm for a nice progress bar for our training.

For the testing, we look at four parameters. accuracy, precision, recall, and f-score.

Accuracy alone would be insufficient for measuring the performance for this task since most of the bytes are not a start of a function. Therefore even a model that would classify everything as “not a start of a function” would get high accuracy.

Results

I trained and tested the model on my personal laptop with Intel(R) Core(TM) i7–7500U CPU @ 2.70GHz and 16 GB ram.

Timing Performance

Preprocessing the whole data took 43 seconds.
Training the model on 90% of the dataset took 33 minutes and 43 seconds.
Testing the model on the remaining 10% took 24 seconds.

Eui et al reported in their paper the training time were 2 hours for each model. Our training took almost a quarter of the time!

Prediction Performance

The results on the test set:

accuracy: 99.9981%
precision: 99.6905%
recall: 99.4613%
f1-score: 99.5758%

The best model to classify the start address of the functions reported in Eui et al. paper achieved an f1 score of 99.24%.

So there it is, a CNN network that can find the start of each function in a binary. This was the first medium article I have written.
Hope you liked it! For more information/questions, feel free to contact me.

Thanks for reading!

[1] E. C. R. Shin, D. Song, and R. Moazzezi. Recognizing Functions in Binaries with Neural Networks (2015). In Proceedings of the 24th USENIX Security Symposium.

[2] T. Bao, J. Burket, M. Woo, R. Turner, and D. Brumley. BYTEWEIGHT: Learning to Recognize Functions in Binary Code (2014). In Proceedings of the 23rd USENIX Security Symposium.

Sursa: https://medium.com/@alon.stern206/cnn-for-reverse-engineering-an-approach-for-function-identification-1c6af88bca43

Sign In

CNN for Reverse Engineering: an Approach for Function Identification

Recommended Posts

Nytro

CNN for Reverse Engineering: an Approach for Function Identification

Why is CNN useful for function identification and how to implement one

Why Neural Networks?

So Why CNN? (CNN vs RNN)

Code

alonstern/function-identification

Contribute to alonstern/function-identification development by creating an account on GitHub.

github.com

The Data

Preprocessing the Data

Building the Model

Neural Network Embeddings Explained

How deep learning can represent War and Peace as a vector

towardsdatascience.com

Training and Testing the Model

Results

Timing Performance

Prediction Performance

Join the conversation

Browse

Activity

Pages