Nov 7, 2017

Binaries reversed as source codes via Deep Learning

Technically, is it possible to reverse from binaries to source, before we delved into the Deep Learning part.   

Let's tackle the problem of signed/unsigned reversing.

Look at the original source code:



And now we analyze the binary:



And next see the modified source:



And its binary:



And so the conclusion:   NO difference at the binary level between "int x" and "unsigned int".

And worst - there may be sources that may have been optimized away after compilation - so the given binary can actually originate from infinitely many possible variations of sources, possibly with different original author intention, but have been compiled into the same piece of binary codes.   These can be classified as potential unknown software bugs.

Take this as example:



and its compiled output:



From above, conclusions:

1.   It may not be possible to derive back the original source that generated the binary, after optimization.

2.   For unsigned and signed difference, there may be other indirect way of identifying original type as SIGNED or UNSIGNED, but at the simplistic level, the compiled output are the same.

For the Deep Learning part, first we will need to generate the data - how to do it?

First given the C program, just compile it with "gcc -g -c unsigned.c", and followed by "objdump -S -d unsigned.o" and you will get the following:



From above, you can immediately derive the mapping between C and the assembly part.   Training the system via LSTM could be the next target.

Even better still, given the source which is generated, we can recompile it again, and feedback into the system to be decompile back.   If compilation is successful, then the decompilation first time is likely to be correct.   After many rounds of this "Generative Adversarial" approach, the system potentially could self-learn a lot of other rules which is not explicitly taught, but is embedded inside the compiler, and now implicitly transferred to the system via this "GAN" type of training.

At the end of the process, you will get a system that can convert short segment of disassembly codes to C and vice versa.   

So there is still a problem of understanding how the segment join together.




No comments: