My Explorations Into Deep Learning: Achieving software bug-hunting via Deep Learning techniques

Abstract:

Given a software program as input, it is possible to deduce many properties of the program - perhaps representing them as CFG, or PDG, or just the software vulnerabilities, all of these can be equated to a "sentence" that describe the properties in detail. Therefore, if we can collect or generate as many sentences as possible, our "understanding" of the program become better. So if a Deep Learning system can trained to derive these "sentences", perhaps with a special focus in for example, vulnerability, it is possible to make discovery of the vulnerabilities properties of the program.

"Conceptual understanding" will be a composite of "many sentences" - all these different sentence will describe the different properties of a program in different ways, from different perspective etc.

Therefore, given software program as input, train the system to generate as much target "sentences" as possible - to describe all the necessary properties of the program, focusing on its vulnerabilties and bugs. Another targets chosen, for example, transformation properties of the program, will enable us to comment about the objectives of the software.

a. Big Picture Intuition: mapping bug finding technologies to Deep Learning current state of art

b. Fundamental problems to be solved

c. How To: Construction of the system to solve the problem
d. References

Big Picture Intuition

To describe the big picture of how to "find programming bugs", a intuitive description that mapped to current state of art in Deep Learning and AI will be described here.

First we know that Deep Learning is good in pattern identification: be it visual, audio, abstract, or even mapping between different languages construct.

Second we know that Deep Learning can identify temporally dependent pattern: languages pattern, musical pattern etc. So sequential flow of concepts, or "logical construction of images" is possible to be represented.

Third we know that Deep Learning need lots of data. For a programming, these data can generated through statistical means ie, Monte Carlo generation of input space, and then derive the output space directly (through running the program).

Borrowing ideas from Reinforcement Learning, like those in DeepMind AlphaGo playing machine, where past (good) data is used to generate and identified more future (good) data, the same methodology can be applied to small programming fragment. Here instead of winning the opponent as the decision criteria, in bug-finding we will use any of the illegal behavior of software as the decision criteria to retain the data - all other will be ignored and thrown away. So "good" data here always means illegitimate outcome of software behavior.

These data can sliced (in Deep Learning, it is equivalent to "ensemble", or subsampling of dataset like in Batch Normalization) in many different ways to collect statistical pattern. ie, subset of data itself, many times have its own pattern. For example, we can focus on the boundary limits of inputs and its corresponding boundary limits of the output (to derive mathematical hyperspace of how these two are correlated).

Another idea is of reducing the need for data, is through intermediation - for example in language translation:

https://www.youtube.com/watch?v=IxQtK2SjWWM#t=2434.540641 (interlingua: 45min)

If you have N source language, and M destination language, there are NxM possible translation linkage.

But if you use one language (eg, English) as the language of interlingua to connect to the N and M language, then you will just need N + M possible translation linkage. Less weightage to be calculated means less data and training needed. To encourage the pairings to be reused (via Transfer Learning, whereby the neural net weights are reused and thus retraining avoided), pair of two concepts to form a sequence will be ideal - thus duplication of this and reconnection with any other network is then easy.

Fourth, in Deep learning we need a cost function - which can define the composite set of patterns the system should strive to achieve during "learning". Since we are interested in finding the bugs in the software, the cost function therefore should measure the distance away from the software bugs. And to encode this "software bug" we will take the intuition from image recognition and understanding technique.

Fifth, in image recognition, we can encode some linguistic description with the image (captioning, see below).

http://reader.s-reader.com/article/a9/3369000.html?l=526395c794d09f5769e657feaafe9df6&ft_size=&site_id=599

Image result for seq2seq image captioning

And many different transformation of the image (ie, video frames) will still retain the same linkage with the same description of the image after training. To satisfy the definition of a "software bug" many different sequences of the above "image-description" is needed, each "image" will be a snapshot of the program state temporally, and a sequence of these states will represent a "legitimate" transition sequence.

Another analogy is in language translation. Many different sentences can be formed to describe the same program in many different ways. And remapping between these different sentences can be achieved through translation:

https://research.googleblog.com/2017/04/introducing-tf-seq2seq-open-source.html

With the above mapping between program understanding and LSTM/seq2seq representation, code clone concepts, or program equivalentness, can be achieved through DeepLearning.

Further more by having different inputs, resulting in different transition sequence, will altogether form a "conceptual understanding" of the program.

"Conceptual understanding" will be a composite of "many sentences" - all these different sentence will describe the different properties of the program in different ways, from different perspective etc.

So to "understand a program", we can:

a. Describe the program in as many possible attributes as possible (eg, collection of input-output limits boundary at every temporal stage of program running)

b. Understand how these different description can temporally affect each other.

c. Part of these description will differentiate between different inputs, resulting in different output behavior.

d. In the area of bug-finding, we only focus on bugs, and thus "illegitimate" output or behavior as the objectives, and all other descriptive properties will thrown away, to reduce the hyperparameters space.

e. Learning all the possible "illegitimate" behavior may take a long time, so to hasten the learning process, we can encode all the CVEs vulnerabilities into different cost function for optimization.

All these pattern recognition and identification, can also be considered as "common sense".

Another way of analysis, is that given the many possible outcomes, or output which can be generated - these are also considered as the most important ones which is learned from past experiences. Connecting together all these "important experiences" and linking it back to the different input, will thus let us comment about the input intelligently.

To quote from John McCarthy's Program with Common Sense (part of quotation is not true today):

The composite of sequences of conceptual pairs will form the "abstraction" that is described above. And "simulate arbitrary behavior" means the generation of data that will act through reinforcement learning to contrain the neural network system.

Take a simple program:

int main(int argc, char *argv[]){
char xxx[256];

sprintf(xxx,"%s my dearest %s, Good Morning %d\n", "Hello", argv[1], atoi(argv[2]));
}

which is to print "Hello" message.

Now we ask this simple question: what can go wrong with this program?

Answering this question is equivalent to answering the question, what are the potential vulnerabilities of this program.

Fundamental problems to be solved

The fundamental problem in bug-hunting are:

a. Read and understand where the program can go wrong.

b. The program may be very large, so it is necessary to apply a "window of focus" to direct the attention to - and only those code in the attention window will be analyzed.

c. The original CWE database may be very large, so it will be good randomly, or heuristically pick those few CWE to compare and use it for bug identification.

How to solve the problem

This can be broken down into many subproblems.

Firstly is canonical program fragment as data. Lots of basic code fragment is needed, which can then be composed together to form larger program. Hand-coded basic code fragment may be possible. But another way is to take a large program, compile it into its native form, and then use basic block analysis to identify all the basic blocks, and followed by the call-flow between the basic blocks. Each basic blocks, and the call-flow between basic blocks all can be used canonical fragment as data. So given any of the learned code fragment, we should be able to tell if the fragment is a "logical" piece of code, or something unseen (and therefore, "illogical") before - something like illogical. The objective of this is given any code fragment, be able to assess its "reasonableness". For example a code that just read update a register without ever reading in the values from somewhere else will be "unreasonable", or buggy.

The next problem, is to build a system that generate data that are more likely to "crash a program", or make it reach any one of the conditions describe in CWE as exception/error condition. See below for a snapshot (https://cwe.mitre.org/data/definitions/1000.html)

Through reinforcement learning, those that did not result in exception are likely to be thrown away, but those resulting in crashes are likely to be remembered - pertaining to the structure of the current program (and so it has to be part of the input).

Based on past knowledge, the extremal values should be attempted as soon as possible. Or it can be unprintable characters, or negative numbers. Based on learning, these selected values should be given first priority in fuzzing attempt. So input are just random numbers, but output are learned input values for each program that are likely to crash the program.

The next problem, is to learn static analysis features (input) like disassembly and remapped to its corresponding output as control flow diagram, or vice versa. Many other combination are possible: control flow diagram to call graph diagram, . More abstract representation is better, and less training perhaps. So given between LLVM disassembly vs x86 disassembly, it may be better to choose LLVM disassembly as it will not have registers-specific information.

The next problem, is to investigate the possibility of stacking the previous problem above: given a program, remapped to its control flow diagram, then to its call graph diagram. A deeper and richer "meaning" of the program can be captured in the deeper network.

The next problem, is to establish the fundamental key features which can be used for bug identification. This is left for another blog post.

References

Programming with Common Sense:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.114.7204&rep=rep1&type=pdf

http://www-formal.stanford.edu/jmc/mcc59/node1.html

https://www.quora.com/How-does-one-integrate-deep-learning-with-reasoning

https://www.youtube.com/watch?v=IxQtK2SjWWM&t=2434.540641&authuser=0

https://www.tensorflow.org/tutorials/seq2seq