Aug 13, 2017

On the nature of vulnerability discovery via source code hyperspace partitioning

Continuing from previous blog:


Given any software, and after substituting the input with values, it will generate a contour of the "software hyperspace" as an output.

This "output" will have different attributes and behavior - some ok and some not legal - the definition must be imposed or constructed, to categorize the output.   For example, writing and reading to and from unallocated memory, reusing freed memory etc are all considered not legal.

Therefore, with all the input values substituted and the software hyperspace computed, at every stage of the software execution (the time dimension is another input dimension to the hyperspace), the hyperspace can have legal vs illegal region.    But so long as there exists any point where illegal behavior is exhibited, then the software is classified as "vulnerable".

Question:   given any software hyperspace, is it possible to do gradient descent to search for the vulnerable region?

This hyperspace, is likely to be unique for each program - globally.   But locally, perhaps taking a few lines of source codes,  the variation of the hyperspace structure is lesser, and identifying "vulnerable" pattern can be more constrained into an even smaller subset - implying it is possible to determine vulnerability of program if enough variations of input have been given during training.   The variation of the hyperspace is important - less varying implies the contour will not have high margin of variation, which will make gradient descent algorithm impossible (because the slope is widely varying).

It is thus not useful to generate lots of data for the whole piece of software, but lots of small segment of program and the system is trained to determine it as vulnerable or not.   Training can be supervised or not, or results from both output can be combined, and through Bayesian estimation, to derive a probabilistically higher decision.

To repeat the question:

Given many different source code - with different structure of input data and software structure, and output behavior - is it possible to recognize the vulnerable pattern whenever it exists, perhaps with input values substituted?

For this:   software callgraph + dependency graph + input data + program instructions + output data can be used as a vector sequence, perhaps remapped to each LLVM bytecode execution, or each assembly instruction, and its total execution is all the vectors in succession.   Sometimes certain instructions or features may be more important than others, like array size allocation and its copy operation have to be matched etc.   These "importance" will be reflected in the multilayer training given lots of data have been feed in.   Another important feature is the capability of "copycat learning", ie, whatever knowledge learned in array size allocation (which must be larger than the copy operation) be capable of "abstracted" and "duplicated" into other network.

Unlike languages, which have a limited vocabulary, the input/output space of software are continuous.   Therefore past algorithm like seq2seq or recurrent NN may not be enough (these can be used on the graph representation of the source codes, but NOT for runtime analysis, where all possible input and output values have to be considered - unless the input/output values are quantized into different level of coarseness for training) - a proper feedback system based on reinforcement learning (which is essentially equivalent to the state of art AFL fuzzing - to be elaborated in another blog) is needed to properly learn the proper values to substitute in order to generate the right output - specifically "illegal" output, as bug hunting is the objective.

So the optimal representation for vulnerability discovery may be a mixed of all the above neural network, each working independently, before combining the results together at the end.

No comments: