Aug 31, 2017

Internal Representation of Source Code Analysis.



Continuing from previous discussion:


Before implementing the deep learning algorithm, we need to specify the internal representations which can facilitate Deep Learning.

What are the basic requirements for representations?

a.   Source code to internal representation - one-one mappable.   Ie, given the source code, there is only one way to represent it in internal representation.   And vice versa:   given the internal representation, it should be possible to reconstruct back the source code.   The representations are essentially of a few types:   Meta-data:   nodes types (== functions names, or operators), operand types and instantiation (different types of variables and different instantiation of the same type should be treated the same), and lastly DATA:   values of the variables.   But underlying all these are just numbers.

b.   Source code generalized:   by "generalizing the source code" it should be possible to generate many similar examples that are "near" the sample, so that through this distance function we can identify them as belonging to the same cluster.   For example, changing the variable names, or any names in general, should not affect the internal representations.   One way to do this, is to assign a preallocated range of ID for variables identification.   So all variables will sequentially take from this store of ID as variable name.

Using "clang" compiler to generate the IR form of representation for source code should be able to achive (a) and (b).   Generally the output of clang is operator + operand pair, and CWE tagged after examining the manual CWE assigned given by dataset, or through the refinement steps analysis mentioned below (c).

c.   Source code, with additional constrains on its input, will lead to reclassifying the source codes into different classes.   For example, "memcpy(a,b,c)", if c>sizeof(a), then it will lead to a illegitimate condition.  This part of assigning reclassification, or refinement of classification based on imposition of input values, is an added luxury:   It is beyond what the source code have said, and MAY OR MAY NOT be true, and as dynamic/runtime data are not available.   Thus the whole class of inputs are assigned, and perhaps only a subset are achievable.   

d.  Memory:   it must remember what it have previously learned.   So architecture like RNN or LSTM will be good, which can keep part of previous learnings are part of the weights for future decision making.    And it is conjectured that lesser variations in the weights for the RNN or LSTM will make it less dependent on need for massive training.   Another conjecture is that having a smaller range of variation of meta-data will mean less training needed.

e.   Concept of "function":   all function can be treated as a blackbox operator.   So still conform to the operator + operand + CWE vector.

f.   Concept of jumping/branching:   it will be an operator (==goto) + target ID of tagged line (all executable lines are tagged).

g.   Sequence of instructions, leading to sequence of operator + operand + CWE vector as a "Window of Focus" for RNN/LSTM learning.   This is mainly to convey the CONTEXT of a sequence of C source codes as constituting a vulnerable condition.   A single source code have minimal meaning by itself.   Take the diagram below as example, all the C source will form the "Window of Focus" sequence of C source lines.



h.   It is desired not to have to generate an extremely large dataset to cater for all possible vulnerable conditions of the same types.    Therefore, given limited dataset, and limited pattern source, be able to learn and recognize only thise limited range of pattern well.   The potential benefit neural network can bring is to have the flexibility of minor changes in environment, or source code constructions, and the vulnerable pattern is still recognizable.

In summary for each C statement:

1.   Use clang to generate the IR representation.
2.   From clang output, identify all input variables, output variables, and operator.
3.   For a sequence of clang output, for a sequence of vector [operator, operand, CWE].
4.   Send for RNN/LSTM training.

No comments: