Continuing from previous blog:
We will analyze in depth how pattern recognition of source codes can be achieve - specifically to classify it according to the CWE classes if it is vulnerable:
From the above list, the most common and serious are these five:
CWE - 119 : Failure to Constrain Operations within the Bounds of a Memory Buffer
CWE - 362 : Race Condition
CWE - 416 : Use After Free
CWE - 476 : NULL Pointer Dereference
CWE - 190 : Integer Overflow or Wraparound
For a sample C program with the above vulnerabilities:
How to generate the dataset:
1. C source is from Juliet Test Suite, compiled to a more canonical format, using GCC or LLVM, using only static analysis.
For an example:
2. Errors errors to be detected are:
- Overflow Numeric calculations which produce a result too large to represent.
- Divide by Zero Dividing a numeric value by zero.
- Invalid Shift Shifting an integer value by an amount which produces an undefined result according to the C standard.
- Memory Errors Accessing an invalid memory region in a way which produces an undefined result, such as accessing an array outside its bounds or accessing heap-allocated memory after the memory has been freed.
- Uninitialized Data Access Accessing memory before the memory has been initialized, so that the result of the access is undefined under C semantics.
3. Therefore, the aim will be recognizing bugs types just purely through source code itself, without any runtime inputs.
4. The variable names, must not be recognized as a feature for bug recognition - data augmentation therefore have to spread the use of characters. One way to do this is to tokenize all the words, and identify and remove all the C specific wording, and what remains are the set of words which can be varied via characters substitution. After that the original program will be transform using the new maps.
5. Some of the remarks are to be used for training for bugs vs non-bugs differentiation, but not others. Manual filtering is needed here.
6. To emphasize the unimportance of variable names, but more on the structures of C program itself, it will be good to convert the C program into some other intermediate (binary or graphical) representation like CFG, or DFD, PDG etc and encode these are graphs. The C input is still needed, as it is necessary to link back the bugs to the source codes level, but its relative importance may be given lesser weightage with respect to the intermediate representation format.
7. But then using whole program binary like CFG etc may have a challenge: the labelling in the test suite is done at the C level, and remapping it to the binary graph like CFG is needed.
For C it is possible to impose a "Windowing" of focus, or attention - where training input is just from those few lines, and this windows can move forward/backward to continue extract other data as input. The advantage of this windowing is that there is much more obvious correlation between what is entailed by the error labels and the source of bugs). But if the bug itself spans across two different windows, then the association between the labels and the nature of the C program is much more difficult to identify. Or if the windowing is too large - taking in a lot more C program then is necessary to associate with the bugs. The correlation with large input data is more difficult to establish, and the training will be longer and dataset needed is more.
This windowing mechanism is more challenging for binary representation - for example, remapping the label to the disassembly boundary, or graphical representation etc.
4. The variable names, must not be recognized as a feature for bug recognition - data augmentation therefore have to spread the use of characters. One way to do this is to tokenize all the words, and identify and remove all the C specific wording, and what remains are the set of words which can be varied via characters substitution. After that the original program will be transform using the new maps.
5. Some of the remarks are to be used for training for bugs vs non-bugs differentiation, but not others. Manual filtering is needed here.
6. To emphasize the unimportance of variable names, but more on the structures of C program itself, it will be good to convert the C program into some other intermediate (binary or graphical) representation like CFG, or DFD, PDG etc and encode these are graphs. The C input is still needed, as it is necessary to link back the bugs to the source codes level, but its relative importance may be given lesser weightage with respect to the intermediate representation format.
7. But then using whole program binary like CFG etc may have a challenge: the labelling in the test suite is done at the C level, and remapping it to the binary graph like CFG is needed.
For C it is possible to impose a "Windowing" of focus, or attention - where training input is just from those few lines, and this windows can move forward/backward to continue extract other data as input. The advantage of this windowing is that there is much more obvious correlation between what is entailed by the error labels and the source of bugs). But if the bug itself spans across two different windows, then the association between the labels and the nature of the C program is much more difficult to identify. Or if the windowing is too large - taking in a lot more C program then is necessary to associate with the bugs. The correlation with large input data is more difficult to establish, and the training will be longer and dataset needed is more.
This windowing mechanism is more challenging for binary representation - for example, remapping the label to the disassembly boundary, or graphical representation etc.
Analysis of a testcase:
1. Clearly all the error/non-error case is labelled:
2. The labelling is done at the line level as a remark statement.
3. The value of the operand are not given, as this current version is only static analysis. Under dynamic analysis it will be possible to assign specific values dynamically which can potentially generate the vulnerable conditions.
4. Another example:
Both this case and (3) above can be post-instrumented at the source code level - explicitly specifying the error conditions using the specific dynamic values - together with other dynamic values which will result in NO ERROR conditions - thus augmenting the data in all possibilities.
5. Another important data augmentation needed is the variables names vs C syntax differentiation: the system have to be taught what strings are characteristics of C language, vs those data that can be anything, for example, variable names.
6. Sometimes one error will lead to another error - eg, integer overflow leading to stackoverflow. To achieve this recognition, the windows of source code for analysis have to be wide enough to cover both the different error conditions.
In next posting we will show the tensorflow implementation for the above source code CWE classification problem.
No comments:
Post a Comment