Feb 16, 2018

On the minimum representation needed to capture semantic information for certain vulnerability

First, when you read the following code:



The first question is immediately:  data * sizeof() is not checked, and therefore may overflow the maximum size allowed for any malloc().   And that is a bug.

And if overflow into a small value, then now malloc() becomes NOT a bug, as malloc(SMALL) small value is OK.

But continuing further:



now the bug will resurface in intPointer[].

This show several properties needed for proper representation of programming to detect bugs:

1.   range min/max:   any input have a limit on both side.

2.   range output:   upon memory allocation, the values should be range-checked to determine validity.

3.   properties propagation:   as we analyze the logic flow above, we noticed that the bug situation turned from buggy to not buggy and then flow into another function to be buggy.

4.   duality of outcome in two situation:   that is a sign of logic error.  

First outcome:   data * sizeof()  and followed by "for(i...<data;i++) =>> this will give one set of value, if there is no overflow.   But if there is a overflow, then each of them will have different values.  

5.   If "data overflow" is never used as a programming method to derive values mathematically, then the existence of "overflow" should trigger an alert to error.

6.   The association between buggy lines vs non-buggy lines have to be able to focus on malloc() to give intPointer[] + data x sizeof() + for() usage of the intPointer array.   All other lines will need to be deemphasized.
(ie, the learning must be such that the coexistence of these pairs of conditions will trigger an alert, even though it may have many other lines in between)   If the variables are not able to link these statement together, then it will not be consider buggy as well.

7.   Arising from multiple compilation - it is possible to get different views of the same program - and thus the linkages can be transformed into different form - but the buggy lines will retained its property as "buggy".

As a start, these simple representation will be used:  clang IR, gcc assembly, C, gcc object codes.

8.   Multiparse is needed:   first parse will derive inputs which will be used to derive the other values for the 2nd parse.   The analogy in nature for this is reading + re-reading often is needed to help the understaning, because the first read (C program) will help to truncate and eliminate all the unnecessary information for a deeper 2nd parsing (eg, assembly).   During the first parse, many attributes will have to be dropped to maintain conciseness.

No comments: