How do we find vulnerabilities pattern in source codes if there are not enough training data. Using the concept of Generative Adversarial Network:
a. Generate a C program with known vulnerability. From this single example, through polymorphic transformation, or Code Cloning mechanism, generate other examples that can be used to characterise the same vulnerability. And also train with another C that have the vulnerability fixed. In this way, the system will be trained to recognize a broad class of C that are semantically similar - to have the same bug, but syntactically differently written, and another that does not have the bug.
b. Given the C, generate the disassembly (using "objdump"), and use it to train the network. So, essentially the variation now is generated by "objdump", instead of the polymorphic transformation tool, or the code clone tool.
c. Using Code optimization (using gcc /O2 for example), it is possible to generate many different variation of the assembly, that essentially perform the same thing. In this way,
d. Alternatively, but substituting values into the program with variables, it is possible to generate almost impossibly large number of distributions of "points" that chacterize the program - not forgetting that the points will have to be generated across a certain time axis as well. An important question to answer is how to identify the important points distributed in space to be used for "representing the problem". Not forgetting that these points chosen, or values substituted, had to represent the shape of the surface across different types of input. And across the surface which part of it correspond to the problem area - and how to distinguish the different outcome based on the different attributes of the surface.
e. Instead of polymorphic transformation, or code cloning, another way is just complexifying the source code itself. Ie, adding lots of different types of codes stolen from other opensource code into the existing source code.
References:
No comments:
Post a Comment