My Explorations Into Deep Learning: Source code representation

Analyzing a sample source code from NIST's SAMATE Juliet source code:

And after compiling it using gcc's GIMPLE mechanism to print out only the core structure:

1. Us the above GIMPLE source course the entire program will be streamed into the LSTM network.

Image result for input and output of LSTM

2. The caller-caller relationship will be ignored.

3. The "potential_flaw() comment will be removed. It will be used as training instructions: position of comment (with respect to the start of the program) + nature of software vuln.

The "potential_flaw() is used as the output for "supervised learning", and the "group of source code" it is supposed to be associated (in the sense of security bugs) will be a few lines before and after the potential_flaw().

4. The nature of the "classification", is to be described literally by the comments in the "potential_flaw()".