My Explorations Into Deep Learning: Code scanner + Reinforcement Learning => Deep Learning

Given the rich repertoire of tools for static source code scanning:

https://en.wikipedia.org/wiki/List_of_tools_for_static_code_analysis

It is possible to use one of them to scan the C source code, generate the output as comment, and use it to differentiate between good and bad sources, and perhaps classifying into the different security bugs.

Then then using the [ C source => static code analysis output ] as sequence pair, we can feed it into an autoencoders engine and apply the Deep Learning algorithm to learn it. The output of this Deep Learning engine is the bug classes types assigned and its probability of correct classification. Using the following as example (motivated by Chatbot), the input for encoding will be C source, and output of decoding is the static analyzer output.

http://suriyadeepan.github.io/2016-06-28-easy-seq2seq/

As mentioned in [8] autoencoder is for deterministics outuput, but whereas VAE is for stochastic output.

But as C source codes have a lot more richer information in it, a lot of it have to be learned to be ignored - and thus only the key sequence / ordering / combinations identified which can be successfully mapped to the bug class.

Through ensemble method, it is possible to use different vulnerability scanner and compute the overall result based on weighted average of individual classes.

Through the Bayesian reasoning - if 80% of the scanner indicated positive results with confidence of 0.8, then the probability overall should be a higher value than 0.8.

Comments as classification

Given that the dataset is from NIST SAMATE juliet source - it comes instrumented with comments on bugs classes - these comments can be used as reinforcement of whether classification is correct or not.

API as classification

As indicated in [4], use of certain API in certain manner can be indicative of bug vulnerability. For example, memcpy with oversized input into the smaller size heap or stack memory. But to do this analysis, additional information have to created as different cases, for the analysis to go ahead with classification.

Post Training vs Pre Training Architecture shift

After the learning/training is over, the input will just have C source code, and the comments perhaps to be ignored. And then then autoencoder engine will be in autopilot mode to classify the bug vulnerability.

A far - fetched goal is understanding algorithm and automatically picking up protocol implementations (eg, networking protocols) - and identifying vulnerable patterns in algorithm and protocols [6].

References:

Source code auditing:

1. https://trailofbits.github.io/ctf/vulnerabilities/source.html

2. https://www.owasp.org/images/7/78/OWASP_AlphaRelease_CodeReviewGuide2.0.pdf

3. https://github.com/dpnishant/raptor

4. http://www.covert.io/research-papers/security/Vulnerability%20Extrapolation%20-%20Assisted%20Discovery%20of%20Vulnerabilities%20using%20Machine%20Learning.pdf

5. https://ayearofai.com/lenny-2-autoencoders-and-word-embeddings-oh-my-576403b0113a

6. https://www.reddit.com/r/learnprogramming/comments/6ytb4r/what_would_be_a_fun_and_brain_friendly_way_to/

7. http://kvfrans.com/variational-autoencoders-explained/

8. https://jaan.io/what-is-variational-autoencoder-vae-tutorial/

9. http://blog.fastforwardlabs.com/2016/08/12/introducing-variational-autoencoders-in-prose-and.html

10. https://arxiv.org/pdf/1602.05012.pdf