Nov 9, 2017

Deep Learning GAN and Fuzzing: Is there a pattern to better successful fuzzing?

I was reading this:   

Faster Fuzzing: Reinitialization with Deep Neural Models
https://arxiv.org/pdf/1711.02807.pdf

and starts thinking:

a.   Is there a pattern to "successful" fuzzing?   The art of good fuzzing strategy is to attempt inputs that is likely to break normal assumptions:

1.   signed vs unsigned issues, buffer/stack overflow
2. having zero as data
3.  triggered many concurrent execution through races:   in data input / processing, and in data output/supply, or any general resources consumptions
4.   challenging memory usages, or any system resources (file descriptors, share drivers, hardware devices like bluetooth etc).

b.  A good pattern is always to try inputs that have not been attempted before, but not totally randomized, but structured in a way to conform to "proper usage", so that the fuzzing can go "DEEP".   

Definition of "DEEP" fuzzing?   It means to be able to provide inputs such that it can pass through as many layers of the software as possible, ie, input have to be very near to VALID, but only deformed to break the deeper layer of software.

c.   Since there is a recommended list of input to attempt, but at the same time to be randomized as much as possible - this seemed like a pattern-less domain to apply Deep Learning.

d.   If fuzzing is structured as a game to play:   every step of software execution  is dependent on the input in a certain way, and aim of input is to outmanoeuvre   the software execution so as to fail him.   Since the software is static and cannot be modified, only the input can, therefore the input and its sequential ordering will be the training input and examples, and the output is just pass or fail.

e.   Reading this:


We can apply analogies here:

TRUE data distribution === input that will produce crashes, ie, successful fuzzing input.
Generator === generates inputs that will produce successful fuzzing.
Discriminator === reading the Generator as inputs and making decision whether inputs belong to TRUE data distribution or not, ie, output crashed or not crashed for fuzzing output.

f.   There the sequence of input generation, which is pseudo-randomized by nature - with known area where randomization is allowed and not allowed, will be the output of the "Generator" in this GAN design.

g.   The Discriminator will attempt to emulate the original software stack, and decide crash/no crash decision.

h.   Different software will have different generator/discriminator pairing.

i.    Unlike the case of a picture, which cannot tell the world if it itself is a DOG or CAT, but a real piece of software can - through emulation.

j.   So Deep Learning via GAN approach, together with emulation to feedback whether the fuzzing is successful or not, can potentially learn the entire domain of vulnerabilities faster than randomized input - whose input space is a mathematically explosive nightmare.

No comments: