https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
key questions:
1. what so unique about bidirectional processing?
2. is bidirectional lead to confusion or messing up with the weights of previous learned materials?
3. what is the interpretations of each directions' processing?
4. what characteristics of thoughts or understanding can be emulated through directed or bidirectional neural networks?
5. How to learn via unsupervised learning the relationship between the different words? How to know if sentence phrased in different ways have the same meaning?
6. How to learn the changes of meanings when the words are reorganized? Or different tenses used?
7. How to learn the via pairing relationship the questions and its answer pairing?
8. How to learn the similar meaning attached to different translation of the same sentence into different languages?
9. How to learn ordering concept: steps in the sequencing of ideas/concepts from one component to another, one time slice to another, causes and effects?
10. How to learn between different abstraction level of concepts: "class" vs "instantiation", "cars" vs "toyota" etc....one set of entities are just different instantiation of the generic class.
11. Answering the basic classes of question: HOW, WHY, WHEN, WHERE, WHAT and SO....
What is BERT?
Key Innovation part: MLM and next senetence prediction.
This diagram best described the innovation #1 for BERT: MLM training:
And next sentence prediction is here:
Further innovation needed:
Prediction of next ideas/word is an indicator of "intelligence" or "understanding". But there may be many different variation of the next word or concept, and after training, they may or may not be combined together. If yes then it is because they expressed the same idea, if not then it is because the ideas are distinctly different.
After you have next word/next sentence, then how about multiple subsequent words or sentences, possibly with ordering requirement entailed - how is it possible to cascade these operation?
If you cascade them as sequential manner through deeper and deeper neural network, will it be possible to consider different neural architecture, or implementing some skip connections, or possibly creating randomized forgetting, or dropping of all weight??
Ie, A-> prediict B, and predict B1, B2 etc.....and then how about doing backward induction/deduction to predict A?
Masking can be treated as a form of skip connection, or forgetting, or regularization via zeroing the weights. So how about randomized masking?