Comparing features in word2vec and Fuzzy Wuzzy, these are my take on concept classifications essentials:
1. Strings proximity via alphabet matching (like Fuzzy Wuzzy) is good and resilient against characters insertion and deletion. But connecting it with a concept or classification entail totally different spelling, and thus will need association properties like those of word2vec, which essentially does a cosine similarity between two different vector in space.
2. Words are group together via a priori hardwired classification: teaching is needed, not any "heuristic" like nearest alphabetical distance etc. Therefore most of the time, alphabetical distance can be used, especially when "UNKNOWN" words are encountered, and the alphabetically off are mistakes. But for others like "good" vs "goods", as both are in the dictionary, and therefore have to be partition into two different classes unassociated via alphabetical nearness.
3. Classification of words into different classes are meaningful only if the words are nouns. For others like descriptive by nature, adjectives, or verbs, association with other words are needed, and thus standaloneness is not useful. Therefore the system should be trained to detect verbs vs nouns etc and filter them off, prior to classification.
4. Without the benefits of supervised learning like those using cosine similarity as loss function, in the unsupervised case, seq2seq is a good alternative. The internal of seq2seq (similarly RNN, or LSTM) exploit the properties of "associativity" - everytime the two concept are used one after another, it is more likely they are "associated" - thus no supervision is needed [13].
5. There is a problem of training needed in real life - some training will require much more data, whereas other training will need less data. Some training will need higher rate of turnover depending on new knowledge, whereas other training (or knowledge repository) are more permanent and thus less re-training needed. Through various strategy of "focus", or attention, it is possible to first pass the data through a "declassifier", which basically identify approximately or suggest vague classes, which will then be channelled to a more well-trained model which will then suggest a more accurate classification based on its more expert knowledge. A few of these "expert" domain can be consulted, and come together to agree on a more accurate classification - analoguous to those of ensemble method.
6. Adversarial attack: this can be a source of unlearning on the currently learned knowledge. [14,17] Perhaps it is even possible [16] to craft specific adversarial examples which can mess up with the current knowledge builtup. So it is an interesting problem on how this source of perturbations can be detected [15].
References:
- https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/
- http://streamhacker.com/2011/10/31/fuzzy-string-matching-python/
- http://dsnotes.com/post/glove-enwiki
- http://sujitpal.blogspot.sg/2014/10/clustering-word-vectors-using-self.html
- https://news.ycombinator.com/item?id=13587903
- https://www.quora.com/How-is-GloVe-different-from-word2vec
- http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
- http://www.cinjon.com/papers-multimodal-seq2seq/
- https://www.linkedin.com/pulse/unsupervised-deep-learning-dialog-chatbots-vc-ramesh
- https://research.googleblog.com/2017/06/accelerating-deep-learning-research.html
- https://openreview.net/pdf?id=H1Gq5Q9el
- https://arxiv.org/abs/1611.02683
- https://vcrsoft.wordpress.com/2016/10/16/unsupervised-deep-learning-for-vertical-conversational-chatbots/
- https://arxiv.org/abs/1704.08006
- https://github.com/QData/FeatureSqueezing
- https://github.com/gongzhitaao/tensorflow-adversarial
- https://github.com/bogdan-kulynych/textfool
No comments:
Post a Comment