Classification of "BBC News" and comparison of performance between 3 types of model's architectures. Then 2D word embedding visualization using PCA and 3D word embedding visualisation using T-SNE
This a part of Kaggle Competion, Toxic Comment Classification Challenge by Jigsaw .This was a multilabel classification challenge.This code is a improved version of my submission in the competion.
A word embedding is a learned representation for text where words that have the same meaning have a similar representation.Word embeddings are in fact a class of techniques where individual words are represented as real-valued vectors in a predefined vector space. Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network
The aim of this project is to discover OOV(out of vocabulary) from Sina Weibo and to understand OOV by using the Word2Vec model. The first step was generated word lists through Mutual information and Left and Right Entropy Measures from news corpus of Sina Weibo was crawled, and OOV was extracted from the word lists through online dictionaries. The second step was extracted the relevant corpus containing OOV from Weibo. The third step, a third-party tool was used to divide the corpus into words and to obtain the distributed representation of words using Word2Vec's CBOW(continues bag of word) and Skip-Gram models. The fourth step was distributed representation information is used to compute words that are similar to the OOV in order to achieve semantic understanding of the OOV.The final result model has a high rate of correct word comprehension and is able to understand most of the OOV.