What is an “embedding” for a dictionary of words? That is the question this article addresses.
When one uses neural networks for linguistic processing, one of the first things to consider is “how should a word be represented”. In other words, every word needs to have a numeric representation in order for the network to be able to perform any processing on it. “Embedding” is the term used for assigning a number, or vector (set of numbers) to represent a particular word from a dictionary. An obvious, but as it turns out, not very practical embedding would be the traditional one-hot code: Take the dictionary of all possible words in the vocabulary, define a vector matching the length of the dictionary, and set the element of the vector matching the position of a word to 1 and all others to zero. For example, if the dictionary of all words you will work with in your network is simply: [‘man’, ‘woman’, ‘red’, ‘blue’, ‘green’], then your vector would have 5 dimensions and the word ‘man’ would be represented by the vector [1, 0, 0, 0, 0], woman would be assigned the vector [0, 1, 0, 0, 0], etc. There are several problems with this choice of embedding:
- The size of each vector is defined by the size of the dictionary. For typical “real-world” dictionaries, each word would be represented by a vector that might be more than a million numbers in length.
- The vectors themselves are sparse (all but one of the 1,000,000+ numbers in any particular vector has a value of 0.0).
- Words embedded in this vector space have no particular relationship to one another based on their location in the vector space.
So, the question arises: Can a vector space be found that can be used to embed (represent) words from a dictionary such that the vectors themselves are small (say, only 100-300 elements, rather than 1,000,000+ elements for a one-hot encoding) and would have the desirable feature that words that are near one another in this multi-dimensional space have similar meanings? The nice thing about the last of these desired attributes is that rather than representing a word by an arbitrary number or vector, the chosen vectors (one for each word) would somehow represent the syntactic or semantic characteristics of each word in the dictionary.
Rather than using the simple one-hot encoding, or embedding, Tomas Mikolov in 2013 (working at Google at the time) came up with the idea of training a simple neural network (one with only linear activations – no sigmoid, relu, or other non-linear activations) to learn embeddings for words based on the context in which they are used. Here’s a link to Mikolov’s word2vec paper , titled “Efficient Estimation of Word Representations in Vector Space” (it’s a very readable paper!). Mikolov trained a network using a dictionary of the million most-frequently-used words in a training corpus consisting of the 6 billion words from Google News. An example vector space consisted of a set of 300 floating point values (there could be any number of these vector spaces, one is created each time the a word2vec network architecture is defined (via hyper parameters) and trained).
The interesting thing about Mikolov’s 300-dimensional vector space, is that words with similar syntactic and semantic meanings tend to group near one another in various dimensions of the space. And, surprisingly, one can do simple vector math on the learned embedding vectors to explore these relationships. For example, if you train a network to learn a set of word embeddings, and you take the vector representing the word “King”, subtract the vector for the word “Man” and add the vector for the word “Woman” you will find that the word closest to the resulting vector in the word2vec vector space turns out to be “Queen”. This was a rather amazing result, and to my knowledge, Mikolov’s group was the first to find this type of multi-dimensional vector encodings that could perform semantic math! In fact, after training their network on the Google News corpus, Mikolov’s group performed a test using some 20,000 similar semantic and syntactic tests (analogies) and found that their learned embeddings placed 65% of the syntactic vectors and 34% of the semantic vectors in the “correct” location in the vector space – where “correct” meant that the embedding (vector code) for any particular target word was closer to the expected word than the million other words in the dictionary.
Mikolov’s group trained their network using two different methods: Continuous Bag of Words (CBOW) and the skip-gram model. The CBOW method tries to guess a particular word based on the 4 words preceding the target word and the four words following the target word (the window size of 4 words is arbitrary, another hyper parameter, but Mikolov found that a +/- 4 window size was found to produce good results). Positive training examples were produced from the Google News corpus, negative examples were generated from random sets of preceding and following words . Here’s a pictorial representation of the CBOW model (from the Mikolov paper):
As you can see, the input in the CBOW model in this example consists of the embeddings for the two words preceding the target word and the two words following the target word, and the desired output is then the predicted target word. This network is trained to produce high activations for examples from the corpus and low activations for random sequences of input words.
The skip-gram model turns this network on its head: rather than predicting the word that would lie between a set of preceding and succeeding words, the skip-gram model takes as input a single word and predicts, as output, the most likely set of words to precede and succeed the input word, as shown in the second diagram, below:
It turns out that the skip-gram model actually does the better job at determining the “best” set of embeddings for a dictionary of words, where “best” means that the vector encodings for words in the dictionary end up being located in the 300-dimensional vector space such that words with similar syntactic and semantic meanings are located near one another in this vector space.