CBOW v.s. skip-gram: why invert context and target words?

Here is my oversimplified and rather naive understanding of the difference: As we know, CBOW is learning to predict the word by the context. Or maximize the probability of the target word by looking at the context. And this happens to be a problem for rare words. For example, given the context yesterday was a … Read more

gensim word2vec: Find number of words in vocabulary

In recent versions, the model.wv property holds the words-and-vectors, and can itself can report a length – the number of words it contains. So if w2v_model is your Word2Vec (or Doc2Vec or FastText) model, it’s enough to just do: vocab_len = len(w2v_model.wv) If your model is just a raw set of word-vectors, like a KeyedVectors … Read more

Doc2vec: How to get document vectors

If you want to train Doc2Vec model, your data set needs to contain lists of words (similar to Word2Vec format) and tags (id of documents). It can also contain some additional info (see https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb for more information). # Import libraries from gensim.models import doc2vec from collections import namedtuple # Load data doc1 = [“This is … Read more

What is a projection layer in the context of neural networks?

I find the previous answers here a bit overcomplicated – a projection layer is just a simple matrix multiplication, or in the context of NN, a regular/dense/linear layer, without the non-linear activation in the end (sigmoid/tanh/relu/etc.) The idea is to project the (e.g.) 100K-dimensions discrete vector into a 600-dimensions continuous vector (I chose the numbers … Read more

Convert word2vec bin file to text

I use this code to load binary model, then save the model to text file, from gensim.models.keyedvectors import KeyedVectors model = KeyedVectors.load_word2vec_format(‘path/to/GoogleNews-vectors-negative300.bin’, binary=True) model.save_word2vec_format(‘path/to/GoogleNews-vectors-negative300.txt’, binary=False) References: API and nullege. Note: Above code is for new version of gensim. For previous version, I used this code: from gensim.models import word2vec model = word2vec.Word2Vec.load_word2vec_format(‘path/to/GoogleNews-vectors-negative300.bin’, binary=True) model.save_word2vec_format(‘path/to/GoogleNews-vectors-negative300.txt’, binary=False)

How to calculate the sentence similarity using word2vec model of gensim with python

This is actually a pretty challenging problem that you are asking. Computing sentence similarity requires building a grammatical model of the sentence, understanding equivalent structures (e.g. “he walked to the store yesterday” and “yesterday, he walked to the store”), finding similarity not just in the pronouns and verbs but also in the proper nouns, finding … Read more

tech