doc2vec – Tarik Billa

Doc2Vec Get most similar documents

June 8, 2023 by Tarik

You need to use infer_vector to get a document vector of the new text – which does not alter the underlying model. Here is how you do it: tokens = “a new sentence to match”.split() new_vector = model.infer_vector(tokens) sims = model.docvecs.most_similar([new_vector]) #gives you top 10 document tags and their cosine similarity Edit: Here is an … Read more

How to use Gensim doc2vec with pre-trained word vectors?

May 28, 2023 by Tarik

Note that the “DBOW” (dm=0) training mode doesn’t require or even create word-vectors as part of the training. It merely learns document vectors that are good at predicting each word in turn (much like the word2vec skip-gram training mode). (Before gensim 0.12.0, there was the parameter train_words mentioned in another comment, which some documentation suggested … Read more

gensim Doc2Vec vs tensorflow Doc2Vec

April 17, 2023 by Tarik

Old question, but an answer would be useful for future visitors. So here are some of my thoughts. There are some problems in the tensorflow implementation: window is 1-side size, so window=5 would be 5*2+1 = 11 words. Note that with PV-DM version of doc2vec, the batch_size would be the number of documents. So train_word_dataset … Read more

ImportError: cannot import name ‘joblib’ from ‘sklearn.externals’

December 30, 2022 by Tarik