gensim – Tarik Billa

I got the solution for the problem , There was two parameters I didn’t take care of it which should be passed to Phrases() model, those are min_count ignore all words and bigrams with total collected count lower than this. Bydefault it value is 5 threshold represents a threshold for forming the phrases (higher means … Read more

Doc2Vec Get most similar documents

June 8, 2023 by Tarik

You need to use infer_vector to get a document vector of the new text – which does not alter the underlying model. Here is how you do it: tokens = “a new sentence to match”.split() new_vector = model.infer_vector(tokens) sims = model.docvecs.most_similar([new_vector]) #gives you top 10 document tags and their cosine similarity Edit: Here is an … Read more

How to get tfidf with pandas dataframe?

June 2, 2023 by Tarik

Scikit-learn implementation is really easy : from sklearn.feature_extraction.text import TfidfVectorizer v = TfidfVectorizer() x = v.fit_transform(df[‘sent’]) There are plenty of parameters you can specify. See the documentation here The output of fit_transform will be a sparse matrix, if you want to visualize it you can do x.toarray() In [44]: x.toarray() Out[44]: array([[ 0.64612892, 0.38161415, 0. … Read more

How to check if a key exists in a word2vec trained model or not

May 31, 2023 by Tarik

Word2Vec also provides a ‘vocab’ member, which you can access directly. Using a pythonistic approach: if word in w2v_model.vocab: # Do something EDIT Since gensim release 2.0, the API for Word2Vec changed. To access the vocabulary you should now use this: if word in w2v_model.wv.vocab: # Do something EDIT 2 The attribute ‘wv’ is being … Read more