tf-idf
Can I use CountVectorizer in scikit-learn to count frequency of documents that were not used to extract the tokens?
You’re right that vocabulary is what you want. It works like this: >>> cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=[‘hot’, ‘cold’, ‘old’]) >>> cv.fit_transform([‘pease porridge hot’, ‘pease porridge cold’, ‘pease porridge in the pot’, ‘nine days old’]).toarray() array([[1, 0, 0], [0, 1, 0], [0, 0, 0], [0, 0, 1]], dtype=int64) So you pass it a dict with your desired … Read more
TFIDF for Large Dataset
Gensim has an efficient tf-idf model and does not need to have everything in memory at once. Your corpus simply needs to be an iterable, so it does not need to have the whole corpus in memory at a time. The make_wiki script runs over Wikipedia in about 50m on a laptop according to the … Read more
How to get tfidf with pandas dataframe?
Scikit-learn implementation is really easy : from sklearn.feature_extraction.text import TfidfVectorizer v = TfidfVectorizer() x = v.fit_transform(df[‘sent’]) There are plenty of parameters you can specify. See the documentation here The output of fit_transform will be a sparse matrix, if you want to visualize it you can do x.toarray() In [44]: x.toarray() Out[44]: array([[ 0.64612892, 0.38161415, 0. … Read more
Why is log used when calculating term frequency weight and IDF, inverse document frequency?
Debasis’s answer is correct. I am not sure why he got downvoted. Here is the intuition: If term frequency for the word ‘computer’ in doc1 is 10 and in doc2 it’s 20, we can say that doc2 is more relevant than doc1 for the word ‘computer. However, if the term frequency of the same word, … Read more
Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score
You have to do a little bit of a song and dance to get the matrices as numpy arrays instead, but this should do what you’re looking for: feature_array = np.array(tfidf.get_feature_names()) tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1] n = 3 top_n = feature_array[tfidf_sorting][:n] This gives me: array([u’fruit’, u’travellers’, u’jupiter’], dtype=”<U13″) The argsort call is really the useful one, … Read more