Can I use CountVectorizer in scikit-learn to count frequency of documents that were not used to extract the tokens?

You’re right that vocabulary is what you want. It works like this: >>> cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=[‘hot’, ‘cold’, ‘old’]) >>> cv.fit_transform([‘pease porridge hot’, ‘pease porridge cold’, ‘pease porridge in the pot’, ‘nine days old’]).toarray() array([[1, 0, 0], [0, 1, 0], [0, 0, 0], [0, 0, 1]], dtype=int64) So you pass it a dict with your desired … Read more

How to get tfidf with pandas dataframe?

Scikit-learn implementation is really easy : from sklearn.feature_extraction.text import TfidfVectorizer v = TfidfVectorizer() x = v.fit_transform(df[‘sent’]) There are plenty of parameters you can specify. See the documentation here The output of fit_transform will be a sparse matrix, if you want to visualize it you can do x.toarray() In [44]: x.toarray() Out[44]: array([[ 0.64612892, 0.38161415, 0. … Read more

Why is log used when calculating term frequency weight and IDF, inverse document frequency?

Debasis’s answer is correct. I am not sure why he got downvoted. Here is the intuition: If term frequency for the word ‘computer’ in doc1 is 10 and in doc2 it’s 20, we can say that doc2 is more relevant than doc1 for the word ‘computer. However, if the term frequency of the same word, … Read more

Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

You have to do a little bit of a song and dance to get the matrices as numpy arrays instead, but this should do what you’re looking for: feature_array = np.array(tfidf.get_feature_names()) tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1] n = 3 top_n = feature_array[tfidf_sorting][:n] This gives me: array([u’fruit’, u’travellers’, u’jupiter’], dtype=”<U13″) The argsort call is really the useful one, … Read more

tech