tf-idf – Tarik Billa

tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer

December 20, 2023 by Tarik

Using Sklearn’s TfidfVectorizer transform

September 1, 2023 by Tarik

How do I calculate the cosine similarity of two vectors?

August 26, 2023 by Tarik

How to see top n entries of term-document matrix after tfidf in scikit-learn

August 1, 2023 by Tarik

Cosine similarity and tf-idf

July 23, 2023 by Tarik

Can I use CountVectorizer in scikit-learn to count frequency of documents that were not used to extract the tokens?

June 14, 2023 by Tarik

You’re right that vocabulary is what you want. It works like this: >>> cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=[‘hot’, ‘cold’, ‘old’]) >>> cv.fit_transform([‘pease porridge hot’, ‘pease porridge cold’, ‘pease porridge in the pot’, ‘nine days old’]).toarray() array([[1, 0, 0], [0, 1, 0], [0, 0, 0], [0, 0, 1]], dtype=int64) So you pass it a dict with your desired … Read more

TFIDF for Large Dataset

June 3, 2023 by Tarik

Gensim has an efficient tf-idf model and does not need to have everything in memory at once. Your corpus simply needs to be an iterable, so it does not need to have the whole corpus in memory at a time. The make_wiki script runs over Wikipedia in about 50m on a laptop according to the … Read more

How to get tfidf with pandas dataframe?

June 2, 2023 by Tarik

Scikit-learn implementation is really easy : from sklearn.feature_extraction.text import TfidfVectorizer v = TfidfVectorizer() x = v.fit_transform(df[‘sent’]) There are plenty of parameters you can specify. See the documentation here The output of fit_transform will be a sparse matrix, if you want to visualize it you can do x.toarray() In [44]: x.toarray() Out[44]: array([[ 0.64612892, 0.38161415, 0. … Read more

Why is log used when calculating term frequency weight and IDF, inverse document frequency?

May 20, 2023 by Tarik

Debasis’s answer is correct. I am not sure why he got downvoted. Here is the intuition: If term frequency for the word ‘computer’ in doc1 is 10 and in doc2 it’s 20, we can say that doc2 is more relevant than doc1 for the word ‘computer. However, if the term frequency of the same word, … Read more

Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

May 13, 2023 by Tarik

You have to do a little bit of a song and dance to get the matrices as numpy arrays instead, but this should do what you’re looking for: feature_array = np.array(tfidf.get_feature_names()) tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1] n = 3 top_n = feature_array[tfidf_sorting][:n] This gives me: array([u’fruit’, u’travellers’, u’jupiter’], dtype=”<U13″) The argsort call is really the useful one, … Read more