information-retrieval – Tarik Billa

How to parse the data from Google Alerts?

December 14, 2023 by Tarik

When you create the alert, set the “Deliver To” to “Feed” and then you can consume the feed XML as you would any other feed. This is much easier to parse and digest into a database.

Cosine similarity and tf-idf

July 23, 2023 by Tarik

Why is log used when calculating term frequency weight and IDF, inverse document frequency?

May 20, 2023 by Tarik

Debasis’s answer is correct. I am not sure why he got downvoted. Here is the intuition: If term frequency for the word ‘computer’ in doc1 is 10 and in doc2 it’s 20, we can say that doc2 is more relevant than doc1 for the word ‘computer. However, if the term frequency of the same word, … Read more

How to specify two Fields in Lucene QueryParser?

February 19, 2023 by Tarik

There are 3 ways to do this. The first way is to construct a query manually, this is what QueryParser is doing internally. This is the most powerful way to do it, and means that you don’t have to parse the user input if you want to prevent access to some of the more exotic … Read more

Python: tf-idf-cosine: to find document similarity

December 29, 2022 by Tarik

First off, if you want to extract count features and apply TF-IDF normalization and row-wise euclidean normalization you can do it in one operation with TfidfVectorizer: >>> from sklearn.feature_extraction.text import TfidfVectorizer >>> from sklearn.datasets import fetch_20newsgroups >>> twenty = fetch_20newsgroups() >>> tfidf = TfidfVectorizer().fit_transform(twenty.data) >>> tfidf <11314×130088 sparse matrix of type ‘<type ‘numpy.float64′>’ with 1787553 … Read more

What is the best way to compute trending topics or tags?

October 13, 2022 by Tarik

This problem calls for a z-score or standard score, which will take into account the historical average, as other people have mentioned, but also the standard deviation of this historical data, making it more robust than just using the average. In your case a z-score is calculated by the following formula, where the trend would … Read more