This is how you can do it:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
my_stop_words = text.ENGLISH_STOP_WORDS.union(["book"])
vectorizer = TfidfVectorizer(ngram_range=(1,1), stop_words=my_stop_words)
X = vectorizer.fit_transform(["this is an apple.","this is a book."])
idf_values = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
# printing the tfidf vectors
print(X)
# printing the vocabulary
print(vectorizer.vocabulary_)
In this example, I created the tfidf vectors for two sample documents:
"This is a green apple."
"This is a machine learning book."
By default, this
, is
, a
, and an
are all in the ENGLISH_STOP_WORDS
list. And, I also added book
to the stop word list. This is the output:
(0, 1) 0.707106781187
(0, 0) 0.707106781187
(1, 3) 0.707106781187
(1, 2) 0.707106781187
{'green': 1, 'machine': 3, 'learning': 2, 'apple': 0}
As we can see, the word book
is also removed from the list of features because we listed it as a stop word. As a result, tfidfvectorizer did accept the manually added word as a stop word and ignored the word at the time of creating the vectors.