adding words to stop_words list in TfidfVectorizer in sklearn

This is how you can do it: from sklearn.feature_extraction import text from sklearn.feature_extraction.text import TfidfVectorizer my_stop_words = text.ENGLISH_STOP_WORDS.union([“book”]) vectorizer = TfidfVectorizer(ngram_range=(1,1), stop_words=my_stop_words) X = vectorizer.fit_transform([“this is an apple.”,”this is a book.”]) idf_values = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_)) # printing the tfidf vectors print(X) # printing the vocabulary print(vectorizer.vocabulary_) In this example, I created the tfidf vectors for … Read more

Adding words to scikit-learn’s CountVectorizer’s stop list

According to the source code for sklearn.feature_extraction.text, the full list (actually a frozenset, from stop_words) of ENGLISH_STOP_WORDS is exposed through __all__. Therefore if you want to use that list plus some more items, you could do something like: from sklearn.feature_extraction import text stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words) (where my_additional_stop_words is any sequence of strings) and use the … Read more

NLTK and Stopwords Fail #lookuperror

You don’t seem to have the stopwords corpus on your computer. You need to start the NLTK Downloader and download all the data you need. Open a Python console and do the following: >>> import nltk >>> nltk.download() showing info http://nltk.github.com/nltk_data/ In the GUI window that opens simply press the ‘Download’ button to download all … Read more

Faster way to remove stop words in Python

Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck. from nltk.corpus import stopwords cachedStopWords = stopwords.words(“english”) def testFuncOld(): text=”hello bye the the hi” text=” “.join([word for word in text.split() if word not in stopwords.words(“english”)]) def testFuncNew(): text=”hello bye the the hi” text=” “.join([word … Read more

Add/remove custom stop words with spacy

Using Spacy 2.0.11, you can update its stopwords set using one of the following: To add a single stopword: import spacy nlp = spacy.load(“en”) nlp.Defaults.stop_words.add(“my_new_stopword”) To add several stopwords at once: import spacy nlp = spacy.load(“en”) nlp.Defaults.stop_words |= {“my_new_stopword1″,”my_new_stopword2”,} To remove a single stopword: import spacy nlp = spacy.load(“en”) nlp.Defaults.stop_words.remove(“whatever”) To remove several stopwords at … Read more

Stopword removal with NLTK

There is an in-built stopword list in NLTK made up of 2,400 stopwords for 11 languages (Porter et al), see http://nltk.org/book/ch02.html >>> from nltk import word_tokenize >>> from nltk.corpus import stopwords >>> stop = set(stopwords.words(‘english’)) >>> sentence = “this is a foo bar sentence” >>> print([i for i in sentence.lower().split() if i not in stop]) … Read more