spacy Can’t find model ‘en_core_web_sm’ on windows 10 and Python 3.5.3 :: Anaconda custom (64-bit)

Initially I downloaded two en packages using following statements in anaconda prompt. python -m spacy download en_core_web_lg python -m spacy download en_core_web_sm But, I kept on getting linkage error and finally running below command helped me to establish link and solved error. python -m spacy download en Also make sure you to restart your runtime … Read more

What does Keras Tokenizer method exactly do?

From the source code: fit_on_texts Updates internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency. So if you give it something like, “The cat sat on the mat.” It will create a dictionary s.t. word_index[“the”] = 1; word_index[“cat”] = 2 it is word -> index dictionary … Read more

Understanding min_df and max_df in scikit CountVectorizer

max_df is used for removing terms that appear too frequently, also known as “corpus-specific stop words”. For example: max_df = 0.50 means “ignore terms that appear in more than 50% of the documents“. max_df = 25 means “ignore terms that appear in more than 25 documents“. The default max_df is 1.0, which means “ignore terms … Read more

How do you implement a “Did you mean”? [duplicate]

Actually what Google does is very much non-trivial and also at first counter-intuitive. They don’t do anything like check against a dictionary, but rather they make use of statistics to identify “similar” queries that returned more results than your query, the exact algorithm is of course not known. There are different sub-problems to solve here, … Read more

Java or Python for Natural Language Processing [closed]

Java vs Python for NLP is very much a preference or necessity. Depending on the company/projects you’ll need to use one or the other and often there isn’t much of a choice unless you’re heading a project. Other than NLTK (www.nltk.org), there are actually other libraries for text processing in python: TextBlob: http://textblob.readthedocs.org/en/dev/ Gensim: http://radimrehurek.com/gensim/ … Read more

Difference between constituency parser and dependency parser

A constituency parse tree breaks a text into sub-phrases. Non-terminals in the tree are types of phrases, the terminals are the words in the sentence, and the edges are unlabeled. For a simple sentence “John sees Bill”, a constituency parse would be: Sentence | +————-+————+ | | Noun Phrase Verb Phrase | | John +——-+——–+ … Read more

How does Apple find dates, times and addresses in emails?

They likely use Information Extraction techniques for this. Here is a demo of Stanford’s SUTime tool: http://nlp.stanford.edu:8080/sutime/process You would extract attributes about n-grams (consecutive words) in a document: numberOfLetters numberOfSymbols length previousWord nextWord nextWordNumberOfSymbols … And then use a classification algorithm, and feed it positive and negative examples: Observation nLetters nSymbols length prevWord nextWord isPartOfDate … Read more

How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?

I think there is a lot of confusion about which weights are used for what. I am not sure I know precisely what bothers you so I am going to cover different topics, bear with me ;). Class weights The weights from the class_weight parameter are used to train the classifier. They are not used … Read more