nlp – Page 14 – Tarik Billa

spacy Can’t find model ‘en_core_web_sm’ on windows 10 and Python 3.5.3 :: Anaconda custom (64-bit)

December 30, 2022 by Tarik

Initially I downloaded two en packages using following statements in anaconda prompt. python -m spacy download en_core_web_lg python -m spacy download en_core_web_sm But, I kept on getting linkage error and finally running below command helped me to establish link and solved error. python -m spacy download en Also make sure you to restart your runtime … Read more

How do I do word Stemming or Lemmatization?

December 27, 2022 by Tarik

If you know Python, The Natural Language Toolkit (NLTK) has a very powerful lemmatizer that makes use of WordNet. Note that if you are using this lemmatizer for the first time, you must download the corpus prior to using it. This can be done by: >>> import nltk >>> nltk.download(‘wordnet’) You only have to do … Read more

What does Keras Tokenizer method exactly do?

December 26, 2022 by Tarik

From the source code: fit_on_texts Updates internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency. So if you give it something like, “The cat sat on the mat.” It will create a dictionary s.t. word_index[“the”] = 1; word_index[“cat”] = 2 it is word -> index dictionary … Read more

Understanding min_df and max_df in scikit CountVectorizer

December 25, 2022 by Tarik

max_df is used for removing terms that appear too frequently, also known as “corpus-specific stop words”. For example: max_df = 0.50 means “ignore terms that appear in more than 50% of the documents“. max_df = 25 means “ignore terms that appear in more than 25 documents“. The default max_df is 1.0, which means “ignore terms … Read more

How do you implement a “Did you mean”? [duplicate]

December 24, 2022 by Tarik

Actually what Google does is very much non-trivial and also at first counter-intuitive. They don’t do anything like check against a dictionary, but rather they make use of statistics to identify “similar” queries that returned more results than your query, the exact algorithm is of course not known. There are different sub-problems to solve here, … Read more

Java or Python for Natural Language Processing [closed]

December 24, 2022 by Tarik

Java vs Python for NLP is very much a preference or necessity. Depending on the company/projects you’ll need to use one or the other and often there isn’t much of a choice unless you’re heading a project. Other than NLTK (www.nltk.org), there are actually other libraries for text processing in python: TextBlob: http://textblob.readthedocs.org/en/dev/ Gensim: http://radimrehurek.com/gensim/ … Read more

word2vec: negative sampling (in layman term)?

December 24, 2022 by Tarik

The idea of word2vec is to maximise the similarity (dot product) between the vectors for words which appear close together (in the context of each other) in text, and minimise the similarity of words that do not. In equation (3) of the paper you link to, ignore the exponentiation for a moment. You have v_c … Read more

Difference between constituency parser and dependency parser

December 18, 2022 by Tarik

A constituency parse tree breaks a text into sub-phrases. Non-terminals in the tree are types of phrases, the terminals are the words in the sentence, and the edges are unlabeled. For a simple sentence “John sees Bill”, a constituency parse would be: Sentence | +————-+————+ | | Noun Phrase Verb Phrase | | John +——-+——–+ … Read more

How does Apple find dates, times and addresses in emails?

December 11, 2022 by Tarik

They likely use Information Extraction techniques for this. Here is a demo of Stanford’s SUTime tool: http://nlp.stanford.edu:8080/sutime/process You would extract attributes about n-grams (consecutive words) in a document: numberOfLetters numberOfSymbols length previousWord nextWord nextWordNumberOfSymbols … And then use a classification algorithm, and feed it positive and negative examples: Observation nLetters nSymbols length prevWord nextWord isPartOfDate … Read more

How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?

December 2, 2022 by Tarik

I think there is a lot of confusion about which weights are used for what. I am not sure I know precisely what bothers you so I am going to cover different topics, bear with me ;). Class weights The weights from the class_weight parameter are used to train the classifier. They are not used … Read more