nlp – Page 2 – Tarik Billa

What do the BILOU tags mean in Named Entity Recognition?

April 9, 2023 by Tarik

Based on an issue and a patch in Clear TK, it seems like BILOU stands for “Beginning, Inside and Last tokens of multi-token chunks, Unit-length chunks and Outside” (emphasis added). For instance, the chunking denoted by brackets (foo foo foo) (bar) no no no (bar bar) can be encoded with BILOU as B-foo, I-foo, L-foo, … Read more

Training data for sentiment analysis [closed]

April 9, 2023 by Tarik

http://www.cs.cornell.edu/home/llee/data/ http://mpqa.cs.pitt.edu/corpora/mpqa_corpus You can use twitter, with its smileys, like this: http://web.archive.org/web/20111119181304/http://deepthoughtinc.com/wp-content/uploads/2011/01/Twitter-as-a-Corpus-for-Sentiment-Analysis-and-Opinion-Mining.pdf Hope that gets you started. There’s more in the literature, if you’re interested in specific subtasks like negation, sentiment scope, etc. To get a focus on companies, you might pair a method with topic detection, or cheaply just a lot of mentions of … Read more

SpaCy OSError: Can’t find model ‘en’

March 19, 2023 by Tarik

FINALLY CLEARED THE ERROR !!! Best Way to Install now pip install -U pip setuptools wheel pip install -U spacy python -m spacy download en_core_web_sm Always Open Anaconda Prompt / Command Prompt with Admin Rights to avoid Linking errors!!! Tried multiple options including : python -m spacy download en conda install -c conda-forge spacy python … Read more

What is CoNLL data format?

March 8, 2023 by Tarik

There are many different CoNLL formats since CoNLL is a different shared task each year. The format for CoNLL 2009 is described here. Each line represents a single word with a series of tab-separated fields. _s indicate empty values. Mate-Parser’s manual says that it uses the first 12 columns of CoNLL 2009: ID FORM LEMMA … Read more

How to use Bert for long text classification?

February 24, 2023 by Tarik

You have basically three options: You cut the longer texts off and only use the first 512 Tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient. You can split your text in multiple subtexts, classifier each of them and combine the results … Read more

How to extract common / significant phrases from a series of text entries

February 22, 2023 by Tarik

I suspect you don’t just want the most common phrases, but rather you want the most interesting collocations. Otherwise, you could end up with an overrepresentation of phrases made up of common words and fewer interesting and informative phrases. To do this, you’ll essentially want to extract n-grams from your data and then find the … Read more

Stemmers vs Lemmatizers

February 5, 2023 by Tarik

Q1: “[..] are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English” Yes. Stemmers are much simpler, smaller, and usually faster than lemmatizers, and for many applications, their results are good enough. Using a lemmatizer for that is a waste of resources. Consider, for example, dimensionality reduction … Read more

How do I do word Stemming or Lemmatization?

December 27, 2022 by Tarik

If you know Python, The Natural Language Toolkit (NLTK) has a very powerful lemmatizer that makes use of WordNet. Note that if you are using this lemmatizer for the first time, you must download the corpus prior to using it. This can be done by: >>> import nltk >>> nltk.download(‘wordnet’) You only have to do … Read more

How do you implement a “Did you mean”? [duplicate]

December 24, 2022 by Tarik

Actually what Google does is very much non-trivial and also at first counter-intuitive. They don’t do anything like check against a dictionary, but rather they make use of statistics to identify “similar” queries that returned more results than your query, the exact algorithm is of course not known. There are different sub-problems to solve here, … Read more

Detecting syllables in a word

November 19, 2022 by Tarik

Read about the TeX approach to this problem for the purposes of hyphenation. Especially see Frank Liang’s thesis dissertation Word Hy-phen-a-tion by Com-put-er. His algorithm is very accurate, and then includes a small exceptions dictionary for cases where the algorithm does not work.