Training data for sentiment analysis [closed]

http://www.cs.cornell.edu/home/llee/data/ http://mpqa.cs.pitt.edu/corpora/mpqa_corpus You can use twitter, with its smileys, like this: http://web.archive.org/web/20111119181304/http://deepthoughtinc.com/wp-content/uploads/2011/01/Twitter-as-a-Corpus-for-Sentiment-Analysis-and-Opinion-Mining.pdf Hope that gets you started. There’s more in the literature, if you’re interested in specific subtasks like negation, sentiment scope, etc. To get a focus on companies, you might pair a method with topic detection, or cheaply just a lot of mentions of … Read more

SpaCy OSError: Can’t find model ‘en’

FINALLY CLEARED THE ERROR !!! Best Way to Install now pip install -U pip setuptools wheel pip install -U spacy python -m spacy download en_core_web_sm Always Open Anaconda Prompt / Command Prompt with Admin Rights to avoid Linking errors!!! Tried multiple options including : python -m spacy download en conda install -c conda-forge spacy python … Read more

How to use Bert for long text classification?

You have basically three options: You cut the longer texts off and only use the first 512 Tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient. You can split your text in multiple subtexts, classifier each of them and combine the results … Read more

How to extract common / significant phrases from a series of text entries

I suspect you don’t just want the most common phrases, but rather you want the most interesting collocations. Otherwise, you could end up with an overrepresentation of phrases made up of common words and fewer interesting and informative phrases. To do this, you’ll essentially want to extract n-grams from your data and then find the … Read more

Stemmers vs Lemmatizers

Q1: “[..] are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English” Yes. Stemmers are much simpler, smaller, and usually faster than lemmatizers, and for many applications, their results are good enough. Using a lemmatizer for that is a waste of resources. Consider, for example, dimensionality reduction … Read more

How do you implement a “Did you mean”? [duplicate]

Actually what Google does is very much non-trivial and also at first counter-intuitive. They don’t do anything like check against a dictionary, but rather they make use of statistics to identify “similar” queries that returned more results than your query, the exact algorithm is of course not known. There are different sub-problems to solve here, … Read more

Detecting syllables in a word

Read about the TeX approach to this problem for the purposes of hyphenation. Especially see Frank Liang’s thesis dissertation Word Hy-phen-a-tion by Com-put-er. His algorithm is very accurate, and then includes a small exceptions dictionary for cases where the algorithm does not work.

tech