nlp – Page 15 – Tarik Billa

How to get rid of punctuation using NLTK tokenizer?

November 19, 2022 by Tarik

Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else: from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r’\w+’) tokenizer.tokenize(‘Eighty-seven miles to go, yet. Onward!’) Output: [‘Eighty’, ‘seven’, ‘miles’, ‘to’, ‘go’, ‘yet’, ‘Onward’]

Detecting syllables in a word

November 19, 2022 by Tarik

Read about the TeX approach to this problem for the purposes of hyphenation. Especially see Frank Liang’s thesis dissertation Word Hy-phen-a-tion by Com-put-er. His algorithm is very accurate, and then includes a small exceptions dictionary for cases where the algorithm does not work.

googletrans stopped working with error ‘NoneType’ object has no attribute ‘group’

November 13, 2022 by Tarik

How to determine the language of a piece of text?

November 12, 2022 by Tarik

1. TextBlob. Requires NLTK package, uses Google. from textblob import TextBlob b = TextBlob(“bonjour”) b.detect_language() pip install textblob Note: This solution requires internet access and Textblob is using Google Translate’s language detector by calling the API. 2. Polyglot. Requires numpy and some arcane libraries, unlikely to get it work for Windows. (For Windows, get an … Read more

Java Stanford NLP: Part of Speech labels?

October 22, 2022 by Tarik

The Penn Treebank Project. Look at the Part-of-speech tagging ps. JJ is adjective. NNS is noun, plural. VBP is verb present tense. RB is adverb. That’s for english. For chinese, it’s the Penn Chinese Treebank. And for german it’s the NEGRA corpus. CC Coordinating conjunction CD Cardinal number DT Determiner EX Existential there FW Foreign … Read more

What is the difference between lemmatization vs stemming?

October 19, 2022 by Tarik

Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope … Read more

How to compute the similarity between two text documents?

October 4, 2022 by Tarik

The common way of doing this is to transform the documents into TF-IDF vectors and then compute the cosine similarity between them. Any textbook on information retrieval (IR) covers this. See esp. Introduction to Information Retrieval, which is free and available online. Computing Pairwise Similarities TF-IDF (and similar text transformations) are implemented in the Python … Read more

How does the Google “Did you mean?” Algorithm work? [closed]

September 16, 2022 by Tarik

Here’s the explanation directly from the source ( almost ) Search 101! at min 22:03 Worth watching! Basically and according to Douglas Merrill former CTO of Google it is like this: 1) You write a ( misspelled ) word in google 2) You don’t find what you wanted ( don’t click on any results ) … Read more