Improving the extraction of human names with nltk [closed]

Question

Must agree with the suggestion that “make my code better” isn’t well suited for this site, but I can give you some way where you can try to dig in.

Disclaimer: This answer is ~7 years old. Definitely, it needs to be updated to newer Python and NLTK versions. Please, try to do it yourself, and if it works, share your know-how with us.

Take a look at Stanford Named Entity Recognizer (NER). Its binding has been included in NLTK v 2.0, but you must download some core files. Here is script which can do all of that for you.

I wrote this script:

import nltk
from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
text = """YOUR TEXT GOES HERE"""

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1]=='PERSON': print tag

and got not so bad output:

(‘Francois’, ‘PERSON’)
(‘R.’, ‘PERSON’)
(‘Velde’, ‘PERSON’)
(‘Richard’, ‘PERSON’)
(‘Branson’, ‘PERSON’)
(‘Virgin’, ‘PERSON’)
(‘Galactic’, ‘PERSON’)
(‘Bitcoin’, ‘PERSON’)
(‘Bitcoin’, ‘PERSON’)
(‘Paul’, ‘PERSON’)
(‘Krugman’, ‘PERSON’)
(‘Larry’, ‘PERSON’)
(‘Summers’, ‘PERSON’)
(‘Bitcoin’, ‘PERSON’)
(‘Nick’, ‘PERSON’)
(‘Colas’, ‘PERSON’)

Hope this is helpful.

Leave a Comment Cancel reply