information-extraction

What is CoNLL data format?

March 8, 2023 by Tarik

There are many different CoNLL formats since CoNLL is a different shared task each year. The format for CoNLL 2009 is described here. Each line represents a single word with a series of tab-separated fields. _s indicate empty values. Mate-Parser’s manual says that it uses the first 12 columns of CoNLL 2009: ID FORM LEMMA … Read more

PDF Parsing Using Python – extracting formatted and plain texts [closed]

January 31, 2023 by Tarik

You can also take a look at PDFMiner (or for older versions of Python see PDFMiner and PDFMiner). A particular feature of interest in PDFMiner is that you can control how it regroups text parts when extracting them. You do this by specifying the space between lines, words, characters, etc. So, maybe by tweaking this … Read more

How does Apple find dates, times and addresses in emails?

December 11, 2022 by Tarik

They likely use Information Extraction techniques for this. Here is a demo of Stanford’s SUTime tool: http://nlp.stanford.edu:8080/sutime/process You would extract attributes about n-grams (consecutive words) in a document: numberOfLetters numberOfSymbols length previousWord nextWord nextWordNumberOfSymbols … And then use a classification algorithm, and feed it positive and negative examples: Observation nLetters nSymbols length prevWord nextWord isPartOfDate … Read more