PDF Parsing Using Python – extracting formatted and plain texts [closed]

You can also take a look at PDFMiner (or for older versions of Python see PDFMiner and PDFMiner). A particular feature of interest in PDFMiner is that you can control how it regroups text parts when extracting them. You do this by specifying the space between lines, words, characters, etc. So, maybe by tweaking this … Read more

How does Apple find dates, times and addresses in emails?

They likely use Information Extraction techniques for this. Here is a demo of Stanford’s SUTime tool: http://nlp.stanford.edu:8080/sutime/process You would extract attributes about n-grams (consecutive words) in a document: numberOfLetters numberOfSymbols length previousWord nextWord nextWordNumberOfSymbols … And then use a classification algorithm, and feed it positive and negative examples: Observation nLetters nSymbols length prevWord nextWord isPartOfDate … Read more

tech