text-mining
What is CoNLL data format?
There are many different CoNLL formats since CoNLL is a different shared task each year. The format for CoNLL 2009 is described here. Each line represents a single word with a series of tab-separated fields. _s indicate empty values. Mate-Parser’s manual says that it uses the first 12 columns of CoNLL 2009: ID FORM LEMMA … Read more
How do I search for a pattern within a text file using Python combining regex & string/file operations and store instances of the pattern?
import re pattern = re.compile(“<(\d{4,5})>”) for i, line in enumerate(open(‘test.txt’)): for match in re.finditer(pattern, line): print ‘Found on line %s: %s’ % (i+1, match.group()) A couple of notes about the regex: You don’t need the ? at the end and the outer (…) if you don’t want to match the number with the angle brackets, … Read more
Inconsistent behaviour with tm_map transformation functions when using multiple cores
If you try to overwrite your memory with a program that uses parallel processing, you should first verify that it’s worth it. For instance, check if your disk is at 80%-100% writing speed; if that is the case, then your program could also just use a single core, because it is blocked by disk writing … Read more
What is “entropy and information gain”?
I assume entropy was mentioned in the context of building decision trees. To illustrate, imagine the task of learning to classify first-names into male/female groups. That is given a list of names each labeled with either m or f, we want to learn a model that fits the data and can be used to predict … Read more