Extracting an information from web page by machine learning

Question

First, your task fits into the information extraction area of research. There are mainly 2 levels of complexity for this task:

extract from a given html page or a website with the fixed template
(like Amazon). In this case the best way is to look at the HTML code
of the pages and craft the corresponding XPath or DOM selectors to
get to the right info. The disadvantage with this approach is that it
is not generalizable to new websites, since you have to do it for
each website one by one.
create a model that extracts same
information from many websites within one domain (having an
assumption that there is some inherent regularity in the way web
designers present the corresponding attribute, like zip or phone or whatever else). In this case you should create some features (to use ML approach and let IE algorithm to “understand the content of pages”). The most common features are: DOM path, the format of the value (attribute) to be extracted, layout (like bold, italic and etc.), and surrounding context words. You label some values (you need at least 100-300 pages depending on domain to do it with some sort of reasonable quality). Then you train a model on the labelled pages. There is also an alternative to it – to do IE in unsupervised manner (leveraging the idea of information regularity across pages). In this case you/your algorith tries to find repetitive patterns across pages (without labelling) and consider as valid those, that are the most frequent.

The most challenging part overall will be to work with DOM tree and generate the right features. Also data labelling in the right way is a tedious task. For ML models – have a look at CRF, 2DCRF, semi-markov CRF.

And finally, this is in the general case a cutting edge in IE research and not a hack that you can do it a few evenings.

p.s. also I think NLTK will not be very helpful – it is an NLP, not Web-IE library.

Leave a Comment Cancel reply