“Smart” way of parsing and using website data? [closed]

I’ve done some of this recently, and here are my experiences.

There are three basic approaches:

  1. Regular Expressions.
    • Most flexible, easiest to use with loosely-structured info and changing formats.
    • Harder to do structural/tag analysis, but easier to do text matching.
    • Built in validation of data formatting.
    • Harder to maintain than others, because you have to write a regular expression for each pattern you want to use to extract/transform the document
    • Generally slower than 2 and 3.
    • Works well for lists of similarly-formatted items
    • A good regex development/testing tool and some sample pages will help. I’ve got good things to say about RegexBuddy here. Try their demo.
    • I’ve had the most success with this. The flexibility lets you work with nasty, brutish, in-the-wild HTML code.
  2. Convert HTML to XHTML and use XML extraction tools. Clean up HTML, convert it to legal XHTML, and use XPath/XQuery/ X-whatever to query it as XML data.
    • Tools: TagSoup, HTMLTidy, etc
    • Quality of HTML-to-XHML conversion is VERY important, and highly variable.
    • Best solution if data you want is structured by the HTML layout and tags (data in HTML tables, lists, DIV/SPAN groups, etc)
    • Most suitable for getting link structures, nested tables, images, lists, and so forth
    • Should be faster than option 1, but slower than option 3.
    • Works well if content formatting changes/is variable, but document structure/layout does not.
    • If the data isn’t structured by HTML tags, you’re in trouble.
    • Can be used with option 1.
  3. Parser generator (ANTLR, etc) — create a grammar for parsing & analyzing the page.
    • I have not tried this because it was not suitable for my (messy) pages
    • Most suitable if HTML structure is highly structured, very constant, regular, and never changes.
    • Use this if there are easy-to-describe patterns in the document, but they don’t involve HTML tags and involve recursion or complex behaviors
    • Does not require XHTML input
    • FASTEST throughput, generally
    • Big learning curve, but easier to maintain

I’ve tinkered with web harvest for option 2, but I find their syntax to be kind of weird. Mix of XML and some pseudo-Java scripting language. If you like Java, and like XML-style data extraction (XPath, XQuery) that might be the ticket for you.


Edit: if you use regular expressions, make sure you use a library with lazy quantifiers and capturing groups! PHP’s older regex libraries lack these, and they’re indispensable for matching data between open/close tags in HTML.

Leave a Comment

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)