I’ve done some of this recently, and here are my experiences.
There are three basic approaches:
- Regular Expressions.
- Most flexible, easiest to use with loosely-structured info and changing formats.
- Harder to do structural/tag analysis, but easier to do text matching.
- Built in validation of data formatting.
- Harder to maintain than others, because you have to write a regular expression for each pattern you want to use to extract/transform the document
- Generally slower than 2 and 3.
- Works well for lists of similarly-formatted items
- A good regex development/testing tool and some sample pages will help. I’ve got good things to say about RegexBuddy here. Try their demo.
- I’ve had the most success with this. The flexibility lets you work with nasty, brutish, in-the-wild HTML code.
- Convert HTML to XHTML and use XML extraction tools. Clean up HTML, convert it to legal XHTML, and use XPath/XQuery/ X-whatever to query it as XML data.
- Tools: TagSoup, HTMLTidy, etc
- Quality of HTML-to-XHML conversion is VERY important, and highly variable.
- Best solution if data you want is structured by the HTML layout and tags (data in HTML tables, lists, DIV/SPAN groups, etc)
- Most suitable for getting link structures, nested tables, images, lists, and so forth
- Should be faster than option 1, but slower than option 3.
- Works well if content formatting changes/is variable, but document structure/layout does not.
- If the data isn’t structured by HTML tags, you’re in trouble.
- Can be used with option 1.
- Parser generator (ANTLR, etc) — create a grammar for parsing & analyzing the page.
- I have not tried this because it was not suitable for my (messy) pages
- Most suitable if HTML structure is highly structured, very constant, regular, and never changes.
- Use this if there are easy-to-describe patterns in the document, but they don’t involve HTML tags and involve recursion or complex behaviors
- Does not require XHTML input
- FASTEST throughput, generally
- Big learning curve, but easier to maintain
I’ve tinkered with web harvest for option 2, but I find their syntax to be kind of weird. Mix of XML and some pseudo-Java scripting language. If you like Java, and like XML-style data extraction (XPath, XQuery) that might be the ticket for you.
Edit: if you use regular expressions, make sure you use a library with lazy quantifiers and capturing groups! PHP’s older regex libraries lack these, and they’re indispensable for matching data between open/close tags in HTML.