Writing an HTML Parser

Question

The looseness of HTML can be accommodated by figuring out the missing open and close tags as needed. This is essentially what a validator like tidy does.

You’ll keep a stack (perhaps implicitly with a tree) of the current context. For example, {<html>, <body>} means you’re currently in the body of the html document. When you encounter a new node, you compare the requirements for that node to what’s currently on the stack.

Suppose your stack is currently just {html}. You encounter a  tag. You look up  in a table that tells you a paragraph must be inside the <body>. Since you’re not in the body, you implicitly push <body> onto your stack (or add a body node to your tree). Then you can put the  into the tree.

Now supposed you see another . Your rules tell you that you cannot nest a paragraph within a paragraph, so you know you have to pop the current  off the stack (as though you had seen a close tag) before pushing the new paragraph onto the stack.

At the end of your document, you pop each remaining element off your stack, as though you had seen a close tag for each one.

The trick is to find a good way to represent the context requirements for each element.

Leave a Comment Cancel reply