The looseness of HTML can be accommodated by figuring out the missing open and close tags as needed. This is essentially what a validator like tidy does.
You’ll keep a stack (perhaps implicitly with a tree) of the current context. For example, {<html>
, <body>
} means you’re currently in the body of the html document. When you encounter a new node, you compare the requirements for that node to what’s currently on the stack.
Suppose your stack is currently just {html
}. You encounter a <p>
tag. You look up <p>
in a table that tells you a paragraph must be inside the <body>
. Since you’re not in the body, you implicitly push <body>
onto your stack (or add a body node to your tree). Then you can put the <p>
into the tree.
Now supposed you see another <p>
. Your rules tell you that you cannot nest a paragraph within a paragraph, so you know you have to pop the current <p>
off the stack (as though you had seen a close tag) before pushing the new paragraph onto the stack.
At the end of your document, you pop each remaining element off your stack, as though you had seen a close tag for each one.
The trick is to find a good way to represent the context requirements for each element.