You can do so in several steps.
- Parse HTML with
parse5. The bad part is that the result is not DOM. Though it’s fast enough and W3C-compiant. - Serialize it to XHTML with
xmlserializerthat accepts DOM-like structures ofparse5as input. - Parse that XHTML again with
xmldom. Now you finally have that DOM. - The
xpathlibrary builds uponxmldom, allowing you to run XPath queries. Be aware that XHTML has its own namespace, and queries like//awon’t work.
Finally you get something like this.
const fs = require('mz/fs');
const xpath = require('xpath');
const parse5 = require('parse5');
const xmlser = require('xmlserializer');
const dom = require('xmldom').DOMParser;
(async () => {
const html = await fs.readFile('./test.htm');
const document = parse5.parse(html.toString());
const xhtml = xmlser.serializeToString(document);
const doc = new dom().parseFromString(xhtml);
const select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
const nodes = select("//x:a/@href", doc);
console.log(nodes);
})();
Note that you have to prepend every single HTML element of a query with the x: prefix, for example to match an a inside a div you would need:
//x:div/x:a