html-parsing – Tarik Billa

What can I do when a regular expression pattern doesn’t match anywhere in a string?

January 18, 2024 by Tarik

Oh Yes You Can Use Regexes to Parse HTML! For the task you are attempting, regexes are perfectly fine! It is true that most people underestimate the difficulty of parsing HTML with regular expressions and therefore do so poorly. But this is not some fundamental flaw related to computational theory. That silliness is parroted a … Read more

jQuery-like interface for PHP?

December 30, 2023 by Tarik

HtmlAgilityPack set node InnerText

December 26, 2023 by Tarik

Try code below. It select all nodes without children and filtered out script nodes. Maybe you need to add some additional filtering. In addition to your XPath expression this one also looking for leaf nodes and filter out text content of <script> tags. var nodes = doc.DocumentNode.SelectNodes(“//body//text()[(normalize-space(.) != ”) and not(parent::script) and not(*)]”); foreach (HtmlNode … Read more

Android HTML ImageGetter as AsyncTask

December 18, 2023 by Tarik

beautiful soup getting tag.id

November 25, 2023 by Tarik

You can access tag’s attributes by treating the tag like a dictionary (documentation): for tag in soup.find_all(class_=”bookmark blurb group”) : print tag.get(‘id’) The reason tag.id didn’t work is that it is equivalent to tag.find(‘id’), which results into None since there is no id tag found (documentation).

Problem with HTML Parser in IE

September 21, 2023 by Tarik

You’re modifying document while it’s being loaded (when browser hasn’t “seen” closing tag for this element) . This causes very tricky situation in the parser and in IE it’s not allowed. IE blog has explanation of this. The solution is to modify another element that’s earlier in the document and has been loaded completely (where … Read more

Writing an HTML Parser

September 13, 2023 by Tarik

The looseness of HTML can be accommodated by figuring out the missing open and close tags as needed. This is essentially what a validator like tidy does. You’ll keep a stack (perhaps implicitly with a tree) of the current context. For example, {<html>, <body>} means you’re currently in the body of the html document. When … Read more

Cleaning HTML by removing extra/redundant formatting tags

September 13, 2023 by Tarik

What is the best practice for parsing remote content with jQuery?

September 1, 2023 by Tarik

Instead of hacking jQuery to do this I’d suggest you drop out of jQuery for a minute and use raw XML dom methods. Using XML Dom methods you would can do this: window.onload = function(){ $.ajax({ type: ‘GET’, url: ‘text.html’, dataType: ‘html’, success: function(data) { //cross platform xml object creation from w3schools try //Internet Explorer … Read more

How to parse an HTML string in Google Apps Script without using XmlService? [duplicate]

August 30, 2023 by Tarik

I made cheeriogs for your problem. it’s works on GAS as cheerio which is jQuery-like api. You can do that like this. const content = UrlFetchApp.fetch(‘https://example.co/’).getContentText(); const $ = Cheerio.load(content); Logger.log($(‘p .blah’).first().text()); // blah blah blah … See also https://github.com/asciian/cheeriogs