HtmlAgilityPack selecting childNodes not as expected
You should remove the forwardslash prefix from “/img[@alt]” as it signifies that you want to start at the root of the document. HtmlNode imageNode = linkNode.SelectSingleNode(“img[@alt]”);
You should remove the forwardslash prefix from “/img[@alt]” as it signifies that you want to start at the root of the document. HtmlNode imageNode = linkNode.SelectSingleNode(“img[@alt]”);
The code below works correctly with the example provided, even deals with some weird stuff like <div><br></div>, there’re still some things to improve, but the basic idea is there. See the comments. public static string FormatLineBreaks(string html) { //first – remove all the existing ‘\n’ from HTML //they mean nothing in HTML, but break our … Read more
The code below works correctly with the example provided, even deals with some weird stuff like <div><br></div>, there’re still some things to improve, but the basic idea is there. See the comments. public static string FormatLineBreaks(string html) { //first – remove all the existing ‘\n’ from HTML //they mean nothing in HTML, but break our … Read more
I wrote an algorithm based on Oded’s suggestions. Here it is. Works like a charm. It removes all tags except strong, em, u and raw text nodes. internal static string RemoveUnwantedTags(string data) { if(string.IsNullOrEmpty(data)) return string.Empty; var document = new HtmlDocument(); document.LoadHtml(data); var acceptableTags = new String[] { “strong”, “em”, “u”}; var nodes = new … Read more
Use the contains function: //div[contains(@id,’test’)]
If you use //, it searches from the document begin. Use .// to search all from the current node foreach (HtmlAgilityPack.HtmlNode node2 in node.SelectNodes(“.//span[@prop]”)) Or drop the prefix entirely to search just for direct children: foreach (HtmlAgilityPack.HtmlNode node2 in node.SelectNodes(“span[@prop]”))
The Html Agility Pack is equiped with a utility class called HtmlEntity. It has a static method with the following signature: /// <summary> /// Replace known entities by characters. /// </summary> /// <param name=”text”>The source text.</param> /// <returns>The result text.</returns> public static string DeEntitize(string text) It supports well-known entities (like ) and encoded characters such … Read more
How about something like: Using HTML Agility Pack HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(@”<html><body><p><table id=””foo””><tr><th>hello</th></tr><tr><td>world</td></tr></table></body></html>”); foreach (HtmlNode table in doc.DocumentNode.SelectNodes(“//table”)) { Console.WriteLine(“Found: ” + table.Id); foreach (HtmlNode row in table.SelectNodes(“tr”)) { Console.WriteLine(“row”); foreach (HtmlNode cell in row.SelectNodes(“th|td”)) { Console.WriteLine(“cell: ” + cell.InnerText); } } } Note that you can make it prettier with LINQ-to-Objects if … Read more
The expression you’re looking for is: //div[contains(@class, ‘class1’) and contains(@class, ‘class2’)] I highly suggest XPath visualizer, which can help you debug xpath expressions easily. It can be found here: http://xpathvisualizer.codeplex.com/
(Updated 2018-03-17) The problem: The problem, as you’ve spotted, is that String.Contains does not perform a word-boundary check, so Contains(“float”) will return true for both “foo float bar” (correct) and “unfloating” (which is incorrect). The solution is to ensure that “float” (or whatever your desired class-name is) appears alongside a word-boundary at both ends. A … Read more