jsoup – Tarik Billa

Jsoup Cookies for HTTPS scraping

January 1, 2024 by Tarik

I know I’m kinda late by 10 months here. But a good option using Jsoup is to use this easy peasy piece of code: //This will get you the response. Response res = Jsoup .connect(“url”) .data(“loginField”, “[email protected]”, “passField”, “pass1234”) .method(Method.POST) .execute(); //This will get you cookies Map<String, String> cookies = res.cookies(); //And this is the … Read more

Does jsoup support xpath?

December 17, 2023 by Tarik

JSoup doesn’t support XPath yet, but you may try XSoup – “Jsoup with XPath”. Here’s an example quoted from the projects Github site (link): @Test public void testSelect() { String html = “<html><div><a href=”https://github.com”>github.com</a></div>” + “<table><tr><td>a</td><td>b</td></tr></table></html>”; Document document = Jsoup.parse(html); String result = Xsoup.compile(“//a/@href”).evaluate(document).get(); Assert.assertEquals(“https://github.com”, result); List<String> list = Xsoup.compile(“//tr/td/text()”).evaluate(document).list(); Assert.assertEquals(“a”, list.get(0)); Assert.assertEquals(“b”, list.get(1)); } … Read more

Jsoup select div having multiple classes

December 13, 2023 by Tarik

Works for me with latest Jsoup (1.5.2). String html = “<div class=\”content-text right-align bold-font\”>foo</div>”; Document document = Jsoup.parse(html); Elements elements = document.select(“div.content-text.right-align.bold-font”); System.out.println(elements.text()); // foo So either you’re possibly using an outdated version of Jsoup which exposes a bug related to this, or the actual HTML doesn’t contain a <div> like that.

Jsoup select and iterate all elements

December 1, 2023 by Tarik

You can select all elements of the document using * selector and then get text of each individually using Element#ownText(). Elements elements = document.body().select(“*”); for (Element element : elements) { System.out.println(element.ownText()); }

(how) can I download an image using JSoup?

September 11, 2023 by Tarik

I didn’t even finish writing the question before I found the answer via JSoup and a little experimentation. //Open a URL Stream Response resultImageResponse = Jsoup.connect(imageLocation).cookies(cookies) .ignoreContentType(true).execute(); // output here FileOutputStream out = (new FileOutputStream(new java.io.File(outputFolder + name))); out.write(resultImageResponse.bodyAsBytes()); // resultImageResponse.body() is where the image’s contents are. out.close();

How to parse XML with jsoup

August 21, 2023 by Tarik

It seems the latest version of Jsoup (1.6.2 – released March 28, 2012) includes some basic support for XML. String html = “<?xml version=\”1.0\” encoding=\”UTF-8\”><tests><test><id>xxx</id><status>xxx</status></test><test><id>xxx</id><status>xxx</status></test></tests></xml>”; Document doc = Jsoup.parse(html, “”, Parser.xmlParser()); for (Element e : doc.select(“test”)) { System.out.println(e); } Give that a shot..

Jsoup: how to get an image’s absolute url?

August 20, 2023 by Tarik

Once you have the image element, e.g.: Element image = document.select(“img”).first(); String url = image.absUrl(“src”); // url = http://www.example.com/images/chicken.jpg Alternatively: String url = image.attr(“abs:src”); Jsoup has a builtin absUrl() method on all nodes to resolve an attribute to an absolute URL, using the base URL of the node (which could be different from the URL … Read more

How to parse data in Talend with Java (coming from a previously produced .txt file)?

August 18, 2023 by Tarik

This is a problem related to Talend, in your code, use the complete method names including their packages. For your document parsing for example, you can use : Document document = org.jsoup.Jsoup.parse(new File(“C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt”), “utf-8”);

How to parse HTML table using jsoup?

August 15, 2023 by Tarik

Yes, it is possible with JSoup. First, you select the table. Then, you select the <tr> tags for rows. You can start from the second index since the first row contains only the column names. Then loop over the <th> tags and get the specific index. In your case, the indexes 7 and 5 are … Read more

Connection error: “org.jsoup.UnsupportedMimeTypeException: Unhandled content type”

June 12, 2023 by Tarik

Use ignoreContentType() (see doc here): String myURL = “http://www.rfi.ro/podcast/emisiune/174/feed.xml”; Document pod = Jsoup.connect(myURL).ignoreContentType(true).get();