web-scraping – Tarik Billa

Create link previews on the client side, like in Facebook/LinkedIn

April 11, 2024 by Tarik

After hours of googling I found the answer myself.. there is already a question in SO Is there open-source code for making ‘link preview’ text and icons, like in facebook? . So we can use this link http://api.embed.ly/1/oembed?url=YOUR-URL via http GET where we get the meta tags in json format.. I wrote my own code … Read more

Pandas error in Python: columns must be same length as key

April 7, 2024 by Tarik

You need a bit modify solution, because sometimes it return 2 and sometimes only one column: df2 = pd.DataFrame({‘STATUS’:[‘Estimated 3:17 PM’,’Delayed 3:00 PM’]}) df3 = df2[‘STATUS’].str.split(n=1, expand=True) df3.columns = [‘STATUS_ID{}’.format(x+1) for x in df3.columns] print (df3) STATUS_ID1 STATUS_ID2 0 Estimated 3:17 PM 1 Delayed 3:00 PM df2 = df2.join(df3) print (df2) STATUS STATUS_ID1 STATUS_ID2 0 … Read more

BeautifulSoup: what’s the difference between ‘lxml’ and ‘html.parser’ and ‘html5lib’ parsers?

April 3, 2024 by Tarik

From the docs‘s summarized table of advantages and disadvantages: html.parser – BeautifulSoup(markup, “html.parser”) Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.) Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2) lxml – BeautifulSoup(markup, “lxml”) Advantages: Very fast, Lenient Disadvantages: External C dependency html5lib – BeautifulSoup(markup, “html5lib”) Advantages: Extremely lenient, Parses pages … Read more

What’s the best way of scraping data from a web site? [closed]

February 29, 2024 by Tarik

You will definitely want to start with a good web scraping framework. Later on, you may decide that they are too limiting and you can put together your own stack of libraries but without a lot of scraping experience, your design will be much worse than pjscrape or scrapy. Note: I use the terms crawling … Read more

How do I avoid HTTP error 403 when web scraping with Python?

February 27, 2024 by Tarik

This is probably because of mod_security or some similar server security feature which blocks known spider/bot user agents (urllib uses something like python urllib/3.3.0, it’s easily detected). Try setting a known browser user agent with: from urllib.request import Request, urlopen req = Request( url=”http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1″, headers={‘User-Agent’: ‘Mozilla/5.0’} ) webpage = urlopen(req).read() This works for me. By … Read more

Jsoup Cookies for HTTPS scraping

January 1, 2024 by Tarik

I know I’m kinda late by 10 months here. But a good option using Jsoup is to use this easy peasy piece of code: //This will get you the response. Response res = Jsoup .connect(“url”) .data(“loginField”, “[email protected]”, “passField”, “pass1234”) .method(Method.POST) .execute(); //This will get you cookies Map<String, String> cookies = res.cookies(); //And this is the … Read more

How to get all html data after all scripts and page loading is done? (puppeteer)

January 1, 2024 by Tarik

If you want full html same as inspect? Here it is: const puppeteer = require(‘puppeteer’); (async function main() { try { const browser = await puppeteer.launch(); const [page] = await browser.pages(); await page.goto(‘https://example.org/’, { waitUntil: ‘networkidle0’ }); const data = await page.evaluate(() => document.querySelector(‘*’).outerHTML); console.log(data); await browser.close(); } catch (err) { console.error(err); } })();

Python selenium multiprocessing

December 27, 2023 by Tarik

how can I reduce the execution time using selenium when it is made to run using multiprocessing A lot of time in your solution is spent on launching the webdriver for each URL. You can reduce this time by launching the driver only once per thread: (… skipped for brevity …) threadLocal = threading.local() def … Read more

Extract Links from Webpage using R

December 20, 2023 by Tarik

Can a Telegram bot read messages of channel

December 16, 2023 by Tarik

The FAQ reads: All bots, regardless of settings, will receive: All service messages. All messages from private chats with users. All messages from channels where they are a member. Bot admins and bots with privacy mode disabled will receive all messages except messages sent by other bots. Bots with privacy mode enabled will receive: Commands … Read more