scraper – Tarik Billa

scrape websites with infinite scrolling

December 12, 2023 by Tarik

You can use selenium to scrap the infinite scrolling website like twitter or facebook. Step 1 : Install Selenium using pip pip install selenium Step 2 : use the code below to automate infinite scroll and extract the source code from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import … Read more

How to scrape a website that requires login first with Python

July 26, 2023 by Tarik

This works for me: ##################################### Method 1 import mechanize import cookielib from BeautifulSoup import BeautifulSoup import html2text # Browser br = mechanize.Browser() # Cookie Jar cj = cookielib.LWPCookieJar() br.set_cookiejar(cj) # Browser options br.set_handle_equiv(True) br.set_handle_gzip(True) br.set_handle_redirect(True) br.set_handle_referer(True) br.set_handle_robots(False) br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) br.addheaders = [(‘User-agent’, ‘Chrome’)] # The site we will navigate into, handling it’s session br.open(‘https://github.com/login’) # … Read more

BeautifulSoup: extract text from anchor tag

April 18, 2023 by Tarik

This will help: from bs4 import BeautifulSoup data=””‘<div class=”image”> <a href=”http://www.example.com/eg1″>Content1<img src=”http://image.example.com/img1.jpg” /></a> </div> <div class=”image”> <a href=”http://www.example.com/eg2″>Content2<img src=”http://image.example.com/img2.jpg” /> </a> </div>”’ soup = BeautifulSoup(data) for div in soup.findAll(‘div’, attrs={‘class’:’image’}): print(div.find(‘a’)[‘href’]) print(div.find(‘a’).contents[0]) print(div.find(‘img’)[‘src’]) If you are looking into Amazon products then you should be using the official API. There is at least one Python package … Read more

crawler vs scraper

February 10, 2023 by Tarik

A crawler gets web pages — i.e., given a starting address (or set of starting addresses) and some conditions (e.g., how many links deep to go, types of files to ignore) it downloads whatever is linked to from the starting point(s). A scraper takes pages that have been downloaded or, in a more general sense, … Read more

XPath:: Get following Sibling

January 4, 2023 by Tarik

You should be looking for the second tr that has the td that equals ‘ Color Digest ‘, then you need to look at either the following sibling of the first td in the tr, or the second td. Try the following: //tr[td=’Color Digest’][2]/td/following-sibling::td[1] or //tr[td=’Color Digest’][2]/td[2] http://www.xpathtester.com/saved/76bb0bca-1896-43b7-8312-54f924a98a89