how can I reduce the execution time using selenium when it is made to run using multiprocessing
A lot of time in your solution is spent on launching the webdriver for each URL. You can reduce this time by launching the driver only once per thread:
(... skipped for brevity ...)
threadLocal = threading.local()
def get_driver():
driver = getattr(threadLocal, 'driver', None)
if driver is None:
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chromeOptions)
setattr(threadLocal, 'driver', driver)
return driver
def get_title(url):
driver = get_driver()
driver.get(url)
(...)
(...)
On my system this reduces the time from 1m7s to just 24.895s, a ~35% improvement. To test yourself, download the full script.
Note: ThreadPool
uses threads, which are constrained by the Python GIL. That’s ok if for the most part the task is I/O bound. Depending on the post-processing you do with the scraped results, you may want to use a multiprocessing.Pool
instead. This launches parallel processes which as a group are not constrained by the GIL. The rest of the code stays the same.