Managing puppeteer for memory and performance

You probably want to create a pool of multiple Chromium instances with independent browsers. The advantage of that is, when one browser crashes all other jobs can keep running. The advantage of one browser (with multiple pages) is a slight memory and CPU advantage and the cookies are shared between your pages.

Pool of puppeteer instances

The library puppteer-cluster (disclaimer: I’m the author) creates a pool of browsers or pages for you. It takes care of the creation, error handling, browser restarting, etc. for you. So you can simply queue jobs/URLs and the library takes care of everything else.

Code sample

const { Cluster } = require('puppeteer-cluster');

(async () => {
    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_BROWSER, // use one browser per worker
        maxConcurrency: 4, // cluster with four workers
    });

    // Define a task to be executed for your data (put your "crawling code" in here)
    await cluster.task(async ({ page, data: url }) => {
        await page.goto(url);
        // ...
    });

    // Queue URLs when the cluster is created
    cluster.queue('http://www.google.com/');
    cluster.queue('http://www.wikipedia.org/');

    // Or queue URLs anytime later
    setTimeout(() => {
        cluster.queue('http://...');
    }, 1000);
})();

You can also queue functions directly in case you have different task to do. Normally you would close the cluster after you are finished via cluster.close(), but you are free to just let it stay open. You find another example for a cluster that gets data when a request comes in in the repository.

Leave a Comment