Puppeteer: Node.js Web Scraping Library for JavaScript

Josselin Liebe

July 13, 2025

•

min read

•

votes

•

Scraping

In the modern web landscape, traditional HTTP clients often fall short when dealing with JavaScript-heavy websites, single-page applications (SPAs), and dynamic content. Puppeteer, a powerful Node.js library that provides a high-level API to control Chrome or Chromium browsers programmatically. Unlike conventional scraping tools that only handle static HTML, Puppeteer renders pages just like a real browser, making it perfect for scraping modern web applications. You can open the Github project.

What is Puppeteer?

Puppeteer is a Node.js library developed by Google that provides a high-level API to control headless Chrome or Chromium browsers. It can also be configured to run in full (non-headless) mode for debugging purposes. Puppeteer allows you to automate form submission, UI testing, keyboard input, and most importantly for our purposes, web scraping of JavaScript-rendered content.

Built by the Chrome DevTools team, Puppeteer provides fine-grained control over the browser instance, enabling you to intercept network requests, inject JavaScript, take screenshots, generate PDFs, and extract data from complex web applications that traditional scrapers cannot handle.

Key Features

Full Browser Automation

JavaScript Execution : Full support for JavaScript-heavy websites
DOM Manipulation : Interact with elements, click buttons, fill forms
Network Interception : Monitor and modify network requests
Cookie Management : Automatic cookie handling and session management

Advanced Scraping Capabilities

Dynamic Content : Handle infinite scroll, lazy loading, and AJAX requests
Screenshots & PDFs : Generate visual captures and documents
Mobile Emulation : Simulate mobile devices and viewports
Geolocation : Simulate different geographic locations

Performance & Control

Headless Mode : Run browsers without UI for better performance
Resource Blocking : Block images, CSS, fonts to improve speed
Request Interception : Modify requests on the fly
Concurrent Execution : Run multiple browser instances simultaneously

Use Cases

SPA and React/Vue/Angular Applications

Modern web applications often load content dynamically through JavaScript. Puppeteer can:

Wait for specific elements to load
Handle client-side routing
Interact with complex UI components
Scrape data that only appears after user interactions

E-commerce Price Monitoring

Navigate through product catalogs
Handle lazy-loaded images and reviews
Automate search and filtering
Extract pricing information from JavaScript-rendered pages

Social Media and News Scraping

Scroll through infinite feeds
Handle authentication flows
Extract comments and interactions
Monitor real-time content updates

Testing and Quality Assurance

Automated UI testing
Performance monitoring
Screenshot comparisons (at Piloterr, we have a software called Capturekit.dev for API screenshots)
Cross-browser compatibility testing

Getting Started

Installation

# Create a new project
mkdir puppeteer-scraping
cd puppeteer-scraping
npm init -y

# Install Puppeteer
npm install puppeteer

# Optional: Install additional utilities
npm install fs-extra lodash

Basic Usage

Here's a simple example to get you started:

const puppeteer = require('puppeteer');

(async () => {
  // Launch browser
  const browser = await puppeteer.launch({ 
    headless: false, // Set to true for production
    defaultViewport: null 
  });
  
  // Create a new page
  const page = await browser.newPage();
  
  // Navigate to website
  await page.goto('https://example.com');
  
  // Extract data
  const title = await page.title();
  console.log('Page title:', title);
  
  // Take screenshot
  await page.screenshot({ path: 'example.png' });
  
  // Close browser
  await browser.close();
})();

Advanced Examples

E-commerce Product Scraping

const puppeteer = require('puppeteer');

(async () => {
  // Launch browser
  const browser = await puppeteer.launch({ 
    headless: false, // Set to true for production
    defaultViewport: null 
  });
  
  // Create a new page
  const page = await browser.newPage();
  
  // Navigate to website
  await page.goto('https://example.com');
  
  // Extract data
  const title = await page.title();
  console.log('Page title:', title);
  
  // Take screenshot
  await page.screenshot({ path: 'example.png' });
  
  // Close browser
  await browser.close();
})();
```

## Advanced Examples

### 1. E-commerce Product Scraping

```javascript
const puppeteer = require('puppeteer');
const fs = require('fs-extra');

class EcommerceScraper {
  constructor() {
    this.browser = null;
    this.page = null;
  }

  async init() {
    this.browser = await puppeteer.launch({
      headless: true,
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-accelerated-2d-canvas',
        '--disable-gpu'
      ]
    });
    
    this.page = await this.browser.newPage();
    
    const userAgent = `Mozilla/5.0 (Windows NT 10.0; Win64; x64) ` +
                   	`AppleWebKit/537.36 (KHTML, like Gecko) ` +
                   `Chrome/91.0.4472.124 Safari/537.36`;

    // Set user agent to avoid detection
    await this.page.setUserAgent(userAgent);
    
    // Block images to improve performance
    await this.page.setRequestInterception(true);
    this.page.on('request', (request) => {
      if (request.resourceType() === 'image') {
        request.abort();
      } else {
        request.continue();
      }
    });
  }

  async scrapeProduct(productUrl) {
    try {
      await this.page.goto(productUrl, { waitUntil: 'networkidle2' });
      
      // Wait for product details to load
      await this.page.waitForSelector('.product-title', { timeout: 10000 });
      
      const productData = await this.page.evaluate(() => {
        const title = document.querySelector('.product-title')?.textContent?.trim();
        const price = document.querySelector('.price')?.textContent?.trim();
        const description = document.querySelector('.product-description')?.textContent?.trim();
        const availability = document.querySelector('.availability')?.textContent?.trim();
        
        // Extract images
        const images = Array.from(document.querySelectorAll('.product-images img'))
          .map(img => img.src)
          .filter(src => src && src.startsWith('http'));
        
        // Extract specifications
        const specs = {};
        document.querySelectorAll('.specifications tr').forEach(row => {
          const key = row.querySelector('td:first-child')?.textContent?.trim();
          const value = row.querySelector('td:last-child')?.textContent?.trim();
          if (key && value) {
            specs[key] = value;
          }
        });
        
        return {
          title,
          price,
          description,
          availability,
          images,
          specifications: specs,
          url: window.location.href,
          scrapedAt: new Date().toISOString()
        };
      });
      
      return productData;
    } catch (error) {
      console.error(`Error scraping product: ${error.message}`);
      return null;
    }
  }

  async scrapeCategory(categoryUrl, maxProducts = 50) {
    const products = [];
    
    await this.page.goto(categoryUrl, { waitUntil: 'networkidle2' });
    
    let currentPage = 1;
    
    while (products.length < maxProducts) {
      console.log(`Scraping page ${currentPage}...`);
      
      // Wait for products to load
      await this.page.waitForSelector('.product-item', { timeout: 10000 });
      
      // Extract product links
      const productLinks = await this.page.evaluate(() => {
        return Array.from(document.querySelectorAll('.product-item a'))
          .map(a => a.href)
          .filter(href => href && href.includes('/product/'));
      });
      
      // Scrape each product
      for (const link of productLinks) {
        if (products.length >= maxProducts) break;
        
        console.log(`Scraping product: ${link}`);
        const productData = await this.scrapeProduct(link);
        
        if (productData) {
          products.push(productData);
        }
        
        // Add delay to avoid rate limiting
        await this.page.waitForTimeout(1000 + Math.random() * 2000);
      }
      
      // Try to go to next page
      const nextButton = await this.page.$('.pagination .next:not(.disabled)');
      if (nextButton) {
        await nextButton.click();
        await this.page.waitForNavigation({ waitUntil: 'networkidle2' });
        currentPage++;
      } else {
        break;
      }
    }
    
    return products;
  }

  async close() {
    if (this.browser) {
      await this.browser.close();
    }
  }
}

// Usage
async function main() {
  const scraper = new EcommerceScraper();
  
  try {
    await scraper.init();
    
    // Scrape a single product
    const product = await scraper.scrapeProduct('https://example-store.com/product/123');
    console.log('Product data:', product);
    
    // Scrape a category
    const products = await scraper.scrapeCategory('https://example-store.com/category/electronics', 20);
    
    // Save to file
    await fs.writeJson('products.json', products, { spaces: 2 });
    console.log(`Scraped ${products.length} products`);
    
  } catch (error) {
    console.error('Scraping failed:', error);
  } finally {
    await scraper.close();
  }
}

main();

Best Practices

Resource management

// Always close browsers and pages
async function safeScraping() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  
  try {
    // Your scraping logic here
    await page.goto('https://example.com');
    // ... scraping code
  } finally {
    await page.close();
    await browser.close();
  }
}

Rate limiting

async function rateLimitedScraping(urls, delayMs = 1000) {
  const browser = await puppeteer.launch();
  const results = [];
  
  for (const url of urls) {
    const page = await browser.newPage();
    
    try {
      await page.goto(url);
      // Extract data
      const data = await page.evaluate(() => {
        return { title: document.title };
      });
      results.push(data);
    } finally {
      await page.close();
    }
    
    // Wait before next request
    await new Promise(resolve => setTimeout(resolve, delayMs));
  }
  
  await browser.close();
  return results;
}

Memory management

async function memoryEfficientScraping() {
  const browser = await puppeteer.launch({
    args: ['--max-old-space-size=4096'] // Limit memory usage
  });
  
  // Reuse pages instead of creating new ones
  const page = await browser.newPage();
  
  for (const url of urls) {
    await page.goto(url);
    // Extract data
    const data = await page.evaluate(() => {
      return { title: document.title };
    });
    
    // Clear page resources
    await page.evaluate(() => {
      // Clear any global variables or large objects
      window.largeData = null;
    });
  }
  
  await browser.close();
}

Comparison with Other Scraping Tools

Feature	Puppeteer	Playwright	Selenium	Cheerio
JavaScript Execution	✅	✅	✅	❌
Cross-Browser Support	Chrome only	✅	✅	❌
Performance	High	High	Medium	Very High
API Simplicity	Excellent	Excellent	Complex	Simple
Resource Usage	Medium	Medium	High	Low
Dynamic Content	✅	✅	✅	❌
Learning Curve	Easy	Easy	Steep	Very Easy

Troubleshooting

Memory leaks

// Problem: Pages not being closed properly
// Solution: Always close pages and use try-finally blocks

async function properCleanup() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  
  try {
    await page.goto('https://example.com');
    // ... scraping logic
  } finally {
    await page.close(); // Always close pages
    await browser.close(); // Always close browser
  }
}

‍

Timeouts

// Problem: Elements taking too long to load
// Solution: Increase timeouts and use proper waiting strategies

await page.waitForSelector('.dynamic-content', { 
  timeout: 30000,
  visible: true 
});

// Or wait for network to be idle
await page.goto(url, { 
  waitUntil: 'networkidle2',
  timeout: 30000 
});

‍

Detection avoidance

Alternatively, you can use Piloterr for your scraping project, as the APIs will help you bypass the best anti-bots on the market.

// Problem: Being detected as a bot
// Solution: Use stealth techniques

const puppeteer = require('puppeteer');

async function stealthBrowser() {
  const browser = await puppeteer.launch({
    headless: true,
    args: [
      '--no-sandbox',
      '--disable-setuid-sandbox',
      '--disable-dev-shm-usage',
      '--disable-accelerated-2d-canvas',
      '--disable-gpu',
      '--disable-dev-tools',
      '--no-first-run',
      '--no-zygote'
    ]
  });
  
  const page = await browser.newPage();
  
  const userAgent = `Mozilla/5.0 (Windows NT 10.0; Win64; x64) ` +
                   `AppleWebKit/537.36 (KHTML, like Gecko) ` +
                   `Chrome/91.0.4472.124 Safari/537.36`;
          
  // Set realistic user agent
  await page.setUserAgent(userAgent);
  
  // Set viewport
  await page.setViewport({ width: 1920, height: 1080 });
  
  // Remove webdriver property
  await page.evaluateOnNewDocument(() => {
    delete navigator.__proto__.webdriver;
  });
  
  return { browser, page };
}

‍

Good Dockerfile configuration

FROM node:16-alpine

# Install dependencies for Puppeteer
RUN apk add --no-cache \
    chromium \
    nss \
    freetype \
    freetype-dev \
    harfbuzz \
    ca-certificates \
    ttf-freefont

# Set environment variables
ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true \
    PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser

WORKDIR /app

# Copy package files
COPY package*.json ./

# Install dependencies
RUN npm ci --only=production

# Copy source code
COPY . .

# Run as non-root user
USER node

CMD ["node", "index.js"]

Conclusion

Puppeteer has revolutionized web scraping by providing developers with a powerful, browser-based approach to data extraction. Its ability to handle JavaScript-heavy websites, dynamic content, and complex user interactions makes it an indispensable tool for modern web scraping projects.

The library's intuitive API, excellent performance, and comprehensive feature set enable developers to build sophisticated scraping solutions that can handle the most challenging modern web applications. From e-commerce monitoring to social media data collection, Puppeteer provides the tools needed to extract valuable insights from today's dynamic web.

While Puppeteer does consume more resources than traditional HTTP clients, the trade-off is worthwhile for applications that require JavaScript execution and authentic browser behavior. Its ability to bypass anti-bot measures and handle complex authentication flows makes it particularly valuable for enterprise-level scraping projects.

As web applications continue to become more JavaScript-dependent and sophisticated, tools like Puppeteer will become increasingly essential for successful web scraping initiatives. The combination of Google's backing, active development, and strong community support ensures that Puppeteer will remain a leading choice for browser automation and web scraping.

Resources

Heading 2

Scraper with an API

Discover Piloterr, the all-in-one scraping API. Sign up now and get 50 free API requests.⚡

Create account - For free Browse API library

How to Build a Company Employee Dataset

In this tutorial, we’ll learn how to leverage the precision of Google Dorks and the automation power of Piloterr APIs to collect public LinkedIn profile data. The final result is a structured .json dataset ready for analysis.

Fingerprint

Jul 12, 2025

•

5 min

•

Wreq : Rust HTTP Client for Browser Emulation and TLS Fingerprinting

Discover Wreq, the ultimate Rust HTTP client for browser emulation and TLS fingerprinting. Bypass anti-bot systems with practical web scraping examples.