Tutorial 9 min read

Web Scraping with Node.js: Puppeteer, Playwright, or API?

Compare Node.js web scraping approaches: Puppeteer, Playwright, and scraping APIs. Learn when to use each with practical code examples.

FT
FineData Team
|

Web Scraping with Node.js: Puppeteer, Playwright, or API?

Node.js is one of the most popular platforms for web scraping. Its async-first architecture makes it naturally suited for I/O-heavy tasks like fetching thousands of web pages. But the ecosystem offers several fundamentally different approaches, and choosing the wrong one for your use case can waste weeks of development time.

This guide compares the three main approaches: Puppeteer, Playwright, and scraping APIs. We’ll cover what each is good at, where each falls short, and show real code so you can make an informed decision.

Approach 1: Puppeteer

Puppeteer is Google’s official Node.js library for controlling Chrome. It launches a headless Chrome instance and gives you full programmatic control — navigate to pages, click buttons, fill forms, take screenshots, and extract content.

Basic Puppeteer Scraping

const puppeteer = require('puppeteer');

async function scrapeWithPuppeteer(url) {
  const browser = await puppeteer.launch({
    headless: 'new',
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });

  const page = await browser.newPage();

  // Set a realistic viewport and user agent
  await page.setViewport({ width: 1920, height: 1080 });
  await page.setUserAgent(
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
  );

  await page.goto(url, { waitUntil: 'networkidle2' });

  // Extract data from the rendered page
  const products = await page.evaluate(() => {
    return Array.from(document.querySelectorAll('.product-card')).map(card => ({
      title: card.querySelector('.title')?.textContent?.trim(),
      price: card.querySelector('.price')?.textContent?.trim(),
      url: card.querySelector('a')?.href,
    }));
  });

  await browser.close();
  return products;
}

Puppeteer Strengths

  • Full browser control — Click, type, scroll, screenshot, PDF generation
  • JavaScript rendering — Handles React, Vue, Angular sites natively
  • Google-backed — Stable, well-maintained, excellent Chrome integration
  • Rich ecosystem — Large community, lots of plugins and examples

Puppeteer Weaknesses

  • Resource hungry — Each Chrome instance uses 200-500MB RAM
  • Detectable — Sites can detect Puppeteer via navigator.webdriver, missing plugins, and browser fingerprint inconsistencies
  • Chrome only — No Firefox or Safari support
  • Scaling is painful — Running 50 concurrent browsers on a server requires careful memory management
  • No built-in anti-bot bypass — You need to add stealth plugins, proxy rotation, and CAPTCHA solving yourself

Approach 2: Playwright

Playwright is Microsoft’s answer to Puppeteer. It supports Chrome, Firefox, and Safari, has better auto-waiting, and includes features that Puppeteer lacks.

Basic Playwright Scraping

const { chromium } = require('playwright');

async function scrapeWithPlaywright(url) {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext({
    viewport: { width: 1920, height: 1080 },
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
  });

  const page = await context.newPage();

  await page.goto(url, { waitUntil: 'networkidle' });

  // Playwright has better selectors and auto-waiting
  const products = await page.$$eval('.product-card', cards =>
    cards.map(card => ({
      title: card.querySelector('.title')?.textContent?.trim(),
      price: card.querySelector('.price')?.textContent?.trim(),
      url: card.querySelector('a')?.href,
    }))
  );

  await browser.close();
  return products;
}

What Playwright Improves Over Puppeteer

  • Multi-browser support — Chromium, Firefox, and WebKit (Safari)
  • Better auto-waiting — Actions automatically wait for elements to be visible and stable
  • Browser contexts — Lightweight isolation without full browser overhead
  • Network interception — Easier API mocking and request modification
  • Trace viewer — Built-in debugging tool for recording and replaying tests

Where Playwright Still Struggles

The core limitations are the same as Puppeteer because they share the same architecture — running a real browser:

// The anti-bot problem persists with Playwright too
const { chromium } = require('playwright');

async function scrapeProtectedSite(url) {
  const browser = await chromium.launch({
    headless: true,
    // Even with these flags, sophisticated anti-bot systems
    // detect Playwright through:
    // - WebDriver flag in navigator
    // - Differences in browser plugin list
    // - Missing GPU/WebGL fingerprint details
    // - TLS fingerprint mismatches
  });

  const page = await browser.newPage();
  await page.goto(url);

  // Often results in a CAPTCHA or block page
  const content = await page.content();

  if (content.includes('captcha') || content.includes('blocked')) {
    console.log('Detected and blocked');
    // Now what? You need to add:
    // 1. Stealth plugins (playwright-extra)
    // 2. Proxy rotation service
    // 3. CAPTCHA solving service
    // 4. Custom fingerprint spoofing
  }

  await browser.close();
}

Community projects like playwright-extra and puppeteer-extra-plugin-stealth help, but they’re in an arms race with anti-bot systems and often lag behind.

Approach 3: Scraping API

A scraping API offloads the browser execution, anti-bot bypass, proxy rotation, and CAPTCHA solving to a managed service. You send a URL, you get back rendered HTML.

Basic API Scraping with Node.js

const axios = require('axios');
const cheerio = require('cheerio');

const FINEDATA_API_KEY = 'fd_your_api_key';

async function scrapeWithApi(url) {
  const response = await axios.post(
    'https://api.finedata.ai/api/v1/scrape',
    {
      url,
      use_js_render: true,
      tls_profile: 'chrome124',
      use_residential: true,
      timeout: 30
    },
    {
      headers: {
        'x-api-key': FINEDATA_API_KEY,
        'Content-Type': 'application/json'
      }
    }
  );

  const $ = cheerio.load(response.data.body);

  const products = [];
  $('.product-card').each((i, card) => {
    products.push({
      title: $(card).find('.title').text().trim(),
      price: $(card).find('.price').text().trim(),
      url: $(card).find('a').attr('href'),
    });
  });

  return products;
}

No browser to launch, no memory to manage, no stealth plugins to configure. The HTML comes back fully rendered with anti-bot protections handled.

Scaling with the API

Where the API approach really shines is concurrent scraping. With Puppeteer/Playwright, each concurrent page needs a browser tab (and the RAM to go with it). With an API, concurrency is just HTTP requests:

const axios = require('axios');
const cheerio = require('cheerio');
const pLimit = require('p-limit');

const FINEDATA_API_KEY = 'fd_your_api_key';
const limit = pLimit(10); // 10 concurrent requests

async function scrapeUrl(url) {
  const response = await axios.post(
    'https://api.finedata.ai/api/v1/scrape',
    {
      url,
      use_js_render: true,
      tls_profile: 'chrome124',
      timeout: 30
    },
    {
      headers: {
        'x-api-key': FINEDATA_API_KEY,
        'Content-Type': 'application/json'
      }
    }
  );

  return response.data.body;
}

async function scrapeMany(urls) {
  const results = await Promise.all(
    urls.map(url => limit(() => scrapeUrl(url)))
  );

  return results.map((html, i) => {
    const $ = cheerio.load(html);
    return {
      url: urls[i],
      title: $('h1').text().trim(),
      products: $('.product-card').length,
    };
  });
}

// Scrape 100 URLs with 10 concurrency — uses ~50MB RAM
const urls = Array.from({ length: 100 }, (_, i) =>
  `https://store.example.com/category?page=${i + 1}`
);

scrapeMany(urls).then(results => {
  console.log(`Scraped ${results.length} pages`);
});

Doing this with Puppeteer would require either a beefy server (10 Chrome instances at 300MB each = 3GB RAM for just the browsers) or a distributed system with message queues and worker pools.

Batch Scraping

For even larger workloads, FineData’s batch endpoint processes multiple URLs in a single request:

async function batchScrape(urls) {
  const response = await axios.post(
    'https://api.finedata.ai/api/v1/batch',
    {
      urls,
      use_js_render: true,
      use_residential: true
    },
    {
      headers: {
        'x-api-key': FINEDATA_API_KEY,
        'Content-Type': 'application/json'
      }
    }
  );

  const batchId = response.data.batch_id;

  // Poll for results
  while (true) {
    const status = await axios.get(
      `https://api.finedata.ai/api/v1/batch/${batchId}`,
      { headers: { 'x-api-key': FINEDATA_API_KEY } }
    );

    if (status.data.status === 'completed') {
      return status.data.results;
    }

    await new Promise(resolve => setTimeout(resolve, 5000));
  }
}

Head-to-Head Comparison

FactorPuppeteerPlaywrightFineData API
LanguageNode.jsNode.js, Python, Java, .NETAny (REST API)
Browser supportChrome onlyChrome, Firefox, SafariN/A (server-side)
JS renderingYesYesYes
RAM per page200-500MB150-300MB~0 (server-side)
Anti-bot bypassPoor (detectable)Poor (detectable)Built-in
CAPTCHA solvingManual integrationManual integrationBuilt-in
Proxy supportManual setupManual setupBuilt-in
Max concurrency5-20 (local)5-30 (local)100+
Setup time15-30 min10-20 min2 min
MaintenanceHighModerateLow
CostInfrastructureInfrastructurePer-request
Page interactionFullFullNone
Screenshots/PDFsYesYesNo

When to Use Each

Choose Puppeteer when:

  • You need to interact with pages — fill forms, click buttons, navigate multi-step flows
  • You’re building end-to-end tests alongside scraping
  • You need screenshots or PDFs
  • You’re scraping a small number of unprotected sites
  • You want to stay within the Google/Chrome ecosystem

Choose Playwright when:

  • Everything above, plus:
  • You need cross-browser testing (Firefox, Safari)
  • You want better auto-waiting and debugging tools
  • You prefer the API over Puppeteer’s (Playwright’s is generally considered more ergonomic)

Choose a Scraping API when:

  • Sites have anti-bot protection (Cloudflare, CAPTCHAs, fingerprinting)
  • You need to scrape at scale (hundreds to millions of pages)
  • You don’t want to manage browser infrastructure
  • Reliability matters — your data pipeline can’t tolerate frequent failures
  • You’re working in a team and want to minimize ops burden

The Hybrid Pattern

Many production systems combine approaches:

async function smartScrape(url, options = {}) {
  // Use Playwright for sites that need interaction
  if (options.needsInteraction) {
    return scrapeWithPlaywright(url, options.steps);
  }

  // Use API for everything else (handles anti-bot, JS, scale)
  return scrapeWithApi(url);
}

Use Playwright for the 10% of cases that need browser interaction (login flows, form submissions, file downloads) and the API for the 90% that just need rendered HTML.

Performance Benchmarks

Here’s what scraping 100 product pages looks like across approaches (tested on a 4-core, 8GB RAM server):

MetricPuppeteerPlaywrightFineData API
Total time8-12 min6-10 min2-4 min
Peak RAM3.2 GB2.4 GB120 MB
Success rate (protected)35%40%92%
Success rate (unprotected)95%97%99%
CPU usage60-80%50-70%5-10%

The API approach is significantly faster and lighter because the browser execution happens on FineData’s infrastructure, and your Node.js process just handles HTTP requests and HTML parsing.

Key Takeaways

  • Puppeteer and Playwright are powerful tools for browser automation, but they’re resource-intensive, detectable by anti-bot systems, and hard to scale past a few dozen concurrent pages.
  • Scraping APIs trade per-request costs for zero infrastructure overhead, built-in anti-bot bypass, and effortless concurrency.
  • For page interaction (forms, clicks, navigation flows), use Puppeteer or Playwright. For data extraction at scale, use an API.
  • The hybrid approach combines the best of both: Playwright for interaction-heavy flows, API for everything else.
  • At scale, the API approach uses a fraction of the RAM and CPU, letting you run on a smaller (cheaper) server.

Ready to try the API approach? Check out our getting started guide or dive into our documentation. For Python-focused scraping, see our comparison of Requests + BeautifulSoup vs API.

#nodejs #puppeteer #playwright #javascript #comparison #tutorial

Related Articles