Web Scraping with Node.js: Puppeteer, Playwright, or API?
Compare Node.js web scraping approaches: Puppeteer, Playwright, and scraping APIs. Learn when to use each with practical code examples.
Web Scraping with Node.js: Puppeteer, Playwright, or API?
Node.js is one of the most popular platforms for web scraping. Its async-first architecture makes it naturally suited for I/O-heavy tasks like fetching thousands of web pages. But the ecosystem offers several fundamentally different approaches, and choosing the wrong one for your use case can waste weeks of development time.
This guide compares the three main approaches: Puppeteer, Playwright, and scraping APIs. We’ll cover what each is good at, where each falls short, and show real code so you can make an informed decision.
Approach 1: Puppeteer
Puppeteer is Google’s official Node.js library for controlling Chrome. It launches a headless Chrome instance and gives you full programmatic control — navigate to pages, click buttons, fill forms, take screenshots, and extract content.
Basic Puppeteer Scraping
const puppeteer = require('puppeteer');
async function scrapeWithPuppeteer(url) {
const browser = await puppeteer.launch({
headless: 'new',
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
const page = await browser.newPage();
// Set a realistic viewport and user agent
await page.setViewport({ width: 1920, height: 1080 });
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
);
await page.goto(url, { waitUntil: 'networkidle2' });
// Extract data from the rendered page
const products = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.product-card')).map(card => ({
title: card.querySelector('.title')?.textContent?.trim(),
price: card.querySelector('.price')?.textContent?.trim(),
url: card.querySelector('a')?.href,
}));
});
await browser.close();
return products;
}
Puppeteer Strengths
- Full browser control — Click, type, scroll, screenshot, PDF generation
- JavaScript rendering — Handles React, Vue, Angular sites natively
- Google-backed — Stable, well-maintained, excellent Chrome integration
- Rich ecosystem — Large community, lots of plugins and examples
Puppeteer Weaknesses
- Resource hungry — Each Chrome instance uses 200-500MB RAM
- Detectable — Sites can detect Puppeteer via
navigator.webdriver, missing plugins, and browser fingerprint inconsistencies - Chrome only — No Firefox or Safari support
- Scaling is painful — Running 50 concurrent browsers on a server requires careful memory management
- No built-in anti-bot bypass — You need to add stealth plugins, proxy rotation, and CAPTCHA solving yourself
Approach 2: Playwright
Playwright is Microsoft’s answer to Puppeteer. It supports Chrome, Firefox, and Safari, has better auto-waiting, and includes features that Puppeteer lacks.
Basic Playwright Scraping
const { chromium } = require('playwright');
async function scrapeWithPlaywright(url) {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext({
viewport: { width: 1920, height: 1080 },
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
});
const page = await context.newPage();
await page.goto(url, { waitUntil: 'networkidle' });
// Playwright has better selectors and auto-waiting
const products = await page.$$eval('.product-card', cards =>
cards.map(card => ({
title: card.querySelector('.title')?.textContent?.trim(),
price: card.querySelector('.price')?.textContent?.trim(),
url: card.querySelector('a')?.href,
}))
);
await browser.close();
return products;
}
What Playwright Improves Over Puppeteer
- Multi-browser support — Chromium, Firefox, and WebKit (Safari)
- Better auto-waiting — Actions automatically wait for elements to be visible and stable
- Browser contexts — Lightweight isolation without full browser overhead
- Network interception — Easier API mocking and request modification
- Trace viewer — Built-in debugging tool for recording and replaying tests
Where Playwright Still Struggles
The core limitations are the same as Puppeteer because they share the same architecture — running a real browser:
// The anti-bot problem persists with Playwright too
const { chromium } = require('playwright');
async function scrapeProtectedSite(url) {
const browser = await chromium.launch({
headless: true,
// Even with these flags, sophisticated anti-bot systems
// detect Playwright through:
// - WebDriver flag in navigator
// - Differences in browser plugin list
// - Missing GPU/WebGL fingerprint details
// - TLS fingerprint mismatches
});
const page = await browser.newPage();
await page.goto(url);
// Often results in a CAPTCHA or block page
const content = await page.content();
if (content.includes('captcha') || content.includes('blocked')) {
console.log('Detected and blocked');
// Now what? You need to add:
// 1. Stealth plugins (playwright-extra)
// 2. Proxy rotation service
// 3. CAPTCHA solving service
// 4. Custom fingerprint spoofing
}
await browser.close();
}
Community projects like playwright-extra and puppeteer-extra-plugin-stealth help, but they’re in an arms race with anti-bot systems and often lag behind.
Approach 3: Scraping API
A scraping API offloads the browser execution, anti-bot bypass, proxy rotation, and CAPTCHA solving to a managed service. You send a URL, you get back rendered HTML.
Basic API Scraping with Node.js
const axios = require('axios');
const cheerio = require('cheerio');
const FINEDATA_API_KEY = 'fd_your_api_key';
async function scrapeWithApi(url) {
const response = await axios.post(
'https://api.finedata.ai/api/v1/scrape',
{
url,
use_js_render: true,
tls_profile: 'chrome124',
use_residential: true,
timeout: 30
},
{
headers: {
'x-api-key': FINEDATA_API_KEY,
'Content-Type': 'application/json'
}
}
);
const $ = cheerio.load(response.data.body);
const products = [];
$('.product-card').each((i, card) => {
products.push({
title: $(card).find('.title').text().trim(),
price: $(card).find('.price').text().trim(),
url: $(card).find('a').attr('href'),
});
});
return products;
}
No browser to launch, no memory to manage, no stealth plugins to configure. The HTML comes back fully rendered with anti-bot protections handled.
Scaling with the API
Where the API approach really shines is concurrent scraping. With Puppeteer/Playwright, each concurrent page needs a browser tab (and the RAM to go with it). With an API, concurrency is just HTTP requests:
const axios = require('axios');
const cheerio = require('cheerio');
const pLimit = require('p-limit');
const FINEDATA_API_KEY = 'fd_your_api_key';
const limit = pLimit(10); // 10 concurrent requests
async function scrapeUrl(url) {
const response = await axios.post(
'https://api.finedata.ai/api/v1/scrape',
{
url,
use_js_render: true,
tls_profile: 'chrome124',
timeout: 30
},
{
headers: {
'x-api-key': FINEDATA_API_KEY,
'Content-Type': 'application/json'
}
}
);
return response.data.body;
}
async function scrapeMany(urls) {
const results = await Promise.all(
urls.map(url => limit(() => scrapeUrl(url)))
);
return results.map((html, i) => {
const $ = cheerio.load(html);
return {
url: urls[i],
title: $('h1').text().trim(),
products: $('.product-card').length,
};
});
}
// Scrape 100 URLs with 10 concurrency — uses ~50MB RAM
const urls = Array.from({ length: 100 }, (_, i) =>
`https://store.example.com/category?page=${i + 1}`
);
scrapeMany(urls).then(results => {
console.log(`Scraped ${results.length} pages`);
});
Doing this with Puppeteer would require either a beefy server (10 Chrome instances at 300MB each = 3GB RAM for just the browsers) or a distributed system with message queues and worker pools.
Batch Scraping
For even larger workloads, FineData’s batch endpoint processes multiple URLs in a single request:
async function batchScrape(urls) {
const response = await axios.post(
'https://api.finedata.ai/api/v1/batch',
{
urls,
use_js_render: true,
use_residential: true
},
{
headers: {
'x-api-key': FINEDATA_API_KEY,
'Content-Type': 'application/json'
}
}
);
const batchId = response.data.batch_id;
// Poll for results
while (true) {
const status = await axios.get(
`https://api.finedata.ai/api/v1/batch/${batchId}`,
{ headers: { 'x-api-key': FINEDATA_API_KEY } }
);
if (status.data.status === 'completed') {
return status.data.results;
}
await new Promise(resolve => setTimeout(resolve, 5000));
}
}
Head-to-Head Comparison
| Factor | Puppeteer | Playwright | FineData API |
|---|---|---|---|
| Language | Node.js | Node.js, Python, Java, .NET | Any (REST API) |
| Browser support | Chrome only | Chrome, Firefox, Safari | N/A (server-side) |
| JS rendering | Yes | Yes | Yes |
| RAM per page | 200-500MB | 150-300MB | ~0 (server-side) |
| Anti-bot bypass | Poor (detectable) | Poor (detectable) | Built-in |
| CAPTCHA solving | Manual integration | Manual integration | Built-in |
| Proxy support | Manual setup | Manual setup | Built-in |
| Max concurrency | 5-20 (local) | 5-30 (local) | 100+ |
| Setup time | 15-30 min | 10-20 min | 2 min |
| Maintenance | High | Moderate | Low |
| Cost | Infrastructure | Infrastructure | Per-request |
| Page interaction | Full | Full | None |
| Screenshots/PDFs | Yes | Yes | No |
When to Use Each
Choose Puppeteer when:
- You need to interact with pages — fill forms, click buttons, navigate multi-step flows
- You’re building end-to-end tests alongside scraping
- You need screenshots or PDFs
- You’re scraping a small number of unprotected sites
- You want to stay within the Google/Chrome ecosystem
Choose Playwright when:
- Everything above, plus:
- You need cross-browser testing (Firefox, Safari)
- You want better auto-waiting and debugging tools
- You prefer the API over Puppeteer’s (Playwright’s is generally considered more ergonomic)
Choose a Scraping API when:
- Sites have anti-bot protection (Cloudflare, CAPTCHAs, fingerprinting)
- You need to scrape at scale (hundreds to millions of pages)
- You don’t want to manage browser infrastructure
- Reliability matters — your data pipeline can’t tolerate frequent failures
- You’re working in a team and want to minimize ops burden
The Hybrid Pattern
Many production systems combine approaches:
async function smartScrape(url, options = {}) {
// Use Playwright for sites that need interaction
if (options.needsInteraction) {
return scrapeWithPlaywright(url, options.steps);
}
// Use API for everything else (handles anti-bot, JS, scale)
return scrapeWithApi(url);
}
Use Playwright for the 10% of cases that need browser interaction (login flows, form submissions, file downloads) and the API for the 90% that just need rendered HTML.
Performance Benchmarks
Here’s what scraping 100 product pages looks like across approaches (tested on a 4-core, 8GB RAM server):
| Metric | Puppeteer | Playwright | FineData API |
|---|---|---|---|
| Total time | 8-12 min | 6-10 min | 2-4 min |
| Peak RAM | 3.2 GB | 2.4 GB | 120 MB |
| Success rate (protected) | 35% | 40% | 92% |
| Success rate (unprotected) | 95% | 97% | 99% |
| CPU usage | 60-80% | 50-70% | 5-10% |
The API approach is significantly faster and lighter because the browser execution happens on FineData’s infrastructure, and your Node.js process just handles HTTP requests and HTML parsing.
Key Takeaways
- Puppeteer and Playwright are powerful tools for browser automation, but they’re resource-intensive, detectable by anti-bot systems, and hard to scale past a few dozen concurrent pages.
- Scraping APIs trade per-request costs for zero infrastructure overhead, built-in anti-bot bypass, and effortless concurrency.
- For page interaction (forms, clicks, navigation flows), use Puppeteer or Playwright. For data extraction at scale, use an API.
- The hybrid approach combines the best of both: Playwright for interaction-heavy flows, API for everything else.
- At scale, the API approach uses a fraction of the RAM and CPU, letting you run on a smaller (cheaper) server.
Ready to try the API approach? Check out our getting started guide or dive into our documentation. For Python-focused scraping, see our comparison of Requests + BeautifulSoup vs API.
Related Articles
Free No-Code Web Scraper: Extract Data Without Writing Code
How to use no-code web scrapers to extract structured data from websites. Tools, workflows, and practical limitations for non-developers.
TutorialHow to Scrape Dynamic Job Listings with Authentication in 2026
Learn how to scrape job portals with login requirements using FineData API, including session handling and secure credential management.
TutorialHow to Scrape Job Postings with Dynamic Filters Using FineData API
Step-by-step guide to extract job listings from career sites with dynamic filters using FineData's API and Playwright rendering.