Tutorial 9 min read

How to Scrape JavaScript-Heavy Websites and SPAs

Learn why traditional scrapers fail on SPAs and how to scrape React, Vue, and Angular sites using JavaScript rendering and wait strategies.

FT
FineData Team
|

How to Scrape JavaScript-Heavy Websites and SPAs

The modern web runs on JavaScript. React, Vue, Angular, Next.js, Nuxt — the majority of websites built in the last five years render their content dynamically in the browser. When you fetch these pages with a standard HTTP request, you get an empty shell: a <div id="root"></div> and a bunch of <script> tags.

This is the single biggest challenge in web scraping today. Here’s how to solve it.

Why Traditional Scrapers Fail on SPAs

When you make a request with Python’s requests library, here’s what happens:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://some-react-app.com/products")
soup = BeautifulSoup(response.text, "html.parser")

products = soup.select(".product-card")
print(len(products))  # 0 — nothing found

The response contains something like:

<!DOCTYPE html>
<html>
<head><title>Products</title></head>
<body>
  <div id="root"></div>
  <script src="/static/js/bundle.js"></script>
</body>
</html>

All the actual content — product cards, prices, images — gets injected by JavaScript after the page loads. The requests library downloads the HTML but never executes the JavaScript.

This affects a huge number of modern sites:

  • React / Next.js — Most e-commerce stores, dashboards, SaaS products
  • Vue / Nuxt — News sites, marketplaces, booking platforms
  • Angular — Enterprise applications, government portals
  • Svelte / SvelteKit — Newer sites and tools
  • Any site using client-side rendering (CSR)

The Traditional Solution: Headless Browsers

The classic approach is to run a headless browser — a real browser without a visible window:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)

driver.get("https://some-react-app.com/products")

# Wait for content to load
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, ".product-card"))
)

products = driver.find_elements(By.CSS_SELECTOR, ".product-card")
print(len(products))  # Now we get results

driver.quit()

This works, but it comes with significant drawbacks:

  • Resource intensive — Each browser instance uses 200-500MB of RAM
  • Slow — Pages take 3-10 seconds to fully render
  • Fragile — Browser crashes, memory leaks, and zombie processes
  • Detectable — Sites can detect Selenium via navigator.webdriver and other fingerprints
  • Hard to scale — Running 50 concurrent browsers needs serious infrastructure

How FineData’s JS Rendering Works

FineData handles JavaScript rendering on its infrastructure, so you don’t need to manage headless browsers. You send a request with use_js_render: True, and FineData:

  1. Loads the page in a real browser environment
  2. Executes all JavaScript (React, Vue, etc.)
  3. Waits for the content to finish rendering
  4. Returns the fully-rendered HTML
import requests
from bs4 import BeautifulSoup

FINEDATA_API_KEY = "fd_your_api_key"

response = requests.post(
    "https://api.finedata.ai/api/v1/scrape",
    headers={
        "x-api-key": FINEDATA_API_KEY,
        "Content-Type": "application/json"
    },
    json={
        "url": "https://some-react-app.com/products",
        "use_js_render": True,
        "timeout": 30
    }
)

data = response.json()
soup = BeautifulSoup(data["body"], "html.parser")

products = soup.select(".product-card")
print(len(products))  # Works — JS was rendered server-side

No browser management, no Selenium drivers, no memory leaks. The rendered HTML comes back in the API response just like a regular HTTP request.

Wait Strategies: Getting the Timing Right

The trickiest part of JS rendering is knowing when the page is “done.” SPAs load data asynchronously — the initial HTML renders, then API calls fetch product data, which then gets rendered into the DOM. You need to wait for all of this to complete.

FineData supports several wait strategies:

Network Idle (Default)

{
    "url": "https://example.com",
    "use_js_render": True,
    "js_wait_for": "networkidle"
}

This waits until there are no more network requests for 500ms. It’s the safest default — most SPAs load data immediately on page render, and once those API calls finish, the content is ready.

Best for: Most SPAs, e-commerce sites, dashboards

DOM Content Loaded

{
    "url": "https://example.com",
    "use_js_render": True,
    "js_wait_for": "domcontentloaded"
}

Returns as soon as the initial HTML is parsed, without waiting for stylesheets, images, or subframes. This is faster but may miss dynamically loaded content.

Best for: Server-side rendered pages (Next.js SSR, Nuxt SSR) where the content is in the initial HTML but some JS enhancement runs after

Selector-Based Waiting

{
    "url": "https://example.com/products",
    "use_js_render": True,
    "js_wait_for": "selector:.product-card"
}

This waits until a specific CSS selector appears in the DOM. It’s the most precise strategy — you’re telling FineData exactly what element signals that the page is ready.

Best for: Pages where you know the exact element that indicates content has loaded

Full Page Load

{
    "url": "https://example.com",
    "use_js_render": True,
    "js_wait_for": "load"
}

Waits for the window.onload event, which fires after all resources (images, stylesheets, iframes) have finished loading.

Best for: Pages where images or iframes contain important data

Handling Common SPA Patterns

Infinite Scroll

Many modern sites use infinite scroll instead of pagination. The content loads as you scroll down. To scrape these, you need to simulate scrolling:

With FineData, you can use the JS rendering with networkidle wait strategy. The initial load typically brings the first batch of items. For subsequent pages, look for the underlying API endpoints:

def scrape_infinite_scroll_api(base_api_url, pages=5):
    """
    Instead of scrolling, hit the underlying API directly.
    Most infinite scroll sites fetch from a paginated API.
    """
    all_items = []

    for page in range(1, pages + 1):
        api_url = f"{base_api_url}?page={page}&limit=20"

        response = requests.post(
            "https://api.finedata.ai/api/v1/scrape",
            headers={
                "x-api-key": FINEDATA_API_KEY,
                "Content-Type": "application/json"
            },
            json={
                "url": api_url,
                "use_js_render": False,  # API returns JSON, no JS needed
                "timeout": 30
            }
        )

        data = response.json()
        # The body will contain the raw API JSON response
        items = json.loads(data["body"])
        all_items.extend(items)

    return all_items

Pro tip: Open your browser’s DevTools Network tab, scroll the page, and watch for XHR/Fetch requests. You’ll often find a clean JSON API behind the infinite scroll UI. Scraping the API directly is faster, cheaper (no JS rendering needed), and more reliable.

Lazy-Loaded Content

Some sites delay loading certain sections until the user scrolls to them. This is common for images and below-the-fold content:

# Use selector-based waiting for the specific content you need
response = requests.post(
    "https://api.finedata.ai/api/v1/scrape",
    headers={
        "x-api-key": FINEDATA_API_KEY,
        "Content-Type": "application/json"
    },
    json={
        "url": "https://example.com/product",
        "use_js_render": True,
        "js_wait_for": "selector:.review-section",
        "timeout": 45
    }
)

Client-Side Routing

SPAs often use client-side routing (React Router, Vue Router). URLs like /products/123 don’t correspond to actual server paths — they’re handled by JavaScript. The good news: FineData’s JS rendering handles this automatically. Just pass the full URL and the SPA’s router will navigate to the correct view.

Comparison: Selenium vs Playwright vs FineData

Here’s how the approaches stack up for scraping JavaScript-heavy sites:

FactorSeleniumPlaywrightFineData API
Setup time30+ min15 min2 min
RAM per page200-500 MB150-300 MB0 (server-side)
Anti-bot bypassPoorModerateBuilt-in
Concurrent pages5-10 (local)10-20 (local)100+
TLS fingerprintingDetectableLess detectableChrome-identical
MaintenanceHighModerateNone
CostInfrastructureInfrastructurePer-request tokens

When to use Selenium/Playwright:

  • You need to interact with pages (fill forms, click buttons, navigate flows)
  • You’re scraping a small number of pages (<100/day) and already have the infrastructure
  • You need to capture screenshots or PDFs

When to use FineData:

  • You need rendered HTML at scale (hundreds to thousands of pages)
  • Anti-bot protection is present
  • You don’t want to manage browser infrastructure
  • You need residential proxies and CAPTCHA solving alongside JS rendering

Real-World Example: Scraping a React E-Commerce Store

Let’s put it all together with a practical example — scraping a React-based product catalog:

import requests
from bs4 import BeautifulSoup
import json

FINEDATA_API_KEY = "fd_your_api_key"

def scrape_react_store(category_url):
    """Scrape products from a React-based e-commerce store."""
    # Step 1: Get the rendered category page
    response = requests.post(
        "https://api.finedata.ai/api/v1/scrape",
        headers={
            "x-api-key": FINEDATA_API_KEY,
            "Content-Type": "application/json"
        },
        json={
            "url": category_url,
            "use_js_render": True,
            "js_wait_for": "selector:[data-testid='product-grid']",
            "tls_profile": "chrome124",
            "timeout": 30
        }
    )
    data = response.json()
    soup = BeautifulSoup(data["body"], "html.parser")

    # Step 2: Extract products
    products = []
    for card in soup.select("[data-testid='product-card']"):
        product = {
            "name": card.select_one("h3").get_text(strip=True),
            "price": card.select_one("[data-testid='price']").get_text(strip=True),
            "image": card.select_one("img").get("src"),
            "link": card.select_one("a").get("href"),
        }
        products.append(product)

    # Step 3: Check for next page
    next_btn = soup.select_one("[data-testid='next-page']")
    has_next = next_btn is not None and "disabled" not in next_btn.get("class", [])

    return products, has_next

# Scrape multiple pages
all_products = []
page = 1

while True:
    url = f"https://store.example.com/electronics?page={page}"
    products, has_next = scrape_react_store(url)
    all_products.extend(products)
    print(f"Page {page}: {len(products)} products")

    if not has_next or page >= 10:
        break
    page += 1

print(f"\nTotal: {len(all_products)} products")

Token Costs for JS Rendering

JavaScript rendering adds 5 tokens per request on top of the base cost:

ConfigurationTokensUse Case
Base only1Static HTML sites
Base + JS render6SPAs, React/Vue sites
Base + JS + residential9Protected SPAs
Base + JS + residential + CAPTCHA19Heavily protected sites

For most SPA scraping, 6 tokens per request (base + JS rendering) is all you need.

Key Takeaways

  • Standard HTTP requests return empty HTML for JavaScript-heavy sites — you need a browser to render the content.
  • Headless browsers (Selenium, Playwright) work but are resource-intensive, hard to scale, and easy to detect.
  • FineData’s JS rendering handles browser execution server-side, returning fully-rendered HTML via a simple API call.
  • Choose the right wait strategy: networkidle for most cases, selector:... when you know exactly what to wait for.
  • Look for underlying APIs behind infinite scroll and dynamic content — scraping the API directly is faster and cheaper.
  • JS rendering costs 5 extra tokens per request — a fraction of what running your own browser infrastructure costs.

For sites with additional anti-bot protection beyond JavaScript rendering, check out our guides on handling CAPTCHAs and bypassing Cloudflare.

#javascript #spa #react #vue #js-rendering #tutorial

Related Articles