Technical 10 min read

Scaling Web Scraping from 1K to 10M Pages per Day

Architecture guide for scaling web scraping from thousands to millions of pages per day. Covers async patterns, queues, rate limiting, distributed systems, and cost optimization.

FineData Engineering · Editorial Policy

| February 9, 2026

Scaling Web Scraping from 1K to 10M Pages per Day

Most scraping projects start the same way: a Python script with a for loop and requests.get(). It works fine for a few hundred pages. Then the requirements grow — more pages, more sites, tighter schedules — and that simple script becomes a bottleneck. Scaling web scraping is not just about making requests faster; it requires rethinking the entire architecture.

This guide walks through the architectural evolution from a simple script to a system capable of processing millions of pages per day, with practical patterns you can implement at each stage.

Stage 1: Single-Threaded Script (1K pages/day)

Where every scraping project begins:

import requests
from bs4 import BeautifulSoup
import time

urls = load_urls()  # ~1,000 URLs

for url in urls:
    try:
        response = requests.get(url, timeout=30)
        soup = BeautifulSoup(response.text, "html.parser")
        data = extract_data(soup)
        save_to_database(data)
    except Exception as e:
        log_error(url, e)
    time.sleep(1)  # Be polite

Throughput: ~1 request/second = ~3,600 pages/hour = ~86K pages/day (theoretical max without sleep)

Bottleneck: Sequential execution. Each request waits for the previous one to complete. With a 1-second politeness delay and average response time of 500ms, you are getting about 40 pages per minute.

When this breaks: As soon as you need more than a few thousand pages per day, or when you are scraping from multiple independent domains.

Stage 2: Concurrent Requests with asyncio (10K-50K pages/day)

The first major architectural change is moving from sequential to concurrent execution:

import asyncio
import aiohttp
from aiohttp import ClientTimeout

CONCURRENCY = 50  # Max simultaneous requests
RATE_LIMIT = 10   # Requests per second per domain

semaphore = asyncio.Semaphore(CONCURRENCY)

async def scrape_url(session: aiohttp.ClientSession, url: str) -> dict:
    async with semaphore:
        try:
            async with session.get(url, timeout=ClientTimeout(total=30)) as response:
                html = await response.text()
                return {"url": url, "html": html, "status": response.status}
        except Exception as e:
            return {"url": url, "error": str(e)}

async def main(urls: list[str]):
    connector = aiohttp.TCPConnector(limit=CONCURRENCY)
    async with aiohttp.ClientSession(connector=connector) as session:
        tasks = [scrape_url(session, url) for url in urls]
        results = await asyncio.gather(*tasks)
        return results

urls = load_urls()
results = asyncio.run(main(urls))

Throughput: 50 concurrent requests with ~500ms average = ~100 pages/second = ~360K pages/hour

Key improvements:

Concurrent I/O means network wait time overlaps instead of stacking
Connection pooling reduces TCP/TLS handshake overhead
Semaphore controls concurrency to avoid overwhelming targets or running out of file descriptors

New challenges at this stage:

Rate limiting per domain becomes critical
Error handling needs to be more robust (retries, backoff)
Memory management with large numbers of concurrent responses
No persistence — if the script crashes, all progress is lost

Stage 3: Queue-Based Architecture (50K-500K pages/day)

At this stage, you need to separate URL discovery, fetching, and processing into independent components:

                    ┌─────────────┐
                    │  URL Source  │
                    │  (Scheduler) │
                    └──────┬──────┘
                           │
                           ▼
                    ┌─────────────┐
                    │  Task Queue  │
                    │ (Redis/RMQ)  │
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              ▼            ▼            ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
        │ Worker 1 │ │ Worker 2 │ │ Worker N │
        └────┬─────┘ └────┬─────┘ └────┬─────┘
              │            │            │
              ▼            ▼            ▼
        ┌─────────────────────────────────────┐
        │         Results Store               │
        │    (Database / Object Storage)      │
        └─────────────────────────────────────┘

Each component has a clear responsibility:

Scheduler: Discovers or receives URLs to scrape, deduplicates them, and enqueues them with priority and rate-limiting metadata.

Task Queue: Redis, RabbitMQ, or SQS. Provides persistence (survive crashes), backpressure (prevent worker overload), and visibility (monitoring).

Workers: Stateless processes that pull URLs from the queue, scrape them, and store results. Can be scaled horizontally.

Results Store: PostgreSQL for structured data, S3/MinIO for raw HTML, or both.

# Worker process (simplified)
import redis
import json

r = redis.Redis()

while True:
    # Blocking pop from queue
    _, message = r.brpop("scrape_queue")
    task = json.loads(message)

    try:
        result = scrape_with_retries(task["url"], max_retries=3)
        store_result(task["url"], result)
        r.incr("stats:success")
    except Exception as e:
        if task.get("retries", 0) < 3:
            task["retries"] = task.get("retries", 0) + 1
            r.lpush("scrape_queue", json.dumps(task))
        else:
            store_failure(task["url"], str(e))
            r.incr("stats:failed")

Key improvements:

Persistence — crashed workers resume from where they left off
Horizontal scaling — add more workers as needed
Rate limiting can be implemented per-domain at the queue level
Monitoring via queue depth, processing rates, and error counts

Stage 4: Distributed System with Rate Limiting (500K-2M pages/day)

At this scale, rate limiting becomes a first-class concern. You need per-domain rate limiters that work across multiple workers:

import time
import redis

class DistributedRateLimiter:
    """Token bucket rate limiter using Redis."""

    def __init__(self, redis_client: redis.Redis, domain: str, rate: float, burst: int):
        self.redis = redis_client
        self.key = f"ratelimit:{domain}"
        self.rate = rate    # tokens per second
        self.burst = burst  # max tokens

    def acquire(self) -> bool:
        """Try to acquire a token. Returns True if allowed."""
        now = time.time()
        pipe = self.redis.pipeline()

        # Lua script for atomic token bucket
        lua = """
        local key = KEYS[1]
        local rate = tonumber(ARGV[1])
        local burst = tonumber(ARGV[2])
        local now = tonumber(ARGV[3])

        local data = redis.call('hmget', key, 'tokens', 'last_time')
        local tokens = tonumber(data[1]) or burst
        local last_time = tonumber(data[2]) or now

        local elapsed = now - last_time
        tokens = math.min(burst, tokens + elapsed * rate)

        if tokens >= 1 then
            tokens = tokens - 1
            redis.call('hmset', key, 'tokens', tokens, 'last_time', now)
            redis.call('expire', key, 3600)
            return 1
        else
            return 0
        end
        """
        result = self.redis.eval(lua, 1, self.key, self.rate, self.burst, now)
        return bool(result)

Error Handling and Retry Strategy

At scale, transient errors are a certainty. Implement exponential backoff with jitter:

import random

async def scrape_with_retries(url: str, max_retries: int = 5) -> dict:
    for attempt in range(max_retries):
        try:
            result = await scrape(url)
            if result["status"] < 400:
                return result

            if result["status"] == 429:  # Rate limited
                wait = min(300, (2 ** attempt) + random.uniform(0, 1))
                await asyncio.sleep(wait)
                continue

            if result["status"] >= 500:  # Server error, retry
                wait = min(60, (2 ** attempt) + random.uniform(0, 1))
                await asyncio.sleep(wait)
                continue

            return result  # 4xx (except 429) — don't retry

        except (ConnectionError, TimeoutError):
            wait = min(60, (2 ** attempt) + random.uniform(0, 1))
            await asyncio.sleep(wait)

    raise MaxRetriesExceeded(url)

Monitoring and Alerting

At this scale, you need real-time visibility:

Key Metrics:
├── Throughput: pages/minute (total and per-domain)
├── Success Rate: % of requests returning valid data
├── Queue Depth: pending tasks in queue
├── Worker Health: active workers, CPU/memory usage
├── Error Distribution: errors by type and domain
├── Latency: p50, p95, p99 response times
└── Cost: tokens consumed, proxy bandwidth used

Set alerts for:

Queue depth growing faster than processing rate (backlog)
Success rate dropping below threshold (site blocking)
Worker count dropping (infrastructure issues)
Error rate spikes (anti-bot update, site change)

Stage 5: Optimized at Scale (2M-10M+ pages/day)

At millions of pages per day, every inefficiency multiplies. This is where using a managed scraping API becomes particularly compelling.

Using FineData for Async Batch Processing

Instead of managing browser pools, proxy rotation, and anti-bot evasion yourself, delegate to an API and focus on orchestration:

import requests
import asyncio
import aiohttp

API_URL = "https://api.finedata.ai/api/v1"
HEADERS = {
    "x-api-key": "fd_your_api_key",
    "Content-Type": "application/json"
}

# Submit batch job
def submit_batch(urls: list[str]) -> dict:
    response = requests.post(
        f"{API_URL}/batch",
        headers=HEADERS,
        json={
            "urls": urls,
            "use_js_render": False,
            "callback_url": "https://your-app.com/webhook/batch-complete"
        }
    )
    return response.json()

# Process in chunks of 100
all_urls = load_urls()  # 100,000+ URLs
for i in range(0, len(all_urls), 100):
    chunk = all_urls[i:i+100]
    result = submit_batch(chunk)
    print(f"Batch {result['batch_id']} submitted: {len(chunk)} URLs")

For real-time processing where callbacks are not suitable, use async jobs with polling:

async def scrape_async_with_polling(url: str) -> dict:
    # Submit async job
    async with aiohttp.ClientSession() as session:
        resp = await session.post(
            f"{API_URL}/scrape/async",
            headers=HEADERS,
            json={"url": url, "use_js_render": True}
        )
        job = await resp.json()
        job_id = job["job_id"]

        # Poll for completion
        while True:
            resp = await session.get(
                f"{API_URL}/jobs/{job_id}",
                headers=HEADERS
            )
            status = await resp.json()

            if status["status"] == "completed":
                return status["result"]
            elif status["status"] == "failed":
                raise ScrapingError(status["error"])

            await asyncio.sleep(2)  # Poll interval

Cost Optimization at Scale

When processing millions of pages, small optimizations compound:

1. Classify URLs by difficulty before scraping.

# Route easy targets through cheap path, hard targets through full path
EASY_DOMAINS = {"example.com", "data.gov", "wikipedia.org"}

def get_scrape_config(url: str) -> dict:
    domain = extract_domain(url)
    if domain in EASY_DOMAINS:
        return {"use_js_render": False, "use_residential": False}  # 1 token
    else:
        return {"use_js_render": True, "use_residential": True}    # 9 tokens

2. Cache aggressively. If a page has not changed (check via ETags or Last-Modified), skip re-scraping.

3. Request only what you need. Disable JS rendering for static HTML pages. Skip residential proxies for unprotected sites.

4. Use batch operations. Batch API calls have lower per-request overhead than individual calls.

5. Schedule smart. Many sites have lower traffic (and lower anti-bot sensitivity) during off-peak hours. Schedule large scraping jobs during these windows.

Architecture at 10M Pages/Day

┌──────────────────────────────────────────────┐
│                 Orchestrator                   │
│  (URL scheduling, priority, deduplication)     │
└────────────┬─────────────────┬────────────────┘
             │                 │
     ┌───────▼───────┐  ┌─────▼────────┐
     │  Direct Queue  │  │  API Queue   │
     │ (Easy targets) │  │(Hard targets)│
     └───────┬───────┘  └─────┬────────┘
             │                 │
     ┌───────▼───────┐  ┌─────▼────────┐
     │  DIY Workers   │  │  FineData    │
     │ (aiohttp+proxy)│  │  Batch API   │
     └───────┬───────┘  └─────┬────────┘
             │                 │
     ┌───────▼─────────────────▼────────┐
     │         Results Pipeline          │
     │  (Validate → Parse → Store)       │
     └──────────────────────────────────┘

This hybrid architecture routes easy targets through a lightweight in-house scraper (minimal cost) and hard targets through FineData’s API (reliable, maintained anti-bot bypass). The results pipeline is unified — regardless of how a page was fetched, it goes through the same validation, parsing, and storage flow.

Common Pitfalls at Scale

1. Ignoring politeness. Aggressive scraping that crashes or degrades target sites invites IP bans, legal action, and technical countermeasures. Rate limit per domain, respect robots.txt crawl-delay directives, and scrape during off-peak hours when possible.

2. No deduplication. Without deduplication, URL discovery processes (sitemaps, link following) generate duplicate work. Use a Bloom filter or Redis set for efficient membership testing:

import redis

r = redis.Redis()

def is_new_url(url: str) -> bool:
    return r.sadd("seen_urls", url) == 1  # Returns 1 if newly added

3. Unbounded retries. Failed URLs that keep retrying clog the queue. Implement a dead letter queue for URLs that fail after max retries, and review them separately.

4. Monolithic parsing. Coupling fetching and parsing makes the system fragile — a parsing bug on one site can block all processing. Separate fetching (get HTML) from parsing (extract data) into different pipeline stages.

5. No backpressure. If workers produce results faster than downstream systems can process them, memory fills up and the system crashes. Implement backpressure at every stage — queue depth limits, buffer sizes, and flow control.

Key Takeaways

Architecture evolves with scale. Do not over-engineer at 1K pages/day, but plan for the transition points.
Queues are the backbone. A persistent task queue transforms a fragile script into a resilient system.
Rate limiting is not optional. Both for politeness and for avoiding bans.
Hybrid approaches win at scale. Easy targets through lightweight scrapers, hard targets through managed APIs.
Monitor everything. At millions of pages/day, you need real-time visibility into every component.

The path from 1K to 10M pages per day is not linear — it requires fundamental architectural changes at each order of magnitude. Start simple, instrument everything, and evolve the architecture as your actual requirements (not hypothetical ones) demand it.

Ready to scale without managing infrastructure? FineData’s batch and async APIs handle millions of pages with built-in rate limiting, proxy rotation, and anti-bot bypass. Get started free.

#scaling #architecture #performance #queue #distributed

Technical

Scaling Web Scraping from 1K to 10M Pages per Day

Scaling Web Scraping from 1K to 10M Pages per Day

Stage 1: Single-Threaded Script (1K pages/day)

Stage 2: Concurrent Requests with asyncio (10K-50K pages/day)

Stage 3: Queue-Based Architecture (50K-500K pages/day)

Stage 4: Distributed System with Rate Limiting (500K-2M pages/day)

Error Handling and Retry Strategy

Monitoring and Alerting

Stage 5: Optimized at Scale (2M-10M+ pages/day)

Using FineData for Async Batch Processing

Cost Optimization at Scale

Architecture at 10M Pages/Day

Common Pitfalls at Scale

Key Takeaways

Related Articles

Building ETL Pipelines with Web Scraping APIs

Proxy Rotation Strategies for Large-Scale Web Scraping

Anti-Bot Detection: How Cloudflare, DataDome, and PerimeterX Work