Scaling Web Scraping from 1K to 10M Pages per Day
Architecture guide for scaling web scraping from thousands to millions of pages per day. Covers async patterns, queues, rate limiting, distributed systems, and cost optimization.
Scaling Web Scraping from 1K to 10M Pages per Day
Most scraping projects start the same way: a Python script with a for loop and requests.get(). It works fine for a few hundred pages. Then the requirements grow — more pages, more sites, tighter schedules — and that simple script becomes a bottleneck. Scaling web scraping is not just about making requests faster; it requires rethinking the entire architecture.
This guide walks through the architectural evolution from a simple script to a system capable of processing millions of pages per day, with practical patterns you can implement at each stage.
Stage 1: Single-Threaded Script (1K pages/day)
Where every scraping project begins:
import requests
from bs4 import BeautifulSoup
import time
urls = load_urls() # ~1,000 URLs
for url in urls:
try:
response = requests.get(url, timeout=30)
soup = BeautifulSoup(response.text, "html.parser")
data = extract_data(soup)
save_to_database(data)
except Exception as e:
log_error(url, e)
time.sleep(1) # Be polite
Throughput: ~1 request/second = ~3,600 pages/hour = ~86K pages/day (theoretical max without sleep)
Bottleneck: Sequential execution. Each request waits for the previous one to complete. With a 1-second politeness delay and average response time of 500ms, you are getting about 40 pages per minute.
When this breaks: As soon as you need more than a few thousand pages per day, or when you are scraping from multiple independent domains.
Stage 2: Concurrent Requests with asyncio (10K-50K pages/day)
The first major architectural change is moving from sequential to concurrent execution:
import asyncio
import aiohttp
from aiohttp import ClientTimeout
CONCURRENCY = 50 # Max simultaneous requests
RATE_LIMIT = 10 # Requests per second per domain
semaphore = asyncio.Semaphore(CONCURRENCY)
async def scrape_url(session: aiohttp.ClientSession, url: str) -> dict:
async with semaphore:
try:
async with session.get(url, timeout=ClientTimeout(total=30)) as response:
html = await response.text()
return {"url": url, "html": html, "status": response.status}
except Exception as e:
return {"url": url, "error": str(e)}
async def main(urls: list[str]):
connector = aiohttp.TCPConnector(limit=CONCURRENCY)
async with aiohttp.ClientSession(connector=connector) as session:
tasks = [scrape_url(session, url) for url in urls]
results = await asyncio.gather(*tasks)
return results
urls = load_urls()
results = asyncio.run(main(urls))
Throughput: 50 concurrent requests with ~500ms average = ~100 pages/second = ~360K pages/hour
Key improvements:
- Concurrent I/O means network wait time overlaps instead of stacking
- Connection pooling reduces TCP/TLS handshake overhead
- Semaphore controls concurrency to avoid overwhelming targets or running out of file descriptors
New challenges at this stage:
- Rate limiting per domain becomes critical
- Error handling needs to be more robust (retries, backoff)
- Memory management with large numbers of concurrent responses
- No persistence — if the script crashes, all progress is lost
Stage 3: Queue-Based Architecture (50K-500K pages/day)
At this stage, you need to separate URL discovery, fetching, and processing into independent components:
┌─────────────┐
│ URL Source │
│ (Scheduler) │
└──────┬──────┘
│
▼
┌─────────────┐
│ Task Queue │
│ (Redis/RMQ) │
└──────┬──────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Worker 1 │ │ Worker 2 │ │ Worker N │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────┐
│ Results Store │
│ (Database / Object Storage) │
└─────────────────────────────────────┘
Each component has a clear responsibility:
Scheduler: Discovers or receives URLs to scrape, deduplicates them, and enqueues them with priority and rate-limiting metadata.
Task Queue: Redis, RabbitMQ, or SQS. Provides persistence (survive crashes), backpressure (prevent worker overload), and visibility (monitoring).
Workers: Stateless processes that pull URLs from the queue, scrape them, and store results. Can be scaled horizontally.
Results Store: PostgreSQL for structured data, S3/MinIO for raw HTML, or both.
# Worker process (simplified)
import redis
import json
r = redis.Redis()
while True:
# Blocking pop from queue
_, message = r.brpop("scrape_queue")
task = json.loads(message)
try:
result = scrape_with_retries(task["url"], max_retries=3)
store_result(task["url"], result)
r.incr("stats:success")
except Exception as e:
if task.get("retries", 0) < 3:
task["retries"] = task.get("retries", 0) + 1
r.lpush("scrape_queue", json.dumps(task))
else:
store_failure(task["url"], str(e))
r.incr("stats:failed")
Key improvements:
- Persistence — crashed workers resume from where they left off
- Horizontal scaling — add more workers as needed
- Rate limiting can be implemented per-domain at the queue level
- Monitoring via queue depth, processing rates, and error counts
Stage 4: Distributed System with Rate Limiting (500K-2M pages/day)
At this scale, rate limiting becomes a first-class concern. You need per-domain rate limiters that work across multiple workers:
import time
import redis
class DistributedRateLimiter:
"""Token bucket rate limiter using Redis."""
def __init__(self, redis_client: redis.Redis, domain: str, rate: float, burst: int):
self.redis = redis_client
self.key = f"ratelimit:{domain}"
self.rate = rate # tokens per second
self.burst = burst # max tokens
def acquire(self) -> bool:
"""Try to acquire a token. Returns True if allowed."""
now = time.time()
pipe = self.redis.pipeline()
# Lua script for atomic token bucket
lua = """
local key = KEYS[1]
local rate = tonumber(ARGV[1])
local burst = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local data = redis.call('hmget', key, 'tokens', 'last_time')
local tokens = tonumber(data[1]) or burst
local last_time = tonumber(data[2]) or now
local elapsed = now - last_time
tokens = math.min(burst, tokens + elapsed * rate)
if tokens >= 1 then
tokens = tokens - 1
redis.call('hmset', key, 'tokens', tokens, 'last_time', now)
redis.call('expire', key, 3600)
return 1
else
return 0
end
"""
result = self.redis.eval(lua, 1, self.key, self.rate, self.burst, now)
return bool(result)
Error Handling and Retry Strategy
At scale, transient errors are a certainty. Implement exponential backoff with jitter:
import random
async def scrape_with_retries(url: str, max_retries: int = 5) -> dict:
for attempt in range(max_retries):
try:
result = await scrape(url)
if result["status"] < 400:
return result
if result["status"] == 429: # Rate limited
wait = min(300, (2 ** attempt) + random.uniform(0, 1))
await asyncio.sleep(wait)
continue
if result["status"] >= 500: # Server error, retry
wait = min(60, (2 ** attempt) + random.uniform(0, 1))
await asyncio.sleep(wait)
continue
return result # 4xx (except 429) — don't retry
except (ConnectionError, TimeoutError):
wait = min(60, (2 ** attempt) + random.uniform(0, 1))
await asyncio.sleep(wait)
raise MaxRetriesExceeded(url)
Monitoring and Alerting
At this scale, you need real-time visibility:
Key Metrics:
├── Throughput: pages/minute (total and per-domain)
├── Success Rate: % of requests returning valid data
├── Queue Depth: pending tasks in queue
├── Worker Health: active workers, CPU/memory usage
├── Error Distribution: errors by type and domain
├── Latency: p50, p95, p99 response times
└── Cost: tokens consumed, proxy bandwidth used
Set alerts for:
- Queue depth growing faster than processing rate (backlog)
- Success rate dropping below threshold (site blocking)
- Worker count dropping (infrastructure issues)
- Error rate spikes (anti-bot update, site change)
Stage 5: Optimized at Scale (2M-10M+ pages/day)
At millions of pages per day, every inefficiency multiplies. This is where using a managed scraping API becomes particularly compelling.
Using FineData for Async Batch Processing
Instead of managing browser pools, proxy rotation, and anti-bot evasion yourself, delegate to an API and focus on orchestration:
import requests
import asyncio
import aiohttp
API_URL = "https://api.finedata.ai/api/v1"
HEADERS = {
"x-api-key": "fd_your_api_key",
"Content-Type": "application/json"
}
# Submit batch job
def submit_batch(urls: list[str]) -> dict:
response = requests.post(
f"{API_URL}/batch",
headers=HEADERS,
json={
"urls": urls,
"use_js_render": False,
"callback_url": "https://your-app.com/webhook/batch-complete"
}
)
return response.json()
# Process in chunks of 100
all_urls = load_urls() # 100,000+ URLs
for i in range(0, len(all_urls), 100):
chunk = all_urls[i:i+100]
result = submit_batch(chunk)
print(f"Batch {result['batch_id']} submitted: {len(chunk)} URLs")
For real-time processing where callbacks are not suitable, use async jobs with polling:
async def scrape_async_with_polling(url: str) -> dict:
# Submit async job
async with aiohttp.ClientSession() as session:
resp = await session.post(
f"{API_URL}/scrape/async",
headers=HEADERS,
json={"url": url, "use_js_render": True}
)
job = await resp.json()
job_id = job["job_id"]
# Poll for completion
while True:
resp = await session.get(
f"{API_URL}/jobs/{job_id}",
headers=HEADERS
)
status = await resp.json()
if status["status"] == "completed":
return status["result"]
elif status["status"] == "failed":
raise ScrapingError(status["error"])
await asyncio.sleep(2) # Poll interval
Cost Optimization at Scale
When processing millions of pages, small optimizations compound:
1. Classify URLs by difficulty before scraping.
# Route easy targets through cheap path, hard targets through full path
EASY_DOMAINS = {"example.com", "data.gov", "wikipedia.org"}
def get_scrape_config(url: str) -> dict:
domain = extract_domain(url)
if domain in EASY_DOMAINS:
return {"use_js_render": False, "use_residential": False} # 1 token
else:
return {"use_js_render": True, "use_residential": True} # 9 tokens
2. Cache aggressively. If a page has not changed (check via ETags or Last-Modified), skip re-scraping.
3. Request only what you need. Disable JS rendering for static HTML pages. Skip residential proxies for unprotected sites.
4. Use batch operations. Batch API calls have lower per-request overhead than individual calls.
5. Schedule smart. Many sites have lower traffic (and lower anti-bot sensitivity) during off-peak hours. Schedule large scraping jobs during these windows.
Architecture at 10M Pages/Day
┌──────────────────────────────────────────────┐
│ Orchestrator │
│ (URL scheduling, priority, deduplication) │
└────────────┬─────────────────┬────────────────┘
│ │
┌───────▼───────┐ ┌─────▼────────┐
│ Direct Queue │ │ API Queue │
│ (Easy targets) │ │(Hard targets)│
└───────┬───────┘ └─────┬────────┘
│ │
┌───────▼───────┐ ┌─────▼────────┐
│ DIY Workers │ │ FineData │
│ (aiohttp+proxy)│ │ Batch API │
└───────┬───────┘ └─────┬────────┘
│ │
┌───────▼─────────────────▼────────┐
│ Results Pipeline │
│ (Validate → Parse → Store) │
└──────────────────────────────────┘
This hybrid architecture routes easy targets through a lightweight in-house scraper (minimal cost) and hard targets through FineData’s API (reliable, maintained anti-bot bypass). The results pipeline is unified — regardless of how a page was fetched, it goes through the same validation, parsing, and storage flow.
Common Pitfalls at Scale
1. Ignoring politeness. Aggressive scraping that crashes or degrades target sites invites IP bans, legal action, and technical countermeasures. Rate limit per domain, respect robots.txt crawl-delay directives, and scrape during off-peak hours when possible.
2. No deduplication. Without deduplication, URL discovery processes (sitemaps, link following) generate duplicate work. Use a Bloom filter or Redis set for efficient membership testing:
import redis
r = redis.Redis()
def is_new_url(url: str) -> bool:
return r.sadd("seen_urls", url) == 1 # Returns 1 if newly added
3. Unbounded retries. Failed URLs that keep retrying clog the queue. Implement a dead letter queue for URLs that fail after max retries, and review them separately.
4. Monolithic parsing. Coupling fetching and parsing makes the system fragile — a parsing bug on one site can block all processing. Separate fetching (get HTML) from parsing (extract data) into different pipeline stages.
5. No backpressure. If workers produce results faster than downstream systems can process them, memory fills up and the system crashes. Implement backpressure at every stage — queue depth limits, buffer sizes, and flow control.
Key Takeaways
- Architecture evolves with scale. Do not over-engineer at 1K pages/day, but plan for the transition points.
- Queues are the backbone. A persistent task queue transforms a fragile script into a resilient system.
- Rate limiting is not optional. Both for politeness and for avoiding bans.
- Hybrid approaches win at scale. Easy targets through lightweight scrapers, hard targets through managed APIs.
- Monitor everything. At millions of pages/day, you need real-time visibility into every component.
The path from 1K to 10M pages per day is not linear — it requires fundamental architectural changes at each order of magnitude. Start simple, instrument everything, and evolve the architecture as your actual requirements (not hypothetical ones) demand it.
Ready to scale without managing infrastructure? FineData’s batch and async APIs handle millions of pages with built-in rate limiting, proxy rotation, and anti-bot bypass. Get started free.
Related Articles
Anti-Bot Detection in 2026: How Cloudflare, DataDome, and PerimeterX Work
How modern anti-bot systems detect scrapers in 2026: IP reputation, TLS fingerprinting, JS challenges, behavioral analysis, and device fingerprinting explained.
TechnicalBuilding ETL Pipelines with Web Scraping APIs
Learn how to build production-ready ETL pipelines using web scraping APIs. Covers extraction, transformation, loading, scheduling, and monitoring.
TechnicalThe Future of Web Scraping: AI, LLMs, and Structured Extraction
Explore how AI and large language models are transforming web scraping with natural language queries, intelligent extraction, and the MCP protocol.