Web Scraper in Python: Build a Robust, Anti-Detection Tool with FineData API
Learn how to build a Python web scraper that bypasses anti-bot systems using FineData's API, with real code examples for Cloudflare, CAPTCHA, and JavaScript rendering.
Web Scraper in Python: Build a Robust, Anti-Detection Tool with FineData API
You’re not here for another requests-based scraper that fails on the third request. You’re building something that survives in production: a web scraper in Python that bypasses Cloudflare, handles CAPTCHAs, renders JavaScript, and returns structured data — without your infrastructure becoming a target.
This isn’t magic. It’s engineering.
In 2026, rate-limiting, fingerprinting, and bot detection are more aggressive than ever. The old stack — requests + BeautifulSoup + Selenium — is a maintenance nightmare. You’re either rate-limited, blocked, or spending hours debugging why playwright crashes on a single page. The cost of ownership for a DIY scraper has never been higher.
FineData’s API is not a silver bullet. But it’s the closest thing to a production-grade, battle-tested web scraping layer you can plug into your pipeline without writing a single line of browser automation.
Let’s build a scraper that actually works.
Why the Old Way Fails in 2026
A year ago, I ran a scraper on 100+ e-commerce sites. I used playwright with requests to fetch pages, BeautifulSoup to extract data, and a proxy rotation layer. It worked for 3 days.
Then Cloudflare started returning 403 with a jschl-v challenge. Not just once. Every 12–15 requests. I spent 40 hours reverse-engineering the JS challenge, only to have it break again in 2 weeks.
The problem isn’t just JavaScript. It’s TLS fingerprinting. It’s user-agent rotation. It’s session fingerprinting. It’s behavioral analysis.
You can’t fake a real browser with playwright alone. You need:
- Real TLS fingerprints (Chrome, Firefox, Safari)
- Residential or mobile proxy rotation
- Anti-bot evasion (Cloudflare, DataDome, PerimeterX)
- CAPTCHA solving
- JavaScript rendering
- Structured data extraction
And yes — you need a system that survives a 10k-page crawl.
The Architecture: FineData as Your Anti-Bot Abstraction Layer
Instead of maintaining a fleet of Playwright instances, I now use FineData as a single, reliable abstraction layer.
It’s not about “avoiding detection” — it’s about bypassing detection. The API handles:
- TLS fingerprint spoofing (Chrome, Firefox, Safari profiles)
- Stealth rendering (Playwright, Patchright, Nodriver)
- Residential and mobile proxy rotation
- CAPTCHA solving (reCAPTCHA, hCaptcha, Turnstile)
- JavaScript rendering
- Anti-bot evasion (Cloudflare, DataDome, etc.)
You send a single POST request. It returns HTML, Markdown, text, or structured JSON.
No more debugging why page.evaluate() failed because of a missing __ow function.
No more spending 3 hours on a navigator.webdriver check that’s not even in the DOM.
Step 1: Set Up Your Python Environment
# requirements.txt
httpx==0.24.0
pydantic==2.4.0
python-dotenv==1.0.0
Create a .env file:
FINE_DATA_API_KEY=fd_your_api_key
Use httpx for async HTTP calls. It’s faster than requests, supports streaming, and integrates cleanly with async/await.
# scraper.py
import httpx
from pydantic import BaseModel
from dotenv import load_dotenv
import os
load_dotenv()
class ScrapedProduct(BaseModel):
title: str
price: float
rating: float | None = None
description: str | None = None
class FineDataClient:
def __init__(self):
self.api_key = os.getenv("FINE_DATA_API_KEY")
self.base_url = "https://api.finedata.ai"
self.headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json",
}
async def scrape(self, url: str, extract_prompt: str | None = None):
payload = {
"url": url,
"formats": ["html", "text"],
"use_js_render": True,
"stealth_antibot": True,
"use_residential": True,
"solve_captcha": True,
"extract_prompt": extract_prompt or "Extract title, price, rating, and description as JSON.",
"only_main_content": True,
}
try:
response = await httpx.post(
f"{self.base_url}/api/v1/scrape",
json=payload,
headers=self.headers,
timeout=30.0,
)
response.raise_for_status()
return response.json()
except httpx.HTTPStatusError as e:
print(f"HTTP error: {e.response.status_code} - {e.response.text}")
return None
except httpx.RequestError as e:
print(f"Request error: {e}")
return None
Use
httpxoverrequestsfor async. The performance difference is measurable at scale.Test the API endpoint with
curlfirst. If you get a 500 error, it’s likely a malformed payload or invalid API key. Check the docs: How to Bypass Cloudflare Protection for Data Collection.
Step 2: Bypass Cloudflare with Stealth Mode
Cloudflare’s challenge system has evolved. It’s not just about jschl-v anymore. It’s about:
- JavaScript execution
- DOM mutation detection
- User-agent and header fingerprinting
- Mouse movement simulation (in some cases)
FineData’s stealth_antibot: true flag enables:
- Real browser fingerprinting (Chrome 115+ profile)
- Headless browser with Playwright + stealth plugin
- Proxy rotation
- CAPTCHA detection and solving
async def scrape_with_cloudflare_bypass(self, url: str):
payload = {
"url": url,
"formats": ["html", "text"],
"use_js_render": True,
"stealth_antibot": True,
"use_residential": True,
"solve_captcha": True,
"extract_prompt": "Extract title, price, rating, and description. Return as JSON.",
"only_main_content": True,
}
response = await self.client.post(f"{self.base_url}/api/v1/scrape", json=payload, headers=self.headers)
return response.json()
This works out of the box on sites like amazon.com, bestbuy.com, and walmart.com — even when they’ve locked down JavaScript execution.
Trade-off: Residential proxies are slower than data center ones. But if you’re scraping e-commerce sites with aggressive bot detection, it’s worth the 300–500ms delay.
Step 3: Handle CAPTCHAs Like It’s 2026
You can’t skip reCAPTCHA. You can’t skip hCaptcha. You can’t skip Cloudflare Turnstile.
FineData handles all three. It detects the challenge type and routes it through a solver cluster.
The key is solve_captcha: true. It’s not a toggle. It’s a signal.
async def scrape_with_captcha_handling(self, url: str):
payload = {
"url": url,
"formats": ["html", "text"],
"use_js_render": True,
"stealth_antibot": True,
"use_residential": True,
"solve_captcha": True,
"extract_prompt": "Extract product title, price, and rating. Return as JSON.",
"only_main_content": True,
}
response = await self.client.post(f"{self.base_url}/api/v1/scrape", json=payload, headers=self.headers)
return response.json()
Don’t use
solve_captcha: trueon sites that don’t have CAPTCHAs. It adds 1–2 seconds of overhead. Only enable it when needed.
Step 4: Extract Structured Data with LLM-Powered Prompting
This is where the real power lies. You’re not just scraping HTML. You’re extracting structured data.
FineData supports LLM-powered structured extraction. You provide a prompt. It returns a JSON object.
extract_prompt = """
Extract the following from the HTML:
- title: product title
- price: numeric value in USD
- rating: float between 0.0 and 5.0
- description: short summary of features
Return only valid JSON. Do not include markdown or code blocks.
"""
async def scrape_product(self, url: str):
payload = {
"url": url,
"formats": ["json"],
"use_js_render": True,
"stealth_antibot": True,
"use_residential": True,
"solve_captcha": True,
"extract_prompt": extract_prompt,
"only_main_content": True,
}
response = await self.client.post(f"{self.base_url}/api/v1/scrape", json=payload, headers=self.headers)
return response.json()
Example response:
{
"title": "Sony WH-1000XM5 Wireless Headphones",
"price": 279.99,
"rating": 4.8,
"description": "Over-ear noise-cancelling headphones with 30-hour battery life and AI voice pickup."
}
This is not prompt engineering. It’s prompt design. You’re not training a model. You’re describing the output format clearly.
Pro tip: Use
pydanticto validate the response. It’s faster and safer thanjson.loads()with string parsing.
Step 5: Scale to 10k Pages with Async Batching
For large-scale jobs, use the async batch endpoint.
async def scrape_batch(self, urls: list[str], extract_prompt: str):
payload = {
"urls": urls,
"formats": ["json"],
"use_js_render": True,
"stealth_antibot": True,
"use_residential": True,
"solve_captcha": True,
"extract_prompt": extract_prompt,
"only_main_content": True,
}
response = await self.client.post(f"{self.base_url}/api/v1/async/batch", json=payload, headers=self.headers)
return response.json()
Max 100 URLs per batch. Use
asyncio.gather()to parallelize.
Don’t send 10k URLs at once. Use a queue (e.g.,
aiosqliteoraiopg) and throttle requests to avoid rate-limiting.
Real-World Example: Scraping Amazon Product Pages
# Example: Amazon product scraper
async def scrape_amazon_product(self, asin: str):
url = f"https://www.amazon.com/dp/{asin}"
extract_prompt = """
Extract:
- title: product title
- price: numeric value in USD (e.g. 29.99)
- rating: float between 0.0 and 5.0
- review_count: integer number of reviews
- features: list of 3–5 key product features
Return only valid JSON. Do not include markdown or code blocks.
"""
return await self.scrape(url, extract_prompt)
This works on www.amazon.com, www.amazon.co.uk, and www.amazon.de. No more selenium sessions or playwright timeouts.
Trade-Offs and Real Talk
You’re not avoiding AI. You’re avoiding detection systems that use AI.
- Cost: FineData is not free. But it’s cheaper than maintaining a proxy farm, CAPTCHA solvers, and browser clusters.
- Latency: 2–4 seconds per request. Acceptable for batch jobs. Not for real-time.
- Rate Limits: 100 requests/minute per API key. Use
asyncio.sleep(1)between batches. - Data Quality: The LLM extraction is good, but not perfect. Validate with a small sample.
Use
only_main_content: trueto reduce payload size and improve extraction accuracy.
Don’t use
use_residential: trueon low-traffic sites. It’s overkill.
Never hardcode your API key. Use environment variables.
Final Thoughts
A web scraper in Python isn’t about requests or BeautifulSoup. It’s about resilience.
In 2026, the only way to scrape at scale is to offload anti-bot evasion to a managed system.
FineData isn’t a replacement for your logic. It’s a reliability layer.
You still write the data pipeline. You still validate the output. You still store the results.
But you don’t spend 40 hours reverse-engineering a jschl-v challenge.
You don’t debug why navigator.webdriver is true in a Playwright session.
You don’t pay $100/month for a CAPTCHA solver.
You don’t have to worry about TLS fingerprinting.
You just call the API.
And it works.
For more on how AI is reshaping data extraction, see The Future of Web Scraping: AI, LLMs, and Structured Extraction.
Ready to Build?
Set up your API key, write a single httpx call, and you’re live.
No more 403 errors. No more jschl-v puzzles. No more navigator.webdriver bugs.
Just data.
And that’s what matters.
Related Articles
Free No-Code Web Scraper: Extract Data Without Writing Code
How to use no-code web scrapers to extract structured data from websites. Tools, workflows, and practical limitations for non-developers.
TutorialHow to Scrape Dynamic Job Listings with Authentication in 2026
Learn how to scrape job portals with login requirements using FineData API, including session handling and secure credential management.
TutorialHow to Scrape Job Postings with Dynamic Filters Using FineData API
Step-by-step guide to extract job listings from career sites with dynamic filters using FineData's API and Playwright rendering.