Tutorial 8 min read

Web Scraper in Python: Build a Robust, Anti-Detection Tool with FineData API

Learn how to build a Python web scraper that bypasses anti-bot systems using FineData's API, with real code examples for Cloudflare, CAPTCHA, and JavaScript rendering.

FineData Engineering · Editorial Policy

| February 21, 2026

Web Scraper in Python: Build a Robust, Anti-Detection Tool with FineData API

You’re not here for another requests-based scraper that fails on the third request. You’re building something that survives in production: a web scraper in Python that bypasses Cloudflare, handles CAPTCHAs, renders JavaScript, and returns structured data — without your infrastructure becoming a target.

This isn’t magic. It’s engineering.

In 2026, rate-limiting, fingerprinting, and bot detection are more aggressive than ever. The old stack — requests + BeautifulSoup + Selenium — is a maintenance nightmare. You’re either rate-limited, blocked, or spending hours debugging why playwright crashes on a single page. The cost of ownership for a DIY scraper has never been higher.

FineData’s API is not a silver bullet. But it’s the closest thing to a production-grade, battle-tested web scraping layer you can plug into your pipeline without writing a single line of browser automation.

Let’s build a scraper that actually works.

Why the Old Way Fails in 2026

A year ago, I ran a scraper on 100+ e-commerce sites. I used playwright with requests to fetch pages, BeautifulSoup to extract data, and a proxy rotation layer. It worked for 3 days.

Then Cloudflare started returning 403 with a jschl-v challenge. Not just once. Every 12–15 requests. I spent 40 hours reverse-engineering the JS challenge, only to have it break again in 2 weeks.

The problem isn’t just JavaScript. It’s TLS fingerprinting. It’s user-agent rotation. It’s session fingerprinting. It’s behavioral analysis.

You can’t fake a real browser with playwright alone. You need:

Real TLS fingerprints (Chrome, Firefox, Safari)
Residential or mobile proxy rotation
Anti-bot evasion (Cloudflare, DataDome, PerimeterX)
CAPTCHA solving
JavaScript rendering
Structured data extraction

And yes — you need a system that survives a 10k-page crawl.

The Architecture: FineData as Your Anti-Bot Abstraction Layer

Instead of maintaining a fleet of Playwright instances, I now use FineData as a single, reliable abstraction layer.

It’s not about “avoiding detection” — it’s about bypassing detection. The API handles:

TLS fingerprint spoofing (Chrome, Firefox, Safari profiles)
Stealth rendering (Playwright, Patchright, Nodriver)
Residential and mobile proxy rotation
CAPTCHA solving (reCAPTCHA, hCaptcha, Turnstile)
JavaScript rendering
Anti-bot evasion (Cloudflare, DataDome, etc.)

You send a single POST request. It returns HTML, Markdown, text, or structured JSON.

No more debugging why page.evaluate() failed because of a missing __ow function.

No more spending 3 hours on a navigator.webdriver check that’s not even in the DOM.

Step 1: Set Up Your Python Environment

# requirements.txt
httpx==0.24.0
pydantic==2.4.0
python-dotenv==1.0.0

Create a .env file:

FINE_DATA_API_KEY=fd_your_api_key

Use httpx for async HTTP calls. It’s faster than requests, supports streaming, and integrates cleanly with async/await.

# scraper.py
import httpx
from pydantic import BaseModel
from dotenv import load_dotenv
import os

load_dotenv()

class ScrapedProduct(BaseModel):
    title: str
    price: float
    rating: float | None = None
    description: str | None = None

class FineDataClient:
    def __init__(self):
        self.api_key = os.getenv("FINE_DATA_API_KEY")
        self.base_url = "https://api.finedata.ai"
        self.headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
        }

    async def scrape(self, url: str, extract_prompt: str | None = None):
        payload = {
            "url": url,
            "formats": ["html", "text"],
            "use_js_render": True,
            "stealth_antibot": True,
            "use_residential": True,
            "solve_captcha": True,
            "extract_prompt": extract_prompt or "Extract title, price, rating, and description as JSON.",
            "only_main_content": True,
        }

        try:
            response = await httpx.post(
                f"{self.base_url}/api/v1/scrape",
                json=payload,
                headers=self.headers,
                timeout=30.0,
            )
            response.raise_for_status()
            return response.json()
        except httpx.HTTPStatusError as e:
            print(f"HTTP error: {e.response.status_code} - {e.response.text}")
            return None
        except httpx.RequestError as e:
            print(f"Request error: {e}")
            return None

Use httpx over requests for async. The performance difference is measurable at scale.

Test the API endpoint with curl first. If you get a 500 error, it’s likely a malformed payload or invalid API key. Check the docs: How to Bypass Cloudflare Protection for Data Collection.

Step 2: Bypass Cloudflare with Stealth Mode

Cloudflare’s challenge system has evolved. It’s not just about jschl-v anymore. It’s about:

JavaScript execution
DOM mutation detection
User-agent and header fingerprinting
Mouse movement simulation (in some cases)

FineData’s stealth_antibot: true flag enables:

Real browser fingerprinting (Chrome 115+ profile)
Headless browser with Playwright + stealth plugin
Proxy rotation
CAPTCHA detection and solving

async def scrape_with_cloudflare_bypass(self, url: str):
    payload = {
        "url": url,
        "formats": ["html", "text"],
        "use_js_render": True,
        "stealth_antibot": True,
        "use_residential": True,
        "solve_captcha": True,
        "extract_prompt": "Extract title, price, rating, and description. Return as JSON.",
        "only_main_content": True,
    }

    response = await self.client.post(f"{self.base_url}/api/v1/scrape", json=payload, headers=self.headers)
    return response.json()

This works out of the box on sites like amazon.com, bestbuy.com, and walmart.com — even when they’ve locked down JavaScript execution.

Trade-off: Residential proxies are slower than data center ones. But if you’re scraping e-commerce sites with aggressive bot detection, it’s worth the 300–500ms delay.

Step 3: Handle CAPTCHAs Like It’s 2026

You can’t skip reCAPTCHA. You can’t skip hCaptcha. You can’t skip Cloudflare Turnstile.

FineData handles all three. It detects the challenge type and routes it through a solver cluster.

The key is solve_captcha: true. It’s not a toggle. It’s a signal.

async def scrape_with_captcha_handling(self, url: str):
    payload = {
        "url": url,
        "formats": ["html", "text"],
        "use_js_render": True,
        "stealth_antibot": True,
        "use_residential": True,
        "solve_captcha": True,
        "extract_prompt": "Extract product title, price, and rating. Return as JSON.",
        "only_main_content": True,
    }

    response = await self.client.post(f"{self.base_url}/api/v1/scrape", json=payload, headers=self.headers)
    return response.json()

Don’t use solve_captcha: true on sites that don’t have CAPTCHAs. It adds 1–2 seconds of overhead. Only enable it when needed.

Step 4: Extract Structured Data with LLM-Powered Prompting

This is where the real power lies. You’re not just scraping HTML. You’re extracting structured data.

FineData supports LLM-powered structured extraction. You provide a prompt. It returns a JSON object.

extract_prompt = """
Extract the following from the HTML:
- title: product title
- price: numeric value in USD
- rating: float between 0.0 and 5.0
- description: short summary of features

Return only valid JSON. Do not include markdown or code blocks.
"""

async def scrape_product(self, url: str):
    payload = {
        "url": url,
        "formats": ["json"],
        "use_js_render": True,
        "stealth_antibot": True,
        "use_residential": True,
        "solve_captcha": True,
        "extract_prompt": extract_prompt,
        "only_main_content": True,
    }

    response = await self.client.post(f"{self.base_url}/api/v1/scrape", json=payload, headers=self.headers)
    return response.json()

Example response:

{
  "title": "Sony WH-1000XM5 Wireless Headphones",
  "price": 279.99,
  "rating": 4.8,
  "description": "Over-ear noise-cancelling headphones with 30-hour battery life and AI voice pickup."
}

This is not prompt engineering. It’s prompt design. You’re not training a model. You’re describing the output format clearly.

Pro tip: Use pydantic to validate the response. It’s faster and safer than json.loads() with string parsing.

Step 5: Scale to 10k Pages with Async Batching

For large-scale jobs, use the async batch endpoint.

async def scrape_batch(self, urls: list[str], extract_prompt: str):
    payload = {
        "urls": urls,
        "formats": ["json"],
        "use_js_render": True,
        "stealth_antibot": True,
        "use_residential": True,
        "solve_captcha": True,
        "extract_prompt": extract_prompt,
        "only_main_content": True,
    }

    response = await self.client.post(f"{self.base_url}/api/v1/async/batch", json=payload, headers=self.headers)
    return response.json()

Max 100 URLs per batch. Use asyncio.gather() to parallelize.

Don’t send 10k URLs at once. Use a queue (e.g., aiosqlite or aiopg) and throttle requests to avoid rate-limiting.

Real-World Example: Scraping Amazon Product Pages

# Example: Amazon product scraper
async def scrape_amazon_product(self, asin: str):
    url = f"https://www.amazon.com/dp/{asin}"
    extract_prompt = """
    Extract:
    - title: product title
    - price: numeric value in USD (e.g. 29.99)
    - rating: float between 0.0 and 5.0
    - review_count: integer number of reviews
    - features: list of 3–5 key product features

    Return only valid JSON. Do not include markdown or code blocks.
    """

    return await self.scrape(url, extract_prompt)

This works on www.amazon.com, www.amazon.co.uk, and www.amazon.de. No more selenium sessions or playwright timeouts.

Trade-Offs and Real Talk

You’re not avoiding AI. You’re avoiding detection systems that use AI.

Cost: FineData is not free. But it’s cheaper than maintaining a proxy farm, CAPTCHA solvers, and browser clusters.
Latency: 2–4 seconds per request. Acceptable for batch jobs. Not for real-time.
Rate Limits: 100 requests/minute per API key. Use asyncio.sleep(1) between batches.
Data Quality: The LLM extraction is good, but not perfect. Validate with a small sample.

Use only_main_content: true to reduce payload size and improve extraction accuracy.

Don’t use use_residential: true on low-traffic sites. It’s overkill.

Never hardcode your API key. Use environment variables.

Final Thoughts

A web scraper in Python isn’t about requests or BeautifulSoup. It’s about resilience.

In 2026, the only way to scrape at scale is to offload anti-bot evasion to a managed system.

FineData isn’t a replacement for your logic. It’s a reliability layer.

You still write the data pipeline. You still validate the output. You still store the results.

But you don’t spend 40 hours reverse-engineering a jschl-v challenge.

You don’t debug why navigator.webdriver is true in a Playwright session.

You don’t pay $100/month for a CAPTCHA solver.

You don’t have to worry about TLS fingerprinting.

You just call the API.

And it works.

For more on how AI is reshaping data extraction, see The Future of Web Scraping: AI, LLMs, and Structured Extraction.

Ready to Build?

Set up your API key, write a single httpx call, and you’re live.

No more 403 errors. No more jschl-v puzzles. No more navigator.webdriver bugs.

Just data.

And that’s what matters.

#python web scraping #anti-bot bypass #FineData API #JavaScript rendering #structured data extraction

Tutorial

Web Scraper in Python: Build a Robust, Anti-Detection Tool with FineData API

Web Scraper in Python: Build a Robust, Anti-Detection Tool with FineData API

Why the Old Way Fails in 2026

The Architecture: FineData as Your Anti-Bot Abstraction Layer

Step 1: Set Up Your Python Environment

Step 2: Bypass Cloudflare with Stealth Mode

Step 3: Handle CAPTCHAs Like It’s 2026

Step 4: Extract Structured Data with LLM-Powered Prompting

Step 5: Scale to 10k Pages with Async Batching

Real-World Example: Scraping Amazon Product Pages

Trade-Offs and Real Talk

Final Thoughts

Ready to Build?

Related Articles

How to Scrape Amazon Product Pages at Scale with FineData API

How to Scrape Dynamic Product Feeds from Shopify Stores

Free No-Code Web Scraper: Extract Data Without Writing Code