Tutorial 10 min read

How to Scrape Amazon Product Data with Python in 2026

Learn how to extract Amazon product data including titles, prices, reviews, and ratings using Python. Complete tutorial with code examples.

FT
FineData Team
|

How to Scrape Amazon Product Data with Python in 2026

Amazon is the world’s largest online marketplace with over 350 million products. Whether you’re building a price comparison tool, doing competitive research, or feeding an analytics pipeline, Amazon product data is incredibly valuable. But extracting it programmatically is one of the hardest web scraping challenges out there.

In this guide, you’ll learn how to scrape Amazon product data reliably using Python and the FineData API, from a single product page to thousands of listings at scale.

Why Scraping Amazon Is So Challenging

Amazon invests heavily in anti-bot technology. If you’ve tried scraping Amazon with a simple requests.get(), you’ve likely seen one of these:

  • CAPTCHA pages — Amazon serves CAPTCHAs aggressively to suspected bots
  • IP bans — Datacenter IPs get blocked within a few dozen requests
  • Dynamic content — Product details, reviews, and pricing are loaded via JavaScript
  • Request fingerprinting — Amazon inspects TLS fingerprints, headers, and browser characteristics
  • Rate limiting — Even with rotating proxies, too many requests trigger throttling

A naive approach might work for 10 requests, but it will fail at any meaningful scale. Let’s build something that actually works.

Setting Up Your Environment

First, install the dependencies:

pip install requests beautifulsoup4

You’ll also need a FineData API key. Sign up at finedata.ai and grab your key from the dashboard.

import requests
from bs4 import BeautifulSoup
import json
import time

FINEDATA_API_KEY = "fd_your_api_key"
FINEDATA_URL = "https://api.finedata.ai/api/v1/scrape"

def scrape_page(url, use_js=False):
    """Fetch a page through FineData's API."""
    response = requests.post(
        FINEDATA_URL,
        headers={
            "x-api-key": FINEDATA_API_KEY,
            "Content-Type": "application/json"
        },
        json={
            "url": url,
            "use_js_render": use_js,
            "tls_profile": "chrome124",
            "use_residential": True,
            "timeout": 30
        }
    )
    response.raise_for_status()
    return response.json()

We’re using use_residential: True because Amazon blocks most datacenter IPs. Residential proxies rotate through real consumer IP addresses, which Amazon treats as legitimate traffic.

Extracting Product Data from a Single Page

Let’s start with the core task: extracting structured data from an Amazon product page.

def parse_product_page(html):
    """Extract product details from Amazon product page HTML."""
    soup = BeautifulSoup(html, "html.parser")

    product = {}

    # Product title
    title_el = soup.select_one("#productTitle")
    product["title"] = title_el.get_text(strip=True) if title_el else None

    # Price — Amazon uses multiple price containers
    price_el = (
        soup.select_one(".a-price .a-offscreen")
        or soup.select_one("#priceblock_ourprice")
        or soup.select_one("#priceblock_dealprice")
        or soup.select_one(".a-price-whole")
    )
    product["price"] = price_el.get_text(strip=True) if price_el else None

    # Rating (e.g., "4.5 out of 5 stars")
    rating_el = soup.select_one("#acrPopover .a-icon-alt")
    if rating_el:
        rating_text = rating_el.get_text(strip=True)
        product["rating"] = float(rating_text.split(" ")[0])
    else:
        product["rating"] = None

    # Number of reviews
    reviews_el = soup.select_one("#acrCustomerReviewText")
    if reviews_el:
        reviews_text = reviews_el.get_text(strip=True)
        product["review_count"] = int(
            reviews_text.split(" ")[0].replace(",", "")
        )
    else:
        product["review_count"] = None

    # Availability
    avail_el = soup.select_one("#availability span")
    product["availability"] = (
        avail_el.get_text(strip=True) if avail_el else None
    )

    # Product images
    images = []
    img_block = soup.select_one("#imgTagWrapperId img")
    if img_block and img_block.get("data-a-dynamic-image"):
        img_data = json.loads(img_block["data-a-dynamic-image"])
        images = list(img_data.keys())
    product["images"] = images

    # Feature bullets
    bullets = soup.select("#feature-bullets .a-list-item")
    product["features"] = [
        b.get_text(strip=True) for b in bullets
        if b.get_text(strip=True)
    ]

    return product

Now put it together:

def scrape_amazon_product(asin):
    """Scrape a single Amazon product by ASIN."""
    url = f"https://www.amazon.com/dp/{asin}"
    result = scrape_page(url, use_js=True)

    html = result["body"]
    product = parse_product_page(html)
    product["asin"] = asin
    product["url"] = url

    return product

# Example usage
product = scrape_amazon_product("B0BSHF7WHW")
print(json.dumps(product, indent=2))

We enable JavaScript rendering (use_js=True) because Amazon loads pricing and availability dynamically. Without it, you’ll often get incomplete data.

Handling Product Variations

Many Amazon products have variations — different sizes, colors, or configurations. These are typically loaded via AJAX when a user clicks a variation button, and the data lives in a JavaScript object embedded in the page source.

import re

def extract_variations(html):
    """Extract product variation data from page source."""
    variations = []

    # Amazon embeds variation data in a JS object
    pattern = r'"dimensionValuesDisplayData"\s*:\s*(\{[^}]+\})'
    match = re.search(pattern, html)

    if match:
        try:
            dim_data = json.loads(match.group(1))
            for asin, values in dim_data.items():
                variations.append({
                    "asin": asin,
                    "attributes": values
                })
        except json.JSONDecodeError:
            pass

    return variations

For a complete picture, you’d scrape each variation’s ASIN separately — but this gives you the list of available variations without extra requests.

Scraping Search Results for Product Discovery

Often you don’t have specific ASINs — you want to discover products by searching Amazon. Here’s how to scrape Amazon search results:

def scrape_amazon_search(query, max_pages=3):
    """Scrape Amazon search results for a given query."""
    all_products = []

    for page in range(1, max_pages + 1):
        url = (
            f"https://www.amazon.com/s"
            f"?k={query.replace(' ', '+')}&page={page}"
        )
        result = scrape_page(url, use_js=True)
        html = result["body"]
        soup = BeautifulSoup(html, "html.parser")

        items = soup.select('[data-component-type="s-search-result"]')

        for item in items:
            product = {}

            # ASIN from data attribute
            product["asin"] = item.get("data-asin", "")

            # Title
            title_el = item.select_one("h2 a span")
            product["title"] = (
                title_el.get_text(strip=True) if title_el else None
            )

            # Price
            price_whole = item.select_one(".a-price-whole")
            price_frac = item.select_one(".a-price-fraction")
            if price_whole:
                price_str = price_whole.get_text(strip=True).rstrip(".")
                if price_frac:
                    price_str += "." + price_frac.get_text(strip=True)
                product["price"] = float(price_str.replace(",", ""))
            else:
                product["price"] = None

            # Rating
            rating_el = item.select_one(".a-icon-alt")
            if rating_el:
                try:
                    product["rating"] = float(
                        rating_el.get_text().split(" ")[0]
                    )
                except ValueError:
                    product["rating"] = None
            else:
                product["rating"] = None

            # Review count
            reviews_el = item.select_one(
                '[aria-label*="stars"] + span .a-size-base'
            )
            if reviews_el:
                text = reviews_el.get_text(strip=True).replace(",", "")
                product["review_count"] = int(text) if text.isdigit() else None
            else:
                product["review_count"] = None

            product["url"] = (
                f"https://www.amazon.com/dp/{product['asin']}"
            )

            all_products.append(product)

        # Be polite — add a delay between pages
        time.sleep(2)

    return all_products

# Search for wireless earbuds
results = scrape_amazon_search("wireless earbuds", max_pages=2)
print(f"Found {len(results)} products")
for p in results[:5]:
    print(f"  {p['title'][:60]}... — ${p['price']}")

Scaling to Thousands of Products

When you need to scrape hundreds or thousands of product pages, sequential requests are too slow. FineData supports batch scraping to parallelize the work:

def scrape_products_batch(asins, batch_size=20):
    """Scrape multiple products using FineData's batch endpoint."""
    all_products = []

    for i in range(0, len(asins), batch_size):
        batch = asins[i:i + batch_size]
        urls = [f"https://www.amazon.com/dp/{asin}" for asin in batch]

        response = requests.post(
            "https://api.finedata.ai/api/v1/batch",
            headers={
                "x-api-key": FINEDATA_API_KEY,
                "Content-Type": "application/json"
            },
            json={
                "urls": urls,
                "use_js_render": True,
                "use_residential": True
            }
        )
        response.raise_for_status()
        batch_result = response.json()

        # Poll for results
        batch_id = batch_result["batch_id"]
        while True:
            status_resp = requests.get(
                f"https://api.finedata.ai/api/v1/batch/{batch_id}",
                headers={"x-api-key": FINEDATA_API_KEY}
            )
            status = status_resp.json()

            if status["status"] == "completed":
                for job in status["results"]:
                    if job["status"] == "completed":
                        product = parse_product_page(job["body"])
                        product["url"] = job["url"]
                        all_products.append(product)
                break

            time.sleep(5)

        print(f"Processed {min(i + batch_size, len(asins))}/{len(asins)}")

    return all_products

This processes 20 URLs at a time in parallel, dramatically reducing total scraping time.

Best Practices for Amazon Scraping

1. Use Residential Proxies

Amazon’s anti-bot system is sophisticated enough to detect most datacenter IP ranges. Residential proxies are essential for consistent results. FineData’s use_residential flag handles this automatically.

2. Enable JavaScript Rendering

Amazon dynamically loads prices, availability, and review data. Always use use_js_render: True for product pages to get complete data.

3. Respect Rate Limits

Even with rotating residential proxies, hammering Amazon with hundreds of requests per second will trigger blocks. Add delays between requests (1-3 seconds for sequential, or use batch endpoints that handle pacing for you).

4. Handle Edge Cases

Amazon product pages aren’t uniform. Some products have:

  • Multiple sellers with different prices (Buy Box vs. other offers)
  • Subscribe & Save pricing
  • Lightning deals with countdown timers
  • Out-of-stock items with no price

Build your parser to gracefully handle missing elements rather than crashing.

5. Cache Results

Product data doesn’t change every second. Cache results for at least 15-30 minutes to avoid unnecessary requests and token usage.

Token Cost Estimation

Here’s what a typical Amazon scraping session costs in FineData tokens:

OperationTokens per RequestNotes
Base request1Always charged
JS rendering+5Needed for Amazon
Residential proxy+3Recommended for Amazon
Total per page9

For 1,000 product pages, that’s 9,000 tokens. If you encounter CAPTCHAs (rare with residential proxies), add 10 tokens per CAPTCHA solve.

Storing Your Data

Once you’ve scraped product data, you’ll want to store it. Here’s a quick example using SQLite:

import sqlite3

def init_db():
    conn = sqlite3.connect("amazon_products.db")
    conn.execute("""
        CREATE TABLE IF NOT EXISTS products (
            asin TEXT PRIMARY KEY,
            title TEXT,
            price REAL,
            rating REAL,
            review_count INTEGER,
            availability TEXT,
            features TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)
    conn.commit()
    return conn

def save_product(conn, product):
    conn.execute("""
        INSERT OR REPLACE INTO products
        (asin, title, price, rating, review_count, availability, features)
        VALUES (?, ?, ?, ?, ?, ?, ?)
    """, (
        product.get("asin"),
        product.get("title"),
        product.get("price"),
        product.get("rating"),
        product.get("review_count"),
        product.get("availability"),
        json.dumps(product.get("features", []))
    ))
    conn.commit()

For a more complete data pipeline with scheduling and alerts, check out our guide on building a price monitoring tool.

Key Takeaways

  • Amazon is one of the hardest sites to scrape due to aggressive anti-bot measures, CAPTCHAs, and dynamic content loading.
  • Residential proxies and JavaScript rendering are essential for reliable Amazon scraping.
  • Structure your scraper to handle missing elements gracefully — Amazon product pages vary significantly.
  • Use batch scraping to parallelize requests when working with large product lists.
  • Cache results and add delays between requests to stay under the radar and minimize token usage.
  • Store scraped data in a database for analysis and tracking over time.

Ready to start scraping Amazon data? Sign up for FineData and get free tokens to try it out. For more advanced patterns, check out our tutorial on handling CAPTCHAs and our API documentation.

#amazon #python #ecommerce #product-data #tutorial

Related Articles