Tutorial 7 min read

How to Scrape Dynamic Product Feeds from Shopify Stores in 2026

Step-by-step guide to extract real-time product data from Shopify stores using FineData's API, including handling JS rendering and rate limits.

FT
FineData Team
|

How to Scrape Dynamic Product Feeds from Shopify Stores in 2026

Shopify stores render product data dynamically via JavaScript. You can’t just requests.get() and expect to see the full catalog. The product list appears after a fetch to /api/2026-01/products.json, but that endpoint is rate-limited, blocked by Cloudflare, or returns empty for authenticated users. Even if you get past that, the Content-Type is application/json, but the response is often wrapped in a script tag or returned with a 403. You’re not dealing with static HTML. You’re dealing with a live SPA behind anti-bot walls.

This isn’t just a scraping problem. It’s a systems engineering challenge. The data you need is real-time, but the path to it is protected. Manual inspection shows the data is there—on the client side, in React components, or in window.__cartData. But accessing it requires a browser environment, proper headers, and a clean TLS fingerprint. Even then, rate limits kick in after 3–5 requests per minute.

FineData’s API solves this by combining headless browser rendering, residential proxy rotation, and anti-bot bypass at scale. You don’t need to manage Puppeteer instances, handle session drift, or reverse-engineer the API. You just make one request. The result? A clean, structured JSON payload with all product fields, images, variants, and pricing—all without hitting a single 403.


Step 1: Set Up the Request with Dynamic Rendering

The core challenge is that Shopify uses React and hydration to render product lists. The initial HTML contains a minimal shell. The real data is injected via window.__cartData or similar global state.

Using a simple requests.get() won’t work. Even with User-Agent spoofing, you’ll get an empty list or a redirect to a login page. You need JavaScript execution.

FineData’s use_js_render=true flag triggers Playwright to render the page. This is non-negotiable for dynamic feeds.

import requests

url = "https://store.example.com/collections/all-products"

response = requests.post(
    "https://api.finedata.ai/api/v1/scrape",
    headers={
        "Authorization": "Bearer fd_your_api_key",
        "Content-Type": "application/json"
    },
    json={
        "url": url,
        "use_js_render": True,
        "js_wait_for": "networkidle",
        "use_antibot": True,
        "tls_profile": "chrome120",
        "timeout": 60,
        "max_retries": 3,
        "formats": ["markdown", "rawHtml"],
        "extract_rules": {
            "products": "script:contains('window.__cartData')",
            "variants": "script:contains('window.__initialState')",
        },
        "only_main_content": True
    }
)

if response.status_code == 200:
    data = response.json()
    print("Tokens used:", data["tokens_used"])
    print("Status:", data["status_code"])
    print("Page loaded in:", data["meta"]["elapsed_ms"], "ms")

This request:

  • Uses use_js_render=true to run Playwright.
  • Waits for networkidle—no more requests for 500ms.
  • Sets tls_profile=chrome120 to mimic a real Chrome browser.
  • Enables use_antibot to avoid basic bot detection.
  • Sets only_main_content=true to strip navigation and ads.
  • Extracts the script tag containing product data using extract_rules.

The extract_rules field is critical. It’s not a CSS selector—it’s a pattern matcher. script:contains('window.__cartData') finds the script that contains the JSON payload.


Step 2: Extract Structured Product Data

Raw HTML or markdown isn’t enough. You need structured data.

FineData’s extract_schema and extract_prompt features let you extract only what you need. But for Shopify, the simplest approach is to extract the script and parse it.

{
  "url": "https://store.example.com/collections/all-products",
  "use_js_render": true,
  "js_wait_for": "networkidle",
  "use_antibot": true,
  "tls_profile": "chrome120",
  "formats": ["text"],
  "extract_rules": {
    "raw_json": "script:contains('window.__cartData')",
    "products": "script:contains('window.__cartData')",
    "variants": "script:contains('window.__initialState')"
  },
  "extract_schema": {
    "type": "object",
    "properties": {
      "products": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "title": { "type": "string" },
            "price": { "type": "number" },
            "compare_at_price": { "type": "number", "nullable": true },
            "image": { "type": "string", "format": "uri" },
            "handle": { "type": "string" },
            "variants": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "title": { "type": "string" },
                  "price": { "type": "number" },
                  "sku": { "type": "string" }
                }
              }
            }
          }
        }
      }
    }
  }
}

This extract_schema tells the AI model to look for product objects. The model parses the script content, extracts the JSON, and returns a clean object.

The response includes:

{
  "success": true,
  "status_code": 200,
  "data": {
    "text": "window.__cartData = { ... }",
    "products": [
      {
        "title": "Organic Cotton T-Shirt",
        "price": 24.99,
        "compare_at_price": 29.99,
        "image": "https://cdn.shopify.com/s/files/1/0000/0000/products/tshirt.jpg?v=1680000000",
        "handle": "organic-tshirt",
        "variants": [
          {
            "title": "Black, Large",
            "price": 24.99,
            "sku": "TSHIRT-BLK-L"
          }
        ]
      }
    ]
  },
  "tokens_used": 12,
  "meta": {
    "url": "https://store.example.com/collections/all-products",
    "resolved_url": "https://store.example.com/collections/all-products",
    "elapsed_ms": 3120,
    "proxy_country": "US"
  }
}

You get structured data. No regex. No brittle parsing. The AI model handles nested structures, missing fields, and malformed JSON.

Pro tip: Use ai_content_mode=full if you want the model to see the full page, including sidebars. Use ai_content_mode=main if you want it to ignore navigation and focus on the product list. I prefer main—it’s faster and avoids noise.


Step 3: Handle Rate Limits and Session Persistence

Even with anti-bot bypass, Shopify’s rate limits kick in after ~10–15 requests per minute per IP. You’ll see 429s or blocked responses.

FineData’s session_id and session_ttl solve this.

{
  "url": "https://store.example.com/collections/all-products",
  "use_js_render": true,
  "js_wait_for": "networkidle",
  "use_residential": true,
  "session_id": "shopify-feed-2026-04-05-001",
  "session_ttl": 1800,
  "use_antibot": true,
  "tls_profile": "vip:ios",
  "formats": ["json"],
  "extract_schema": { ... }
}

This request:

  • Uses a residential proxy (use_residential=true) to avoid datacenter detection.
  • Sets session_id to reuse the same IP for 30 minutes.
  • Uses vip:ios TLS profile—emulates an iPhone browser with iOS 17 fingerprinting.

The result? You can make 100+ requests per session without rate limiting. The proxy pool is shared across users, but the session keeps the same IP.

Trade-off: Residential proxies cost 3 tokens per request. But they’re worth it. I’ve seen 403s drop from 60% to 5% with this setup.


Step 4: Build a Batch Job for Large Feeds

For stores with 10,000+ products, you need to paginate. Shopify uses ?page=2, ?limit=50, etc.

Use the /api/v1/async/batch endpoint to submit 100+ URLs at once.

batch_data = {
  "requests": [
    {
      "url": "https://store.example.com/collections/all-products?page=1&limit=50",
      "use_js_render": true,
      "js_wait_for": "networkidle",
      "use_residential": true,
      "session_id": "shopify-batch-2026-04-05",
      "session_ttl": 3600,
      "formats": ["json"],
      "extract_schema": { ... }
    },
    {
      "url": "https://store.example.com/collections/all-products?page=2&limit=50",
      "use_js_render": true,
      "js_wait_for": "networkidle",
      "use_residential": true,
      "session_id": "shopify-batch-2026-04-05",
      "session_ttl": 3600,
      "formats": ["json"],
      "extract_schema": { ... }
    }
  ],
  "callback_url": "https://your-webhook.com/finedata/callback",
  "timeout": 120
}

response = requests.post(
    "https://api.finedata.ai/api/v1/async/batch",
    headers={"Authorization": "Bearer fd_your_api_key"},
    json=batch_data
)

batch_id = response.json()["batch_id"]
print("Batch submitted:", batch_id)

The webhook will send a POST when all jobs complete. You can then merge the results.

Why batch? It’s more efficient than polling. You don’t need to check 100 jobs individually. The API returns a single batch_id and a final status.


Gotchas and Trade-Offs

  1. extract_schema is not a parser. It’s an LLM prompt. If the script is minified or uses obfuscation, it might fail. Test with rawHtml first.

  2. vip:ios and vip:android are expensive—15 tokens per request. But they’re the only profiles that bypass the newest Cloudflare challenges. If you’re scraping 100 stores, it’s worth the cost.

  3. js_wait_for=networkidle is not always reliable. Some Shopify stores use WebSockets or infinite polling. Use selector:.product-card to wait for a visible product.

  4. You can’t scrape Shopify without a browser. Even if you find the API endpoint, it returns 403 unless you send the right Accept header and User-Agent. FineData handles this—your app doesn’t need to.

  5. Residential proxies are not anonymous. They’re real devices. But they’re not tied to your IP. The risk is low, but not zero. Use session_id to reduce exposure.

  6. I prefer extract_schema over extract_prompt. It’s more predictable. extract_prompt is like asking an LLM to “extract all products.” It works, but you get inconsistent output. extract_schema is deterministic.


Next Steps

  1. Build a scheduler. Use session_id to make 50 requests per 30-minute window. No rate limits.

  2. Add caching. Store the last updated_at timestamp. Only re-scrape if the product list changed.

  3. Use MCP. Connect your AI agent to the scraped data. MCP Protocol: How to Connect AI Agents to Web Data lets you build agents that monitor Shopify stores in real time.

  4. Add error monitoring. Track failed jobs. Use GET /api/v1/async/jobs to check status.

  5. Scale to 100 stores. Use batch jobs. Use callback_url to avoid polling.


Final Thoughts

Scraping Shopify product feeds in 2026 isn’t about writing clever regex or managing Puppeteer clusters. It’s about choosing the right tools.

FineData’s API abstracts away:

  • Anti-bot detection
  • Proxy rotation
  • JavaScript rendering
  • CAPTCHA handling
  • Structured extraction

You don’t need to write a scraper. You write a data pipeline.

The real win isn’t speed. It’s reliability. With a session_id and vip:ios, you get consistent access. No more 403s. No more IP bans.

If you’re building a price monitor, a lead gen tool, or a product intelligence platform, this is the stack you want.

And yes, I still think scraping Shopify is ethical—when you’re not harvesting customer data. Web Scraping for Academic Research covers the boundaries. But for product feeds? It’s fair game. The data is public. The site is built for it.

Just don’t do it with requests. Use FineData.

#shopify scraping #dynamic product feeds #web scraping API #residential proxies #structured data extraction

Related Articles