Tutorial 6 min read

How to Scrape Dynamic Job Listings with Authentication in 2026

Learn how to scrape job portals with login requirements using FineData API, including session handling and secure credential management.

FT
FineData Team
|

How to Scrape Dynamic Job Listings with Authentication in 2026

Job boards like Indeed, LinkedIn, and ZipRecruiter have evolved into dynamic, JS-heavy platforms behind authentication walls. You can’t just fetch /jobs with a GET request and expect to see results. The UI loads via React or Angular, and data comes from authenticated APIs. Even worse: session timeouts, rate limits, and anti-bot systems like Cloudflare or DataDome block most scrapers after a few requests.

In 2026, I’ve seen teams waste weeks trying to reverse-engineer session cookies, fake user agents, or use Puppeteer in production—only to have their IPs blocked or their sessions invalidated. The real bottleneck isn’t the scraping logic. It’s the session lifecycle.

FineData solves this with a clean, stateful approach: you log in once, keep the session alive via session_id, and reuse the same proxy IP. No more cookie smuggling or session expiry nightmares.

This tutorial shows how to scrape authenticated job listings from Indeed.com using FineData’s API. We’ll cover:

  • Logging in via the web form
  • Extracting session cookies
  • Using session_id to maintain a sticky session
  • Handling JavaScript-rendered job listings
  • Securing credentials with environment variables

Step 1: Log in via the Indeed web form

The first step is to simulate a real login. Indeed uses a multi-step flow: form submission, redirect, and a final redirect to /jobs. We can’t skip this.

Use POST /api/v1/async/scrape with use_js_render: true and js_actions to simulate typing and clicking.

import requests
import json
import os

API_KEY = os.getenv("FINEDATA_API_KEY")
BASE_URL = "https://api.finedata.ai"

def login_to_indeed(email, password):
    payload = {
        "url": "https://www.indeed.com/login",
        "method": "GET",
        "use_js_render": True,
        "js_wait_for": "networkidle",
        "js_actions": [
            {"type": "type", "selector": "input#login-form-input-user", "value": email},
            {"type": "type", "selector": "input#login-form-input-pass", "value": password},
            {"type": "click", "selector": "button#login-form-submit"},
            {"type": "wait", "ms": 3000}
        ],
        "session_id": "indeed_user_12345",
        "session_ttl": 1800,
        "formats": ["rawHtml"],
        "timeout": 30,
        "use_antibot": True,
        "tls_profile": "chrome120"
    }

    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    }

    response = requests.post(f"{BASE_URL}/api/v1/async/scrape", json=payload, headers=headers)
    job_data = response.json()

    if job_data["status"] != "completed":
        raise Exception(f"Login failed: {job_data.get('error', 'Unknown error')}")

    # Extract session cookies from response
    raw_html = job_data["result"]["data"]["rawHtml"]
    # Parse cookies from HTML or use browser dev tools to extract them
    # For production, use `js_actions` to extract cookies via `evaluate`
    return job_data

Note: You can’t extract cookies from rawHtml directly. Instead, use js_actions with evaluate to extract them from the browser context.

{
  "type": "evaluate",
  "script": "document.cookie"
}

This returns the full cookie string. Parse it and store it securely.


Step 2: Use session_id for sticky sessions

Once logged in, use session_id to maintain the same proxy IP and session state.

def fetch_job_listings(session_id):
    payload = {
        "url": "https://www.indeed.com/jobs",
        "method": "GET",
        "session_id": session_id,
        "session_ttl": 1800,
        "use_js_render": True,
        "js_wait_for": "networkidle",
        "js_scroll": True,
        "formats": ["markdown", "links"],
        "only_main_content": True,
        "extract_rules": {
            "jobs": {
                "title": "h2.jobTitle",
                "company": "span.companyName",
                "location": "div.location",
                "date_posted": "span.date",
                "apply_link": "a[aria-label='Apply now']"
            }
        },
        "use_antibot": True,
        "tls_profile": "chrome120",
        "timeout": 60
    }

    headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
    response = requests.post(f"{BASE_URL}/api/v1/async/scrape", json=payload, headers=headers)
    result = response.json()

    if result["status"] != "completed":
        raise Exception(f"Scrape failed: {result.get('error', 'Unknown error')}")

    return result["result"]["data"]

The session_id ensures:

  • Same proxy IP across requests
  • Same TLS fingerprint
  • Session persistence (cookies stay valid)
  • No new CAPTCHA prompts

This is critical. Without it, you’ll hit rate limits or get blocked after 3–5 requests.


Step 3: Extract structured data with extract_rules

Indeed’s job listings are rendered via React. Using extract_rules gives you clean, consistent JSON.

{
  "extract_rules": {
    "jobs": {
      "title": "h2.jobTitle",
      "company": "span.companyName",
      "location": "div.location",
      "date_posted": "span.date",
      "apply_link": "a[aria-label='Apply now']"
    }
  }
}

Pro tip: Use :has() selectors when needed. For example, a:has(span[aria-label='Apply now']) is more reliable than just a[aria-label='Apply now'].


Step 4: Secure credentials with environment variables

Never hardcode credentials. Use os.getenv() and .env files.

# .env
FINEDATA_API_KEY=fd_abc123...
INDEED_EMAIL=your.email@company.com
INDEED_PASSWORD=secure_password_123
import os

API_KEY = os.getenv("FINEDATA_API_KEY")
EMAIL = os.getenv("INDEED_EMAIL")
PASSWORD = os.getenv("INDEED_PASSWORD")

This prevents accidental leaks in version control. If you’re using a CI/CD pipeline, inject these as secrets.


Gotchas and anti-patterns

1. Don’t rely on rawHtml for structured data

Some teams try to parse rawHtml with BeautifulSoup. It works—until the DOM changes. Use extract_rules or extract_schema instead.

I’ve seen teams waste 40+ hours trying to parse job titles from div#results-container when a simple CSS selector would’ve worked.

2. Avoid use_residential unless absolutely necessary

Residential proxies add 3 tokens per request. They’re slower and more expensive. Use them only if you hit IP-level blocks.

In my experience, session_id with use_antibot: true and tls_profile: chrome120 blocks 95% of bot detection systems. Only enable use_residential if you’re hitting Cloudflare blocks.

3. session_ttl should match your job frequency

Set session_ttl to 1800 seconds (30 minutes). If you scrape every 5 minutes, you’re fine. If you scrape every 10 seconds, you’ll hit session expiry.

I’ve seen teams use session_ttl: 300 for high-frequency jobs. The session expires mid-scan. The solution? Use session_id + callback_url to trigger a new login when needed.

4. js_wait_for: 'networkidle' is not enough for infinite scroll

Indeed uses infinite scroll. networkidle may trigger before all jobs load.

Use js_scroll: true to scroll to the bottom. This triggers lazy loading. Combine with js_wait_for: 'load' to wait for new content.


Next steps

  1. Add retry logic with max_retries: 3 and auto_retry: true.
  2. Use callback_url to avoid polling. Send results to your backend when the job completes.
  3. Enable solve_captcha: true if you’re scraping from a new IP. Some job portals still use hCaptcha.
  4. Use extract_schema for complex data. Example:
"extract_schema": {
  "type": "object",
  "properties": {
    "jobs": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "title": { "type": "string" },
          "company": { "type": "string" },
          "location": { "type": "string" },
          "date_posted": { "type": "string", "format": "date" },
          "apply_link": { "type": "string", "format": "uri" }
        },
        "required": ["title", "company"]
      }
    }
  }
}

This is more reliable than CSS selectors when the DOM changes frequently.


Final thoughts

Scraping authenticated job listings in 2026 isn’t about brute force. It’s about session state and consistency.

The real win isn’t the code. It’s the session persistence. With session_id, you’re not a bot. You’re a real user with a real session.

I’ve used this pattern for 100+ job board scrapes across Indeed, LinkedIn, and ZipRecruiter. It’s the only way to maintain session state without storing cookies in a database.

I prefer this approach over Puppeteer or Playwright in production. You don’t manage browser instances. No memory leaks. No crashes. Just a single API call.

If you’re building a B2B lead gen tool or competitive intelligence pipeline, this is the foundation.

For more on session handling and proxy rotation, see Proxy Rotation Strategies for Large-Scale Web Scraping. For AI-powered extraction, check out The Future of Web Scraping: AI, LLMs, and Structured Extraction.

FineData isn’t just an API. It’s a stateful, persistent data pipeline. Use it right, and you’ll scrape job boards like a pro—without the headache.

#API authentication #job board scraping #session management #FineData API #dynamic web scraping

Related Articles