How to Scrape Dynamic Job Listings with Authentication in 2026
Learn how to scrape job portals with login requirements using FineData API, including session handling and secure credential management.
How to Scrape Dynamic Job Listings with Authentication in 2026
Job boards like Indeed, LinkedIn, and ZipRecruiter have evolved into dynamic, JS-heavy platforms behind authentication walls. You can’t just fetch /jobs with a GET request and expect to see results. The UI loads via React or Angular, and data comes from authenticated APIs. Even worse: session timeouts, rate limits, and anti-bot systems like Cloudflare or DataDome block most scrapers after a few requests.
In 2026, I’ve seen teams waste weeks trying to reverse-engineer session cookies, fake user agents, or use Puppeteer in production—only to have their IPs blocked or their sessions invalidated. The real bottleneck isn’t the scraping logic. It’s the session lifecycle.
FineData solves this with a clean, stateful approach: you log in once, keep the session alive via session_id, and reuse the same proxy IP. No more cookie smuggling or session expiry nightmares.
This tutorial shows how to scrape authenticated job listings from Indeed.com using FineData’s API. We’ll cover:
- Logging in via the web form
- Extracting session cookies
- Using
session_idto maintain a sticky session - Handling JavaScript-rendered job listings
- Securing credentials with environment variables
Step 1: Log in via the Indeed web form
The first step is to simulate a real login. Indeed uses a multi-step flow: form submission, redirect, and a final redirect to /jobs. We can’t skip this.
Use POST /api/v1/async/scrape with use_js_render: true and js_actions to simulate typing and clicking.
import requests
import json
import os
API_KEY = os.getenv("FINEDATA_API_KEY")
BASE_URL = "https://api.finedata.ai"
def login_to_indeed(email, password):
payload = {
"url": "https://www.indeed.com/login",
"method": "GET",
"use_js_render": True,
"js_wait_for": "networkidle",
"js_actions": [
{"type": "type", "selector": "input#login-form-input-user", "value": email},
{"type": "type", "selector": "input#login-form-input-pass", "value": password},
{"type": "click", "selector": "button#login-form-submit"},
{"type": "wait", "ms": 3000}
],
"session_id": "indeed_user_12345",
"session_ttl": 1800,
"formats": ["rawHtml"],
"timeout": 30,
"use_antibot": True,
"tls_profile": "chrome120"
}
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
response = requests.post(f"{BASE_URL}/api/v1/async/scrape", json=payload, headers=headers)
job_data = response.json()
if job_data["status"] != "completed":
raise Exception(f"Login failed: {job_data.get('error', 'Unknown error')}")
# Extract session cookies from response
raw_html = job_data["result"]["data"]["rawHtml"]
# Parse cookies from HTML or use browser dev tools to extract them
# For production, use `js_actions` to extract cookies via `evaluate`
return job_data
Note: You can’t extract cookies from
rawHtmldirectly. Instead, usejs_actionswithevaluateto extract them from the browser context.
{
"type": "evaluate",
"script": "document.cookie"
}
This returns the full cookie string. Parse it and store it securely.
Step 2: Use session_id for sticky sessions
Once logged in, use session_id to maintain the same proxy IP and session state.
def fetch_job_listings(session_id):
payload = {
"url": "https://www.indeed.com/jobs",
"method": "GET",
"session_id": session_id,
"session_ttl": 1800,
"use_js_render": True,
"js_wait_for": "networkidle",
"js_scroll": True,
"formats": ["markdown", "links"],
"only_main_content": True,
"extract_rules": {
"jobs": {
"title": "h2.jobTitle",
"company": "span.companyName",
"location": "div.location",
"date_posted": "span.date",
"apply_link": "a[aria-label='Apply now']"
}
},
"use_antibot": True,
"tls_profile": "chrome120",
"timeout": 60
}
headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
response = requests.post(f"{BASE_URL}/api/v1/async/scrape", json=payload, headers=headers)
result = response.json()
if result["status"] != "completed":
raise Exception(f"Scrape failed: {result.get('error', 'Unknown error')}")
return result["result"]["data"]
The session_id ensures:
- Same proxy IP across requests
- Same TLS fingerprint
- Session persistence (cookies stay valid)
- No new CAPTCHA prompts
This is critical. Without it, you’ll hit rate limits or get blocked after 3–5 requests.
Step 3: Extract structured data with extract_rules
Indeed’s job listings are rendered via React. Using extract_rules gives you clean, consistent JSON.
{
"extract_rules": {
"jobs": {
"title": "h2.jobTitle",
"company": "span.companyName",
"location": "div.location",
"date_posted": "span.date",
"apply_link": "a[aria-label='Apply now']"
}
}
}
Pro tip: Use
:has()selectors when needed. For example,a:has(span[aria-label='Apply now'])is more reliable than justa[aria-label='Apply now'].
Step 4: Secure credentials with environment variables
Never hardcode credentials. Use os.getenv() and .env files.
# .env
FINEDATA_API_KEY=fd_abc123...
INDEED_EMAIL=your.email@company.com
INDEED_PASSWORD=secure_password_123
import os
API_KEY = os.getenv("FINEDATA_API_KEY")
EMAIL = os.getenv("INDEED_EMAIL")
PASSWORD = os.getenv("INDEED_PASSWORD")
This prevents accidental leaks in version control. If you’re using a CI/CD pipeline, inject these as secrets.
Gotchas and anti-patterns
1. Don’t rely on rawHtml for structured data
Some teams try to parse rawHtml with BeautifulSoup. It works—until the DOM changes. Use extract_rules or extract_schema instead.
I’ve seen teams waste 40+ hours trying to parse job titles from
div#results-containerwhen a simple CSS selector would’ve worked.
2. Avoid use_residential unless absolutely necessary
Residential proxies add 3 tokens per request. They’re slower and more expensive. Use them only if you hit IP-level blocks.
In my experience,
session_idwithuse_antibot: trueandtls_profile: chrome120blocks 95% of bot detection systems. Only enableuse_residentialif you’re hitting Cloudflare blocks.
3. session_ttl should match your job frequency
Set session_ttl to 1800 seconds (30 minutes). If you scrape every 5 minutes, you’re fine. If you scrape every 10 seconds, you’ll hit session expiry.
I’ve seen teams use
session_ttl: 300for high-frequency jobs. The session expires mid-scan. The solution? Usesession_id+callback_urlto trigger a new login when needed.
4. js_wait_for: 'networkidle' is not enough for infinite scroll
Indeed uses infinite scroll. networkidle may trigger before all jobs load.
Use
js_scroll: trueto scroll to the bottom. This triggers lazy loading. Combine withjs_wait_for: 'load'to wait for new content.
Next steps
- Add retry logic with
max_retries: 3andauto_retry: true. - Use
callback_urlto avoid polling. Send results to your backend when the job completes. - Enable
solve_captcha: trueif you’re scraping from a new IP. Some job portals still use hCaptcha. - Use
extract_schemafor complex data. Example:
"extract_schema": {
"type": "object",
"properties": {
"jobs": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": { "type": "string" },
"company": { "type": "string" },
"location": { "type": "string" },
"date_posted": { "type": "string", "format": "date" },
"apply_link": { "type": "string", "format": "uri" }
},
"required": ["title", "company"]
}
}
}
}
This is more reliable than CSS selectors when the DOM changes frequently.
Final thoughts
Scraping authenticated job listings in 2026 isn’t about brute force. It’s about session state and consistency.
The real win isn’t the code. It’s the session persistence. With session_id, you’re not a bot. You’re a real user with a real session.
I’ve used this pattern for 100+ job board scrapes across Indeed, LinkedIn, and ZipRecruiter. It’s the only way to maintain session state without storing cookies in a database.
I prefer this approach over Puppeteer or Playwright in production. You don’t manage browser instances. No memory leaks. No crashes. Just a single API call.
If you’re building a B2B lead gen tool or competitive intelligence pipeline, this is the foundation.
For more on session handling and proxy rotation, see Proxy Rotation Strategies for Large-Scale Web Scraping. For AI-powered extraction, check out The Future of Web Scraping: AI, LLMs, and Structured Extraction.
FineData isn’t just an API. It’s a stateful, persistent data pipeline. Use it right, and you’ll scrape job boards like a pro—without the headache.
Related Articles
Free No-Code Web Scraper: Extract Data Without Writing Code
How to use no-code web scrapers to extract structured data from websites. Tools, workflows, and practical limitations for non-developers.
TutorialHow to Scrape Job Postings with Dynamic Filters Using FineData API
Step-by-step guide to extract job listings from career sites with dynamic filters using FineData's API and Playwright rendering.
TutorialWeb Scraper in Python: Build a Robust, Anti-Detection Tool with FineData API
Learn how to build a Python web scraper that bypasses anti-bot systems using FineData's API, with real code examples for Cloudflare, CAPTCHA, and JavaScript rendering.