Technical 10 min read

Web Scraping Legal Guide: GDPR, CCPA, and robots.txt in 2026

Legal guide to web scraping in 2026: court cases, GDPR, CCPA, robots.txt, Terms of Service, and best practices for compliant and ethical data collection.

FT
FineData Team
|

Web Scraping Legal Guide: GDPR, CCPA, and robots.txt in 2026

Web scraping occupies a complex legal space. It is simultaneously one of the most common data collection methods on the internet and one of the most legally contested. Search engines, price comparison sites, academic researchers, journalists, and businesses of every size rely on web scraping — yet the legal boundaries remain nuanced and jurisdiction-dependent.

This guide provides a practical overview of the legal landscape for web scraping in 2026. It is not legal advice — consult a qualified attorney for your specific situation — but it covers the key frameworks, court precedents, and best practices that every scraping practitioner should understand.

Web scraping legality depends on several intersecting areas of law:

  1. Computer fraud and access laws (CFAA in the US, Computer Misuse Act in the UK)
  2. Copyright law (who owns the data, and does scraping constitute copying?)
  3. Contract law (Terms of Service as a binding agreement)
  4. Data protection regulations (GDPR, CCPA)
  5. Trespass to chattels (using someone’s server resources without permission)

No single law governs web scraping. Instead, the legality of a specific scraping activity depends on what you are scraping, how you are scraping it, where the data subjects are located, and what you do with the data.

Key Court Cases

hiQ Labs v. LinkedIn (2022)

This is the most significant US court case for web scraping. hiQ Labs scraped publicly available LinkedIn profiles to provide workforce analytics. LinkedIn sent a cease-and-desist letter and blocked hiQ’s access. hiQ sued, and the case went to the Supreme Court and back.

Key ruling: The Ninth Circuit ruled that scraping publicly available data does not violate the Computer Fraud and Abuse Act (CFAA). The CFAA’s prohibition on accessing a computer “without authorization” applies to systems where authorization is required (like password-protected areas), not to publicly accessible websites.

What this means for scrapers:

  • Scraping data that is publicly accessible (no login required) is generally not a CFAA violation
  • This does not address copyright, ToS, or data protection concerns — only the CFAA
  • The ruling is binding in the Ninth Circuit and persuasive elsewhere, but is not universal law

Meta v. Bright Data (2024)

Meta (Facebook) sued Bright Data for scraping public Facebook and Instagram profiles. The court dismissed several of Meta’s claims, finding that scraping publicly available data — data that can be accessed without logging in — did not violate the CFAA or California’s computer access statute.

Key ruling: The court reinforced that public data scraping is not unauthorized computer access. However, Bright Data had initially logged in to access some data, and the court allowed claims related to scraping logged-in-only content to proceed.

What this means: The public/private distinction matters enormously. Scraping data visible to anyone without authentication is on much stronger legal footing than scraping data behind a login wall.

Ryanair v. PR Aviation (EU, 2015)

The European Court of Justice ruled that databases not meeting the threshold of “substantial investment” for database right protection under the EU Database Directive are not protected. Ryanair’s terms of service, which prohibited scraping, were considered a contractual matter rather than a database right issue.

What this means for EU scrapers: The EU Database Directive provides additional protections beyond copyright for databases, but the threshold for protection is high. Terms of Service restrictions may still apply as a contractual matter.

The robots.txt file is a voluntary standard (now formalized as RFC 9309) that allows website owners to communicate their preferences about automated access. Its legal status is nuanced.

What robots.txt Is

A text file at the root of a website specifying which paths automated agents should and should not access:

User-agent: *
Disallow: /private/
Disallow: /api/internal/
Crawl-delay: 10

User-agent: Googlebot
Allow: /

Sitemap: https://example.com/sitemap.xml

robots.txt is not legally binding in most jurisdictions. It is a convention, not a contract. However:

  • Respecting robots.txt demonstrates good faith. In legal disputes, courts consider whether the scraper respected robots.txt as evidence of responsible behavior.
  • Ignoring robots.txt can support a trespass claim. If a website explicitly disallows scraping and you proceed anyway, it weakens your legal position.
  • Some jurisdictions give robots.txt more weight. In the EU, ignoring robots.txt while scraping personal data could be viewed as not meeting the “legitimate interest” balancing test under GDPR.

Best Practices

  1. Always check robots.txt before scraping a new domain.
  2. Respect Disallow directives unless you have a specific, defensible reason not to.
  3. Honor Crawl-delay to avoid overloading the server.
  4. Identify your scraper with a descriptive User-Agent that includes contact information.
  5. Document your compliance — keep logs showing that you checked and respected robots.txt.
import requests
from urllib.robotparser import RobotFileParser

def check_robots(url: str, user_agent: str = "MyCompanyBot/1.0") -> bool:
    """Check if scraping this URL is allowed by robots.txt."""
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    rp = RobotFileParser()
    rp.set_url(robots_url)
    try:
        rp.read()
        return rp.can_fetch(user_agent, url)
    except Exception:
        return True  # If robots.txt is unavailable, default to allowed

GDPR: Scraping and European Data Protection

The General Data Protection Regulation is the most impactful data protection law for web scraping. If you scrape data about individuals in the EU/EEA — names, email addresses, photos, employment history, social media posts — GDPR applies regardless of where your company is located.

When GDPR Applies to Scraping

GDPR applies when you collect or process personal data of EU residents. Personal data includes any information that can identify a natural person, directly or indirectly:

  • Names, email addresses, phone numbers
  • Social media profiles and posts
  • IP addresses
  • Photos where individuals are identifiable
  • Employment information
  • Location data

GDPR does not apply to: Non-personal data (product prices, weather data, stock prices, anonymous statistics), or data about legal entities (company information, business addresses).

Under GDPR, you need a legal basis to process personal data. For scraping, the two most relevant bases are:

Legitimate Interest (Article 6(1)(f)): You can process personal data if you have a legitimate interest that is not overridden by the data subject’s rights. This requires a balancing test:

  • What is your legitimate interest? (market research, fraud prevention, journalism)
  • Is scraping necessary for that interest, or are there less intrusive alternatives?
  • What are the data subjects’ reasonable expectations?
  • What is the impact on the data subjects?

Consent (Article 6(1)(a)): Rarely applicable for scraping, since you typically do not have a relationship with the data subjects before collecting their data.

GDPR Compliance Checklist for Scrapers

RequirementAction
Legal basisDocument your legitimate interest assessment
Data minimizationOnly collect data you actually need
Purpose limitationDefine and document why you need the data
Storage limitationSet retention periods and delete when no longer needed
TransparencyProvide a privacy notice explaining your data collection
Data subject rightsImplement processes for access, deletion, and objection requests
Data Protection Impact AssessmentConduct DPIA for large-scale scraping of personal data
Record of processingMaintain records of your scraping activities
SecurityProtect scraped personal data with appropriate measures

The “Publicly Available” Question Under GDPR

A common misconception is that GDPR does not apply to publicly available data. This is wrong. GDPR applies to all personal data processing, regardless of whether the data is publicly available. The fact that data is public may be a factor in the legitimate interest balancing test (data subjects have a lower expectation of privacy for data they voluntarily made public), but it does not exempt you from GDPR obligations.

CCPA: California Consumer Privacy Act

The CCPA (and its successor, the CPRA) provides California residents with rights over their personal information. It is less prescriptive about legal bases than GDPR but still imposes obligations on businesses that collect personal information.

When CCPA Applies

CCPA applies if your business:

  • Has annual gross revenue over $25 million, OR
  • Buys, sells, or shares the personal information of 100,000+ California residents, OR
  • Derives 50%+ of revenue from selling personal information

AND you collect personal information of California residents.

Key Requirements

Right to Know: Consumers can request what personal information you have collected about them and how you collected it.

Right to Delete: Consumers can request deletion of their personal information.

Right to Opt-Out: Consumers can opt out of the “sale” of their personal information. If you scrape data and share it with third parties (even for free), this may constitute a “sale” under CCPA.

Notice at Collection: You must inform consumers about the categories of personal information you collect and the purposes. For scraping, this is challenging since you typically do not interact with data subjects directly. A publicly accessible privacy policy describing your scraping practices is a minimum requirement.

Terms of Service Considerations

Most websites include Terms of Service (ToS) that prohibit scraping. The legal enforceability of these terms depends on several factors:

Browse-Wrap vs. Click-Wrap

Click-wrap agreements (where users actively click “I agree”) are generally enforceable. If you create an account on a site and agree to ToS that prohibit scraping, you are likely bound by those terms.

Browse-wrap agreements (where ToS are linked at the bottom of the page and “accepted” by merely using the site) have a weaker legal standing. Courts have frequently found that users did not have adequate notice of browse-wrap terms.

Practical Implications

  • Scraping without an account (publicly accessible pages) generally does not create a ToS relationship, especially for browse-wrap agreements.
  • Scraping with an account means you likely agreed to ToS, and violating them could support a breach of contract claim.
  • ToS cannot override legal rights. In some jurisdictions, ToS provisions that restrict access to public information may be unenforceable as contrary to public policy.

Based on the current legal landscape, here are practical guidelines for compliant web scraping:

1. Scrape Only Publicly Available Data

Stick to data that is accessible without logging in. This provides the strongest legal protection under the CFAA and equivalent laws.

2. Respect robots.txt

Check and follow robots.txt directives. Document your compliance. Use the Crawl-delay directive to pace your requests.

3. Minimize Personal Data Collection

If your scraping involves personal data:

  • Collect only what you need (data minimization)
  • Anonymize or pseudonymize when possible
  • Set retention limits and delete data when no longer needed
  • Document your legitimate interest assessment

4. Identify Your Scraper

Use a descriptive User-Agent string that identifies your company and provides contact information:

User-Agent: MyCompanyBot/1.0 (+https://mycompany.com/bot-info; bot@mycompany.com)

This builds trust, allows site owners to contact you, and demonstrates good faith.

5. Do Not Overload Target Servers

Rate limit your requests. Scraping that degrades a website’s performance can support trespass-to-chattels claims and is simply bad practice:

import time

# Respect crawl-delay, default to 1 second
def polite_scrape(urls: list[str], crawl_delay: float = 1.0):
    for url in urls:
        result = scrape(url)
        process(result)
        time.sleep(crawl_delay)

6. Check for Official APIs

Before scraping, check if the site offers an official API or data feed. Using an authorized data source eliminates legal risk and typically provides cleaner, more structured data.

7. Do Not Circumvent Access Controls

Do not bypass CAPTCHAs, break password protection, or circumvent technical access restrictions on content that is not intended to be public. This is both legally risky (potential CFAA violation) and ethically questionable.

Note: anti-bot systems on publicly accessible pages are different from access controls on private content. A site that protects public product listings with Cloudflare is not making that content “private” — it is trying to manage automated access. The legal distinction between “access control” and “anti-bot measure on public content” is still evolving.

8. Document Everything

Maintain records of:

  • Your robots.txt compliance checks
  • Rate limiting settings
  • Data retention policies
  • Legitimate interest assessments (for GDPR)
  • Dates and sources of data collection

Documentation demonstrates good faith and provides evidence of compliance if challenged.

FineData’s Approach to Ethical Scraping

At FineData, we build compliance considerations into the platform:

  • robots.txt awareness — our documentation encourages users to check robots.txt before scraping
  • Rate limiting — built-in request rate controls prevent server overload
  • No credential storage — the API does not store or manage login credentials for target sites
  • Transparent operation — clear logging of what is accessed and when
  • Data handling — scraped content is returned to the user and not stored on our servers beyond the request lifecycle
import requests

# FineData handles the technical complexity while you manage compliance
response = requests.post(
    "https://api.finedata.ai/api/v1/scrape",
    headers={
        "x-api-key": "fd_your_api_key",
        "Content-Type": "application/json"
    },
    json={
        "url": "https://example.com/public-data",
        "use_js_render": False,
        "timeout": 30
    }
)

The responsibility for legal compliance ultimately lies with you, the data collector. FineData provides the technical infrastructure, but you must ensure that your scraping activities comply with applicable laws.

Regional Considerations

United States

  • CFAA: Public data scraping is generally permissible (hiQ v. LinkedIn)
  • State privacy laws: CCPA/CPRA (California), VCDPA (Virginia), CPA (Colorado), CTDPA (Connecticut)
  • Copyright: Facts are not copyrightable, but compilations with creative selection may be

European Union

  • GDPR: Applies to all personal data, even if publicly available
  • EU Database Directive: Additional protection for databases with substantial investment
  • ePrivacy Directive: May apply to certain tracking and data collection activities

United Kingdom

  • UK GDPR: Similar to EU GDPR, maintained post-Brexit
  • Computer Misuse Act: Similar to CFAA, prohibits unauthorized access
  • Copyright, Designs and Patents Act: Protects original databases

Australia

  • Privacy Act 1988: Applies to personal information collection
  • No specific web scraping legislation, but general privacy principles apply

China

  • Personal Information Protection Law (PIPL): China’s comprehensive data protection law
  • Data Security Law: Additional requirements for data collection and cross-border transfer
  • Civil Code: Privacy rights provisions

Looking Ahead

The legal landscape for web scraping continues to evolve:

AI training data. The use of scraped data for AI model training is a frontier legal issue, with ongoing litigation (e.g., The New York Times v. OpenAI). This may lead to new frameworks for data collection rights.

State privacy laws. The patchwork of US state privacy laws continues to expand, with more states enacting CCPA-like legislation. A federal privacy law remains a possibility.

International data transfers. Scraping data across borders is increasingly scrutinized, particularly for personal data subject to GDPR or PIPL.

robots.txt and AI. The intersection of robots.txt, AI crawlers, and training data rights is an active area of discussion. New standards for machine learning data collection preferences may emerge.

Key Takeaways

  1. Publicly available data is generally fair game for scraping in the US (under CFAA), but other laws (GDPR, copyright, ToS) may still apply.
  2. GDPR applies to all personal data regardless of whether it is public. Conduct a legitimate interest assessment if scraping personal data of EU residents.
  3. Respect robots.txt. It is not legally binding in most jurisdictions, but compliance demonstrates good faith.
  4. Do not circumvent access controls. Stick to publicly accessible content.
  5. Document your compliance practices. Good records are your best defense.
  6. When in doubt, consult a lawyer. This guide provides a framework, but specific legal advice requires a qualified attorney familiar with your jurisdiction and use case.

FineData provides the technical infrastructure for web scraping with built-in rate limiting and compliance-friendly features. Start with 1,000 free tokens and scrape responsibly.

#legal #gdpr #ccpa #robots-txt #compliance #ethics

Related Articles