Industry Guide 9 min read

Lead Generation with Web Data: From HTML to CRM

Learn how sales teams use web scraping for lead generation — from identifying data sources and extracting contacts to CRM integration and compliance.

FineData Engineering · Editorial Policy

| February 9, 2026

Lead Generation with Web Data: From HTML to CRM

Every sale starts with a lead. For B2B sales teams, the quality and volume of your lead pipeline directly determines revenue outcomes. And the web is, by far, the richest source of lead data available — company directories, professional profiles, industry listings, event attendee lists, job postings, and public business registrations.

The challenge isn’t finding data — it’s extracting it systematically, cleaning it, and getting it into your CRM in a usable format. This guide walks through the complete process of building a web-powered lead generation pipeline, from source identification to CRM integration.

Where to Find Lead Data Online

Business Directories

Directories are purpose-built lists of companies, often with contact information, industry classification, and company details:

Industry-specific directories — Clutch (agencies), G2 (software), Capterra (SaaS), ThomasNet (manufacturing)
General business directories — Yellow Pages, BBB, Yelp (for local businesses)
Government registries — SEC EDGAR (public companies), state business registrations, SBA databases
Chamber of Commerce listings

Professional Networks

LinkedIn — The definitive B2B professional database (note: heavy restrictions on automated access)
Crunchbase — Startup and company data with funding information
AngelList / Wellfound — Startup ecosystem data

Company Websites

Individual company websites contain rich signals:

Team / About pages — Decision-maker names and titles
Contact pages — Direct email addresses and phone numbers
Blog / News sections — Technology choices, growth signals, company priorities
Job postings — Hiring indicates growth, and job descriptions reveal tech stack and pain points

Event and Conference Sites

Attendee lists — Published lists from trade shows and conferences
Speaker directories — Industry leaders and decision-makers
Sponsor lists — Companies investing in industry visibility

Review and Community Sites

G2, Capterra, TrustRadius — Companies using competitor products (high-intent leads)
Stack Overflow, GitHub — Developers using relevant technologies
Reddit, Quora — People asking questions your product answers

What to Extract

For B2B leads, a complete record typically includes:

Field	Source	Priority
Company name	Directory, website	Required
Website URL	Directory, search	Required
Industry / vertical	Directory, manual classification	High
Company size	LinkedIn, directory, job postings	High
Contact name	Team page, directory, LinkedIn	High
Job title	Team page, LinkedIn	High
Email address	Contact page, pattern detection	Medium
Phone number	Contact page, directory	Medium
Location	Directory, website	Medium
Technology stack	Job postings, BuiltWith	Contextual
Funding stage	Crunchbase, press releases	Contextual
Recent news	Blog, press releases	Contextual

Building the Extraction Pipeline

Scraping Directory Listings

Directories are usually the highest-ROI starting point. They’re structured, contain many companies per page, and are designed to be browsable.

import requests
import time
from bs4 import BeautifulSoup

FINEDATA_API = "https://api.finedata.ai/api/v1/scrape"
API_KEY = "fd_your_api_key"

def scrape_directory_page(url):
    """Scrape a directory listing page and extract company entries."""
    response = requests.post(
        FINEDATA_API,
        headers={
            "x-api-key": API_KEY,
            "Content-Type": "application/json"
        },
        json={
            "url": url,
            "use_js_render": True,
            "tls_profile": "chrome124",
            "timeout": 30
        }
    )

    if response.status_code != 200:
        return []

    html = response.json()["body"]
    soup = BeautifulSoup(html, "html.parser")
    companies = []

    for listing in soup.select(".company-listing"):
        company = {
            "name": safe_text(listing.select_one(".company-name")),
            "website": safe_attr(listing.select_one("a.website-link"), "href"),
            "location": safe_text(listing.select_one(".location")),
            "description": safe_text(listing.select_one(".description")),
            "category": safe_text(listing.select_one(".category")),
        }
        if company["name"]:
            companies.append(company)

    return companies


def safe_text(element):
    return element.get_text(strip=True) if element else None

def safe_attr(element, attr):
    return element.get(attr) if element else None

Enriching with Company Website Data

Once you have a list of companies, visit each website to extract additional details:

def enrich_company(company):
    """Visit a company website and extract contact and context data."""
    website = company.get("website")
    if not website:
        return company

    # Try the contact page first
    contact_urls = [
        f"{website}/contact",
        f"{website}/contact-us",
        f"{website}/about",
        f"{website}/team"
    ]

    for url in contact_urls:
        response = requests.post(
            FINEDATA_API,
            headers={
                "x-api-key": API_KEY,
                "Content-Type": "application/json"
            },
            json={
                "url": url,
                "use_js_render": True,
                "tls_profile": "chrome124",
                "timeout": 20
            }
        )

        if response.status_code == 200:
            html = response.json()["body"]
            extracted = extract_contact_info(html)
            company.update(extracted)

        time.sleep(1)

    return company


def extract_contact_info(html):
    """Extract emails, phone numbers, and names from HTML."""
    import re

    data = {}

    # Extract emails
    emails = re.findall(
        r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
        html
    )
    # Filter out common non-lead emails
    filtered_emails = [
        e for e in emails
        if not any(prefix in e.lower() for prefix in
            ["noreply", "no-reply", "support", "info@", "admin@", "webmaster"])
    ]
    if filtered_emails:
        data["emails"] = list(set(filtered_emails))

    # Extract phone numbers (US format)
    phones = re.findall(
        r"[\+]?1?[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}",
        html
    )
    if phones:
        data["phone"] = phones[0].strip()

    return data

Data Cleaning and Enrichment

Raw scraped data is messy. Before it goes into your CRM, it needs cleaning:

Deduplication

The same company often appears across multiple sources with slightly different names. Implement fuzzy matching:

from difflib import SequenceMatcher

def find_duplicates(companies, threshold=0.85):
    """Identify potential duplicate companies using fuzzy name matching."""
    duplicates = []
    for i, a in enumerate(companies):
        for j, b in enumerate(companies[i+1:], i+1):
            name_a = a["name"].lower().strip()
            name_b = b["name"].lower().strip()

            similarity = SequenceMatcher(None, name_a, name_b).ratio()
            if similarity >= threshold:
                duplicates.append((i, j, similarity))

    return duplicates

Data Standardization

Company names — Remove Inc., LLC, Ltd. suffixes for matching; keep them for the CRM record
Phone numbers — Normalize to E.164 format
Addresses — Parse into structured components (street, city, state, zip)
Job titles — Map variations to standard titles (VP of Sales, Vice President Sales, Head of Sales → “VP Sales”)
URLs — Normalize (remove trailing slashes, www prefix, HTTP/HTTPS differences)

Email Verification

Don’t load unverified emails into your CRM. At minimum:

Check syntax validity
Verify the domain has MX records
Use an email verification service for bounce checking

Loading bad emails damages your sender reputation and skews your CRM data quality.

Lead Scoring

Not every company is a good lead. Build a scoring model based on:

Company size — Does it match your ideal customer profile?
Industry — Is it in a target vertical?
Technology signals — Are they using complementary or competing tools?
Growth signals — Are they hiring? Recently funded?
Engagement signals — Did they visit your website? Download content?

def score_lead(company):
    score = 0

    # Company size scoring
    size = company.get("employee_count", 0)
    if 50 <= size <= 500:
        score += 30  # Sweet spot for our product
    elif 500 < size <= 2000:
        score += 20
    elif 10 <= size < 50:
        score += 10

    # Industry scoring
    target_industries = ["saas", "ecommerce", "fintech", "marketing"]
    if company.get("industry", "").lower() in target_industries:
        score += 25

    # Has direct contact
    if company.get("emails"):
        score += 15
    if company.get("phone"):
        score += 10

    # Growth signals
    if company.get("is_hiring"):
        score += 10
    if company.get("recent_funding"):
        score += 10

    return score

CRM Integration

The final step is getting clean, scored leads into your sales team’s hands.

Common CRM Integrations

Most CRMs offer APIs for programmatic lead creation:

Salesforce — REST API or Bulk API for high volume
HubSpot — Contacts API with easy setup
Pipedrive — Simple REST API
Close — Well-documented API for sales-focused teams

HubSpot Example

import requests

def push_to_hubspot(lead, hubspot_api_key):
    """Create or update a contact in HubSpot."""
    contact_data = {
        "properties": {
            "company": lead["name"],
            "website": lead.get("website", ""),
            "email": lead.get("emails", [None])[0],
            "phone": lead.get("phone", ""),
            "city": lead.get("city", ""),
            "state": lead.get("state", ""),
            "industry": lead.get("industry", ""),
            "lead_source": "web_scraping",
            "lead_score": lead.get("score", 0)
        }
    }

    response = requests.post(
        "https://api.hubapi.com/crm/v3/objects/contacts",
        headers={
            "Authorization": f"Bearer {hubspot_api_key}",
            "Content-Type": "application/json"
        },
        json=contact_data
    )

    return response.status_code == 201

Best Practices for CRM Loading

Map fields carefully. Ensure scraped data maps to the right CRM fields.
Set the lead source. Tag leads as “Web Scraping” or “Data Enrichment” so you can track conversion rates by source.
Avoid duplicates. Check for existing records before creating new ones. Match on email, company name, or domain.
Include context. Add notes about where the lead was found and why it’s relevant.

Compliance Considerations

Lead generation with web data sits in a legal gray area that requires careful navigation:

If you’re collecting data about individuals in the EU:

You need a legitimate interest basis for processing personal data
Individuals have the right to access, rectify, and delete their data
You must provide a way for people to opt out
Data processing must be proportionate to the purpose

CAN-SPAM (United States)

If you’re emailing leads:

Include a clear unsubscribe mechanism
Don’t use misleading subject lines or sender information
Include your physical address

General Best Practices

Only collect data that’s publicly available
Respect robots.txt and terms of service
Don’t scrape data that’s clearly behind authentication walls
Maintain a suppression list for people who ask to be removed
Document your data collection and processing procedures
Consult a lawyer if you’re operating in regulated industries or across borders

Putting It All Together

A complete lead generation pipeline looks like this:

Source identification — Map directories, websites, and databases relevant to your ICP
Data extraction — Use FineData’s API to scrape structured data at scale
Cleaning and deduplication — Standardize, deduplicate, and verify the data
Enrichment — Augment basic records with additional context from company websites
Scoring — Rank leads by fit and intent signals
CRM integration — Push qualified leads into your sales workflow
Feedback loop — Track which leads convert and refine your scoring model

The teams that generate the most pipeline from web data aren’t just scraping more — they’re scraping smarter. They target the right sources, extract the right fields, and build systematic processes that improve over time.

Get started with FineData and turn the open web into your most productive lead source.

#lead-generation #sales #b2b #crm #data-enrichment

Industry Guide

Lead Generation with Web Data: From HTML to CRM

Lead Generation with Web Data: From HTML to CRM

Where to Find Lead Data Online

Business Directories

Professional Networks

Company Websites

Event and Conference Sites

Review and Community Sites

What to Extract

Building the Extraction Pipeline

Scraping Directory Listings

Enriching with Company Website Data

Data Cleaning and Enrichment

Deduplication

Data Standardization

Email Verification

Lead Scoring

CRM Integration

Common CRM Integrations

HubSpot Example

Best Practices for CRM Loading

Compliance Considerations

CAN-SPAM (United States)

General Best Practices

Putting It All Together

Related Articles

B2B Data Enrichment: Building Quality Lead Lists with Web Scraping

Competitive Intelligence: How to Monitor Competitors at Scale

E-commerce Price Intelligence: Complete Strategy Guide

Lead Generation with Web Data: From HTML to CRM

Where to Find Lead Data Online

Business Directories

Professional Networks

Company Websites

Event and Conference Sites

Review and Community Sites

What to Extract

Building the Extraction Pipeline

Scraping Directory Listings

Enriching with Company Website Data

Data Cleaning and Enrichment

Deduplication

Data Standardization

Email Verification

Lead Scoring

CRM Integration

Common CRM Integrations

HubSpot Example

Best Practices for CRM Loading

Compliance Considerations

GDPR (European Union)

CAN-SPAM (United States)

General Best Practices

Putting It All Together

Related Articles

B2B Data Enrichment: Building Quality Lead Lists with Web Scraping

Competitive Intelligence: How to Monitor Competitors at Scale

E-commerce Price Intelligence: Complete Strategy Guide