Industry Guide 9 min read

Lead Generation with Web Data: From HTML to CRM

Learn how sales teams use web scraping for lead generation — from identifying data sources and extracting contacts to CRM integration and compliance.

FT
FineData Team
|

Lead Generation with Web Data: From HTML to CRM

Every sale starts with a lead. For B2B sales teams, the quality and volume of your lead pipeline directly determines revenue outcomes. And the web is, by far, the richest source of lead data available — company directories, professional profiles, industry listings, event attendee lists, job postings, and public business registrations.

The challenge isn’t finding data — it’s extracting it systematically, cleaning it, and getting it into your CRM in a usable format. This guide walks through the complete process of building a web-powered lead generation pipeline, from source identification to CRM integration.

Where to Find Lead Data Online

Business Directories

Directories are purpose-built lists of companies, often with contact information, industry classification, and company details:

  • Industry-specific directories — Clutch (agencies), G2 (software), Capterra (SaaS), ThomasNet (manufacturing)
  • General business directories — Yellow Pages, BBB, Yelp (for local businesses)
  • Government registries — SEC EDGAR (public companies), state business registrations, SBA databases
  • Chamber of Commerce listings

Professional Networks

  • LinkedIn — The definitive B2B professional database (note: heavy restrictions on automated access)
  • Crunchbase — Startup and company data with funding information
  • AngelList / Wellfound — Startup ecosystem data

Company Websites

Individual company websites contain rich signals:

  • Team / About pages — Decision-maker names and titles
  • Contact pages — Direct email addresses and phone numbers
  • Blog / News sections — Technology choices, growth signals, company priorities
  • Job postings — Hiring indicates growth, and job descriptions reveal tech stack and pain points

Event and Conference Sites

  • Attendee lists — Published lists from trade shows and conferences
  • Speaker directories — Industry leaders and decision-makers
  • Sponsor lists — Companies investing in industry visibility

Review and Community Sites

  • G2, Capterra, TrustRadius — Companies using competitor products (high-intent leads)
  • Stack Overflow, GitHub — Developers using relevant technologies
  • Reddit, Quora — People asking questions your product answers

What to Extract

For B2B leads, a complete record typically includes:

FieldSourcePriority
Company nameDirectory, websiteRequired
Website URLDirectory, searchRequired
Industry / verticalDirectory, manual classificationHigh
Company sizeLinkedIn, directory, job postingsHigh
Contact nameTeam page, directory, LinkedInHigh
Job titleTeam page, LinkedInHigh
Email addressContact page, pattern detectionMedium
Phone numberContact page, directoryMedium
LocationDirectory, websiteMedium
Technology stackJob postings, BuiltWithContextual
Funding stageCrunchbase, press releasesContextual
Recent newsBlog, press releasesContextual

Building the Extraction Pipeline

Scraping Directory Listings

Directories are usually the highest-ROI starting point. They’re structured, contain many companies per page, and are designed to be browsable.

import requests
import time
from bs4 import BeautifulSoup

FINEDATA_API = "https://api.finedata.ai/api/v1/scrape"
API_KEY = "fd_your_api_key"

def scrape_directory_page(url):
    """Scrape a directory listing page and extract company entries."""
    response = requests.post(
        FINEDATA_API,
        headers={
            "x-api-key": API_KEY,
            "Content-Type": "application/json"
        },
        json={
            "url": url,
            "use_js_render": True,
            "tls_profile": "chrome124",
            "timeout": 30
        }
    )

    if response.status_code != 200:
        return []

    html = response.json()["body"]
    soup = BeautifulSoup(html, "html.parser")
    companies = []

    for listing in soup.select(".company-listing"):
        company = {
            "name": safe_text(listing.select_one(".company-name")),
            "website": safe_attr(listing.select_one("a.website-link"), "href"),
            "location": safe_text(listing.select_one(".location")),
            "description": safe_text(listing.select_one(".description")),
            "category": safe_text(listing.select_one(".category")),
        }
        if company["name"]:
            companies.append(company)

    return companies


def safe_text(element):
    return element.get_text(strip=True) if element else None

def safe_attr(element, attr):
    return element.get(attr) if element else None

Enriching with Company Website Data

Once you have a list of companies, visit each website to extract additional details:

def enrich_company(company):
    """Visit a company website and extract contact and context data."""
    website = company.get("website")
    if not website:
        return company

    # Try the contact page first
    contact_urls = [
        f"{website}/contact",
        f"{website}/contact-us",
        f"{website}/about",
        f"{website}/team"
    ]

    for url in contact_urls:
        response = requests.post(
            FINEDATA_API,
            headers={
                "x-api-key": API_KEY,
                "Content-Type": "application/json"
            },
            json={
                "url": url,
                "use_js_render": True,
                "tls_profile": "chrome124",
                "timeout": 20
            }
        )

        if response.status_code == 200:
            html = response.json()["body"]
            extracted = extract_contact_info(html)
            company.update(extracted)

        time.sleep(1)

    return company


def extract_contact_info(html):
    """Extract emails, phone numbers, and names from HTML."""
    import re

    data = {}

    # Extract emails
    emails = re.findall(
        r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
        html
    )
    # Filter out common non-lead emails
    filtered_emails = [
        e for e in emails
        if not any(prefix in e.lower() for prefix in
            ["noreply", "no-reply", "support", "info@", "admin@", "webmaster"])
    ]
    if filtered_emails:
        data["emails"] = list(set(filtered_emails))

    # Extract phone numbers (US format)
    phones = re.findall(
        r"[\+]?1?[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}",
        html
    )
    if phones:
        data["phone"] = phones[0].strip()

    return data

Data Cleaning and Enrichment

Raw scraped data is messy. Before it goes into your CRM, it needs cleaning:

Deduplication

The same company often appears across multiple sources with slightly different names. Implement fuzzy matching:

from difflib import SequenceMatcher

def find_duplicates(companies, threshold=0.85):
    """Identify potential duplicate companies using fuzzy name matching."""
    duplicates = []
    for i, a in enumerate(companies):
        for j, b in enumerate(companies[i+1:], i+1):
            name_a = a["name"].lower().strip()
            name_b = b["name"].lower().strip()

            similarity = SequenceMatcher(None, name_a, name_b).ratio()
            if similarity >= threshold:
                duplicates.append((i, j, similarity))

    return duplicates

Data Standardization

  • Company names — Remove Inc., LLC, Ltd. suffixes for matching; keep them for the CRM record
  • Phone numbers — Normalize to E.164 format
  • Addresses — Parse into structured components (street, city, state, zip)
  • Job titles — Map variations to standard titles (VP of Sales, Vice President Sales, Head of Sales → “VP Sales”)
  • URLs — Normalize (remove trailing slashes, www prefix, HTTP/HTTPS differences)

Email Verification

Don’t load unverified emails into your CRM. At minimum:

  • Check syntax validity
  • Verify the domain has MX records
  • Use an email verification service for bounce checking

Loading bad emails damages your sender reputation and skews your CRM data quality.

Lead Scoring

Not every company is a good lead. Build a scoring model based on:

  • Company size — Does it match your ideal customer profile?
  • Industry — Is it in a target vertical?
  • Technology signals — Are they using complementary or competing tools?
  • Growth signals — Are they hiring? Recently funded?
  • Engagement signals — Did they visit your website? Download content?
def score_lead(company):
    score = 0

    # Company size scoring
    size = company.get("employee_count", 0)
    if 50 <= size <= 500:
        score += 30  # Sweet spot for our product
    elif 500 < size <= 2000:
        score += 20
    elif 10 <= size < 50:
        score += 10

    # Industry scoring
    target_industries = ["saas", "ecommerce", "fintech", "marketing"]
    if company.get("industry", "").lower() in target_industries:
        score += 25

    # Has direct contact
    if company.get("emails"):
        score += 15
    if company.get("phone"):
        score += 10

    # Growth signals
    if company.get("is_hiring"):
        score += 10
    if company.get("recent_funding"):
        score += 10

    return score

CRM Integration

The final step is getting clean, scored leads into your sales team’s hands.

Common CRM Integrations

Most CRMs offer APIs for programmatic lead creation:

  • Salesforce — REST API or Bulk API for high volume
  • HubSpot — Contacts API with easy setup
  • Pipedrive — Simple REST API
  • Close — Well-documented API for sales-focused teams

HubSpot Example

import requests

def push_to_hubspot(lead, hubspot_api_key):
    """Create or update a contact in HubSpot."""
    contact_data = {
        "properties": {
            "company": lead["name"],
            "website": lead.get("website", ""),
            "email": lead.get("emails", [None])[0],
            "phone": lead.get("phone", ""),
            "city": lead.get("city", ""),
            "state": lead.get("state", ""),
            "industry": lead.get("industry", ""),
            "lead_source": "web_scraping",
            "lead_score": lead.get("score", 0)
        }
    }

    response = requests.post(
        "https://api.hubapi.com/crm/v3/objects/contacts",
        headers={
            "Authorization": f"Bearer {hubspot_api_key}",
            "Content-Type": "application/json"
        },
        json=contact_data
    )

    return response.status_code == 201

Best Practices for CRM Loading

  • Map fields carefully. Ensure scraped data maps to the right CRM fields.
  • Set the lead source. Tag leads as “Web Scraping” or “Data Enrichment” so you can track conversion rates by source.
  • Avoid duplicates. Check for existing records before creating new ones. Match on email, company name, or domain.
  • Include context. Add notes about where the lead was found and why it’s relevant.

Compliance Considerations

Lead generation with web data sits in a legal gray area that requires careful navigation:

GDPR (European Union)

If you’re collecting data about individuals in the EU:

  • You need a legitimate interest basis for processing personal data
  • Individuals have the right to access, rectify, and delete their data
  • You must provide a way for people to opt out
  • Data processing must be proportionate to the purpose

CAN-SPAM (United States)

If you’re emailing leads:

  • Include a clear unsubscribe mechanism
  • Don’t use misleading subject lines or sender information
  • Include your physical address

General Best Practices

  • Only collect data that’s publicly available
  • Respect robots.txt and terms of service
  • Don’t scrape data that’s clearly behind authentication walls
  • Maintain a suppression list for people who ask to be removed
  • Document your data collection and processing procedures
  • Consult a lawyer if you’re operating in regulated industries or across borders

Putting It All Together

A complete lead generation pipeline looks like this:

  1. Source identification — Map directories, websites, and databases relevant to your ICP
  2. Data extraction — Use FineData’s API to scrape structured data at scale
  3. Cleaning and deduplication — Standardize, deduplicate, and verify the data
  4. Enrichment — Augment basic records with additional context from company websites
  5. Scoring — Rank leads by fit and intent signals
  6. CRM integration — Push qualified leads into your sales workflow
  7. Feedback loop — Track which leads convert and refine your scoring model

The teams that generate the most pipeline from web data aren’t just scraping more — they’re scraping smarter. They target the right sources, extract the right fields, and build systematic processes that improve over time.

Get started with FineData and turn the open web into your most productive lead source.

#lead-generation #sales #b2b #crm #data-enrichment

Related Articles