Industry Guide 8 min read

B2B Data Enrichment: Building Quality Lead Lists with Web Scraping

Learn how to enrich B2B lead data using web scraping — from company websites and directories to CRM integration and data quality scoring.

FT
FineData Team
|

B2B Data Enrichment: Building Quality Lead Lists with Web Scraping

In B2B sales, the quality of your data directly predicts the quality of your outcomes. A list of 10,000 names with no context is nearly useless. A list of 500 companies with verified contacts, technology stacks, company sizes, funding stages, and recent activity signals — that’s a pipeline.

Data enrichment is the process of taking a basic lead record (maybe just a company name or domain) and layering on additional context that helps sales teams prioritize, personalize, and convert. This guide covers how to build a B2B data enrichment pipeline using web scraping.

What Data Enrichment Actually Means

At its core, enrichment transforms incomplete records into actionable intelligence. Starting with a company domain, you can enrich with:

Company-level data:

  • Company name, description, industry vertical
  • Headquarters location, office locations
  • Employee count, revenue range
  • Founding year, funding history
  • Technology stack (what tools they use)

Contact-level data:

  • Decision-maker names and titles
  • Professional email addresses
  • Direct phone numbers
  • LinkedIn profile URLs
  • Role and department

Signal data:

  • Recent news and press mentions
  • Job postings (growth indicator)
  • Technology changes (adoption signals)
  • Review activity (satisfaction indicator)
  • Social media presence and activity

Data Sources for B2B Enrichment

Company Websites

The most authoritative source. A company’s own website contains:

  • About / Team page — Leadership names, titles, headshots
  • Contact page — Email addresses, phone numbers, office addresses
  • Careers page — Open roles (growth signal), technology stack (from job descriptions)
  • Blog / News — Recent priorities, product launches, partnerships
  • Footer — Social media links, legal entity information

Business Directories

Aggregated databases with structured company data:

  • Crunchbase — Funding, investors, leadership, company details
  • LinkedIn Company Pages — Employee count, industry, headquarters
  • BuiltWith / Wappalyzer — Technology stack detection
  • Glassdoor — Employee count, culture insights, salary data
  • D&B / ZoomInfo — Revenue, industry codes, corporate hierarchy

Public Records

  • SEC EDGAR — Financial filings for public companies
  • State business registrations — Legal entity information
  • Patent databases — Innovation and R&D signals
  • Court records — Legal activity (relevant for compliance-focused sales)

Social and Community

  • GitHub — Developer activity, open-source involvement
  • Twitter / X — Company activity, engagement levels
  • Industry forums — Mentions, questions, recommendations

Building an Enrichment Pipeline

Step 1: Start with What You Have

Most enrichment starts with a seed list. This might be:

  • A list of domains from a conference attendee list
  • Company names from an industry directory
  • Domains extracted from email addresses of existing leads
# Example seed data
seed_companies = [
    {"domain": "acme-corp.com"},
    {"domain": "techstart.io"},
    {"domain": "bigretail.com"},
]

Step 2: Scrape Company Websites

Visit each company’s website to extract foundational data:

import requests
from bs4 import BeautifulSoup
import re

FINEDATA_API = "https://api.finedata.ai/api/v1/scrape"
API_KEY = "fd_your_api_key"

def enrich_from_website(domain):
    """Extract company information from their website."""
    enriched = {"domain": domain}

    # Scrape the homepage
    homepage = scrape_page(f"https://{domain}")
    if homepage:
        enriched.update(extract_homepage_data(homepage))

    # Scrape the about page
    about = scrape_page(f"https://{domain}/about")
    if about:
        enriched.update(extract_about_data(about))

    # Scrape the team page
    for team_path in ["/team", "/about/team", "/about-us", "/our-team", "/people"]:
        team = scrape_page(f"https://{domain}{team_path}")
        if team and "team" in team.lower():
            enriched["team_members"] = extract_team_members(team)
            break

    # Scrape careers for growth signals
    for careers_path in ["/careers", "/jobs", "/join-us", "/work-with-us"]:
        careers = scrape_page(f"https://{domain}{careers_path}")
        if careers and ("job" in careers.lower() or "career" in careers.lower()):
            enriched["open_positions"] = count_job_listings(careers)
            enriched["hiring_departments"] = extract_departments(careers)
            break

    return enriched


def scrape_page(url):
    """Scrape a single page and return HTML content."""
    try:
        response = requests.post(
            FINEDATA_API,
            headers={
                "x-api-key": API_KEY,
                "Content-Type": "application/json"
            },
            json={
                "url": url,
                "use_js_render": True,
                "tls_profile": "chrome124",
                "timeout": 20
            }
        )
        if response.status_code == 200:
            return response.json()["body"]
    except Exception:
        pass
    return None


def extract_homepage_data(html):
    """Extract key data from a company homepage."""
    soup = BeautifulSoup(html, "html.parser")

    data = {}

    # Extract company description from meta tags
    meta_desc = soup.find("meta", attrs={"name": "description"})
    if meta_desc:
        data["description"] = meta_desc.get("content", "")

    # Extract social links
    social_patterns = {
        "linkedin": r"linkedin\.com/company/[\w-]+",
        "twitter": r"twitter\.com/[\w]+",
        "github": r"github\.com/[\w-]+",
    }

    page_text = str(soup)
    for platform, pattern in social_patterns.items():
        match = re.search(pattern, page_text)
        if match:
            data[f"{platform}_url"] = f"https://{match.group()}"

    return data


def extract_team_members(html):
    """Extract team member names and titles from a team page."""
    soup = BeautifulSoup(html, "html.parser")
    members = []

    # Common patterns for team member cards
    for card in soup.select(
        "[class*='team-member'], [class*='person'], [class*='staff'], "
        "[class*='leadership'], [class*='executive']"
    ):
        name_el = card.select_one("h2, h3, h4, [class*='name']")
        title_el = card.select_one("p, span, [class*='title'], [class*='role'], [class*='position']")

        if name_el:
            member = {"name": name_el.get_text(strip=True)}
            if title_el:
                member["title"] = title_el.get_text(strip=True)
            members.append(member)

    return members

Step 3: Technology Stack Detection

Understanding what technologies a company uses helps with relevance scoring and personalization:

def detect_technology_stack(html, domain):
    """Detect technologies used by analyzing the HTML source."""
    technologies = []

    tech_signatures = {
        "React": ["react", "_reactRootContainer", "__NEXT_DATA__"],
        "Vue.js": ["vue", "__vue__", "vue-router"],
        "Angular": ["ng-app", "ng-version", "angular"],
        "WordPress": ["wp-content", "wp-includes"],
        "Shopify": ["cdn.shopify.com", "Shopify.theme"],
        "HubSpot": ["hs-scripts.com", "hubspot"],
        "Google Analytics": ["google-analytics.com", "gtag"],
        "Intercom": ["intercom", "intercomSettings"],
        "Drift": ["drift.com", "driftt"],
        "Segment": ["cdn.segment.com", "analytics.js"],
        "Stripe": ["js.stripe.com", "stripe"],
        "Salesforce": ["force.com", "salesforce"],
    }

    html_lower = html.lower()
    for tech, signatures in tech_signatures.items():
        if any(sig.lower() in html_lower for sig in signatures):
            technologies.append(tech)

    return technologies

Step 4: Data Quality Scoring

Not all enriched records are equally valuable. Score each record based on completeness and freshness:

def calculate_quality_score(company):
    """Score enriched data quality from 0-100."""
    score = 0
    max_score = 100

    # Core fields (50 points)
    core_fields = {
        "description": 10,
        "employee_count": 10,
        "industry": 10,
        "location": 10,
        "founded_year": 5,
        "revenue_range": 5,
    }

    for field, points in core_fields.items():
        if company.get(field):
            score += points

    # Contact fields (30 points)
    if company.get("team_members") and len(company["team_members"]) > 0:
        score += 15
    if company.get("emails") and len(company["emails"]) > 0:
        score += 10
    if company.get("phone"):
        score += 5

    # Signal fields (20 points)
    if company.get("technologies"):
        score += 5
    if company.get("open_positions") is not None:
        score += 5
    if company.get("linkedin_url"):
        score += 5
    if company.get("recent_news"):
        score += 5

    return min(score, max_score)

Deduplication

As you enrich data from multiple sources, duplicates are inevitable. The same company might appear as “Acme Corp”, “Acme Corporation”, “ACME Corp.”, and “acme-corp.com.”

Matching Strategies

  1. Domain matching — The most reliable. Normalize domains (strip www., trailing slashes) and match exactly.
  2. Name matching — Use fuzzy matching with a threshold (>0.85 similarity). Normalize by removing common suffixes (Inc, LLC, Ltd).
  3. Composite matching — Combine domain + name + location for the highest confidence matches.
from difflib import SequenceMatcher

def normalize_domain(domain):
    """Normalize a domain for comparison."""
    domain = domain.lower().strip()
    domain = domain.replace("https://", "").replace("http://", "")
    domain = domain.replace("www.", "")
    return domain.rstrip("/")


def normalize_company_name(name):
    """Normalize company name for matching."""
    name = name.lower().strip()
    suffixes = [" inc", " inc.", " llc", " ltd", " ltd.", " corp", " corp.",
                " co.", " company", " gmbh", " ag", " sa"]
    for suffix in suffixes:
        if name.endswith(suffix):
            name = name[:-len(suffix)].strip()
    return name


def is_duplicate(company_a, company_b, threshold=0.85):
    """Determine if two companies are duplicates."""
    # Domain match — high confidence
    domain_a = normalize_domain(company_a.get("domain", ""))
    domain_b = normalize_domain(company_b.get("domain", ""))
    if domain_a and domain_b and domain_a == domain_b:
        return True

    # Name match — medium confidence
    name_a = normalize_company_name(company_a.get("name", ""))
    name_b = normalize_company_name(company_b.get("name", ""))
    if name_a and name_b:
        similarity = SequenceMatcher(None, name_a, name_b).ratio()
        if similarity >= threshold:
            return True

    return False

CRM Integration

Enriched data needs to flow into your CRM to be actionable. The key is mapping your enriched fields to CRM fields and handling updates gracefully:

def sync_to_crm(enriched_companies, crm_client):
    """Sync enriched data to CRM, creating or updating records."""
    results = {"created": 0, "updated": 0, "skipped": 0}

    for company in enriched_companies:
        quality = calculate_quality_score(company)

        # Skip low-quality records
        if quality < 30:
            results["skipped"] += 1
            continue

        # Check if company already exists in CRM
        existing = crm_client.find_company(domain=company["domain"])

        if existing:
            # Update with new enrichment data
            crm_client.update_company(existing["id"], {
                "enrichment_score": quality,
                "technologies": ", ".join(company.get("technologies", [])),
                "open_positions": company.get("open_positions", 0),
                "last_enriched": datetime.utcnow().isoformat()
            })
            results["updated"] += 1
        else:
            # Create new company record
            crm_client.create_company({
                "name": company.get("name", company["domain"]),
                "domain": company["domain"],
                "description": company.get("description", ""),
                "industry": company.get("industry", ""),
                "employee_count": company.get("employee_count"),
                "enrichment_score": quality,
                "lead_source": "web_enrichment",
                "technologies": ", ".join(company.get("technologies", [])),
            })
            results["created"] += 1

    return results

Compliance and Ethics

B2B data enrichment involves collecting publicly available business information, but there are still important guardrails:

  • Only collect public data. Don’t scrape behind login walls or access private databases.
  • Respect opt-outs. If someone asks to be removed from your lists, remove them immediately.
  • GDPR applies to business contacts. In the EU, even B2B email addresses are personal data. Ensure you have a legitimate interest basis and provide opt-out mechanisms.
  • CCPA considerations. California residents have rights over their personal information regardless of B2B context.
  • Data minimization. Only collect what you actually need for your sales process.
  • Secure storage. Treat enriched lead data with appropriate security controls.

Maintaining Data Quality Over Time

Enrichment isn’t a one-time activity. Company data decays — people change jobs, companies move offices, phone numbers change. Plan for ongoing maintenance:

  • Re-enrich quarterly. Run your enrichment pipeline on existing records every 3-6 months.
  • Monitor bounce rates. If email deliverability drops, it’s time to re-enrich those records.
  • Track enrichment freshness. Add a “last_enriched” timestamp to every record and flag stale data.
  • Score degradation. Automatically lower quality scores over time if records aren’t refreshed.

Getting Started

Building a B2B enrichment pipeline doesn’t require a massive investment. Start simple:

  1. Export your existing lead list with whatever data you have
  2. Identify the gaps — what information would help your sales team most?
  3. Start with company websites — the richest and most accessible source
  4. Use FineData’s API to handle the scraping infrastructure
  5. Build quality scoring so your sales team focuses on the best leads
  6. Iterate — add more data sources and refine your enrichment logic over time

The difference between a mediocre sales pipeline and a high-converting one often comes down to data quality. Enrichment bridges the gap between a name on a list and a qualified, contextualized lead that a sales rep can actually work with.

Start enriching your B2B data with FineData today.

#b2b #data-enrichment #leads #sales #quality

Related Articles