B2B Data Enrichment: Building Quality Lead Lists with Web Scraping
Learn how to enrich B2B lead data using web scraping — from company websites and directories to CRM integration and data quality scoring.
B2B Data Enrichment: Building Quality Lead Lists with Web Scraping
In B2B sales, the quality of your data directly predicts the quality of your outcomes. A list of 10,000 names with no context is nearly useless. A list of 500 companies with verified contacts, technology stacks, company sizes, funding stages, and recent activity signals — that’s a pipeline.
Data enrichment is the process of taking a basic lead record (maybe just a company name or domain) and layering on additional context that helps sales teams prioritize, personalize, and convert. This guide covers how to build a B2B data enrichment pipeline using web scraping.
What Data Enrichment Actually Means
At its core, enrichment transforms incomplete records into actionable intelligence. Starting with a company domain, you can enrich with:
Company-level data:
- Company name, description, industry vertical
- Headquarters location, office locations
- Employee count, revenue range
- Founding year, funding history
- Technology stack (what tools they use)
Contact-level data:
- Decision-maker names and titles
- Professional email addresses
- Direct phone numbers
- LinkedIn profile URLs
- Role and department
Signal data:
- Recent news and press mentions
- Job postings (growth indicator)
- Technology changes (adoption signals)
- Review activity (satisfaction indicator)
- Social media presence and activity
Data Sources for B2B Enrichment
Company Websites
The most authoritative source. A company’s own website contains:
- About / Team page — Leadership names, titles, headshots
- Contact page — Email addresses, phone numbers, office addresses
- Careers page — Open roles (growth signal), technology stack (from job descriptions)
- Blog / News — Recent priorities, product launches, partnerships
- Footer — Social media links, legal entity information
Business Directories
Aggregated databases with structured company data:
- Crunchbase — Funding, investors, leadership, company details
- LinkedIn Company Pages — Employee count, industry, headquarters
- BuiltWith / Wappalyzer — Technology stack detection
- Glassdoor — Employee count, culture insights, salary data
- D&B / ZoomInfo — Revenue, industry codes, corporate hierarchy
Public Records
- SEC EDGAR — Financial filings for public companies
- State business registrations — Legal entity information
- Patent databases — Innovation and R&D signals
- Court records — Legal activity (relevant for compliance-focused sales)
Social and Community
- GitHub — Developer activity, open-source involvement
- Twitter / X — Company activity, engagement levels
- Industry forums — Mentions, questions, recommendations
Building an Enrichment Pipeline
Step 1: Start with What You Have
Most enrichment starts with a seed list. This might be:
- A list of domains from a conference attendee list
- Company names from an industry directory
- Domains extracted from email addresses of existing leads
# Example seed data
seed_companies = [
{"domain": "acme-corp.com"},
{"domain": "techstart.io"},
{"domain": "bigretail.com"},
]
Step 2: Scrape Company Websites
Visit each company’s website to extract foundational data:
import requests
from bs4 import BeautifulSoup
import re
FINEDATA_API = "https://api.finedata.ai/api/v1/scrape"
API_KEY = "fd_your_api_key"
def enrich_from_website(domain):
"""Extract company information from their website."""
enriched = {"domain": domain}
# Scrape the homepage
homepage = scrape_page(f"https://{domain}")
if homepage:
enriched.update(extract_homepage_data(homepage))
# Scrape the about page
about = scrape_page(f"https://{domain}/about")
if about:
enriched.update(extract_about_data(about))
# Scrape the team page
for team_path in ["/team", "/about/team", "/about-us", "/our-team", "/people"]:
team = scrape_page(f"https://{domain}{team_path}")
if team and "team" in team.lower():
enriched["team_members"] = extract_team_members(team)
break
# Scrape careers for growth signals
for careers_path in ["/careers", "/jobs", "/join-us", "/work-with-us"]:
careers = scrape_page(f"https://{domain}{careers_path}")
if careers and ("job" in careers.lower() or "career" in careers.lower()):
enriched["open_positions"] = count_job_listings(careers)
enriched["hiring_departments"] = extract_departments(careers)
break
return enriched
def scrape_page(url):
"""Scrape a single page and return HTML content."""
try:
response = requests.post(
FINEDATA_API,
headers={
"x-api-key": API_KEY,
"Content-Type": "application/json"
},
json={
"url": url,
"use_js_render": True,
"tls_profile": "chrome124",
"timeout": 20
}
)
if response.status_code == 200:
return response.json()["body"]
except Exception:
pass
return None
def extract_homepage_data(html):
"""Extract key data from a company homepage."""
soup = BeautifulSoup(html, "html.parser")
data = {}
# Extract company description from meta tags
meta_desc = soup.find("meta", attrs={"name": "description"})
if meta_desc:
data["description"] = meta_desc.get("content", "")
# Extract social links
social_patterns = {
"linkedin": r"linkedin\.com/company/[\w-]+",
"twitter": r"twitter\.com/[\w]+",
"github": r"github\.com/[\w-]+",
}
page_text = str(soup)
for platform, pattern in social_patterns.items():
match = re.search(pattern, page_text)
if match:
data[f"{platform}_url"] = f"https://{match.group()}"
return data
def extract_team_members(html):
"""Extract team member names and titles from a team page."""
soup = BeautifulSoup(html, "html.parser")
members = []
# Common patterns for team member cards
for card in soup.select(
"[class*='team-member'], [class*='person'], [class*='staff'], "
"[class*='leadership'], [class*='executive']"
):
name_el = card.select_one("h2, h3, h4, [class*='name']")
title_el = card.select_one("p, span, [class*='title'], [class*='role'], [class*='position']")
if name_el:
member = {"name": name_el.get_text(strip=True)}
if title_el:
member["title"] = title_el.get_text(strip=True)
members.append(member)
return members
Step 3: Technology Stack Detection
Understanding what technologies a company uses helps with relevance scoring and personalization:
def detect_technology_stack(html, domain):
"""Detect technologies used by analyzing the HTML source."""
technologies = []
tech_signatures = {
"React": ["react", "_reactRootContainer", "__NEXT_DATA__"],
"Vue.js": ["vue", "__vue__", "vue-router"],
"Angular": ["ng-app", "ng-version", "angular"],
"WordPress": ["wp-content", "wp-includes"],
"Shopify": ["cdn.shopify.com", "Shopify.theme"],
"HubSpot": ["hs-scripts.com", "hubspot"],
"Google Analytics": ["google-analytics.com", "gtag"],
"Intercom": ["intercom", "intercomSettings"],
"Drift": ["drift.com", "driftt"],
"Segment": ["cdn.segment.com", "analytics.js"],
"Stripe": ["js.stripe.com", "stripe"],
"Salesforce": ["force.com", "salesforce"],
}
html_lower = html.lower()
for tech, signatures in tech_signatures.items():
if any(sig.lower() in html_lower for sig in signatures):
technologies.append(tech)
return technologies
Step 4: Data Quality Scoring
Not all enriched records are equally valuable. Score each record based on completeness and freshness:
def calculate_quality_score(company):
"""Score enriched data quality from 0-100."""
score = 0
max_score = 100
# Core fields (50 points)
core_fields = {
"description": 10,
"employee_count": 10,
"industry": 10,
"location": 10,
"founded_year": 5,
"revenue_range": 5,
}
for field, points in core_fields.items():
if company.get(field):
score += points
# Contact fields (30 points)
if company.get("team_members") and len(company["team_members"]) > 0:
score += 15
if company.get("emails") and len(company["emails"]) > 0:
score += 10
if company.get("phone"):
score += 5
# Signal fields (20 points)
if company.get("technologies"):
score += 5
if company.get("open_positions") is not None:
score += 5
if company.get("linkedin_url"):
score += 5
if company.get("recent_news"):
score += 5
return min(score, max_score)
Deduplication
As you enrich data from multiple sources, duplicates are inevitable. The same company might appear as “Acme Corp”, “Acme Corporation”, “ACME Corp.”, and “acme-corp.com.”
Matching Strategies
- Domain matching — The most reliable. Normalize domains (strip www., trailing slashes) and match exactly.
- Name matching — Use fuzzy matching with a threshold (>0.85 similarity). Normalize by removing common suffixes (Inc, LLC, Ltd).
- Composite matching — Combine domain + name + location for the highest confidence matches.
from difflib import SequenceMatcher
def normalize_domain(domain):
"""Normalize a domain for comparison."""
domain = domain.lower().strip()
domain = domain.replace("https://", "").replace("http://", "")
domain = domain.replace("www.", "")
return domain.rstrip("/")
def normalize_company_name(name):
"""Normalize company name for matching."""
name = name.lower().strip()
suffixes = [" inc", " inc.", " llc", " ltd", " ltd.", " corp", " corp.",
" co.", " company", " gmbh", " ag", " sa"]
for suffix in suffixes:
if name.endswith(suffix):
name = name[:-len(suffix)].strip()
return name
def is_duplicate(company_a, company_b, threshold=0.85):
"""Determine if two companies are duplicates."""
# Domain match — high confidence
domain_a = normalize_domain(company_a.get("domain", ""))
domain_b = normalize_domain(company_b.get("domain", ""))
if domain_a and domain_b and domain_a == domain_b:
return True
# Name match — medium confidence
name_a = normalize_company_name(company_a.get("name", ""))
name_b = normalize_company_name(company_b.get("name", ""))
if name_a and name_b:
similarity = SequenceMatcher(None, name_a, name_b).ratio()
if similarity >= threshold:
return True
return False
CRM Integration
Enriched data needs to flow into your CRM to be actionable. The key is mapping your enriched fields to CRM fields and handling updates gracefully:
def sync_to_crm(enriched_companies, crm_client):
"""Sync enriched data to CRM, creating or updating records."""
results = {"created": 0, "updated": 0, "skipped": 0}
for company in enriched_companies:
quality = calculate_quality_score(company)
# Skip low-quality records
if quality < 30:
results["skipped"] += 1
continue
# Check if company already exists in CRM
existing = crm_client.find_company(domain=company["domain"])
if existing:
# Update with new enrichment data
crm_client.update_company(existing["id"], {
"enrichment_score": quality,
"technologies": ", ".join(company.get("technologies", [])),
"open_positions": company.get("open_positions", 0),
"last_enriched": datetime.utcnow().isoformat()
})
results["updated"] += 1
else:
# Create new company record
crm_client.create_company({
"name": company.get("name", company["domain"]),
"domain": company["domain"],
"description": company.get("description", ""),
"industry": company.get("industry", ""),
"employee_count": company.get("employee_count"),
"enrichment_score": quality,
"lead_source": "web_enrichment",
"technologies": ", ".join(company.get("technologies", [])),
})
results["created"] += 1
return results
Compliance and Ethics
B2B data enrichment involves collecting publicly available business information, but there are still important guardrails:
- Only collect public data. Don’t scrape behind login walls or access private databases.
- Respect opt-outs. If someone asks to be removed from your lists, remove them immediately.
- GDPR applies to business contacts. In the EU, even B2B email addresses are personal data. Ensure you have a legitimate interest basis and provide opt-out mechanisms.
- CCPA considerations. California residents have rights over their personal information regardless of B2B context.
- Data minimization. Only collect what you actually need for your sales process.
- Secure storage. Treat enriched lead data with appropriate security controls.
Maintaining Data Quality Over Time
Enrichment isn’t a one-time activity. Company data decays — people change jobs, companies move offices, phone numbers change. Plan for ongoing maintenance:
- Re-enrich quarterly. Run your enrichment pipeline on existing records every 3-6 months.
- Monitor bounce rates. If email deliverability drops, it’s time to re-enrich those records.
- Track enrichment freshness. Add a “last_enriched” timestamp to every record and flag stale data.
- Score degradation. Automatically lower quality scores over time if records aren’t refreshed.
Getting Started
Building a B2B enrichment pipeline doesn’t require a massive investment. Start simple:
- Export your existing lead list with whatever data you have
- Identify the gaps — what information would help your sales team most?
- Start with company websites — the richest and most accessible source
- Use FineData’s API to handle the scraping infrastructure
- Build quality scoring so your sales team focuses on the best leads
- Iterate — add more data sources and refine your enrichment logic over time
The difference between a mediocre sales pipeline and a high-converting one often comes down to data quality. Enrichment bridges the gap between a name on a list and a qualified, contextualized lead that a sales rep can actually work with.
Start enriching your B2B data with FineData today.
Related Articles
How to Scrape OnlyFans Content Safely and Ethically
Learn how to build a reliable OnlyFans data scraper with anti-detection, CAPTCHA bypass, and privacy-conscious practices.
Industry GuideHow to Scrape LinkedIn Company Pages for B2B Lead Generation in 2026
Step-by-step guide to extracting company data from LinkedIn using FineData API—bypassing anti-bot walls with minimal rate limits.
Industry GuideCompetitive Intelligence: How to Monitor Competitors at Scale
A strategic guide to building competitive intelligence systems that monitor competitor pricing, products, content, hiring, and more using web scraping.