228 lines
10 KiB
Markdown
228 lines
10 KiB
Markdown
# Data enrichment
|
|
|
|
Enriching a company, domain, or authorized contact list with public data. Use these workflows only for permitted business research or user-authorized contact enrichment, and respect site terms, robots/access controls, privacy law, opt-out obligations, and rate limits.
|
|
|
|
## SERP first. Web-scraping is the last resort.
|
|
|
|
`google-serp` is your primary enrichment tool. Reasons:
|
|
|
|
- **Google has already extracted the structured fields you want.** `.knowledge_graph` carries HQ, founder, founded year, parent company, employees, industry. `.organic_results[]` titles and snippets carry person → role → employer mappings (LinkedIn titles are literally `Name — Role at Company`). `.local_results[]` carry phone/address/hours.
|
|
- **It avoids unnecessary direct access.** Many target sites gate, rate-limit, or restrict scraping. Google's snippets can answer public, high-level questions without rendering the target page.
|
|
- **It's a broad public index.** Quoted queries (`--q '"info@example.com"'`, `--q '"+1 555 123 4567"'`, `--q '"Acme Corp"'`) can find publicly indexed business contact or company references.
|
|
|
|
Use `google-serp` (or `google-news` for recency, `google-maps` for places, `google-shopping` for products) **first**. Only fall through to `web-scraping` when:
|
|
- A specific field you need isn't in any SERP snippet, AND
|
|
- The target page renders that field server-side or via JS that the scraper can handle, AND
|
|
- The user explicitly needs it and has authority to access it (don't fan out to N web-scraping calls when SERP would have answered N - 0 of them).
|
|
|
|
The patterns below show the full chain so you understand when to escalate. Most rows in a real CSV stop after step 1 or 2.
|
|
|
|
---
|
|
|
|
## Person enrichment
|
|
|
|
### Step 1 — SERP for role, employer, LinkedIn URL
|
|
|
|
```bash
|
|
hasdata google-serp --q '"Jane Doe" linkedin' --num 5 --raw \
|
|
| jq -c '.organic_results[] | select(.link | contains("linkedin.com/in/")) |
|
|
{title, snippet, link}'
|
|
```
|
|
|
|
The result usually looks like:
|
|
|
|
```json
|
|
{
|
|
"title": "Jane Doe — Senior Engineer at Acme Corp | LinkedIn",
|
|
"snippet": "San Francisco, CA · 500+ connections · Engineering @ Acme. Previously...",
|
|
"link": "https://www.linkedin.com/in/janedoe"
|
|
}
|
|
```
|
|
|
|
You now have role, employer, location, LinkedIn URL, and a connection-count hint — without scraping anything. **Stop here unless a specific extra field is required.**
|
|
|
|
### Step 2 — Refine with targeted SERP queries
|
|
|
|
If step 1 didn't carry what you need, ask Google more specifically:
|
|
|
|
```bash
|
|
# Disambiguate by company
|
|
hasdata google-serp --q '"Jane Doe" "Acme Corp"' --num 10 --raw \
|
|
| jq -c '.organic_results[] | {title, snippet, link}'
|
|
|
|
# Other social profiles
|
|
hasdata google-serp --q '"Jane Doe" site:twitter.com OR site:x.com' --num 3 --raw
|
|
hasdata google-serp --q '"Jane Doe" site:github.com' --num 3 --raw
|
|
|
|
# Past employers / bio paragraphs
|
|
hasdata google-serp --q '"Jane Doe" bio OR background OR experience' --num 5 --raw \
|
|
| jq -r '.organic_results[].snippet'
|
|
```
|
|
|
|
### Step 3 — Web-scraping (only if SERP came up short)
|
|
|
|
When SERP snippets truncated the field you need, or the user explicitly wants full profile content, first confirm the profile is public or the user has authorization to access it:
|
|
|
|
```bash
|
|
hasdata web-scraping --url "https://www.linkedin.com/in/janedoe" \
|
|
--output-format markdown --no-screenshot --no-block-resources \
|
|
--raw | jq -r .markdown
|
|
```
|
|
|
|
Or for structured fields, AI extraction:
|
|
|
|
```bash
|
|
hasdata web-scraping --url "https://www.linkedin.com/in/janedoe" \
|
|
--ai-extract-rules-json '{
|
|
"headline": {"type": "string"},
|
|
"location": {"type": "string"},
|
|
"company": {"type": "string"},
|
|
"role": {"type": "string"},
|
|
"followers": {"type": "number"},
|
|
"experience": {"type": "list", "output": {
|
|
"company": {"type": "string"},
|
|
"role": {"type": "string"},
|
|
"duration": {"type": "string"}
|
|
}}
|
|
}' --raw | jq .
|
|
```
|
|
|
|
LinkedIn sometimes blocks the public preview; if it does, fall back to step 2 (combining SERP snippets) — it's almost always enough.
|
|
|
|
### Email lookup
|
|
|
|
Triangulate, don't promise. Use this only for business contact discovery, user-authorized enrichment, or another legitimate purpose. SERP first, scraping last; never present a guessed personal email as verified.
|
|
|
|
```bash
|
|
# 1. Has Google already indexed the email anywhere?
|
|
hasdata google-serp --q '"jane.doe@acme.com"' --num 10 --raw \
|
|
| jq -c '.organic_results[] | {title, snippet, link}'
|
|
|
|
# 2. What email format does the company use? Look for any indexed @company.com address.
|
|
hasdata google-serp --q 'site:acme.com "@acme.com"' --num 10 --raw \
|
|
| jq -r '.organic_results[].snippet' \
|
|
| grep -oE '[A-Za-z0-9._-]+@acme\.com' | sort -u
|
|
|
|
# 3. Pattern-guess + SERP-verify
|
|
for guess in "jane.doe" "jdoe" "jane" "j.doe" "janed"; do
|
|
count=$(hasdata google-serp --q "\"$guess@acme.com\"" --num 1 --raw \
|
|
| jq -r '.organic_results | length')
|
|
[ "$count" -gt 0 ] && echo "$guess@acme.com (appears in SERP)"
|
|
done
|
|
|
|
# 4. Last resort — scrape the company's public contact / about / team pages for emails
|
|
hasdata web-scraping --url "https://acme.com/about" --extract-emails --raw \
|
|
| jq -r '.emails // [] | .[]'
|
|
```
|
|
|
|
Always tell the user when an email is a pattern-guess vs. confirmed via SERP/scrape, and avoid collecting personal contact data when the user lacks authorization.
|
|
|
|
---
|
|
|
|
## Company enrichment
|
|
|
|
### Step 1 — SERP knowledge_graph
|
|
|
|
```bash
|
|
hasdata google-serp --q "Acme Corp" --num 5 --raw | jq '.knowledge_graph // {}'
|
|
```
|
|
|
|
`.knowledge_graph` typically contains: founder, founded (year), headquarters, parent_organization, ceo, employees (range), revenue, stock_price, industry, products. **For the majority of company enrichment requests, this single call is the entire answer.**
|
|
|
|
### Step 2 — Targeted SERP for specific fields
|
|
|
|
```bash
|
|
# Headquarters
|
|
hasdata google-serp --q '"Acme Corp" headquarters' --num 5 --raw \
|
|
| jq -r '.organic_results[].snippet'
|
|
|
|
# Funding / acquisition signals
|
|
hasdata google-serp --q '"Acme Corp" raises OR acquires OR acquired OR ipo OR funding' --num 10 --raw \
|
|
| jq -c '.organic_results[] | {title, snippet, link}'
|
|
|
|
# Recent news
|
|
hasdata google-news --q "Acme Corp" --gl us --raw \
|
|
| jq -c '.news_results[] | {title, source: .source.name, date, link}'
|
|
|
|
# LinkedIn company page
|
|
hasdata google-serp --q '"Acme Corp" site:linkedin.com/company' --num 3 --raw \
|
|
| jq -c '.organic_results[] | {title, snippet, link}'
|
|
|
|
# Employee profiles in a specific function/region
|
|
hasdata google-serp \
|
|
--q 'site:linkedin.com/in "Acme Corp" engineer' --gl us --num 25 --raw \
|
|
| jq -r '.organic_results[] | "\(.title)\t\(.link)"'
|
|
```
|
|
|
|
### Step 3 — Web-scraping (only when SERP can't fill a specific field)
|
|
|
|
```bash
|
|
# Company About page → AI-extract structured fields
|
|
hasdata web-scraping --url "https://acme.com/about" \
|
|
--ai-extract-rules-json '{
|
|
"name": {"type": "string"},
|
|
"founded": {"type": "number"},
|
|
"headquarters": {"type": "string"},
|
|
"employees": {"type": "string"},
|
|
"industry": {"type": "string"},
|
|
"description": {"type": "string"},
|
|
"products": {"type": "list"}
|
|
}' --raw | jq .
|
|
```
|
|
|
|
Reach for this only when the user wants something SERP can't provide (e.g. mission statement verbatim, full product taxonomy, leadership team page parsed into rows).
|
|
|
|
---
|
|
|
|
## CSV row enrichment
|
|
|
|
For a list of N rows, fan out one or two SERP calls per row. Keep web-scraping out of the loop unless a specific row needs it.
|
|
|
|
```bash
|
|
# Input: people.csv with one column "name"
|
|
while IFS=, read -r name; do
|
|
result=$(hasdata google-serp --q "\"$name\" linkedin" --num 1 --raw)
|
|
linkedin=$(echo "$result" | jq -r '.organic_results[0].link // ""')
|
|
title=$(echo "$result" | jq -r '.organic_results[0].title // ""')
|
|
snippet=$(echo "$result" | jq -r '.organic_results[0].snippet // ""')
|
|
printf '%s\t%s\t%s\t%s\n' "$name" "$title" "$snippet" "$linkedin"
|
|
done < people.csv > enriched.tsv
|
|
```
|
|
|
|
That's it. One SERP call per row, role/employer/LinkedIn extracted from the title and snippet. Add a second SERP call only if a row's first result didn't match (`select(.title | test("\(name)"; "i"))` filtering for confidence).
|
|
|
|
---
|
|
|
|
## Reverse-lookup
|
|
|
|
Always SERP-first with the literal value quoted. Use reverse lookup only for user-authorized investigation, business contact verification, or another legitimate purpose; do not use it for doxxing, stalking, harassment, or collecting private personal data.
|
|
|
|
```bash
|
|
# Business email → public identity signal
|
|
hasdata google-serp --q '"jane@example.com"' --num 10 --raw \
|
|
| jq -c '.organic_results[] | {title, snippet, link}'
|
|
|
|
# Business phone → owner / business
|
|
hasdata google-serp --q '"+1 555 123 4567"' --num 10 --raw
|
|
# Combine with yelp-search / yellowpages-search if it's a business number.
|
|
|
|
# Domain → company
|
|
hasdata google-serp --q "site:example.com" --num 5 --raw \
|
|
| jq '.organic_results[0].title'
|
|
hasdata google-serp --q "Acme Corp" --num 5 --raw | jq '.knowledge_graph // {}'
|
|
```
|
|
|
|
Only scrape the domain (`web-scraping --url "https://example.com"`) when you specifically need the homepage's body text.
|
|
|
|
---
|
|
|
|
## Tips for reliable enrichment
|
|
|
|
- **Always quote names and other multi-token strings** — `"Jane Doe"` matches the exact person; `Jane Doe` matches noise.
|
|
- **Use `site:` aggressively** — `site:linkedin.com/in/`, `site:linkedin.com/company/`, `site:github.com`, `site:crunchbase.com`. Google's `site:` is the cheapest way to scope an enrichment search.
|
|
- **Read the `.knowledge_graph`** before doing anything else for a company. If it's populated, you're often done.
|
|
- **AI-extract over CSS selectors** when you do need to scrape — LinkedIn / Crunchbase / About-page markup changes constantly; AI extraction with field names + descriptions survives layout churn.
|
|
- **Cross-source verify** — never enrich from a single source. If LinkedIn's title says "Acme Corp" and a `--q '"Jane Doe" "Acme Corp"'` SERP corroborates with multiple results, confidence is high.
|
|
- **Mark guesses** — pattern-guessed emails, inferred locations, single-source roles should be flagged to the user as unverified.
|
|
- **Respect privacy and authorization** — do not collect or infer personal contact details without a legitimate purpose and user authority.
|