182 lines
8.5 KiB
Markdown
182 lines
8.5 KiB
Markdown
# web-scraping reference
|
||
|
||
Subcommand: `web-scraping` (10 credits/call). Single endpoint for arbitrary URL scraping with JS rendering, proxies, AI extraction, screenshots, markdown conversion.
|
||
|
||
> **`web-scraping` is the last resort, not the default.** Before reaching for it, ask: would `google-serp` (or `google-news` / `google-shopping` / `google-maps`) already have the field I need? Google's `.knowledge_graph`, `.organic_results[].snippet`, `.local_results[]` carry pre-extracted public facts without direct page access. Only invoke `web-scraping` when:
|
||
>
|
||
> - the user gave you a specific URL to read, OR
|
||
> - SERP came up short for a specific field, OR
|
||
> - the target page renders content that doesn't show up in any SERP snippet.
|
||
>
|
||
> Use it only for public pages or content the user is authorized to access, and respect site terms, robots/access controls, privacy law, and rate limits.
|
||
>
|
||
> See `references/enrichment.md` for the SERP-first patterns.
|
||
|
||
```bash
|
||
hasdata web-scraping --url "URL" [flags] --raw | jq .
|
||
```
|
||
|
||
## Required
|
||
|
||
- `--url URL` — target page
|
||
|
||
## Output formats
|
||
|
||
- `--output-format html|text|markdown|json` (repeatable)
|
||
- One format → that format directly (e.g. `--output-format markdown` returns markdown text under `.markdown`)
|
||
- Multiple → JSON response with one key per format
|
||
- `--output-format json` combined with others → wraps everything in JSON
|
||
|
||
```bash
|
||
# LLM-friendly markdown for prompt context
|
||
hasdata web-scraping --url "$URL" --output-format markdown --raw | jq -r .markdown
|
||
```
|
||
|
||
## Proxy & rendering
|
||
|
||
- `--proxy-type datacenter|residential` (default datacenter)
|
||
- `--proxy-country US|UK|DE|IE|FR|IT|SE|BR|CA|JP|SG|IN|ID` (default US)
|
||
- `--js-rendering` / `--no-js-rendering` (default on) — full headless browser
|
||
- `--block-ads` / `--no-block-ads` (default on)
|
||
- `--block-resources` / `--no-block-resources` (default on) — blocks images/CSS for speed
|
||
- `--screenshot` / `--no-screenshot` (default on; the result includes a screenshot URL)
|
||
- `--remove-base64-images` — strip inline base64 images from response
|
||
- `--extract-emails` / `--no-extract-emails` (default on)
|
||
- `--extract-links` (default off)
|
||
|
||
## Wait controls
|
||
|
||
- `--wait MS` — fixed wait after page load
|
||
- `--wait-for "CSS_SELECTOR"` — wait until selector appears
|
||
|
||
## Custom JS scenario (complex array — JSON only)
|
||
|
||
```bash
|
||
hasdata web-scraping --url "$URL" \
|
||
--js-scenario-json '[
|
||
{"wait": 2000},
|
||
{"click": ".load-more"},
|
||
{"waitFor": ".item"},
|
||
{"scrollY": 1000},
|
||
{"fill": ["input#q", "espresso"]}
|
||
]' --raw
|
||
```
|
||
|
||
Supported actions: `evaluate`, `click`, `wait`, `waitFor`, `waitForAndClick`, `scrollX`, `scrollY`, `fill`. Executed sequentially.
|
||
|
||
Accepts raw JSON, `@file.json`, or `-` (stdin).
|
||
|
||
## Headers (kvSlice + JSON escape)
|
||
|
||
```bash
|
||
# Repeatable kv form (splits on first `=`)
|
||
hasdata web-scraping --url "$URL" \
|
||
--headers "User-Agent=hasdata-cli" \
|
||
--headers "Accept-Language=en-US,en;q=0.9" \
|
||
--headers "Cookie=session=abc=def" \
|
||
--raw
|
||
|
||
# JSON base + kv overrides
|
||
hasdata web-scraping --url "$URL" \
|
||
--headers-json '{"User-Agent":"base","X-Common":"shared"}' \
|
||
--headers "User-Agent=override" \
|
||
--raw
|
||
```
|
||
|
||
## CSS-selector data extraction (kvSlice or JSON)
|
||
|
||
```bash
|
||
# Lightweight kv form: --extract-rules KEY=SELECTOR
|
||
hasdata web-scraping --url "https://quotes.toscrape.com" \
|
||
--extract-rules "quote=.quote .text" \
|
||
--extract-rules "author=.quote .author" \
|
||
--raw | jq .
|
||
|
||
# JSON form for complex selectors / attributes
|
||
hasdata web-scraping --url "$URL" \
|
||
--extract-rules-json '{"title":"h1","links":"a @href","price":".price-now"}' \
|
||
--raw
|
||
```
|
||
|
||
`@href`, `@src`, etc. extract attributes. Without `@`, extracts text content.
|
||
|
||
## AI extraction (LLM-driven)
|
||
|
||
```bash
|
||
hasdata web-scraping --url "$URL" \
|
||
--ai-extract-rules-json '{
|
||
"headline": {"type": "string", "description": "the main story headline"},
|
||
"comments_count": {"type": "number"},
|
||
"is_paid_content": {"type": "boolean"},
|
||
"tags": {"type": "list", "description": "topic tags"},
|
||
"author": {"type": "item", "output": {
|
||
"name": {"type": "string"},
|
||
"verified": {"type": "boolean"}
|
||
}}
|
||
}' --raw | jq .
|
||
```
|
||
|
||
Supported types: `string`, `number`, `boolean`, `list`, `item` (nested object — defines its shape under `output`).
|
||
|
||
## Tag filtering
|
||
|
||
- `--include-only-tags "main,article"` (comma-joined CSS selectors) — keep only matching elements
|
||
- `--exclude-tags script --exclude-tags style` (repeatable) — remove elements
|
||
|
||
```bash
|
||
hasdata web-scraping --url "$URL" \
|
||
--output-format markdown \
|
||
--include-only-tags "article,main" \
|
||
--exclude-tags script --exclude-tags style --exclude-tags nav \
|
||
--raw | jq -r .markdown
|
||
```
|
||
|
||
## URL blocklist
|
||
|
||
```bash
|
||
--block-urls-json '["**.googletagmanager.com/**","**.doubleclick.net/**"]'
|
||
```
|
||
|
||
Glob patterns block specific subresource URLs from loading.
|
||
|
||
## Saving binary output
|
||
|
||
The `web-scraping` response is JSON, but if `--output-format` is set to a single non-JSON format, the wrapped result is still JSON. Use `jq -r .markdown > file.md` to extract text. For screenshots specifically, the response contains a screenshot URL — fetch it separately with `curl`.
|
||
|
||
## Non-obvious use cases
|
||
|
||
- **Page-to-prompt grounding** — `--output-format markdown` produces clean LLM-ready text from any URL. Strip nav/ads with `--exclude-tags script --exclude-tags style --exclude-tags nav`. Beats fetch + regex.
|
||
- **JavaScript-rendered SPAs that `curl` can't read** — default `--js-rendering` uses a real browser, so React/Vue/Angular pages return their hydrated DOM, not the empty shell.
|
||
- **Geo/availability testing where allowed** — `--proxy-type residential` can model residential network availability; use only for authorized tests where the target's terms and access controls permit it.
|
||
- **Geo-targeted content** — `--proxy-country DE` to see what users in Germany see (different prices, currencies, A/B variants, or geo-blocked content).
|
||
- **Quick "is this page real" check** — `--screenshot` (default on) returns a screenshot URL in the response; verify visually without manually opening the URL.
|
||
- **Universal price extractor** — `--ai-extract-rules-json '{"price":{"type":"number"},"currency":{"type":"string"},"in_stock":{"type":"boolean"}}'` works on arbitrary retailer pages without writing a CSS selector. Cheaper than maintaining per-site selectors when the user only needs occasional spot-checks.
|
||
- **Authenticated content with user authority** — `--headers Cookie=session=...` injects auth cookies if the user has them. Use only with explicit permission and authority to access that account/content; never use cookies to bypass someone else's access controls.
|
||
- **Convert paginated lists to a clean record set** — combine `--js-scenario-json` (click "Load more" 5×) with `--ai-extract-rules-json` (pull the list shape). Lets you scrape paginated SPAs with one CLI call instead of N.
|
||
- **Headless screenshot of a layout** — set `--js-rendering`, `--no-block-resources` (so CSS loads), and capture the screenshot URL from the response. Useful for "render this URL and show me what it looks like".
|
||
- **Markdown for RAG ingestion** — pipe `.markdown` from many URLs into a JSONL corpus; embed and store. The CLI handles JS, ads, images so you don't need a custom pipeline.
|
||
- **Fallback for any other API** — when no purpose-built API exists for a vertical (e.g. niche directories, government pages, less-popular real-estate sites), `web-scraping` is the catch-all.
|
||
- **Detect content changes** — schedule `web-scraping --url X --output-format markdown` and diff the output across runs to flag pricing-page or terms-of-service changes.
|
||
- **Read PDFs / non-HTML resources** — `--output-format text` works on text-extractable PDFs accessible via URL (the underlying renderer handles them).
|
||
- **AI extraction for forms or tables** — pages with structured data in HTML tables are easy: `--ai-extract-rules-json '{"rows":{"type":"list","output":{"name":{"type":"string"},"value":{"type":"number"}}}}'`. The model fills in nested rows.
|
||
|
||
## Common patterns
|
||
|
||
```bash
|
||
# Full-page markdown for RAG
|
||
hasdata web-scraping --url "$URL" \
|
||
--output-format markdown --no-screenshot --no-block-resources \
|
||
--raw | jq -r .markdown >> corpus.md
|
||
|
||
# JS-heavy SPA: wait + scroll
|
||
hasdata web-scraping --url "$URL" \
|
||
--js-scenario-json '[{"wait":2000},{"scrollY":2000},{"wait":1500}]' \
|
||
--wait-for ".item" \
|
||
--output-format html --raw | jq -r .html
|
||
|
||
# Extract structured data from an arbitrary page
|
||
hasdata web-scraping --url "$URL" \
|
||
--ai-extract-rules-json '{"price":{"type":"number"},"in_stock":{"type":"boolean"}}' \
|
||
--raw | jq .
|
||
```
|