8.5 KiB
web-scraping reference
Subcommand: web-scraping (10 credits/call). Single endpoint for arbitrary URL scraping with JS rendering, proxies, AI extraction, screenshots, markdown conversion.
web-scrapingis the last resort, not the default. Before reaching for it, ask: wouldgoogle-serp(orgoogle-news/google-shopping/google-maps) already have the field I need? Google's.knowledge_graph,.organic_results[].snippet,.local_results[]carry pre-extracted public facts without direct page access. Only invokeweb-scrapingwhen:
- the user gave you a specific URL to read, OR
- SERP came up short for a specific field, OR
- the target page renders content that doesn't show up in any SERP snippet.
Use it only for public pages or content the user is authorized to access, and respect site terms, robots/access controls, privacy law, and rate limits.
See
references/enrichment.mdfor the SERP-first patterns.
hasdata web-scraping --url "URL" [flags] --raw | jq .
Required
--url URL— target page
Output formats
--output-format html|text|markdown|json(repeatable)- One format → that format directly (e.g.
--output-format markdownreturns markdown text under.markdown) - Multiple → JSON response with one key per format
--output-format jsoncombined with others → wraps everything in JSON
- One format → that format directly (e.g.
# LLM-friendly markdown for prompt context
hasdata web-scraping --url "$URL" --output-format markdown --raw | jq -r .markdown
Proxy & rendering
--proxy-type datacenter|residential(default datacenter)--proxy-country US|UK|DE|IE|FR|IT|SE|BR|CA|JP|SG|IN|ID(default US)--js-rendering/--no-js-rendering(default on) — full headless browser--block-ads/--no-block-ads(default on)--block-resources/--no-block-resources(default on) — blocks images/CSS for speed--screenshot/--no-screenshot(default on; the result includes a screenshot URL)--remove-base64-images— strip inline base64 images from response--extract-emails/--no-extract-emails(default on)--extract-links(default off)
Wait controls
--wait MS— fixed wait after page load--wait-for "CSS_SELECTOR"— wait until selector appears
Custom JS scenario (complex array — JSON only)
hasdata web-scraping --url "$URL" \
--js-scenario-json '[
{"wait": 2000},
{"click": ".load-more"},
{"waitFor": ".item"},
{"scrollY": 1000},
{"fill": ["input#q", "espresso"]}
]' --raw
Supported actions: evaluate, click, wait, waitFor, waitForAndClick, scrollX, scrollY, fill. Executed sequentially.
Accepts raw JSON, @file.json, or - (stdin).
Headers (kvSlice + JSON escape)
# Repeatable kv form (splits on first `=`)
hasdata web-scraping --url "$URL" \
--headers "User-Agent=hasdata-cli" \
--headers "Accept-Language=en-US,en;q=0.9" \
--headers "Cookie=session=abc=def" \
--raw
# JSON base + kv overrides
hasdata web-scraping --url "$URL" \
--headers-json '{"User-Agent":"base","X-Common":"shared"}' \
--headers "User-Agent=override" \
--raw
CSS-selector data extraction (kvSlice or JSON)
# Lightweight kv form: --extract-rules KEY=SELECTOR
hasdata web-scraping --url "https://quotes.toscrape.com" \
--extract-rules "quote=.quote .text" \
--extract-rules "author=.quote .author" \
--raw | jq .
# JSON form for complex selectors / attributes
hasdata web-scraping --url "$URL" \
--extract-rules-json '{"title":"h1","links":"a @href","price":".price-now"}' \
--raw
@href, @src, etc. extract attributes. Without @, extracts text content.
AI extraction (LLM-driven)
hasdata web-scraping --url "$URL" \
--ai-extract-rules-json '{
"headline": {"type": "string", "description": "the main story headline"},
"comments_count": {"type": "number"},
"is_paid_content": {"type": "boolean"},
"tags": {"type": "list", "description": "topic tags"},
"author": {"type": "item", "output": {
"name": {"type": "string"},
"verified": {"type": "boolean"}
}}
}' --raw | jq .
Supported types: string, number, boolean, list, item (nested object — defines its shape under output).
Tag filtering
--include-only-tags "main,article"(comma-joined CSS selectors) — keep only matching elements--exclude-tags script --exclude-tags style(repeatable) — remove elements
hasdata web-scraping --url "$URL" \
--output-format markdown \
--include-only-tags "article,main" \
--exclude-tags script --exclude-tags style --exclude-tags nav \
--raw | jq -r .markdown
URL blocklist
--block-urls-json '["**.googletagmanager.com/**","**.doubleclick.net/**"]'
Glob patterns block specific subresource URLs from loading.
Saving binary output
The web-scraping response is JSON, but if --output-format is set to a single non-JSON format, the wrapped result is still JSON. Use jq -r .markdown > file.md to extract text. For screenshots specifically, the response contains a screenshot URL — fetch it separately with curl.
Non-obvious use cases
- Page-to-prompt grounding —
--output-format markdownproduces clean LLM-ready text from any URL. Strip nav/ads with--exclude-tags script --exclude-tags style --exclude-tags nav. Beats fetch + regex. - JavaScript-rendered SPAs that
curlcan't read — default--js-renderinguses a real browser, so React/Vue/Angular pages return their hydrated DOM, not the empty shell. - Geo/availability testing where allowed —
--proxy-type residentialcan model residential network availability; use only for authorized tests where the target's terms and access controls permit it. - Geo-targeted content —
--proxy-country DEto see what users in Germany see (different prices, currencies, A/B variants, or geo-blocked content). - Quick "is this page real" check —
--screenshot(default on) returns a screenshot URL in the response; verify visually without manually opening the URL. - Universal price extractor —
--ai-extract-rules-json '{"price":{"type":"number"},"currency":{"type":"string"},"in_stock":{"type":"boolean"}}'works on arbitrary retailer pages without writing a CSS selector. Cheaper than maintaining per-site selectors when the user only needs occasional spot-checks. - Authenticated content with user authority —
--headers Cookie=session=...injects auth cookies if the user has them. Use only with explicit permission and authority to access that account/content; never use cookies to bypass someone else's access controls. - Convert paginated lists to a clean record set — combine
--js-scenario-json(click "Load more" 5×) with--ai-extract-rules-json(pull the list shape). Lets you scrape paginated SPAs with one CLI call instead of N. - Headless screenshot of a layout — set
--js-rendering,--no-block-resources(so CSS loads), and capture the screenshot URL from the response. Useful for "render this URL and show me what it looks like". - Markdown for RAG ingestion — pipe
.markdownfrom many URLs into a JSONL corpus; embed and store. The CLI handles JS, ads, images so you don't need a custom pipeline. - Fallback for any other API — when no purpose-built API exists for a vertical (e.g. niche directories, government pages, less-popular real-estate sites),
web-scrapingis the catch-all. - Detect content changes — schedule
web-scraping --url X --output-format markdownand diff the output across runs to flag pricing-page or terms-of-service changes. - Read PDFs / non-HTML resources —
--output-format textworks on text-extractable PDFs accessible via URL (the underlying renderer handles them). - AI extraction for forms or tables — pages with structured data in HTML tables are easy:
--ai-extract-rules-json '{"rows":{"type":"list","output":{"name":{"type":"string"},"value":{"type":"number"}}}}'. The model fills in nested rows.
Common patterns
# Full-page markdown for RAG
hasdata web-scraping --url "$URL" \
--output-format markdown --no-screenshot --no-block-resources \
--raw | jq -r .markdown >> corpus.md
# JS-heavy SPA: wait + scroll
hasdata web-scraping --url "$URL" \
--js-scenario-json '[{"wait":2000},{"scrollY":2000},{"wait":1500}]' \
--wait-for ".item" \
--output-format html --raw | jq -r .html
# Extract structured data from an arbitrary page
hasdata web-scraping --url "$URL" \
--ai-extract-rules-json '{"price":{"type":"number"},"in_stock":{"type":"boolean"}}' \
--raw | jq .