# web-scraping reference Subcommand: `web-scraping` (10 credits/call). Single endpoint for arbitrary URL scraping with JS rendering, proxies, AI extraction, screenshots, markdown conversion. > **`web-scraping` is the last resort, not the default.** Before reaching for it, ask: would `google-serp` (or `google-news` / `google-shopping` / `google-maps`) already have the field I need? Google's `.knowledge_graph`, `.organic_results[].snippet`, `.local_results[]` carry pre-extracted public facts without direct page access. Only invoke `web-scraping` when: > > - the user gave you a specific URL to read, OR > - SERP came up short for a specific field, OR > - the target page renders content that doesn't show up in any SERP snippet. > > Use it only for public pages or content the user is authorized to access, and respect site terms, robots/access controls, privacy law, and rate limits. > > See `references/enrichment.md` for the SERP-first patterns. ```bash hasdata web-scraping --url "URL" [flags] --raw | jq . ``` ## Required - `--url URL` — target page ## Output formats - `--output-format html|text|markdown|json` (repeatable) - One format → that format directly (e.g. `--output-format markdown` returns markdown text under `.markdown`) - Multiple → JSON response with one key per format - `--output-format json` combined with others → wraps everything in JSON ```bash # LLM-friendly markdown for prompt context hasdata web-scraping --url "$URL" --output-format markdown --raw | jq -r .markdown ``` ## Proxy & rendering - `--proxy-type datacenter|residential` (default datacenter) - `--proxy-country US|UK|DE|IE|FR|IT|SE|BR|CA|JP|SG|IN|ID` (default US) - `--js-rendering` / `--no-js-rendering` (default on) — full headless browser - `--block-ads` / `--no-block-ads` (default on) - `--block-resources` / `--no-block-resources` (default on) — blocks images/CSS for speed - `--screenshot` / `--no-screenshot` (default on; the result includes a screenshot URL) - `--remove-base64-images` — strip inline base64 images from response - `--extract-emails` / `--no-extract-emails` (default on) - `--extract-links` (default off) ## Wait controls - `--wait MS` — fixed wait after page load - `--wait-for "CSS_SELECTOR"` — wait until selector appears ## Custom JS scenario (complex array — JSON only) ```bash hasdata web-scraping --url "$URL" \ --js-scenario-json '[ {"wait": 2000}, {"click": ".load-more"}, {"waitFor": ".item"}, {"scrollY": 1000}, {"fill": ["input#q", "espresso"]} ]' --raw ``` Supported actions: `evaluate`, `click`, `wait`, `waitFor`, `waitForAndClick`, `scrollX`, `scrollY`, `fill`. Executed sequentially. Accepts raw JSON, `@file.json`, or `-` (stdin). ## Headers (kvSlice + JSON escape) ```bash # Repeatable kv form (splits on first `=`) hasdata web-scraping --url "$URL" \ --headers "User-Agent=hasdata-cli" \ --headers "Accept-Language=en-US,en;q=0.9" \ --headers "Cookie=session=abc=def" \ --raw # JSON base + kv overrides hasdata web-scraping --url "$URL" \ --headers-json '{"User-Agent":"base","X-Common":"shared"}' \ --headers "User-Agent=override" \ --raw ``` ## CSS-selector data extraction (kvSlice or JSON) ```bash # Lightweight kv form: --extract-rules KEY=SELECTOR hasdata web-scraping --url "https://quotes.toscrape.com" \ --extract-rules "quote=.quote .text" \ --extract-rules "author=.quote .author" \ --raw | jq . # JSON form for complex selectors / attributes hasdata web-scraping --url "$URL" \ --extract-rules-json '{"title":"h1","links":"a @href","price":".price-now"}' \ --raw ``` `@href`, `@src`, etc. extract attributes. Without `@`, extracts text content. ## AI extraction (LLM-driven) ```bash hasdata web-scraping --url "$URL" \ --ai-extract-rules-json '{ "headline": {"type": "string", "description": "the main story headline"}, "comments_count": {"type": "number"}, "is_paid_content": {"type": "boolean"}, "tags": {"type": "list", "description": "topic tags"}, "author": {"type": "item", "output": { "name": {"type": "string"}, "verified": {"type": "boolean"} }} }' --raw | jq . ``` Supported types: `string`, `number`, `boolean`, `list`, `item` (nested object — defines its shape under `output`). ## Tag filtering - `--include-only-tags "main,article"` (comma-joined CSS selectors) — keep only matching elements - `--exclude-tags script --exclude-tags style` (repeatable) — remove elements ```bash hasdata web-scraping --url "$URL" \ --output-format markdown \ --include-only-tags "article,main" \ --exclude-tags script --exclude-tags style --exclude-tags nav \ --raw | jq -r .markdown ``` ## URL blocklist ```bash --block-urls-json '["**.googletagmanager.com/**","**.doubleclick.net/**"]' ``` Glob patterns block specific subresource URLs from loading. ## Saving binary output The `web-scraping` response is JSON, but if `--output-format` is set to a single non-JSON format, the wrapped result is still JSON. Use `jq -r .markdown > file.md` to extract text. For screenshots specifically, the response contains a screenshot URL — fetch it separately with `curl`. ## Non-obvious use cases - **Page-to-prompt grounding** — `--output-format markdown` produces clean LLM-ready text from any URL. Strip nav/ads with `--exclude-tags script --exclude-tags style --exclude-tags nav`. Beats fetch + regex. - **JavaScript-rendered SPAs that `curl` can't read** — default `--js-rendering` uses a real browser, so React/Vue/Angular pages return their hydrated DOM, not the empty shell. - **Geo/availability testing where allowed** — `--proxy-type residential` can model residential network availability; use only for authorized tests where the target's terms and access controls permit it. - **Geo-targeted content** — `--proxy-country DE` to see what users in Germany see (different prices, currencies, A/B variants, or geo-blocked content). - **Quick "is this page real" check** — `--screenshot` (default on) returns a screenshot URL in the response; verify visually without manually opening the URL. - **Universal price extractor** — `--ai-extract-rules-json '{"price":{"type":"number"},"currency":{"type":"string"},"in_stock":{"type":"boolean"}}'` works on arbitrary retailer pages without writing a CSS selector. Cheaper than maintaining per-site selectors when the user only needs occasional spot-checks. - **Authenticated content with user authority** — `--headers Cookie=session=...` injects auth cookies if the user has them. Use only with explicit permission and authority to access that account/content; never use cookies to bypass someone else's access controls. - **Convert paginated lists to a clean record set** — combine `--js-scenario-json` (click "Load more" 5×) with `--ai-extract-rules-json` (pull the list shape). Lets you scrape paginated SPAs with one CLI call instead of N. - **Headless screenshot of a layout** — set `--js-rendering`, `--no-block-resources` (so CSS loads), and capture the screenshot URL from the response. Useful for "render this URL and show me what it looks like". - **Markdown for RAG ingestion** — pipe `.markdown` from many URLs into a JSONL corpus; embed and store. The CLI handles JS, ads, images so you don't need a custom pipeline. - **Fallback for any other API** — when no purpose-built API exists for a vertical (e.g. niche directories, government pages, less-popular real-estate sites), `web-scraping` is the catch-all. - **Detect content changes** — schedule `web-scraping --url X --output-format markdown` and diff the output across runs to flag pricing-page or terms-of-service changes. - **Read PDFs / non-HTML resources** — `--output-format text` works on text-extractable PDFs accessible via URL (the underlying renderer handles them). - **AI extraction for forms or tables** — pages with structured data in HTML tables are easy: `--ai-extract-rules-json '{"rows":{"type":"list","output":{"name":{"type":"string"},"value":{"type":"number"}}}}'`. The model fills in nested rows. ## Common patterns ```bash # Full-page markdown for RAG hasdata web-scraping --url "$URL" \ --output-format markdown --no-screenshot --no-block-resources \ --raw | jq -r .markdown >> corpus.md # JS-heavy SPA: wait + scroll hasdata web-scraping --url "$URL" \ --js-scenario-json '[{"wait":2000},{"scrollY":2000},{"wait":1500}]' \ --wait-for ".item" \ --output-format html --raw | jq -r .html # Extract structured data from an arbitrary page hasdata web-scraping --url "$URL" \ --ai-extract-rules-json '{"price":{"type":"number"},"in_stock":{"type":"boolean"}}' \ --raw | jq . ```