--- name: hasdata description: Use HasData APIs for web scraping and structured web data extraction. risk: safe source: official source_type: official source_repo: HasData/hasdata-cli license: MIT license_source: "https://github.com/HasData/hasdata-cli/blob/main/LICENSE" date_added: "2026-06-04" --- # HasData Cloud platform for extracting public web data. One API key, three execution modes. All endpoints sit under `https://api.hasdata.com` and authenticate with `x-api-key`. ```bash curl -G 'https://api.hasdata.com/scrape/google/serp' \ --data-urlencode 'q=coffee' \ -H 'x-api-key: ' ``` `401` invalid key, `403` quota exhausted, `429` concurrency cap, `500` server error (retry). ## When to Use Use this skill when: - The user needs web scraping. - The user needs search engine results. - The user needs structured data extraction. - The user needs ecommerce, travel, jobs, or local business data. - The user explicitly asks about HasData. ## Three execution modes | Mode | Latency | When | Endpoint | |---|---|---|---| | **Web Scraping API** | seconds | Arbitrary URL — JS rendering, CSS/AI extraction, screenshots | `POST /scrape/web` | | **Scraper APIs** (sync) | seconds | Pre-parsed JSON for known platforms (Google, Amazon, Zillow, …) | `GET /scrape//` | | **Scraper Jobs** (async) | minutes–hours | Bulk extraction, recursive crawling, webhook fan-out | `POST /scrapers//jobs` | **Decision rule.** Default to a **Scraper API** when one exists for the platform (pre-parsed JSON, no selector maintenance). Use **Web Scraping** for arbitrary URLs not covered by an API. Reach for a **Scraper Job** only when no API equivalent exists — `crawler`, `contacts`, `sec-edgar`, `amazon-bestsellers`, `amazon-product-reviews` — *or* when async fan-out + webhooks save engineering time over a paginated client loop. ## Always-true response shape ```json { "requestMetadata": { "id": "…", "status": "ok", "url": "…" }, "...": "endpoint-specific" } ``` Treat data as valid only if `requestMetadata.status === "ok"`. HTTP 200 alone isn't enough. ## High-leverage patterns - **SERP-first enrichment.** Google SERP can surface public snippets for company and professional-profile lookup. Use it for business or authorized research, avoid unnecessary direct scraping, and treat personal email/phone lookup as allowed only with a legitimate purpose and user authorization. - **AI Mode + verify.** `/scrape/google/ai-mode` for the answer + references → `/scrape/web` (markdown) on each reference URL → cited RAG context, no vector DB. - **Maps → leads.** `/scrape/google-maps/search` returns business websites and phones; collect contact details only from public, permitted sources and apply opt-out, rate, and privacy-law constraints before any outreach use. - **Crawler → corpus.** `crawler` Scraper Job with `outputFormat: ["markdown"]` + `includePaths: "/docs/.+"` produces an LLM-ready corpus in one submission. - **Pre-extracted via SERP rich snippets.** `knowledgeGraph`, `localResults`, `inlineShoppingResults`, `relatedQuestions` carry pre-parsed public facts. Always check them before considering direct page access. ## When to call from code (the wiring) - **Auth:** `x-api-key` header on every request. Read from `HASDATA_API_KEY` env. Never hardcode, never log. - **Timeouts:** **set client timeout ≥ 300 s.** HasData's own deadline is 300 s; shorter clients produce phantom failures while still being billed on completion. - **Retries:** `429` and `5xx` only — exponential backoff, jitter. Never retry `4xx` (auth, validation). - **Concurrency:** cap at your plan limit. The free tier is 1; anything higher just generates `429`s. - **Async jobs:** the submit response handle is `body.id` (integer), **not `jobId`**. Persist it immediately. Poll `GET /scrapers/jobs/` every 10–30 s with backoff; treat webhooks as best-effort and always pair with polling. On `finished` the status carries `data: {csv, json, xlsx}` short-lived URLs — download immediately. See `references/code-recipes.md` for ready-to-paste Python and TypeScript clients with retry, backoff, bounded concurrency, and the full job lifecycle. ## Common gotchas - **300 s server deadline.** Match client timeout. - **Disable `jsRendering` first**, enable only if the page needs it — most static pages parse fine without a headless browser. - **No `cookies` parameter** — cookies go through `headers["Cookie"]`. - **`includePaths` regex is case-sensitive.** `/blog/.+` won't match `/Blog/...`. - **Scraper Job `data` is double-wrapped.** Each row is `body.data[i].data`; outer wraps with `id`, `jobId`, `dataId`, `createdAt`, `updatedAt`. - **`requestMetadata.status === "ok"` is the only success signal.** HTTP 200 alone isn't enough. - **Webhooks are best-effort with 3 retries.** Always have a polling fallback. ## References - [`references/web-scraping.md`](references/web-scraping.md) — `POST /scrape/web` parameters, JS scenarios, AI extraction, cookie auth. - [`references/search.md`](references/search.md) — Google SERP / Light / AI Mode / News / Shopping / Bing / Trends + pagination. - [`references/ecommerce.md`](references/ecommerce.md) — Amazon (product, search, seller, seller-products) and Shopify. - [`references/real-estate.md`](references/real-estate.md) — Zillow, Redfin (bracketed filters). - [`references/travel.md`](references/travel.md) — Airbnb, Booking, Google Flights (occupancy rules, token pagination, IATA codes). - [`references/local-business.md`](references/local-business.md) — Maps (search/place/reviews/photos/posts), Yelp, YellowPages. - [`references/jobs.md`](references/jobs.md) — Indeed and Glassdoor. - [`references/youtube.md`](references/youtube.md) — YouTube search / video / channel / transcript. - [`references/scraper-jobs.md`](references/scraper-jobs.md) — async submit/poll/results, Crawler, Contacts, SEC EDGAR, webhook receiver. - [`references/code-recipes.md`](references/code-recipes.md) — Python / TypeScript clients with retry, backoff, concurrency, polling. ## Resources - Sitemap: - API status codes: - Credits & concurrency: - Dashboard: ## Limitations * Requires access to HasData services and valid credentials. * Data quality and available fields depend on the target website and extraction method used. * JavaScript-heavy websites may require rendering, which can affect performance and cost. * Use only for public data or content the user is authorized to access; respect site terms, robots/access controls, privacy law, and rate limits. * Rate limits, quotas, and account restrictions may apply depending on the endpoint and subscription plan.