# Scraper Jobs — async, bulk Use only when there's no Scraper-API equivalent (`crawler`, `contacts`, `sec-edgar`, `amazon-bestsellers`, `amazon-product-reviews`) or when you want webhook-driven fan-out without managing your own polling loop. Otherwise the matching Scraper API + paginated client loop is simpler. | Slug | Notes | |---|---| | `crawler` | Recursive site crawl. Accepts every Web Scraping API parameter. | | `contacts` | URL list → emails / phones / social profiles. | | `sec-edgar` | Bulk SEC filings by CIK / ticker / company name. | | `google-serp`, `google-maps`, `google-maps-reviews`, `google-trends` | Bulk Google. | | `amazon-search`, `amazon-product`, `amazon-product-reviews`, `amazon-seller-products`, `amazon-bestsellers` | Bulk Amazon. | | `shopify` | Multi-store crawl. | | `zillow`, `redfin`, `airbnb` | Bulk real estate. | | `yelp`, `yellow-pages` | Bulk local. | | `indeed`, `glassdoor` | Bulk jobs. | ## Lifecycle 1. `POST /scrapers//jobs` → returns the full job record. **The handle is `body.id` (numeric integer), not `jobId`** despite older doc snippets — store this. Status starts as `pending`. 2. `GET /scrapers/jobs/` — poll status. 3. `GET /scrapers/jobs//results?page=…&limit=100` — once `status === "finished"`. 4. `DELETE /scrapers/jobs/` — stop early (rows produced before stop are kept). Status values: `pending` → `in_progress` → `finished` (or `stopped` if cancelled). **Shortcut for finished jobs:** the status response on a `finished` job carries a `data` object with direct download URLs: ```json "data": { "csv": "https://f005.backblazeb2.com/file/.../{uuid}.csv", "json": "https://f005.backblazeb2.com/file/.../{uuid}.json", "xlsx": "https://f005.backblazeb2.com/file/.../{uuid}.xlsx" } ``` For one-shot ingestion, fetch `data.json` directly instead of paging `/results`. **These URLs are short-lived** — download immediately on `finished`. ## End-to-end (Python) ```python import os, time, requests API_KEY = os.environ["HASDATA_API_KEY"] H = {"x-api-key": API_KEY, "Content-Type": "application/json"} BASE = "https://api.hasdata.com" def submit(slug, body): r = requests.post(f"{BASE}/scrapers/{slug}/jobs", headers=H, json=body, timeout=60) r.raise_for_status() return r.json()["id"] # numeric job id — not "jobId" def wait(job_id, poll=10, cap=60, timeout=3600): deadline = time.time() + timeout while time.time() < deadline: s = requests.get(f"{BASE}/scrapers/jobs/{job_id}", headers=H, timeout=60).json() if s["status"] in ("finished", "stopped"): return s time.sleep(poll) poll = min(poll * 1.5, cap) raise TimeoutError(job_id) def results(job_id): page = 1 while True: body = requests.get( f"{BASE}/scrapers/jobs/{job_id}/results", headers=H, params={"page": page, "limit": 100}, timeout=120, ).json() for row in body["data"]: yield row["data"] # double-wrapped — see below if body["meta"]["currentPage"] >= body["meta"]["lastPage"]: return page += 1 ``` ### Response shapes Submit (live): ```json { "id": 416349, // ← the job handle, integer "scraperId": 26, "status": "pending", "creditsSpent": 0, "dataRowsCount": 0, "input": { ... }, "createdAt": "...", "updatedAt": "...", "scraper": { "slug": "contacts", ... }, "columns": [ ... ] } ``` Status (live; numeric fields arrive as **strings** when populated): ```json { "id": 416349, "status": "finished", "creditsSpent": "5", // string! "dataRowsCount": "1", // string! "input": { ... }, "data": { "csv": "https://f005.backblazeb2.com/.../{uuid}.csv", "json": "https://f005.backblazeb2.com/.../{uuid}.json", "xlsx": "https://f005.backblazeb2.com/.../{uuid}.xlsx" } } ``` Results page: ```json { "meta": { "total": 1, "perPage": 100, "currentPage": 1, "lastPage": 1, "firstPage": 1, "firstPageUrl": "/?page=1", "lastPageUrl": "/?page=1", "nextPageUrl": null, "previousPageUrl": null }, "data": [ { "id": "...", "jobId": 416349, "dataId": "...", "data": { /* the actual scraped row */ }, "createdAt": "...", "updatedAt": "..." } ] } ``` **Double `data`** — the row is `body["data"][i]["data"]`; the outer wraps with `id`, `jobId`, `dataId`, `createdAt`, `updatedAt`. ## Common body fields - `limit` (int) — max rows. `0` = no cap. - `webhook.url` (string, https), `webhook.events` (any subset of `scraper.job.started`, `scraper.data.scraped`, `scraper.job.finished`), `webhook.headers` (sent on every callback — pin a shared secret here). ## Webhooks ```python # Submit with webhook submit("indeed", { "keywords": ["software engineer", "data scientist"], "locations": ["New York, NY", "Remote"], "limit": 500, "webhook": { "url": "https://your.app/hasdata-hook", "events": ["scraper.data.scraped", "scraper.job.finished"], "headers": {"x-shared-secret": SHARED_SECRET}, }, }) ``` ```python from flask import Flask, request, abort app = Flask(__name__) @app.post("/hasdata-hook") def hook(): if request.headers.get("x-shared-secret") != SHARED_SECRET: abort(401) e = request.json if e["event"] == "scraper.data.scraped": save_row(e["jobId"], e["data"]) elif e["event"] == "scraper.job.finished": finalize(e["jobId"]) return "", 200 # 2xx prevents retry ``` - Async with **3 retries** on non-2xx. **Order not guaranteed** — payload is the source of truth. - **No documented HMAC.** Pin a shared secret via `webhook.headers`, or just fetch results via the API on `scraper.job.finished` and ignore per-row deliveries. - **Always pair webhooks with polling.** A long quiet period probably means missed callbacks. ## Per-scraper bodies ### `crawler` — recursive site crawl Accepts every Web Scraping API parameter applied to **every page**. | Field | Notes | |---|---| | `urls` | **Required.** Seed URLs. | | `maxDepth` | Hops from seed. | | `includePaths` / `excludePaths` | Regex. **Case-sensitive.** | | `limit` | Cap on pages. `0` = unlimited. | ```python job = submit("crawler", { "urls": ["https://docs.example.com"], "maxDepth": 5, "includePaths": "/docs/.+", "outputFormat": ["markdown"], "excludeTags": ["script", "style", "nav", "footer"], "limit": 2000, }) ``` ### `contacts` — URLs → contact info ```python submit("contacts", {"urls": ["https://example.com/about", "https://example.com/team"]}) ``` Verified row schema (one row per input URL): ```json { "url": "https://example.com/about", "emails": ["..."], "phoneNumbers": ["..."], "linkedin": ["..."], "xcom": ["..."], // X / Twitter — note key is "xcom" "facebook": ["..."], "instagram": ["..."], "dribbble": ["..."], "clutch": ["..."] } ``` Empty arrays for missing categories — never null. If you only have a domain, discover URLs first via SERP `site:example.com`. ### `sec-edgar` — bulk SEC filings ```python submit("sec-edgar", { "limit": 100, "ciks": ["AAPL", "789019", "Alphabet Inc."], "filingTypes": "10-K, 10-Q, 8-K", "startDate": "2024-01-01", "endDate": "2025-12-31", }) ``` `ciks` accepts CIKs, tickers, or company names mixed. ### Bulk-API equivalents `google-serp`, `google-maps`, `amazon-search`, `indeed`, `glassdoor`, etc. Jobs accept arrays of inputs (`keywords[]`, `locations[]`, etc.). Use them when you want webhook fan-out; otherwise the synchronous Scraper API + paginated client loop is simpler. ### Crawler vs Contacts vs Web Scraping batch - **crawler** — unknown URL set, recursive discovery. - **contacts** — known URL list, want extracted contact fields. - **`/scrape/batch/web`** — known URL list, want full HTML/markdown/AI extraction at >1k scale. ## Gotchas - **Persist the job `id` immediately** (the integer from the submit response — *not* `jobId`). Only handle to status, results, stop. - **Result file retention is short.** Download right after `finished`. - **Webhooks are best-effort.** Always poll as a backup. - **`includePaths` regex is case-sensitive.** - **Status `stopped` is terminal.** Rows already produced remain available. - **Don't poll faster than every 10 s** — wastes concurrency cap. - **Double-wrapped results** — `body["data"][i]["data"]`, not `body["data"][i]`.