playbook/antigravity-awesome-skills/skills/hasdata/references/scraper-jobs.md

8.4 KiB

Scraper Jobs — async, bulk

Use only when there's no Scraper-API equivalent (crawler, contacts, sec-edgar, amazon-bestsellers, amazon-product-reviews) or when you want webhook-driven fan-out without managing your own polling loop. Otherwise the matching Scraper API + paginated client loop is simpler.

Slug Notes
crawler Recursive site crawl. Accepts every Web Scraping API parameter.
contacts URL list → emails / phones / social profiles.
sec-edgar Bulk SEC filings by CIK / ticker / company name.
google-serp, google-maps, google-maps-reviews, google-trends Bulk Google.
amazon-search, amazon-product, amazon-product-reviews, amazon-seller-products, amazon-bestsellers Bulk Amazon.
shopify Multi-store crawl.
zillow, redfin, airbnb Bulk real estate.
yelp, yellow-pages Bulk local.
indeed, glassdoor Bulk jobs.

Lifecycle

  1. POST /scrapers/<slug>/jobs → returns the full job record. The handle is body.id (numeric integer), not jobId despite older doc snippets — store this. Status starts as pending.
  2. GET /scrapers/jobs/<id> — poll status.
  3. GET /scrapers/jobs/<id>/results?page=…&limit=100 — once status === "finished".
  4. DELETE /scrapers/jobs/<id> — stop early (rows produced before stop are kept).

Status values: pendingin_progressfinished (or stopped if cancelled).

Shortcut for finished jobs: the status response on a finished job carries a data object with direct download URLs:

"data": {
  "csv":  "https://f005.backblazeb2.com/file/.../{uuid}.csv",
  "json": "https://f005.backblazeb2.com/file/.../{uuid}.json",
  "xlsx": "https://f005.backblazeb2.com/file/.../{uuid}.xlsx"
}

For one-shot ingestion, fetch data.json directly instead of paging /results. These URLs are short-lived — download immediately on finished.

End-to-end (Python)

import os, time, requests

API_KEY = os.environ["HASDATA_API_KEY"]
H = {"x-api-key": API_KEY, "Content-Type": "application/json"}
BASE = "https://api.hasdata.com"

def submit(slug, body):
    r = requests.post(f"{BASE}/scrapers/{slug}/jobs", headers=H, json=body, timeout=60)
    r.raise_for_status()
    return r.json()["id"]                            # numeric job id — not "jobId"

def wait(job_id, poll=10, cap=60, timeout=3600):
    deadline = time.time() + timeout
    while time.time() < deadline:
        s = requests.get(f"{BASE}/scrapers/jobs/{job_id}", headers=H, timeout=60).json()
        if s["status"] in ("finished", "stopped"):
            return s
        time.sleep(poll)
        poll = min(poll * 1.5, cap)
    raise TimeoutError(job_id)

def results(job_id):
    page = 1
    while True:
        body = requests.get(
            f"{BASE}/scrapers/jobs/{job_id}/results",
            headers=H, params={"page": page, "limit": 100}, timeout=120,
        ).json()
        for row in body["data"]:
            yield row["data"]                       # double-wrapped — see below
        if body["meta"]["currentPage"] >= body["meta"]["lastPage"]:
            return
        page += 1

Response shapes

Submit (live):

{
  "id": 416349,                        // ← the job handle, integer
  "scraperId": 26,
  "status": "pending",
  "creditsSpent": 0,
  "dataRowsCount": 0,
  "input": { ... },
  "createdAt": "...", "updatedAt": "...",
  "scraper": { "slug": "contacts", ... },
  "columns": [ ... ]
}

Status (live; numeric fields arrive as strings when populated):

{
  "id": 416349,
  "status": "finished",
  "creditsSpent": "5",                 // string!
  "dataRowsCount": "1",                // string!
  "input": { ... },
  "data": {
    "csv":  "https://f005.backblazeb2.com/.../{uuid}.csv",
    "json": "https://f005.backblazeb2.com/.../{uuid}.json",
    "xlsx": "https://f005.backblazeb2.com/.../{uuid}.xlsx"
  }
}

Results page:

{
  "meta": {
    "total": 1, "perPage": 100,
    "currentPage": 1, "lastPage": 1,
    "firstPage": 1, "firstPageUrl": "/?page=1",
    "lastPageUrl": "/?page=1",
    "nextPageUrl": null, "previousPageUrl": null
  },
  "data": [
    {
      "id": "...", "jobId": 416349, "dataId": "...",
      "data": { /* the actual scraped row */ },
      "createdAt": "...", "updatedAt": "..."
    }
  ]
}

Double data — the row is body["data"][i]["data"]; the outer wraps with id, jobId, dataId, createdAt, updatedAt.

Common body fields

  • limit (int) — max rows. 0 = no cap.
  • webhook.url (string, https), webhook.events (any subset of scraper.job.started, scraper.data.scraped, scraper.job.finished), webhook.headers (sent on every callback — pin a shared secret here).

Webhooks

# Submit with webhook
submit("indeed", {
    "keywords":  ["software engineer", "data scientist"],
    "locations": ["New York, NY", "Remote"],
    "limit":     500,
    "webhook":   {
        "url":     "https://your.app/hasdata-hook",
        "events":  ["scraper.data.scraped", "scraper.job.finished"],
        "headers": {"x-shared-secret": SHARED_SECRET},
    },
})
from flask import Flask, request, abort
app = Flask(__name__)

@app.post("/hasdata-hook")
def hook():
    if request.headers.get("x-shared-secret") != SHARED_SECRET:
        abort(401)
    e = request.json
    if e["event"] == "scraper.data.scraped":
        save_row(e["jobId"], e["data"])
    elif e["event"] == "scraper.job.finished":
        finalize(e["jobId"])
    return "", 200                  # 2xx prevents retry
  • Async with 3 retries on non-2xx. Order not guaranteed — payload is the source of truth.
  • No documented HMAC. Pin a shared secret via webhook.headers, or just fetch results via the API on scraper.job.finished and ignore per-row deliveries.
  • Always pair webhooks with polling. A long quiet period probably means missed callbacks.

Per-scraper bodies

crawler — recursive site crawl

Accepts every Web Scraping API parameter applied to every page.

Field Notes
urls Required. Seed URLs.
maxDepth Hops from seed.
includePaths / excludePaths Regex. Case-sensitive.
limit Cap on pages. 0 = unlimited.
job = submit("crawler", {
    "urls":         ["https://docs.example.com"],
    "maxDepth":     5,
    "includePaths": "/docs/.+",
    "outputFormat": ["markdown"],
    "excludeTags":  ["script", "style", "nav", "footer"],
    "limit":        2000,
})

contacts — URLs → contact info

submit("contacts", {"urls": ["https://example.com/about", "https://example.com/team"]})

Verified row schema (one row per input URL):

{
  "url": "https://example.com/about",
  "emails":       ["..."],
  "phoneNumbers": ["..."],
  "linkedin":     ["..."],
  "xcom":         ["..."],          // X / Twitter — note key is "xcom"
  "facebook":     ["..."],
  "instagram":    ["..."],
  "dribbble":     ["..."],
  "clutch":       ["..."]
}

Empty arrays for missing categories — never null. If you only have a domain, discover URLs first via SERP site:example.com.

sec-edgar — bulk SEC filings

submit("sec-edgar", {
    "limit":       100,
    "ciks":        ["AAPL", "789019", "Alphabet Inc."],
    "filingTypes": "10-K, 10-Q, 8-K",
    "startDate":   "2024-01-01",
    "endDate":     "2025-12-31",
})

ciks accepts CIKs, tickers, or company names mixed.

Bulk-API equivalents

google-serp, google-maps, amazon-search, indeed, glassdoor, etc. Jobs accept arrays of inputs (keywords[], locations[], etc.). Use them when you want webhook fan-out; otherwise the synchronous Scraper API + paginated client loop is simpler.

Crawler vs Contacts vs Web Scraping batch

  • crawler — unknown URL set, recursive discovery.
  • contacts — known URL list, want extracted contact fields.
  • /scrape/batch/web — known URL list, want full HTML/markdown/AI extraction at >1k scale.

Gotchas

  • Persist the job id immediately (the integer from the submit response — not jobId). Only handle to status, results, stop.
  • Result file retention is short. Download right after finished.
  • Webhooks are best-effort. Always poll as a backup.
  • includePaths regex is case-sensitive.
  • Status stopped is terminal. Rows already produced remain available.
  • Don't poll faster than every 10 s — wastes concurrency cap.
  • Double-wrapped resultsbody["data"][i]["data"], not body["data"][i].