8.4 KiB
Scraper Jobs — async, bulk
Use only when there's no Scraper-API equivalent (crawler, contacts, sec-edgar, amazon-bestsellers, amazon-product-reviews) or when you want webhook-driven fan-out without managing your own polling loop. Otherwise the matching Scraper API + paginated client loop is simpler.
| Slug | Notes |
|---|---|
crawler |
Recursive site crawl. Accepts every Web Scraping API parameter. |
contacts |
URL list → emails / phones / social profiles. |
sec-edgar |
Bulk SEC filings by CIK / ticker / company name. |
google-serp, google-maps, google-maps-reviews, google-trends |
Bulk Google. |
amazon-search, amazon-product, amazon-product-reviews, amazon-seller-products, amazon-bestsellers |
Bulk Amazon. |
shopify |
Multi-store crawl. |
zillow, redfin, airbnb |
Bulk real estate. |
yelp, yellow-pages |
Bulk local. |
indeed, glassdoor |
Bulk jobs. |
Lifecycle
POST /scrapers/<slug>/jobs→ returns the full job record. The handle isbody.id(numeric integer), notjobIddespite older doc snippets — store this. Status starts aspending.GET /scrapers/jobs/<id>— poll status.GET /scrapers/jobs/<id>/results?page=…&limit=100— oncestatus === "finished".DELETE /scrapers/jobs/<id>— stop early (rows produced before stop are kept).
Status values: pending → in_progress → finished (or stopped if cancelled).
Shortcut for finished jobs: the status response on a finished job carries a data object with direct download URLs:
"data": {
"csv": "https://f005.backblazeb2.com/file/.../{uuid}.csv",
"json": "https://f005.backblazeb2.com/file/.../{uuid}.json",
"xlsx": "https://f005.backblazeb2.com/file/.../{uuid}.xlsx"
}
For one-shot ingestion, fetch data.json directly instead of paging /results. These URLs are short-lived — download immediately on finished.
End-to-end (Python)
import os, time, requests
API_KEY = os.environ["HASDATA_API_KEY"]
H = {"x-api-key": API_KEY, "Content-Type": "application/json"}
BASE = "https://api.hasdata.com"
def submit(slug, body):
r = requests.post(f"{BASE}/scrapers/{slug}/jobs", headers=H, json=body, timeout=60)
r.raise_for_status()
return r.json()["id"] # numeric job id — not "jobId"
def wait(job_id, poll=10, cap=60, timeout=3600):
deadline = time.time() + timeout
while time.time() < deadline:
s = requests.get(f"{BASE}/scrapers/jobs/{job_id}", headers=H, timeout=60).json()
if s["status"] in ("finished", "stopped"):
return s
time.sleep(poll)
poll = min(poll * 1.5, cap)
raise TimeoutError(job_id)
def results(job_id):
page = 1
while True:
body = requests.get(
f"{BASE}/scrapers/jobs/{job_id}/results",
headers=H, params={"page": page, "limit": 100}, timeout=120,
).json()
for row in body["data"]:
yield row["data"] # double-wrapped — see below
if body["meta"]["currentPage"] >= body["meta"]["lastPage"]:
return
page += 1
Response shapes
Submit (live):
{
"id": 416349, // ← the job handle, integer
"scraperId": 26,
"status": "pending",
"creditsSpent": 0,
"dataRowsCount": 0,
"input": { ... },
"createdAt": "...", "updatedAt": "...",
"scraper": { "slug": "contacts", ... },
"columns": [ ... ]
}
Status (live; numeric fields arrive as strings when populated):
{
"id": 416349,
"status": "finished",
"creditsSpent": "5", // string!
"dataRowsCount": "1", // string!
"input": { ... },
"data": {
"csv": "https://f005.backblazeb2.com/.../{uuid}.csv",
"json": "https://f005.backblazeb2.com/.../{uuid}.json",
"xlsx": "https://f005.backblazeb2.com/.../{uuid}.xlsx"
}
}
Results page:
{
"meta": {
"total": 1, "perPage": 100,
"currentPage": 1, "lastPage": 1,
"firstPage": 1, "firstPageUrl": "/?page=1",
"lastPageUrl": "/?page=1",
"nextPageUrl": null, "previousPageUrl": null
},
"data": [
{
"id": "...", "jobId": 416349, "dataId": "...",
"data": { /* the actual scraped row */ },
"createdAt": "...", "updatedAt": "..."
}
]
}
Double data — the row is body["data"][i]["data"]; the outer wraps with id, jobId, dataId, createdAt, updatedAt.
Common body fields
limit(int) — max rows.0= no cap.webhook.url(string, https),webhook.events(any subset ofscraper.job.started,scraper.data.scraped,scraper.job.finished),webhook.headers(sent on every callback — pin a shared secret here).
Webhooks
# Submit with webhook
submit("indeed", {
"keywords": ["software engineer", "data scientist"],
"locations": ["New York, NY", "Remote"],
"limit": 500,
"webhook": {
"url": "https://your.app/hasdata-hook",
"events": ["scraper.data.scraped", "scraper.job.finished"],
"headers": {"x-shared-secret": SHARED_SECRET},
},
})
from flask import Flask, request, abort
app = Flask(__name__)
@app.post("/hasdata-hook")
def hook():
if request.headers.get("x-shared-secret") != SHARED_SECRET:
abort(401)
e = request.json
if e["event"] == "scraper.data.scraped":
save_row(e["jobId"], e["data"])
elif e["event"] == "scraper.job.finished":
finalize(e["jobId"])
return "", 200 # 2xx prevents retry
- Async with 3 retries on non-2xx. Order not guaranteed — payload is the source of truth.
- No documented HMAC. Pin a shared secret via
webhook.headers, or just fetch results via the API onscraper.job.finishedand ignore per-row deliveries. - Always pair webhooks with polling. A long quiet period probably means missed callbacks.
Per-scraper bodies
crawler — recursive site crawl
Accepts every Web Scraping API parameter applied to every page.
| Field | Notes |
|---|---|
urls |
Required. Seed URLs. |
maxDepth |
Hops from seed. |
includePaths / excludePaths |
Regex. Case-sensitive. |
limit |
Cap on pages. 0 = unlimited. |
job = submit("crawler", {
"urls": ["https://docs.example.com"],
"maxDepth": 5,
"includePaths": "/docs/.+",
"outputFormat": ["markdown"],
"excludeTags": ["script", "style", "nav", "footer"],
"limit": 2000,
})
contacts — URLs → contact info
submit("contacts", {"urls": ["https://example.com/about", "https://example.com/team"]})
Verified row schema (one row per input URL):
{
"url": "https://example.com/about",
"emails": ["..."],
"phoneNumbers": ["..."],
"linkedin": ["..."],
"xcom": ["..."], // X / Twitter — note key is "xcom"
"facebook": ["..."],
"instagram": ["..."],
"dribbble": ["..."],
"clutch": ["..."]
}
Empty arrays for missing categories — never null. If you only have a domain, discover URLs first via SERP site:example.com.
sec-edgar — bulk SEC filings
submit("sec-edgar", {
"limit": 100,
"ciks": ["AAPL", "789019", "Alphabet Inc."],
"filingTypes": "10-K, 10-Q, 8-K",
"startDate": "2024-01-01",
"endDate": "2025-12-31",
})
ciks accepts CIKs, tickers, or company names mixed.
Bulk-API equivalents
google-serp, google-maps, amazon-search, indeed, glassdoor, etc. Jobs accept arrays of inputs (keywords[], locations[], etc.). Use them when you want webhook fan-out; otherwise the synchronous Scraper API + paginated client loop is simpler.
Crawler vs Contacts vs Web Scraping batch
- crawler — unknown URL set, recursive discovery.
- contacts — known URL list, want extracted contact fields.
/scrape/batch/web— known URL list, want full HTML/markdown/AI extraction at >1k scale.
Gotchas
- Persist the job
idimmediately (the integer from the submit response — notjobId). Only handle to status, results, stop. - Result file retention is short. Download right after
finished. - Webhooks are best-effort. Always poll as a backup.
includePathsregex is case-sensitive.- Status
stoppedis terminal. Rows already produced remain available. - Don't poll faster than every 10 s — wastes concurrency cap.
- Double-wrapped results —
body["data"][i]["data"], notbody["data"][i].