hasdata
How to Install
git clone https://github.com/official && cp skills/hasdata ~/.claude/skills/Copy SKILL.md into your .cursorrules fileHasData
Cloud platform for extracting public web data. One API key, three execution modes. All endpoints sit under https://api.hasdata.com and authenticate with x-api-key.
curl -G 'https://api.hasdata.com/scrape/google/serp' \
--data-urlencode 'q=coffee' \
-H 'x-api-key: <your-api-key>'
401 invalid key, 403 quota exhausted, 429 concurrency cap, 500 server error (retry).
When to Use
Use this skill when:
- The user needs web scraping.
- The user needs search engine results.
- The user needs structured data extraction.
- The user needs ecommerce, travel, jobs, or local business data.
- The user explicitly asks about HasData.
Three execution modes
| Mode | Latency | When | Endpoint |
|---|---|---|---|
| Web Scraping API | seconds | Arbitrary URL — JS rendering, CSS/AI extraction, screenshots | POST /scrape/web |
| Scraper APIs (sync) | seconds | Pre-parsed JSON for known platforms (Google, Amazon, Zillow, …) | GET /scrape/<vertical>/<resource> |
| Scraper Jobs (async) | minutes–hours | Bulk extraction, recursive crawling, webhook fan-out | POST /scrapers/<slug>/jobs |
Decision rule. Default to a Scraper API when one exists for the platform (pre-parsed JSON, no selector maintenance). Use Web Scraping for arbitrary URLs not covered by an API. Reach for a Scraper Job only when no API equivalent exists — crawler, contacts, sec-edgar, amazon-bestsellers, amazon-product-reviews — or when async fan-out + webhooks save engineering time over a paginated client loop.
Always-true response shape
{ "requestMetadata": { "id": "…", "status": "ok", "url": "…" }, "...": "endpoint-specific" }
Treat data as valid only if requestMetadata.status === "ok". HTTP 200 alone isn't enough.
High-leverage patterns
- SERP-first enrichment. Google SERP can surface public snippets for company and professional-profile lookup. Use it for business or authorized research, avoid unnecessary direct scraping, and treat personal email/phone lookup as allowed only with a legitimate purpose and user authorization.
- AI Mode + verify.
/scrape/google/ai-modefor the answer + references →/scrape/web(markdown) on each reference URL → cited RAG context, no vector DB. - Maps → leads.
/scrape/google-maps/searchreturns business websites and phones; collect contact details only from public, permitted sources and apply opt-out, rate, and privacy-law constraints before any outreach use. - Crawler → corpus.
crawlerScraper Job withoutputFormat: ["markdown"]+includePaths: "/docs/.+"produces an LLM-ready corpus in one submission. - Pre-extracted via SERP rich snippets.
knowledgeGraph,localResults,inlineShoppingResults,relatedQuestionscarry pre-parsed public facts. Always check them before considering direct page access.
When to call from code (the wiring)
- Auth:
x-api-keyheader on every request. Read fromHASDATA_API_KEYenv. Never hardcode, never log. - Timeouts: set client timeout ≥ 300 s. HasData's own deadline is 300 s; shorter clients produce phantom failures while still being billed on completion.
- Retries:
429and5xxonly — exponential backoff, jitter. Never retry4xx(auth, validation). - Concurrency: cap at your plan limit. The free tier is 1; anything higher just generates
429s. - Async jobs: the submit response handle is
body.id(integer), notjobId. Persist it immediately. PollGET /scrapers/jobs/<id>every 10–30 s with backoff; treat webhooks as best-effort and always pair with polling. Onfinishedthe status carriesdata: {csv, json, xlsx}short-lived URLs — download immediately.
See references/code-recipes.md for ready-to-paste Python and TypeScript clients with retry, backoff, bounded concurrency, and the full job lifecycle.
Common gotchas
- 300 s server deadline. Match client timeout.
- Disable
jsRenderingfirst, enable only if the page needs it — most static pages parse fine without a headless browser. - No
cookiesparameter — cookies go throughheaders["Cookie"]. includePathsregex is case-sensitive./blog/.+won't match/Blog/....- Scraper Job
datais double-wrapped. Each row isbody.data[i].data; outer wraps withid,jobId,dataId,createdAt,updatedAt. requestMetadata.status === "ok"is the only success signal. HTTP 200 alone isn't enough.- Webhooks are best-effort with 3 retries. Always have a polling fallback.
References
references/web-scraping.md—POST /scrape/webparameters, JS scenarios, AI extraction, cookie auth.references/search.md— Google SERP / Light / AI Mode / News / Shopping / Bing / Trends + pagination.references/ecommerce.md— Amazon (product, search, seller, seller-products) and Shopify.references/real-estate.md— Zillow, Redfin (bracketed filters).references/travel.md— Airbnb, Booking, Google Flights (occupancy rules, token pagination, IATA codes).references/local-business.md— Maps (search/place/reviews/photos/posts), Yelp, YellowPages.references/jobs.md— Indeed and Glassdoor.references/youtube.md— YouTube search / video / channel / transcript.references/scraper-jobs.md— async submit/poll/results, Crawler, Contacts, SEC EDGAR, webhook receiver.references/code-recipes.md— Python / TypeScript clients with retry, backoff, concurrency, polling.
Resources
- Sitemap: https://docs.hasdata.com/llms.txt
- API status codes: https://docs.hasdata.com/api-codes
- Credits & concurrency: https://docs.hasdata.com/credits-and-concurrency
- Dashboard: https://app.hasdata.com
Limitations
- Requires access to HasData services and valid credentials.
- Data quality and available fields depend on the target website and extraction method used.
- JavaScript-heavy websites may require rendering, which can affect performance and cost.
- Use only for public data or content the user is authorized to access; respect site terms, robots/access controls, privacy law, and rate limits.
- Rate limits, quotas, and account restrictions may apply depending on the endpoint and subscription plan.
Details
| Category | Other → General |
| Source | official |
| Stars | N/A |
| Risk Level | Safe |