wise-scraper
WISE Scraper
WISE teaches an AI coding agent structured, repeatable web scraping for JS-rendered sites. The goal is a working scraping project built from shipped WISE assets.
Rule 0 — Orient before acting. Before opening a browser or writing any code, read
references/guide.md § Big Pictureto understand what you're building and what decisions you need to make. Only then start exploration.
Orient → Explore → Evidence → Choose tier → Exploit → JSONL → Assemble
- Orient — read the schema, templates, and runner options; understand what's shipped
- Explore — inspect the live site with
agent-browser, test selectors, map navigation - Evidence — record selector proof and DOM observations before designing the exploit
- Choose tier — prefer shipped plumbing, escalate only when justified; ask about runtime preference if unclear
- Exploit — assemble a profile from template fragments, run it, extend with hooks or task-local code
- Process — JSONL is the intermediate truth; assemble markdown/CSV/JSON later
Use when: JS-rendered sites, pagination, UI state, filter combos, structured repeatable output.
Not when: a stable API/export exists, or static curl is clearly enough.
Agent Contract
- Orient first. Read
references/guide.md § Big Pictureand scantemplates/*.yamlbefore touchingagent-browseror writing code. - Explore before exploiting. Use
agent-browserto inspect DOM, interactions, and state. - Show evidence. Record selectors, DOM snippets, or snapshots before writing profiles.
- Assemble from fragments. Templates in
templates/*.yamlare composable — combine them. They are not alternatives. - Infer runtime preference. If the user mentions Crawlee, Scrapy, or a Python pipeline, use Tier 4. If unclear, ask.
- DOM eval for live extraction. HTML parsing libraries are for post-processing only.
Exploit Tiers
| Tier | When | What |
|---|---|---|
| 1 | Target fits declarative flow | Assemble template fragments + shipped agent-browser runner |
| 2 | Target needs adaptation | Copy/adapt runner modules, hooks, helpers, or AI adapter |
| 3 | Target exceeds reference boundary | Bespoke project, carrying WISE discipline |
| 4 | User prefers alternative runtime | Same YAML profile, executed via Crawlee or Scrapy+Playwright runner |
When escalating, explain why the simpler tier is insufficient. For Tier 4, the user's runtime preference (or project context like existing package.json/requirements.txt) determines the choice.
Runner Boundary
The shipped runner (references/runner/) uses agent-browser for browser driving. It handles: YAML profile interpretation, DOM-eval extraction, selectors, interactions, pagination, matrix, post-processing.
Alternative runners interpret the same YAML profile with a different backend. See references/comparisons.md for Crawlee and Scrapy+Playwright runner designs.
The agent may extend beyond any runner: hooks, helper scripts, chaining, AI-assisted extraction.
Read Next — by step
Do not read all references upfront. Read only what the current step needs:
| Step | Read |
|---|---|
| Orient | references/guide.md § Big Picture |
| Explore | agent-browser CLI help (agent-browser --help) |
| Choose tier / runtime | SKILL.md § Exploit Tiers, references/comparisons.md (if Tier 4) |
| Write profile | references/field-guide.md, references/schema.cue, scan templates/*.yaml |
| Add hooks | references/guide.md § Hook System |
| Add AI adapter | references/ai-adapter.md |
| Config / CLI | references/guide.md § Config Composition, § Runner CLI Reference |
| Worked examples | examples/overview.md |
Working Rules
- Assemble from template fragments — combine pieces, don't pick one template
- Header-based table mapping — not positional
- Sort verification required — verify state changed after sort interactions
- Avoid ambiguous clicks — scope by CSS/role/context
- JSONL is intermediate truth — assemble final formats later
Common Failure Modes
- Jumping to
agent-browseror code before reading the framework - Designing the exploit before collecting exploration evidence
- Jumping to bespoke code when template fragments would work
- Using HTML parsing on the live page instead of DOM eval
- Reaching for AI when selectors and plumbing are sufficient
- Ignoring user runtime preference (Crawlee/Scrapy) and defaulting to shipped runner