WISE Scraper

WISE teaches an AI coding agent structured, repeatable web scraping for JS-rendered sites. The goal is a working scraping project built from shipped WISE assets.

Rule 0 — Orient before acting. Before opening a browser or writing any code, read references/guide.md § Big Picture to understand what you're building and what decisions you need to make. Only then start exploration.

Orient → Explore → Evidence → Choose tier → Exploit → JSONL → Assemble

Orient — read the schema, templates, and runner options; understand what's shipped
Explore — inspect the live site with agent-browser, test selectors, map navigation
Evidence — record selector proof and DOM observations before designing the exploit
Choose tier — prefer shipped plumbing, escalate only when justified; ask about runtime preference if unclear
Exploit — assemble a profile from template fragments, run it, extend with hooks or task-local code
Process — JSONL is the intermediate truth; assemble markdown/CSV/JSON later

Use when: JS-rendered sites, pagination, UI state, filter combos, structured repeatable output. Not when: a stable API/export exists, or static curl is clearly enough.

Agent Contract

Orient first. Read references/guide.md § Big Picture and scan templates/*.yaml before touching agent-browser or writing code.
Explore before exploiting. Use agent-browser to inspect DOM, interactions, and state.
Show evidence. Record selectors, DOM snippets, or snapshots before writing profiles.
Assemble from fragments. Templates in templates/*.yaml are composable — combine them. They are not alternatives.
Infer runtime preference. If the user mentions Crawlee, Scrapy, or a Python pipeline, use Tier 4. If unclear, ask.
DOM eval for live extraction. HTML parsing libraries are for post-processing only.

Exploit Tiers

Tier	When	What
1	Target fits declarative flow	Assemble template fragments + shipped `agent-browser` runner
2	Target needs adaptation	Copy/adapt runner modules, hooks, helpers, or AI adapter
3	Target exceeds reference boundary	Bespoke project, carrying WISE discipline
4	User prefers alternative runtime	Same YAML profile, executed via Crawlee or Scrapy+Playwright runner

When escalating, explain why the simpler tier is insufficient. For Tier 4, the user's runtime preference (or project context like existing package.json/requirements.txt) determines the choice.

Runner Boundary

The shipped runner (references/runner/) uses agent-browser for browser driving. It handles: YAML profile interpretation, DOM-eval extraction, selectors, interactions, pagination, matrix, post-processing.

Alternative runners interpret the same YAML profile with a different backend. See references/comparisons.md for Crawlee and Scrapy+Playwright runner designs.

The agent may extend beyond any runner: hooks, helper scripts, chaining, AI-assisted extraction.

Step	Read
Orient	`references/guide.md § Big Picture`
Explore	`agent-browser` CLI help (`agent-browser --help`)
Choose tier / runtime	SKILL.md § Exploit Tiers, `references/comparisons.md` (if Tier 4)
Write profile	`references/field-guide.md`, `references/schema.cue`, scan `templates/*.yaml`
Add hooks	`references/guide.md § Hook System`
Add AI adapter	`references/ai-adapter.md`
Config / CLI	`references/guide.md § Config Composition`, `§ Runner CLI Reference`
Worked examples	`examples/overview.md`

Working Rules

Assemble from template fragments — combine pieces, don't pick one template
Header-based table mapping — not positional
Sort verification required — verify state changed after sort interactions
Avoid ambiguous clicks — scope by CSS/role/context
JSONL is intermediate truth — assemble final formats later

Common Failure Modes

Jumping to agent-browser or code before reading the framework
Designing the exploit before collecting exploration evidence
Jumping to bespoke code when template fragments would work
Using HTML parsing on the live page instead of DOM eval
Reaching for AI when selectors and plumbing are sufficient
Ignoring user runtime preference (Crawlee/Scrapy) and defaulting to shipped runner