structured-extraction
Installation
SKILL.md
Structured Extraction
Use this skill when extracting data that must match a specific JSON schema.
Strategy by task type
Simple query (single fact or small object)
- Search for relevant results.
- Scrape promising results with a targeted query.
- Build the result object and call formatOutput immediately.
Single target research (one entity, multiple fields)
- Search for relevant URLs.
- Scrape to extract data — stay in the orchestrator unless you have many independent sources (roughly 5+) where parallel workers clearly help.
- Compile findings and call formatOutput.
List of items (array in schema)
- Search/scrape to get the list of items.
- Are all requested details included in the list?
- Yes: Build the result and call formatOutput.
- No: If there are many items (roughly 5+), use spawnAgents so each worker gets the item and fields; otherwise fetch details sequentially in the orchestrator.
- Aggregate all results and call formatOutput.
All items from a website
- Check sitemaps (sitemap.xml, robots.txt) for an easy route to all pages.
- Scrape the entry page. Determine: pagination? Categories? Subcategories?
- For pagination, use interact to click through every page.
- For categories, scrape each category — use spawnAgents only when many independent categories warrant parallel fan-out.
- Aggregate and call formatOutput.
Scraping for structured data
- PREFER scrape with a targeted query over raw page dumps. It keeps context lean.
- When scraping lists, ALWAYS ask about pagination in your query: "How many total results? Is there a next page?"
- For many independent URLs (roughly 5+), spawnAgents can help — each worker gets specific URLs and fields. Fewer URLs: handle in the orchestrator.
- If a scrape returns a 404 or bot-check, do NOT retry. Move on to alternative sources.
Building the output
- Match the schema EXACTLY. Every required field must be present.
- Use null for missing fields — never omit keys.
- Arrays must be arrays even for single items.
- Numbers must be actual numbers, not strings (10.99 not "$10.99").
- Use bashExec with jq to merge data from multiple sources:
jq -s '.[0] * .[1]' /data/part1.json /data/part2.json > /data/merged.json
Validation before output
Before calling formatOutput, verify:
- All required fields from the schema are present.
- Types match (numbers are numbers, arrays are arrays).
- No duplicate entries in arrays.
- Source URLs are included where the schema has citation fields.
CRITICAL: Always call formatOutput
When you have gathered ALL data, call formatOutput with format "json" and the structured data. Do NOT stream data inline as markdown tables or JSON code blocks. Do NOT skip formatOutput — downstream systems depend on the structured output.
Weekly Installs
1
Repository
firecrawl/firec…wl-agentGitHub Stars
34
First Seen
10 days ago
Security Audits