web-scraper
Web Scraper
Overview
Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.
When to Use This Skill
- When the user mentions "scraper" or related topics
- When the user mentions "scraping" or related topics
- When the user mentions "extrair dados web" or related topics
- When the user mentions "web scraping" or related topics
- When the user mentions "raspar dados" or related topics
- When the user mentions "coletar dados site" or related topics
Do Not Use This Skill When
- The task is unrelated to web scraper
- A simpler, more specific tool can handle the request
- The user needs general-purpose assistance without domain expertise
How It Works
Execute phases in strict order. Each phase feeds the next.
1. CLARIFY -> 2. RECON -> 3. STRATEGY -> 4. EXTRACT -> 5. TRANSFORM -> 6. VALIDATE -> 7. FORMAT
Never skip Phase 1 or Phase 2. They prevent wasted effort and failed extractions.
Fast path: If user provides URL + clear data target + the request is simple (single page, one data type), compress Phases 1-3 into a single action: fetch, classify, and extract in one WebFetch call. Still validate and format.
Capabilities
- Multi-strategy: WebFetch (static), Browser automation (JS-rendered), Bash/curl (APIs), WebSearch (discovery)
- Extraction modes: table, list, article, product, contact, FAQ, pricing, events, jobs, custom
- Output formats: Markdown tables (default), JSON, CSV
- Pagination: auto-detect and follow (page numbers, infinite scroll, load-more)
- Multi-URL: extract same structure across sources with comparison and diff
- Validation: confidence ratings (HIGH/MEDIUM/LOW) on every extraction
- Auto-escalation: WebFetch fails silently -> automatic Browser fallback
- Data transforms: cleaning, normalization, deduplication, enrichment
- Differential mode: detect changes between scraping runs
Web Scraper
Multi-strategy web data extraction with intelligent approach selection, automatic fallback escalation, data transformation, and structured output.
Phase 1: Clarify
Establish extraction parameters before touching any URL.
Required Parameters
| Parameter | Resolve | Default |
|---|---|---|
| Target URL(s) | Which page(s) to scrape? | (required) |
| Data Target | What specific data to extract? | (required) |
| Output Format | Markdown table, JSON, CSV, or text? | Markdown table |
| Scope | Single page, paginated, or multi-URL? | Single page |
Optional Parameters
| Parameter | Resolve | Default |
|---|---|---|
| Pagination | Follow pagination? Max pages? | No, 1 page |
| Max Items | Maximum number of items to collect? | Unlimited |
| Filters | Data to exclude or include? | None |
| Sort Order | How to sort results? | Source order |
| Save Path | Save to file? Which path? | Display only |
| Language | Respond in which language? | User's lang |
| Diff Mode | Compare with previous run? | No |
Clarification Rules
- If user provides a URL and clear data target, proceed directly to Phase 2. Do NOT ask unnecessary questions.
- If request is ambiguous (e.g. "scrape this site"), ask ONLY: "What specific data do you want me to extract from this page?"
- Default to Markdown table output. Mention alternatives only if relevant.
- Accept requests in any language. Always respond in the user's language.
- If user says "everything" or "all data", perform recon first, then present what's available and let user choose.
Discovery Mode
When user has a topic but no specific URL:
- Use WebSearch to find the most relevant pages
- Present top 3-5 URLs with descriptions
- Let user choose which to scrape, or scrape all
- Proceed to Phase 2 with selected URL(s)
Example: "find and extract pricing data for CRM tools" -> WebSearch("CRM tools pricing comparison 2026") -> Present top results -> User selects -> Extract
Phase 2: Reconnaissance
Analyze the target page before extraction.
Step 2.1: Initial Fetch
Use WebFetch to retrieve and analyze the page structure:
WebFetch(
url = TARGET_URL,
prompt = "Analyze this page structure and report:
1. Page type: article, product listing, search results, data table,
directory, dashboard, API docs, FAQ, pricing page, job board, events, or other
2. Main content structure: tables, ordered/unordered lists, card grid, free-form text,
accordion/collapsible sections, tabs
3. Approximate number of distinct data items visible
4. JavaScript rendering indicators: empty containers, loading spinners,
SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS
5. Pagination: next/prev links, page numbers, load-more buttons,
infinite scroll indicators, total results count
6. Data density: how much structured, extractable data exists
7. List the main data fields/columns available for extraction
8. Embedded structured data: JSON-LD, microdata, OpenGraph tags
9. Available download links: CSV, Excel, PDF, API endpoints"
)
Step 2.2: Evaluate Fetch Quality
| Signal | Interpretation | Action |
|---|---|---|
| Rich content with data clearly visible | Static page | Strategy A (WebFetch) |
| Empty containers, "loading...", minimal text | JS-rendered | Strategy B (Browser) |
| Login wall, CAPTCHA, 403/401 response | Blocked | Report to user |
| Content present but poorly structured | Needs precision | Strategy B (Browser) |
| JSON or XML response body | API endpoint | Strategy C (Bash/curl) |
| Download links for CSV/Excel available | Direct data file | Strategy C (download) |
Step 2.3: Content Classification
Classify into an extraction mode:
| Mode | Indicators | Examples |
|---|---|---|
table |
HTML <table>, grid layout with headers |
Price comparison, statistics, specs |
list |
Repeated similar elements, card grids | Search results, product listings |
article |
Long-form text with headings/paragraphs | Blog post, news article, docs |
product |
Product name, price, specs, images, rating | E-commerce product page |
contact |
Names, emails, phones, addresses, roles | Team page, staff directory |
faq |
Question-answer pairs, accordions | FAQ page, help center |
pricing |
Plan names, prices, features, tiers | SaaS pricing page |
events |
Dates, locations, titles, descriptions | Event listings, conferences |
jobs |
Titles, companies, locations, salaries | Job boards, career pages |
custom |
User specified CSS selectors or fields | Anything not matching above |
Record: page type, extraction mode, JS rendering needed (yes/no), available fields, structured data present (JSON-LD etc.).
If user asked for "everything", present the available fields and let them choose.
Phase 3: Strategy Selection
Choose the extraction approach based on recon results.
Decision Tree
Structured data (JSON-LD, microdata) has what we need?
|
+-- YES --> STRATEGY E: Extract structured data directly
|
+-- NO: Content fully visible in WebFetch?
|
+-- YES: Need precise element targeting?
| |
| +-- NO --> STRATEGY A: WebFetch + AI extraction
| +-- YES --> STRATEGY B: Browser automation
|
+-- NO: JavaScript rendering detected?
|
+-- YES --> STRATEGY B: Browser automation
+-- NO: API/JSON/XML endpoint or download link?
|
+-- YES --> STRATEGY C: Bash (curl + jq)
+-- NO --> Report access issue to user
Strategy A: Webfetch With Ai Extraction
Best for: Static pages, articles, simple tables, well-structured HTML.
Use WebFetch with a targeted extraction prompt tailored to the mode:
WebFetch(
url = URL,
prompt = "Extract [DATA_TARGET] from this page.
Return ONLY the extracted data as [FORMAT] with these columns/fields: [FIELDS].
Rules:
- If a value is missing or unclear, use 'N/A'
- Do not include navigation, ads, footers, or unrelated content
- Preserve original values exactly (numbers, currencies, dates)
- Include ALL matching items, not just the first few
- For each item, also extract the URL/link if available"
)
Auto-escalation: If WebFetch returns suspiciously few items (less than 50% of expected from recon), or mostly empty fields, automatically escalate to Strategy B without asking user. Log the escalation in notes.
Strategy B: Browser Automation
Best for: JS-rendered pages, SPAs, interactive content, lazy-loaded data.
Sequence:
- Get tab context:
tabs_context_mcp(createIfEmpty=true)-> get tabId - Navigate to URL:
navigate(url=TARGET_URL, tabId=TAB) - Wait for content to load:
computer(action="wait", duration=3, tabId=TAB) - Check for cookie/consent banners:
find(query="cookie consent or accept button", tabId=TAB)- If found, dismiss it (prefer privacy-preserving option)
- Read page structure:
read_page(tabId=TAB)orget_page_text(tabId=TAB) - Locate target elements:
find(query="[DESCRIPTION]", tabId=TAB) - Extract with JavaScript for precise data via
javascript_tool
// Table extraction
const rows = document.querySelectorAll('TABLE_SELECTOR tr');
const data = Array.from(rows).map(row => {
const cells = row.querySelectorAll('td, th');
return Array.from(cells).map(c => c.textContent.trim());
});
JSON.stringify(data);
// List/card extraction
const items = document.querySelectorAll('ITEM_SELECTOR');
const data = Array.from(items).map(item => ({
field1: item.querySelector('FIELD1_SELECTOR')?.textContent?.trim() || null,
field2: item.querySelector('FIELD2_SELECTOR')?.textContent?.trim() || null,
link: item.querySelector('a')?.href || null,
}));
JSON.stringify(data);
- For lazy-loaded content, scroll and re-extract:
computer(action="scroll", scroll_direction="down", tabId=TAB)thencomputer(action="wait", duration=2, tabId=TAB)
Strategy C: Bash (Curl + Jq)
Best for: REST APIs, JSON endpoints, XML feeds, CSV/Excel downloads.
## Json Api
curl -s "API_URL" | jq '[.items[] | {field1: .key1, field2: .key2}]'
## Csv Download
curl -s "CSV_URL" -o /tmp/scraped_data.csv
## Xml Parsing
curl -s "XML_URL" | python3 -c "
import xml.etree.ElementTree as ET, json, sys
tree = ET.parse(sys.stdin)
## ... Parse And Output Json
"
Strategy D: Hybrid
When a single strategy is insufficient, combine:
- WebSearch to discover relevant URLs
- WebFetch for initial content assessment
- Browser automation for JS-heavy sections
- Bash for post-processing (jq, python for data cleaning)
Strategy E: Structured Data Extraction
When JSON-LD, microdata, or OpenGraph is present:
- Use Browser
javascript_toolto extract structured data:
const scripts = document.querySelectorAll('script[type="application/ld+json"]');
const data = Array.from(scripts).map(s => {
try { return JSON.parse(s.textContent); } catch { return null; }
}).filter(Boolean);
JSON.stringify(data);
- This often provides cleaner, more reliable data than DOM scraping
- Fall back to DOM extraction only for fields not in structured data
Pagination Handling
When pagination is detected and user wants multiple pages:
Page-number pagination (any strategy):
- Extract data from current page
- Identify URL pattern (e.g.
?page=N,/page/N,&offset=N) - Iterate through pages up to user's max (default: 5 pages)
- Show progress: "Extracting page 2/5..."
- Concatenate all results, deduplicate if needed
Infinite scroll (Browser only):
- Extract currently visible data
- Record item count
- Scroll down:
computer(action="scroll", scroll_direction="down", tabId=TAB) - Wait:
computer(action="wait", duration=2, tabId=TAB) - Extract newly loaded data
- Compare count - if no new items after 2 scrolls, stop
- Repeat until no new content or max iterations (default: 5)
"Load More" button (Browser only):
- Extract currently visible data
- Find button:
find(query="load more button", tabId=TAB) - Click it:
computer(action="left_click", ref=REF, tabId=TAB) - Wait and extract new content
- Repeat until button disappears or max iterations reached
Phase 4: Extract
Execute the selected strategy using mode-specific patterns. See references/extraction-patterns.md for CSS selectors and JavaScript snippets.
Table Mode
WebFetch prompt:
"Extract ALL rows from the table(s) on this page.
Return as a markdown table with exact column headers.
Include every row - do not truncate or summarize.
Preserve numeric precision, currencies, and units."
List Mode
WebFetch prompt:
"Extract each [ITEM_TYPE] from this page.
For each item, extract: [FIELD_LIST].
Return as a JSON array of objects with these keys: [KEY_LIST].
Include ALL items, not just the first few. Include link/URL for each item if available."
Article Mode
WebFetch prompt:
"Extract article metadata:
- title, author, date, tags/categories, word count estimate
- Key factual data points, statistics, and named entities
Return as structured markdown. Summarize the content; do not reproduce full text."
Product Mode
WebFetch prompt:
"Extract product data with these exact fields:
- name, brand, price, currency, originalPrice (if discounted),
availability, description (first 200 chars), rating, reviewCount,
specifications (as key-value pairs), productUrl, imageUrl
Return as JSON. Use null for missing fields."
Also check for JSON-LD Product schema (Strategy E) first.
Contact Mode
WebFetch prompt:
"Extract contact information for each person/entity:
- name, title, role, email, phone, address, organization, website, linkedinUrl
Return as a markdown table. Only extract real contacts visible on the page."
Faq Mode
WebFetch prompt:
"Extract all question-answer pairs from this page.
For each FAQ item extract:
- question: the exact question text
- answer: the answer text (first 300 chars if long)
- category: the section/category if grouped
Return as a JSON array of objects."
Pricing Mode
WebFetch prompt:
"Extract all pricing plans/tiers from this page.
For each plan extract:
- planName, monthlyPrice, annualPrice, currency
- features (array of included features)
- limitations (array of limits or excluded features)
- ctaText (call-to-action button text)
- highlighted (true if marked as recommended/popular)
Return as JSON. Use null for missing fields."
Events Mode
WebFetch prompt:
"Extract all events/sessions from this page.
For each event extract:
- title, date, time, endTime, location, description (first 200 chars)
- speakers (array of names), category, registrationUrl
Return as JSON. Use null for missing fields."
Jobs Mode
WebFetch prompt:
"Extract all job listings from this page.
For each job extract:
- title, company, location, salary, salaryRange, type (full-time/part-time/contract)
- postedDate, description (first 200 chars), applyUrl, tags
Return as JSON. Use null for missing fields."
Custom Mode
When user provides specific selectors or field descriptions:
- Use Browser automation with
javascript_tooland user's CSS selectors - Or use WebFetch with a prompt built from user's field descriptions
- Always confirm extracted schema with user before proceeding to multi-URL
Multi-Url Extraction
When extracting from multiple URLs:
- Extract from the first URL to establish the data schema
- Show user the first results and confirm the schema is correct
- Extract from remaining URLs using the same schema
- Add a
sourcecolumn/field to every record with the origin URL - Combine all results into a single output
- Show progress: "Extracting 3/7 URLs..."
Phase 5: Transform
Clean, normalize, and enrich extracted data before validation. See references/data-transforms.md for patterns.
Automatic Transforms (Always Apply)
| Transform | Action |
|---|---|
| Whitespace cleanup | Trim, collapse multiple spaces, remove \n in cells |
| HTML entity decode | & -> &, < -> <, ' -> ' |
| Unicode normalization | NFKC normalization for consistent characters |
| Empty string to null | "" -> null (for JSON), "" -> N/A (for tables) |
Conditional Transforms (Apply When Relevant)
| Transform | When | Action |
|---|---|---|
| Price normalization | Product/pricing modes | Extract numeric value + currency symbol |
| Date normalization | Any dates found | Normalize to ISO-8601 (YYYY-MM-DD) |
| URL resolution | Relative URLs extracted | Convert to absolute URLs |
| Phone normalization | Contact mode | Standardize to E.164 format if possible |
| Deduplication | Multi-page or multi-URL | Remove exact duplicate rows |
| Sorting | User requested or natural | Sort by user-specified field |
Data Enrichment (Only When Useful)
| Enrichment | When | Action |
|---|---|---|
| Currency conversion | User asks for single currency | Note original + convert (approximate) |
| Domain extraction | URLs in data | Add domain column from full URLs |
| Word count | Article mode | Count words in extracted text |
| Relative dates | Dates present | Add "X days ago" column if useful |
Deduplication Strategy
When combining data from multiple pages or URLs:
- Exact match: rows with identical values in all fields -> keep first
- Near match: rows with same key fields (name+source) but different details -> keep most complete (fewer nulls), flag in notes
- Report: "Removed N duplicate rows" in delivery notes
Phase 6: Validate
Verify extraction quality before delivering results.
Validation Checks
| Check | Action |
|---|---|
| Item count | Compare extracted count to expected count from recon |
| Empty fields | Count N/A or null values per field |
| Data type consistency | Numbers should be numeric, dates parseable |
| Duplicates | Flag exact duplicate rows (post-dedup) |
| Encoding | Check for HTML entities, garbled characters |
| Completeness | All user-requested fields present in output |
| Truncation | Verify data wasn't cut off (check last items) |
| Outliers | Flag values that seem anomalous (e.g. $0.00 price) |
Confidence Rating
Assign to every extraction:
| Rating | Criteria |
|---|---|
| HIGH | All fields populated, count matches expected, no anomalies |
| MEDIUM | Minor gaps (<10% empty fields) or count slightly differs |
| LOW | Significant gaps (>10% empty), structural issues, partial data |
Always report confidence with specifics:
Confidence: HIGH - 47 items extracted, all 6 fields populated, matches expected count from page analysis.
Auto-Recovery (Try Before Reporting Issues)
| Issue | Auto-Recovery Action |
|---|---|
| Missing data | Re-attempt with Browser if WebFetch was used |
| Encoding problems | Apply HTML entity decode + unicode normalization |
| Incomplete results | Check for pagination or lazy-loading, fetch more |
| Count mismatch | Scroll/paginate to find remaining items |
| All fields empty | Page likely JS-rendered, switch to Browser strategy |
| Partial fields | Try JSON-LD extraction as supplement |
Log all recovery attempts in delivery notes. Inform user of any irrecoverable gaps with specific details.
Phase 7: Format And Deliver
Structure results according to user preference. See references/output-templates.md for complete formatting templates.
Delivery Envelope
ALWAYS wrap results with this metadata header:
## Extraction Results
**Source:** [Page Title](http://example.com)
**Date:** YYYY-MM-DD HH:MM UTC
**Items:** N records (M fields each)
**Confidence:** HIGH | MEDIUM | LOW
**Strategy:** A (WebFetch) | B (Browser) | C (API) | E (Structured Data)
**Format:** Markdown Table | JSON | CSV
---
[DATA HERE]
---
**Notes:**
- [Any gaps, issues, or observations]
- [Transforms applied: deduplication, normalization, etc.]
- [Pages scraped if paginated: "Pages 1-5 of 12"]
- [Auto-escalation if it occurred: "Escalated from WebFetch to Browser"]
Markdown Table Rules
- Left-align text columns (
:---), right-align numbers (---:) - Consistent column widths for readability
- Include summary row for numeric data when useful (totals, averages)
- Maximum 10 columns per table; split wider data into multiple tables or suggest JSON format
- Truncate long cell values to 60 chars with
...indicator - Use
N/Afor missing values, never leave cells empty - For multi-page results, show combined table (not per-page)
Json Rules
- Use camelCase for keys (e.g.
productName,unitPrice) - Wrap in metadata envelope:
{ "metadata": { "source": "URL", "title": "Page Title", "extractedAt": "ISO-8601", "itemCount": 47, "fieldCount": 6, "confidence": "HIGH", "strategy": "A", "transforms": ["deduplication", "priceNormalization"], "notes": [] }, "data": [ ... ] } - Pretty-print with 2-space indentation
- Numbers as numbers (not strings), booleans as booleans
- null for missing values (not empty strings)
Csv Rules
- First row is always headers
- Quote any field containing commas, quotes, or newlines
- UTF-8 encoding with BOM for Excel compatibility
- Use
,as delimiter (standard) - Include metadata as comments:
# Source: URL
File Output
When user requests file save:
- Markdown:
.mdextension - JSON:
.jsonextension - CSV:
.csvextension - Confirm path before writing
- Report full file path and item count after saving
Multi-Url Comparison Format
When comparing data across multiple sources:
- Add
Sourceas the first column/field - Use short identifiers for sources (domain name or user label)
- Group by source or interleave based on user preference
- Highlight differences if user asks for comparison
- Include summary: "Best price: $X at store-b.com"
Differential Output
When user requests change detection (diff mode):
- Compare current extraction with previous run
- Mark new items with
[NEW] - Mark removed items with
[REMOVED] - Mark changed values with
[WAS: old_value] - Include summary: "Changes since last run: +5 new, -2 removed, 3 modified"
Rate Limiting
- Maximum 1 request per 2 seconds for sequential page fetches
- For multi-URL jobs, process sequentially with pauses
- If a site returns 429 (Too Many Requests), stop and report to user
Access Respect
- If a page blocks access (403, CAPTCHA, login wall), report to user
- Do NOT attempt to bypass bot detection, CAPTCHAs, or access controls
- Do NOT scrape behind authentication unless user explicitly provides access
- Respect robots.txt directives when known
Copyright
- Do NOT reproduce large blocks of copyrighted article text
- For articles: extract factual data, statistics, and structured info; summarize narrative content
- Always include source attribution (http://example.com) in output
Data Scope
- Extract ONLY what the user explicitly requested
- Warn user before collecting potentially sensitive data at scale (emails, phone numbers, personal information)
- Do not store or transmit extracted data beyond what the user sees
Failure Protocol
When extraction fails or is blocked:
- Explain the specific reason (JS rendering, bot detection, login, etc.)
- Suggest alternatives (different URL, API if available, manual approach)
- Never retry aggressively or escalate access attempts
Quick Reference: Mode Cheat Sheet
| User Says... | Mode | Strategy | Output Default |
|---|---|---|---|
| "extract the table" | table | A or B | Markdown table |
| "get all products/prices" | product | E then A | Markdown table |
| "scrape the listings" | list | A or B | Markdown table |
| "extract contact info / team page" | contact | A | Markdown table |
| "get the article data" | article | A | Markdown text |
| "extract the FAQ" | faq | A or B | JSON |
| "get pricing plans" | pricing | A or B | Markdown table |
| "scrape job listings" | jobs | A or B | Markdown table |
| "get event schedule" | events | A or B | Markdown table |
| "find and extract [topic]" | discovery | WebSearch | Markdown table |
| "compare prices across sites" | multi-URL | A or B | Comparison table |
| "what changed since last time" | diff | any | Diff format |
References
-
Extraction patterns: references/extraction-patterns.md CSS selectors, JavaScript snippets, JSON-LD parsing, domain tips.
-
Output templates: references/output-templates.md Markdown, JSON, CSV templates with complete examples.
-
Data transforms: references/data-transforms.md Cleaning, normalization, deduplication, enrichment patterns.
Best Practices
- Provide clear, specific context about your project and requirements
- Review all suggestions before applying them to production code
- Combine with other complementary skills for comprehensive analysis
Common Pitfalls
- Using this skill for tasks outside its domain expertise
- Applying recommendations without understanding your specific context
- Not providing enough project context for accurate analysis