skills/dvorkinguy/claude-skills-agents/apify-scraper-builder

apify-scraper-builder

SKILL.md

Apify Scraper Builder

Build production-ready Apify Actors using Node.js/TypeScript and Crawlee.

Crawler Type Decision Tree

Scenario Crawler Why
Static HTML, no JavaScript CheerioCrawler Fastest, lowest memory
JavaScript-rendered content PlaywrightCrawler Modern, cross-browser
Legacy sites, specific Chrome behavior PuppeteerCrawler Chrome-specific features
Need to handle both static and JS PlaywrightCrawler More versatile
High-volume scraping (1000s pages) CheerioCrawler Best performance

Actor Creation Workflow

Step 1: Initialize Project

python scripts/init_actor.py my-scraper --type cheerio

Or manually create structure:

my-scraper/
├── .actor/
│   ├── actor.json           # REQUIRED
│   ├── input_schema.json    # Recommended
│   └── Dockerfile           # REQUIRED
├── src/
│   └── main.ts              # Entry point
├── package.json
└── tsconfig.json

Step 2: Configure actor.json

{
    "actorSpecification": 1,
    "name": "my-scraper",
    "version": "0.0",
    "buildTag": "latest",
    "input": "./input_schema.json",
    "dockerfile": "./Dockerfile"
}

Step 3: Define Input Schema

python scripts/generate_input_schema.py "Scrape product pages with URLs, max items limit, and proxy support"

Or use templates from references/input-schema-guide.md

Step 4: Implement Crawler

Use patterns from references/crawlee-patterns.md

Step 5: Validate Configuration

python scripts/validate_actor.py /path/to/actor

Step 6: Deploy

apify login
apify push

Project Structure

Required Files

.actor/actor.json

{
    "actorSpecification": 1,
    "name": "my-scraper",
    "version": "0.0",
    "buildTag": "latest",
    "minMemoryMbytes": 256,
    "maxMemoryMbytes": 4096,
    "dockerfile": "./Dockerfile",
    "input": "./input_schema.json",
    "storages": {
        "dataset": "./dataset_schema.json"
    }
}

.actor/Dockerfile (Node.js)

FROM apify/actor-node:20

COPY package*.json ./
RUN npm --quiet set progress=false \
    && npm install --omit=dev --omit=optional \
    && echo "Installed NPM packages:" \
    && npm list || true \
    && echo "Node.js version:" \
    && node --version \
    && echo "NPM version:" \
    && npm --version

COPY . ./
CMD npm start

package.json

{
    "name": "my-scraper",
    "version": "0.0.1",
    "type": "module",
    "main": "dist/main.js",
    "scripts": {
        "start": "node dist/main.js",
        "build": "tsc"
    },
    "dependencies": {
        "apify": "^3.0.0",
        "crawlee": "^3.0.0"
    },
    "devDependencies": {
        "typescript": "^5.0.0"
    }
}

Input Schema Editors

Editor Use Case Example
textfield Single-line text Name, URL
textarea Multi-line text CSS selectors, notes
requestListSources URL list with labels Start URLs
proxy Proxy configuration Apify Proxy settings
json JSON object/array Custom configuration
select Dropdown options Country, category
checkbox Boolean toggle Debug mode
number Integer/float Max items, delay
datepicker Date selection Date range filter

Common Input Schema Pattern

{
    "title": "Scraper Input",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "startUrls": {
            "title": "Start URLs",
            "type": "array",
            "description": "URLs to start scraping from",
            "editor": "requestListSources",
            "prefill": [{"url": "https://example.com"}]
        },
        "maxItems": {
            "title": "Max Items",
            "type": "integer",
            "description": "Maximum number of items to scrape",
            "default": 100,
            "minimum": 1
        },
        "proxyConfig": {
            "title": "Proxy Configuration",
            "type": "object",
            "description": "Proxy settings for the scraper",
            "editor": "proxy",
            "default": {"useApifyProxy": true}
        }
    },
    "required": ["startUrls"]
}

Crawlee Patterns

CheerioCrawler (Fast HTML Parsing)

import { Actor } from 'apify';
import { CheerioCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput<{
    startUrls: { url: string }[];
    maxItems: number;
}>();

const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: input?.maxItems || 100,
    async requestHandler({ request, $, enqueueLinks }) {
        const title = $('h1').text().trim();
        const price = $('.price').text().trim();

        await Dataset.pushData({
            url: request.url,
            title,
            price,
        });

        // Enqueue pagination links
        await enqueueLinks({
            selector: 'a.next-page',
        });
    },
});

await crawler.run(input?.startUrls?.map(u => u.url) || []);
await Actor.exit();

PlaywrightCrawler (JavaScript Rendering)

import { Actor } from 'apify';
import { PlaywrightCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput<{
    startUrls: { url: string }[];
    maxItems: number;
}>();

const proxyConfiguration = await Actor.createProxyConfiguration(
    input?.proxyConfig
);

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    maxRequestsPerCrawl: input?.maxItems || 100,
    async requestHandler({ page, request, enqueueLinks }) {
        // Wait for dynamic content
        await page.waitForSelector('.product-list');

        const products = await page.$$eval('.product', items =>
            items.map(item => ({
                title: item.querySelector('h2')?.textContent?.trim(),
                price: item.querySelector('.price')?.textContent?.trim(),
            }))
        );

        for (const product of products) {
            await Dataset.pushData({
                url: request.url,
                ...product,
            });
        }

        await enqueueLinks({
            selector: 'a.pagination',
        });
    },
});

await crawler.run(input?.startUrls?.map(u => u.url) || []);
await Actor.exit();

PuppeteerCrawler (Chrome-specific)

import { Actor } from 'apify';
import { PuppeteerCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput<{
    startUrls: { url: string }[];
}>();

const crawler = new PuppeteerCrawler({
    launchContext: {
        launchOptions: {
            headless: true,
        },
    },
    async requestHandler({ page, request }) {
        await page.waitForSelector('.content');

        const data = await page.evaluate(() => ({
            title: document.querySelector('h1')?.textContent,
            content: document.querySelector('.content')?.innerHTML,
        }));

        await Dataset.pushData({
            url: request.url,
            ...data,
        });
    },
});

await crawler.run(input?.startUrls?.map(u => u.url) || []);
await Actor.exit();

Scripts

Initialize New Actor

python scripts/init_actor.py <name> --type <cheerio|playwright|puppeteer> [--path <dir>]

Validate Actor Configuration

python scripts/validate_actor.py <actor-path>

Generate Input Schema

python scripts/generate_input_schema.py "<description>" [--output <path>]

Deployment Commands

# Install Apify CLI
npm install -g @apify/cli

# Login to Apify
apify login

# Create new Actor from template (interactive)
apify create my-actor

# Run Actor locally
apify run --purge

# Push to Apify platform
apify push

# Build Actor remotely
apify actors build

# Call Actor remotely
apify actors call <actor-id>

# Pull Actor code from Apify
apify actors pull <actor-id>

Validation Checklist

Before Building

  • Correct crawler type selected for target site
  • Input schema defines all required parameters
  • Dependencies in package.json are correct

Configuration

  • actor.json has actorSpecification: 1
  • actor.json has valid name and version
  • Dockerfile uses correct Node.js base image
  • Input schema editors match field types

Code Quality

  • Error handling for network failures
  • Proxy configuration used for production
  • Rate limiting/delays configured
  • Data validation before pushData

Pre-Deployment

  • apify run --purge succeeds locally
  • Output data structure is correct
  • Memory limits are appropriate

References

Topic File
actor.json Specification references/actor-json-spec.md
Input Schema Editors references/input-schema-guide.md
Crawlee Patterns references/crawlee-patterns.md

Templates

Template Description Path
Cheerio Fast HTML scraping templates/crawlee-cheerio/
Playwright JS-rendered content templates/crawlee-playwright/
Puppeteer Chrome-specific templates/crawlee-puppeteer/
Weekly Installs
11
First Seen
Jan 25, 2026
Installed on
claude-code11
opencode10
gemini-cli10
codex10
antigravity9
github-copilot9