Apify Scraper Builder

Build production-ready Apify Actors using Node.js/TypeScript and Crawlee.

Crawler Type Decision Tree

Scenario	Crawler	Why
Static HTML, no JavaScript	CheerioCrawler	Fastest, lowest memory
JavaScript-rendered content	PlaywrightCrawler	Modern, cross-browser
Legacy sites, specific Chrome behavior	PuppeteerCrawler	Chrome-specific features
Need to handle both static and JS	PlaywrightCrawler	More versatile
High-volume scraping (1000s pages)	CheerioCrawler	Best performance

Actor Creation Workflow

Step 1: Initialize Project

python scripts/init_actor.py my-scraper --type cheerio

Or manually create structure:

my-scraper/
├── .actor/
│   ├── actor.json           # REQUIRED
│   ├── input_schema.json    # Recommended
│   └── Dockerfile           # REQUIRED
├── src/
│   └── main.ts              # Entry point
├── package.json
└── tsconfig.json

Step 2: Configure actor.json

{
    "actorSpecification": 1,
    "name": "my-scraper",
    "version": "0.0",
    "buildTag": "latest",
    "input": "./input_schema.json",
    "dockerfile": "./Dockerfile"
}

Step 3: Define Input Schema

python scripts/generate_input_schema.py "Scrape product pages with URLs, max items limit, and proxy support"

Or use templates from references/input-schema-guide.md

Step 4: Implement Crawler

Use patterns from references/crawlee-patterns.md

Step 5: Validate Configuration

python scripts/validate_actor.py /path/to/actor

Step 6: Deploy

apify login
apify push

Project Structure

Required Files

.actor/actor.json

{
    "actorSpecification": 1,
    "name": "my-scraper",
    "version": "0.0",
    "buildTag": "latest",
    "minMemoryMbytes": 256,
    "maxMemoryMbytes": 4096,
    "dockerfile": "./Dockerfile",
    "input": "./input_schema.json",
    "storages": {
        "dataset": "./dataset_schema.json"
    }
}

.actor/Dockerfile (Node.js)

FROM apify/actor-node:20

COPY package*.json ./
RUN npm --quiet set progress=false \
    && npm install --omit=dev --omit=optional \
    && echo "Installed NPM packages:" \
    && npm list || true \
    && echo "Node.js version:" \
    && node --version \
    && echo "NPM version:" \
    && npm --version

COPY . ./
CMD npm start

package.json

{
    "name": "my-scraper",
    "version": "0.0.1",
    "type": "module",
    "main": "dist/main.js",
    "scripts": {
        "start": "node dist/main.js",
        "build": "tsc"
    },
    "dependencies": {
        "apify": "^3.0.0",
        "crawlee": "^3.0.0"
    },
    "devDependencies": {
        "typescript": "^5.0.0"
    }
}

Input Schema Editors

Editor	Use Case	Example
`textfield`	Single-line text	Name, URL
`textarea`	Multi-line text	CSS selectors, notes
`requestListSources`	URL list with labels	Start URLs
`proxy`	Proxy configuration	Apify Proxy settings
`json`	JSON object/array	Custom configuration
`select`	Dropdown options	Country, category
`checkbox`	Boolean toggle	Debug mode
`number`	Integer/float	Max items, delay
`datepicker`	Date selection	Date range filter

Common Input Schema Pattern

{
    "title": "Scraper Input",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "startUrls": {
            "title": "Start URLs",
            "type": "array",
            "description": "URLs to start scraping from",
            "editor": "requestListSources",
            "prefill": [{"url": "https://example.com"}]
        },
        "maxItems": {
            "title": "Max Items",
            "type": "integer",
            "description": "Maximum number of items to scrape",
            "default": 100,
            "minimum": 1
        },
        "proxyConfig": {
            "title": "Proxy Configuration",
            "type": "object",
            "description": "Proxy settings for the scraper",
            "editor": "proxy",
            "default": {"useApifyProxy": true}
        }
    },
    "required": ["startUrls"]
}

Crawlee Patterns

CheerioCrawler (Fast HTML Parsing)

import { Actor } from 'apify';
import { CheerioCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput<{
    startUrls: { url: string }[];
    maxItems: number;
}>();

const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: input?.maxItems || 100,
    async requestHandler({ request, $, enqueueLinks }) {
        const title = $('h1').text().trim();
        const price = $('.price').text().trim();

        await Dataset.pushData({
            url: request.url,
            title,
            price,
        });

        // Enqueue pagination links
        await enqueueLinks({
            selector: 'a.next-page',
        });
    },
});

await crawler.run(input?.startUrls?.map(u => u.url) || []);
await Actor.exit();

PlaywrightCrawler (JavaScript Rendering)

import { Actor } from 'apify';
import { PlaywrightCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput<{
    startUrls: { url: string }[];
    maxItems: number;
}>();

const proxyConfiguration = await Actor.createProxyConfiguration(
    input?.proxyConfig
);

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    maxRequestsPerCrawl: input?.maxItems || 100,
    async requestHandler({ page, request, enqueueLinks }) {
        // Wait for dynamic content
        await page.waitForSelector('.product-list');

        const products = await page.$$eval('.product', items =>
            items.map(item => ({
                title: item.querySelector('h2')?.textContent?.trim(),
                price: item.querySelector('.price')?.textContent?.trim(),
            }))
        );

        for (const product of products) {
            await Dataset.pushData({
                url: request.url,
                ...product,
            });
        }

        await enqueueLinks({
            selector: 'a.pagination',
        });
    },
});

await crawler.run(input?.startUrls?.map(u => u.url) || []);
await Actor.exit();

PuppeteerCrawler (Chrome-specific)

import { Actor } from 'apify';
import { PuppeteerCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput<{
    startUrls: { url: string }[];
}>();

const crawler = new PuppeteerCrawler({
    launchContext: {
        launchOptions: {
            headless: true,
        },
    },
    async requestHandler({ page, request }) {
        await page.waitForSelector('.content');

        const data = await page.evaluate(() => ({
            title: document.querySelector('h1')?.textContent,
            content: document.querySelector('.content')?.innerHTML,
        }));

        await Dataset.pushData({
            url: request.url,
            ...data,
        });
    },
});

await crawler.run(input?.startUrls?.map(u => u.url) || []);
await Actor.exit();

Scripts

Initialize New Actor

python scripts/init_actor.py <name> --type <cheerio|playwright|puppeteer> [--path <dir>]

Validate Actor Configuration

python scripts/validate_actor.py <actor-path>

Generate Input Schema

python scripts/generate_input_schema.py "<description>" [--output <path>]

Deployment Commands

# Install Apify CLI
npm install -g @apify/cli

# Login to Apify
apify login

# Create new Actor from template (interactive)
apify create my-actor

# Run Actor locally
apify run --purge

# Push to Apify platform
apify push

# Build Actor remotely
apify actors build

# Call Actor remotely
apify actors call <actor-id>

# Pull Actor code from Apify
apify actors pull <actor-id>

Validation Checklist

Before Building

Correct crawler type selected for target site
Input schema defines all required parameters
Dependencies in package.json are correct

Configuration

actor.json has actorSpecification: 1
actor.json has valid name and version
Dockerfile uses correct Node.js base image
Input schema editors match field types

Code Quality

Error handling for network failures
Proxy configuration used for production
Rate limiting/delays configured
Data validation before pushData

Pre-Deployment

apify run --purge succeeds locally
Output data structure is correct
Memory limits are appropriate

References

Topic	File
actor.json Specification	`references/actor-json-spec.md`
Input Schema Editors	`references/input-schema-guide.md`
Crawlee Patterns	`references/crawlee-patterns.md`

Templates

Template	Description	Path
Cheerio	Fast HTML scraping	`templates/crawlee-cheerio/`
Playwright	JS-rendered content	`templates/crawlee-playwright/`
Puppeteer	Chrome-specific	`templates/crawlee-puppeteer/`

apify-scraper-builder