anysite-cli
Anysite CLI
Command-line tool for web data extraction, dataset pipelines, and database operations.
Agent Planning Workflow
BEFORE planning any data collection task, follow this sequence:
-
Discover available endpoints
anysite describe --search "<keyword>" # Search by domain (linkedin, company, user, etc.) -
Select endpoints needed for the task — identify which endpoints will provide the required data
-
Inspect each selected endpoint
anysite describe /api/linkedin/company # View input params and output fields -
Only then plan — now you know the exact parameters, field names, and data structure to build your config or API calls
This prevents errors from wrong endpoint paths, missing required parameters, or incorrect field names in dependencies.
Best Practices
-
Use dataset pipelines for multi-step tasks
- If a task requires sequential API calls, LLM enrichment, or chained data processing — create a
dataset.yamlconfig instead of running multiple ad-hoc commands - Dataset pipelines handle dependencies, incremental collection, and error recovery automatically
- Even for "simple" tasks that grow in scope, a dataset config is easier to maintain
- Benefits: run history, incremental sync, scheduling, notifications, DB loading
- If a task requires sequential API calls, LLM enrichment, or chained data processing — create a
-
Save data in Parquet format by default — unless user requests another format or CSV/JSON fits better
-
Prefer datasets over ad-hoc scripts — one dataset.yaml replaces dozens of shell commands
Quick Start Checklist
Before any data collection task:
# 1. Check CLI is available
anysite --version
# If not found: source .venv/bin/activate or pip install anysite-cli
# 2. Update schema cache (required for endpoint discovery)
anysite schema update
# 3. Verify API key
anysite config get api_key
# If not set: anysite config set api_key sk-xxxxx
Endpoint Discovery
ALWAYS discover endpoints before writing API calls or dataset configs:
anysite describe # List all endpoints
anysite describe --search "company" # Search by keyword
anysite describe /api/linkedin/company # Full details: input params + output fields
Prerequisites
pip install "anysite-cli[data]" # DuckDB + PyArrow for dataset commands
pip install "anysite-cli[llm]" # LLM analysis (openai/anthropic)
pip install "anysite-cli[postgres]" # PostgreSQL adapter
pip install "anysite-cli[clickhouse]" # ClickHouse adapter
anysite config set api_key sk-xxxxx # Configure API key
anysite schema update # Update schema cache
anysite llm setup # Configure LLM provider (paste key directly or use env var)
anysite db add pg --type postgres --host localhost --database mydb --user app --password secret
# Or via env var: anysite db add pg ... --password-env PGPASS
anysite db add ch --type clickhouse --host ch.example.com --port 8443 --database analytics --user app --password secret --ssl
Single API Call
anysite api /api/linkedin/user user=satyanadella
anysite api /api/linkedin/company company=anthropic --format table
anysite api /api/linkedin/search/users keywords="CTO" count=50 --format csv --output ctos.csv
anysite api /api/linkedin/user user=satyanadella --fields "name,headline,urn.value" -q | jq
URN/Name Parameter Formats
Parameters like location, current_companies, industry accept two formats:
# Single name (text search) — resolves to URNs automatically
location="London"
current_companies="Microsoft"
# Multiple URNs (direct) — use JSON array in single quotes
'location=["urn:li:geo:101165590", "urn:li:geo:101282230"]'
'current_companies=["urn:li:company:1035", "urn:li:company:1441"]'
Note: List of names ["Microsoft", "Google"] is NOT supported — use either one name OR multiple URNs.
Batch Processing
anysite api /api/linkedin/user --from-file users.txt --input-key user \
--parallel 5 --rate-limit "10/s" --on-error skip --progress --stats
Dataset Pipeline (Multi-Source Collection)
For complex data collection with dependencies, LLM enrichment, scheduling — use dataset pipelines.
Initialize
anysite dataset init my-dataset
# Creates my-dataset/dataset.yaml with template config
Five Source Types
- Independent — single API call with static
params - from_file — batch calls iterating over input file values
- Dependent — batch calls using values extracted from a parent source
- Union (type: union) — combine records from multiple parent sources into one
- LLM (type: llm) — process parent data through LLM without API calls
Comprehensive Dataset YAML Reference
name: my-dataset # Dataset name (required)
description: Optional description # Human-readable description
sources:
# === TYPE 1: Independent source (single API call) ===
- id: search_results # Unique identifier (required)
endpoint: /api/linkedin/search/users # API endpoint (required for type: api)
params: # Static API parameters
keywords: "software engineer"
count: 50
parallel: 1 # Concurrent requests: 1-10 (default: 1)
rate_limit: "10/s" # Rate limit: "N/s", "N/m", "N/h"
on_error: stop # Error handling: stop | skip (default: stop)
- id: search_extra # Another search (can be combined with union)
endpoint: /api/linkedin/search/users
params: { keywords: "data engineer", count: 50 }
# === TYPE 2: from_file source (batch from file) ===
- id: companies
endpoint: /api/linkedin/company
from_file: companies.txt # Input file: .txt (line per value), .csv, .jsonl
file_field: company_slug # CSV column name (for CSV files only)
input_key: company # API parameter to fill with each value
parallel: 3
# === TYPE 3: Dependent source (values from parent) ===
- id: employees
endpoint: /api/linkedin/company/employees
dependency:
from_source: companies # Parent source ID (required)
field: urn.value # Dot-notation path to extract from parent records
match_by: name # Alternative: fuzzy match instead of exact field
dedupe: true # Remove duplicate values (default: false)
input_key: companies # API parameter for extracted values
input_template: # Transform values before API call
companies:
- type: company
value: "{value}" # {value} = extracted value placeholder
count: 5
refresh: auto # Incremental behavior: auto (default) | always
# === TYPE 4: Union source (combine multiple sources) ===
- id: all_search_results
type: union # Source type: api (default) | union | llm
sources: [search_results, search_extra] # Parent source IDs to combine (required)
dedupe_by: urn.value # Optional: field path for deduplication (dot-notation)
# NOTE: type: union cannot have endpoint, dependency, from_file, input_key, params
# NOTE: all sources in the list must have the same endpoint (same data structure)
# Records are annotated with _union_source = parent source ID
# === TYPE 5: LLM source (process parent data without API) ===
- id: employees_analyzed
type: llm # Source type: api (default) | union | llm
dependency:
from_source: employees
field: name # Required by schema (not used for LLM sources)
llm: # LLM enrichment steps (required for type: llm)
- type: classify # Step types: classify | enrich | summarize | generate
categories: "developer,recruiter,executive" # Comma-separated (omit for auto-detect)
output_column: role_type # Output column name (default: category)
fields: [headline] # Record fields to include in LLM prompt
- type: enrich
add: # Field specs (required for enrich)
- "sentiment:positive/negative/neutral" # Enum: value1/value2/value3
- "language:string" # Types: string | number | integer | boolean
- "quality_score:1-10" # Range: min-max
fields: [headline, summary]
temperature: 0.0 # LLM temperature: 0.0-1.0 (default: 0.0)
provider: openai # Provider override: openai | anthropic
model: gpt-4o-mini # Model override
- type: summarize
max_length: 50 # Max words (default: 100)
output_column: bio
- type: generate
prompt: "Write pitch for {name}" # Template with {field} placeholders (required)
output_column: pitch
temperature: 0.7 # Higher for creative text
export: # Export destinations (runs after Parquet write)
- type: file
path: ./output/{{source}}-{{date}}.csv # Templates: {{date}}, {{source}}, {{dataset}}
format: csv # Format: json | jsonl | csv
# NOTE: type: llm cannot have endpoint, from_file, input_key, params
# === OPTIONAL BLOCKS (any source type) ===
- id: profiles
endpoint: /api/linkedin/user
dependency: { from_source: employees, field: internal_id.value }
input_key: user
transform: # Transform for exports only (Parquet keeps all)
filter: '.follower_count > 100 and .location != ""' # Safe expression
fields: # Field selection with dot-notation aliases
- name
- urn.value AS urn_id
- headline
add_columns: # Static columns to inject
batch: "q1-2026"
export:
- type: file
path: ./output/profiles.csv
format: csv
- type: webhook
url: https://example.com/hook
headers: { X-Token: abc }
db_load: # Database loading config
table: people # Custom table name (default: source ID)
key: urn.value # Unique key for diff-based incremental sync
sync: full # Sync mode: full (default) | append (no DELETE)
fields: # Fields to load (default: all except _input_value)
- name
- urn.value AS urn_id
- headline
exclude: [_input_value, raw_html] # Fields to skip
storage:
format: parquet # Storage format (only parquet supported)
path: ./data/ # Storage directory (relative to dataset.yaml)
schedule:
cron: "0 9 * * *" # Cron expression for scheduling
notifications:
on_complete:
- url: "https://hooks.slack.com/xxx"
headers: { Authorization: "Bearer token" }
on_failure:
- url: "https://alerts.example.com/fail"
type: union Source Details
The type: union source combines records from multiple parent sources:
- Requires: non-empty
sourceslist (parent source IDs) - Optional:
dedupe_byfield path for removing duplicates (supports dot-notation) - Cannot have:
endpoint,dependency,from_file,input_key,params - Validation: all parent sources must have the same endpoint (same data structure)
- Use case: merge multiple search results before a single dependent source processes them
sources:
- id: search_cto
endpoint: /api/linkedin/search/users
params: { keywords: "CTO fintech", count: 50 }
- id: search_vp
endpoint: /api/linkedin/search/users
params: { keywords: "VP Engineering", count: 50 }
# Union combines all search results
- id: all_candidates
type: union
sources: [search_cto, search_vp]
dedupe_by: urn.value # Remove duplicates by URN
# Single dependent source processes all candidates
- id: profiles
endpoint: /api/linkedin/user
dependency:
from_source: all_candidates
field: urn.value
input_key: user
Records from union sources are annotated with _union_source (the parent source ID they came from).
type: llm Source Details
The type: llm source processes existing parent data without making API calls:
- Requires:
dependency(parent source) and non-emptyllmlist - Cannot have:
endpoint,from_file,input_key,params - Use case: Run LLM enrichment on already-collected data, re-analyze with different prompts
# Collect only the LLM source (reads parent Parquet, applies LLM steps)
anysite dataset collect dataset.yaml --source employees_analyzed
Collect Commands
anysite dataset collect dataset.yaml --dry-run # Preview plan
anysite dataset collect dataset.yaml # Run collection
anysite dataset collect dataset.yaml --load-db pg # Collect + auto-load to DB
anysite dataset collect dataset.yaml --incremental # Skip already-collected inputs
anysite dataset collect dataset.yaml --source employees # Single source + dependencies
anysite dataset collect dataset.yaml --no-llm # Skip LLM enrichment steps
Incremental Collection (Resume Where You Left Off)
For from_file and dependency sources, anysite tracks collected input values in metadata.json. This enables resuming collection without re-fetching existing data.
Workflow:
# First run — collects all inputs
anysite dataset collect dataset.yaml
# Later: add new items to input file, run with --incremental
anysite dataset collect dataset.yaml --incremental
# → Only new items are collected, existing ones skipped
# Force full re-collection
anysite dataset reset-cursor dataset.yaml
anysite dataset collect dataset.yaml
Per-source refresh option:
sources:
- id: profiles
refresh: auto # (default) respects --incremental
- id: posts
refresh: always # always re-collects, ignores --incremental
# use for time-sensitive data (feeds, activity)
Reset cursor:
anysite dataset reset-cursor dataset.yaml # all sources
anysite dataset reset-cursor dataset.yaml --source posts # specific source
Query with DuckDB
anysite dataset query dataset.yaml --sql "SELECT * FROM companies LIMIT 10"
anysite dataset query dataset.yaml --source profiles --fields "name, urn.value AS id"
anysite dataset query dataset.yaml --interactive # SQL shell
anysite dataset stats dataset.yaml --source companies
anysite dataset profile dataset.yaml
Load into Database
anysite dataset load-db dataset.yaml -c pg --drop-existing # Full load
anysite dataset load-db dataset.yaml -c pg # Incremental sync (when db_load.key set)
anysite dataset load-db dataset.yaml -c pg --snapshot 2026-01-15
Compare Snapshots
anysite dataset diff dataset.yaml --source profiles --key urn.value
anysite dataset diff dataset.yaml --source profiles --key urn.value --from 2026-01-30 --to 2026-02-01
Scheduling & History
anysite dataset schedule dataset.yaml --incremental --load-db pg # Generate cron entry
anysite dataset history my-dataset
anysite dataset logs my-dataset --run 42
anysite dataset reset-cursor dataset.yaml # Clear incremental state
Database Operations
# Connection management
anysite db add pg --type postgres --host localhost --database mydb --user app --password secret
anysite db add pg --type postgres --host localhost --database mydb --user app --password-env PGPASS
anysite db add ch --type clickhouse --host ch.example.com --port 8443 --database analytics --user app --password secret --ssl
anysite db add local --type sqlite --path ./data.db
anysite db add replica --type postgres --host replica.example.com --read-only
anysite db list
anysite db test pg
# Data operations
cat data.jsonl | anysite db insert pg --table users --stdin --auto-create
anysite db query pg --sql "SELECT * FROM users" --format table
# Pipe API output to database
anysite api /api/linkedin/user user=satyanadella -q --format jsonl \
| anysite db insert pg --table profiles --stdin --auto-create
# Database discovery (schema introspection, sample data, LLM descriptions)
anysite db discover mydb # Discover schema
anysite db discover mydb --with-llm # Add LLM table/column descriptions
anysite db discover mydb --tables users,posts # Filter tables
anysite db discover mydb --exclude-tables _migrations
# View saved catalogs
anysite db catalog # List all catalogs
anysite db catalog mydb # Show full catalog
anysite db catalog mydb --table users # Show specific table
anysite db catalog mydb --json # JSON output for agents
Credentials: --password saves directly in ~/.anysite/connections.yaml, --password-env references an env var. Direct value takes priority. LLM API keys follow the same pattern via anysite llm setup.
Supported databases: SQLite, PostgreSQL, ClickHouse. ClickHouse uses clickhouse-connect driver (HTTP protocol, port 8123 default, 8443 for HTTPS/SSL).
LLM Analysis Commands
Analyze collected dataset records using LLM. Requires anysite llm setup first.
anysite llm summarize dataset.yaml --source profiles --fields "name,headline" --format table
anysite llm classify dataset.yaml --source posts --categories "positive,negative,neutral"
anysite llm enrich dataset.yaml --source profiles \
--add "seniority:junior/mid/senior" --add "is_technical:boolean"
anysite llm generate dataset.yaml --source profiles \
--prompt "Write intro for {name} who works as {headline}" --temperature 0.7
anysite llm match dataset.yaml --source-a profiles --source-b companies --top-k 3
anysite llm deduplicate dataset.yaml --source profiles --key name --threshold 0.8
anysite llm cache-stats
anysite llm cache-clear
Use anysite describe --search <keyword> for more endpoints.
Key Patterns
- Output formats:
--format json|jsonl|csv|table - Field selection:
--fields "name,headline,urn.value"(dot-notation for nested) - Error handling:
--on-error stop|skip|retry - Config priority: CLI args > ENV vars >
~/.anysite/config.yaml> defaults
References
For detailed option tables and advanced configuration:
- api-reference.md — all CLI options
- llm-reference.md — LLM provider config, cache, prompts
- dataset-guide.md — full YAML schema, advanced features