Browser Automation with agent-browser

Two Pathways

Browser CLI commands — use agent-browser: prefix via Bash for all browser control: navigating, clicking, filling forms, taking screenshots, reading snapshots.

agent-browser:
open http://localhost:3001
wait 2000
snapshot -i

Single commands:

agent-browser:
open http://example.com

agent-browser:
close

JS eval in browser — use exec:agent-browser via Bash when you need to run JavaScript in the page context. The body is piped to eval --stdin. Use this for DOM inspection, custom extraction logic, or anything requiring programmatic page access.

exec:agent-browser
document.title

exec:agent-browser
JSON.stringify([...document.querySelectorAll('h1')].map(h => h.textContent))

Always close tabs when done: every open is tracked. Use agent-browser:\nclose (or --session <name> close) when finished. Leaving sessions open accumulates stale tabs — the hook will warn you when other sessions are still open.

Core Workflow

Every browser automation follows this pattern:

Navigate: agent-browser open <url>
Snapshot: agent-browser snapshot -i (get element refs like @e1, @e2)
Interact: Use refs to click, fill, select
Re-snapshot: After navigation or DOM changes, get fresh refs

agent-browser open https://example.com/form
agent-browser snapshot -i
# Output: @e1 [input type="email"], @e2 [input type="password"], @e3 [button] "Submit"

agent-browser fill @e1 "user@example.com"
agent-browser fill @e2 "password123"
agent-browser click @e3
agent-browser wait --load networkidle
agent-browser snapshot -i  # Check result

Essential Commands

# Navigation
agent-browser open <url>              # Navigate (aliases: goto, navigate)
agent-browser close                   # Close browser

# Snapshot
agent-browser snapshot -i             # Interactive elements with refs (recommended)
agent-browser snapshot -i -C          # Include cursor-interactive elements (divs with onclick, cursor:pointer)
agent-browser snapshot -s "#selector" # Scope to CSS selector

# Interaction (use @refs from snapshot)
agent-browser click @e1               # Click element
agent-browser fill @e2 "text"         # Clear and type text
agent-browser type @e2 "text"         # Type without clearing
agent-browser select @e1 "option"     # Select dropdown option
agent-browser check @e1               # Check checkbox
agent-browser press Enter             # Press key
agent-browser scroll down 500         # Scroll page

# Get information
agent-browser get text @e1            # Get element text
agent-browser get url                 # Get current URL
agent-browser get title               # Get page title

# Wait
agent-browser wait @e1                # Wait for element
agent-browser wait --load networkidle # Wait for network idle
agent-browser wait --url "**/page"    # Wait for URL pattern
agent-browser wait 2000               # Wait milliseconds

# Capture
agent-browser screenshot              # Screenshot to temp dir
agent-browser screenshot --full       # Full page screenshot
agent-browser pdf output.pdf          # Save as PDF

Common Patterns

Form Submission

agent-browser open https://example.com/signup
agent-browser snapshot -i
agent-browser fill @e1 "Jane Doe"
agent-browser fill @e2 "jane@example.com"
agent-browser select @e3 "California"
agent-browser check @e4
agent-browser click @e5
agent-browser wait --load networkidle

Authentication with State Persistence

# Login once and save state
agent-browser open https://app.example.com/login
agent-browser snapshot -i
agent-browser fill @e1 "$USERNAME"
agent-browser fill @e2 "$PASSWORD"
agent-browser click @e3
agent-browser wait --url "**/dashboard"
agent-browser state save auth.json

# Reuse in future sessions
agent-browser state load auth.json
agent-browser open https://app.example.com/dashboard

Data Extraction

agent-browser open https://example.com/products
agent-browser snapshot -i
agent-browser get text @e5           # Get specific element text
agent-browser get text body > page.txt  # Get all page text

# JSON output for parsing
agent-browser snapshot -i --json
agent-browser get text @e1 --json

Parallel Sessions

agent-browser --session site1 open https://site-a.com
agent-browser --session site2 open https://site-b.com

agent-browser --session site1 snapshot -i
agent-browser --session site2 snapshot -i

agent-browser session list

Connect to Existing Chrome

# Auto-discover running Chrome with remote debugging enabled
agent-browser --auto-connect open https://example.com
agent-browser --auto-connect snapshot

# Or with explicit CDP port
agent-browser --cdp 9222 snapshot

Visual Browser (Headed Mode)

Use --headed as the first flag on the first line — it propagates to all commands in the block:

agent-browser:
--headed open https://example.com
wait --load networkidle
snapshot -i

agent-browser:
--headed open https://example.com
highlight @e1
record start demo.webm

Local Files (PDFs, HTML)

# Open local files with file:// URLs
agent-browser --allow-file-access open file:///path/to/document.pdf
agent-browser --allow-file-access open file:///path/to/page.html
agent-browser screenshot output.png

iOS Simulator (Mobile Safari)

# List available iOS simulators
agent-browser device list

# Launch Safari on a specific device
agent-browser -p ios --device "iPhone 16 Pro" open https://example.com

# Same workflow as desktop - snapshot, interact, re-snapshot
agent-browser -p ios snapshot -i
agent-browser -p ios tap @e1          # Tap (alias for click)
agent-browser -p ios fill @e2 "text"
agent-browser -p ios swipe up         # Mobile-specific gesture

# Take screenshot
agent-browser -p ios screenshot mobile.png

# Close session (shuts down simulator)
agent-browser -p ios close

Requirements: macOS with Xcode, Appium (npm install -g appium && appium driver install xcuitest)

Real devices: Works with physical iOS devices if pre-configured. Use --device "<UDID>" where UDID is from xcrun xctrace list devices.

Ref Lifecycle (Important)

Refs (@e1, @e2, etc.) are invalidated when the page changes. Always re-snapshot after:

Clicking links or buttons that navigate
Form submissions
Dynamic content loading (dropdowns, modals)

agent-browser click @e5              # Navigates to new page
agent-browser snapshot -i            # MUST re-snapshot
agent-browser click @e1              # Use new refs

Semantic Locators (Alternative to Refs)

When refs are unavailable or unreliable, use semantic locators:

agent-browser find text "Sign In" click
agent-browser find label "Email" fill "user@test.com"
agent-browser find role button click --name "Submit"
agent-browser find placeholder "Search" type "query"
agent-browser find testid "submit-btn" click

JavaScript Evaluation (exec pathway)

Use this pathway when you need to run JavaScript in the browser context — not ordinary CLI commands. This goes through exec:agent-browser via Bash, which pipes your code to agent-browser eval --stdin. Shell quoting can corrupt complex expressions — use the heredoc form.

Use exec:agent-browser via Bash. The code body is piped directly to agent-browser eval --stdin — no shell, no escaping.

exec:agent-browser
document.title

exec:agent-browser
JSON.stringify(
  Array.from(document.querySelectorAll("img"))
    .filter(i => !i.alt)
    .map(i => ({ src: i.src.split("/").pop(), width: i.width }))
)

Never base64-encode the code. Never add agent-browser eval flags. Write plain JavaScript directly as the exec body.

Complete Command Reference

Core Navigation & Lifecycle

agent-browser open <url>              # Navigate (aliases: goto, navigate)
agent-browser close                   # Close browser (aliases: quit, exit)
agent-browser back                    # Go back
agent-browser forward                 # Go forward
agent-browser reload                  # Reload page

Snapshots & Element References

agent-browser snapshot                # Accessibility tree with semantic refs
agent-browser snapshot -i             # Interactive elements with @e refs
agent-browser snapshot -i -C          # Include cursor-interactive divs (onclick, pointer)
agent-browser snapshot -s "#sel"      # Scope snapshot to CSS selector
agent-browser snapshot --json         # JSON output for parsing

Interaction - Click, Fill, Type, Select

agent-browser click <sel>             # Click element
agent-browser click <sel> --new-tab   # Open link in new tab
agent-browser dblclick <sel>          # Double-click
agent-browser focus <sel>             # Focus element
agent-browser type <sel> <text>       # Type into element (append)
agent-browser fill <sel> <text>       # Clear and fill
agent-browser select <sel> <val>      # Select dropdown option
agent-browser check <sel>             # Check checkbox
agent-browser uncheck <sel>           # Uncheck checkbox
agent-browser press <key>             # Press key (Enter, Tab, Control+a, etc.) (alias: key)

Keyboard & Text Input

agent-browser keyboard type <text>    # Type with real keystrokes (no selector, uses focus)
agent-browser keyboard inserttext <text>  # Insert text without triggering key events
agent-browser keydown <key>           # Hold key down
agent-browser keyup <key>             # Release key

Mouse & Drag

agent-browser hover <sel>             # Hover element
agent-browser drag <src> <tgt>        # Drag and drop
agent-browser mouse move <x> <y>      # Move mouse to coordinates
agent-browser mouse down [button]     # Press mouse button (left/right/middle)
agent-browser mouse up [button]       # Release mouse button
agent-browser mouse wheel <dy> [dx]   # Scroll wheel

Scrolling & Viewport

agent-browser scroll <dir> [px]       # Scroll (up/down/left/right, optional px)
agent-browser scrollintoview <sel>    # Scroll element into view (alias: scrollinto)
agent-browser set viewport <w> <h>    # Set viewport size (e.g., 1920 1080)
agent-browser set device <name>       # Emulate device (e.g., "iPhone 14")

Get Information

agent-browser get text <sel>          # Get text content
agent-browser get html <sel>          # Get innerHTML
agent-browser get value <sel>         # Get input value
agent-browser get attr <sel> <attr>   # Get attribute value
agent-browser get title               # Get page title
agent-browser get url                 # Get current URL
agent-browser get count <sel>         # Count matching elements
agent-browser get box <sel>           # Get bounding box {x, y, width, height}
agent-browser get styles <sel>        # Get computed CSS styles

Check State

agent-browser is visible <sel>        # Check if visible
agent-browser is enabled <sel>        # Check if enabled (not disabled)
agent-browser is checked <sel>        # Check if checked (checkbox/radio)

File Operations

agent-browser upload <sel> <files>    # Upload files to file input
agent-browser screenshot [path]       # Screenshot to temp or custom path
agent-browser screenshot --full       # Full page screenshot
agent-browser screenshot --annotate   # Annotated with numbered element labels
agent-browser pdf <path>              # Save as PDF

Semantic Locators (Alternative to Selectors)

agent-browser find role <role> <action> [value]       # By ARIA role
agent-browser find text <text> <action>               # By text content
agent-browser find label <label> <action> [value]     # By form label
agent-browser find placeholder <ph> <action> [value]  # By placeholder text
agent-browser find alt <text> <action>                # By alt text
agent-browser find title <text> <action>              # By title attribute
agent-browser find testid <id> <action> [value]       # By data-testid
agent-browser find first <sel> <action> [value]       # First matching element
agent-browser find last <sel> <action> [value]        # Last matching element
agent-browser find nth <n> <sel> <action> [value]     # Nth matching element

# Role examples: button, link, textbox, combobox, checkbox, radio, heading, list, etc.
# Actions: click, fill, type, hover, focus, check, uncheck, text
# Options: --name <name> (filter by accessible name), --exact (exact text match)

Waiting

agent-browser wait <selector>         # Wait for element to be visible
agent-browser wait <ms>               # Wait for time in milliseconds
agent-browser wait --text "Welcome"   # Wait for text to appear
agent-browser wait --url "**/dash"    # Wait for URL pattern
agent-browser wait --load networkidle # Wait for load state (load, domcontentloaded, networkidle)
agent-browser wait --fn "window.ready === true"  # Wait for JS condition

JavaScript Evaluation (exec pathway — not a direct tool command)

exec:agent-browser
<plain JS>

Use exec:agent-browser via Bash. Code is piped to agent-browser eval --stdin. No base64, no flags.

Browser Environment

agent-browser set geo <lat> <lng>     # Set geolocation
agent-browser set offline [on|off]    # Toggle offline mode
agent-browser set headers <json>      # Set HTTP headers
agent-browser set credentials <u> <p> # HTTP basic auth
agent-browser set media [dark|light]  # Emulate color scheme (prefers-color-scheme)

Cookies & Storage

agent-browser cookies                 # Get all cookies
agent-browser cookies set <name> <val> # Set cookie
agent-browser cookies clear           # Clear cookies
agent-browser storage local           # Get all localStorage
agent-browser storage local <key>     # Get specific key
agent-browser storage local set <k> <v>  # Set value
agent-browser storage local clear     # Clear all localStorage
agent-browser storage session         # Same for sessionStorage
agent-browser storage session <key>   # Get sessionStorage key
agent-browser storage session set <k> <v>  # Set sessionStorage
agent-browser storage session clear   # Clear sessionStorage

Network & Interception

agent-browser network route <url>              # Intercept requests
agent-browser network route <url> --abort      # Block requests
agent-browser network route <url> --body <json>  # Mock response with JSON
agent-browser network unroute [url]            # Remove routes
agent-browser network requests                 # View tracked requests
agent-browser network requests --filter api    # Filter by keyword

Tabs & Windows

agent-browser tab                     # List active tabs
agent-browser tab new [url]           # Open new tab (optionally with URL)
agent-browser tab <n>                 # Switch to tab n
agent-browser tab close [n]           # Close tab (current or specific)
agent-browser window new              # Open new window

Frames

agent-browser frame <sel>             # Switch to iframe by selector
agent-browser frame main              # Switch back to main frame

Dialogs

agent-browser dialog accept [text]    # Accept alert/confirm (with optional prompt text)
agent-browser dialog dismiss          # Dismiss dialog

State Persistence (Auth, Sessions)

agent-browser state save <path>       # Save authenticated session
agent-browser state load <path>       # Load session state
agent-browser state list              # List saved state files
agent-browser state show <file>       # Show state summary
agent-browser state rename <old> <new> # Rename state
agent-browser state clear [name]      # Clear specific session
agent-browser state clear --all       # Clear all states
agent-browser state clean --older-than <days>  # Delete old states

Debugging & Analysis

agent-browser highlight <sel>        # Highlight element visually
agent-browser console                # View console messages (log, error, warn)
agent-browser console --clear        # Clear console
agent-browser errors                 # View JavaScript errors
agent-browser errors --clear         # Clear errors
agent-browser trace start [path]     # Start DevTools trace
agent-browser trace stop [path]      # Stop and save trace
agent-browser profiler start         # Start Chrome DevTools profiler
agent-browser profiler stop [path]   # Stop and save .json profile

Visual Debugging

agent-browser:
--headed open <url>
record start <file.webm>

agent-browser:
record stop

Comparisons & Diffs

agent-browser diff snapshot                              # Compare current vs last snapshot
agent-browser diff snapshot --baseline before.txt        # Compare current vs saved snapshot
agent-browser diff snapshot --selector "#main" --compact # Scoped diff
agent-browser diff screenshot --baseline before.png      # Visual pixel diff
agent-browser diff screenshot --baseline b.png -o d.png  # Save diff to custom path
agent-browser diff screenshot --baseline b.png -t 0.2    # Color threshold 0-1
agent-browser diff url https://v1.com https://v2.com     # Compare two URLs
agent-browser diff url https://v1.com https://v2.com --screenshot  # With visual diff
agent-browser diff url https://v1.com https://v2.com --selector "#main"  # Scoped

Sessions & Parallelism

agent-browser --session <name> <cmd>  # Run in named session (isolated instance)
agent-browser session list            # List active sessions
agent-browser session show            # Show current session
# Example: agent-browser --session agent1 open site.com
#          agent-browser --session agent2 open other.com

Browser Connection

agent-browser connect <port>          # Connect via Chrome DevTools Protocol
agent-browser --auto-connect open <url>  # Auto-discover running Chrome
agent-browser --cdp 9222 <cmd>        # Explicit CDP port

Setup & Installation

agent-browser install                 # Download Chromium browser
agent-browser install --with-deps     # Also install system dependencies (Linux)

Advanced: Local Files & Protocols

agent-browser --allow-file-access open file:///path/to/file.pdf
agent-browser --allow-file-access open file:///path/to/page.html

Advanced: iOS/Mobile Testing

agent-browser device list             # List available iOS simulators
agent-browser -p ios --device "iPhone 16 Pro" open <url>  # Launch on device
agent-browser -p ios snapshot -i      # Snapshot on iOS
agent-browser -p ios tap @e1          # Tap (alias for click)
agent-browser -p ios swipe up         # Mobile gestures
agent-browser -p ios screenshot mobile.png
agent-browser -p ios close            # Close simulator
# Requires: macOS, Xcode, Appium (npm install -g appium && appium driver install xcuitest)

Windows: "Daemon not found" Fix

If agent-browser fails with Daemon not found. Set AGENT_BROWSER_HOME environment variable or run from project directory. on Windows, the AGENT_BROWSER_HOME env var is missing or pointing to the wrong path. It must point to the npm package directory containing dist/daemon.js:

:: Find and set the correct path (run in cmd or PowerShell)
for /f "delims=" %i in ('npm root -g') do setx AGENT_BROWSER_HOME "%i\agent-browser"

Open a new terminal after running. See references/windows-troubleshooting.md for details on stale port files and Git Bash setup.

Key Patterns for Agents

Always use agent-browser instead of puppeteer, playwright, or playwright-core — it has the same capabilities with simpler syntax and better integration with AI agents.

Which pathway to use:

Ordinary browser control (navigation, clicking, forms, screenshots) → call the agent-browser tool directly
Need to run JavaScript in the page → use exec:agent-browser via Bash with plain JS as the body

Multi-step workflows (ordinary pathway — direct tool calls):

agent-browser open <url>
agent-browser snapshot -i (get refs)
agent-browser fill @e1 "value"
agent-browser click @e2
agent-browser wait --load networkidle (after navigation)
agent-browser snapshot -i (re-snapshot for new refs)

JavaScript inspection (exec pathway — when you need page access):

exec:agent-browser
document.title

Debugging complex interactions: Use headed mode — put --headed as the first flag on the first line of an agent-browser: block. It propagates to all subsequent commands in the block.

Ground truth verification: Use the ordinary pathway (agent-browser screenshot) for visual confirmation; use the exec pathway for JavaScript-level inspection.

agent-browser