If a human can do it on a screen, you can too. No API? No integration? No problem.

USE AS A FALLBACK — NOT FIRST CHOICE

Before reaching for any ClawdCursor tool, ask:

Is there a native API? (Gmail API, GitHub API, Slack API) → use the API

Is there a CLI? (git, npm, curl) → use the CLI

Can you edit the file directly? → do that

Is there a browser automation layer? (Playwright, Puppeteer) → use that

None of the above work? Now use ClawdCursor. It's for the last mile.

Modes at a Glance

Mode	Command	Brain	Tools available
`serve`	`clawdcursor serve`	You (REST client)	All 42 tools via HTTP
`mcp`	`clawdcursor mcp`	You (MCP client)	All 42 tools via MCP stdio
`start`	`clawdcursor start`	Built-in LLM pipeline	All 42 tools + autonomous agent

In serve and mcp modes: you reason, ClawdCursor acts. There is no built-in LLM. You call tools, interpret results, decide next steps.

Connecting

Option A — REST (`clawdcursor serve`)

clawdcursor serve        # starts on http://127.0.0.1:3847

All POST endpoints require: Authorization: Bearer <token> (token saved to ~/.clawdcursor/token)

GET  /tools              → all tool schemas (OpenAI function-calling format)
POST /execute/{name}     → run a tool: {"param": "value"}
GET  /health             → {"status":"ok","version":"0.7.5"}
GET  /docs               → full documentation

Example:

POST /execute/get_windows     {}
POST /execute/mouse_click     {"x": 640, "y": 400}
POST /execute/type_text       {"text": "hello world"}

If the server isn't running, start it yourself — don't ask the user:

clawdcursor serve
# wait 2 seconds, then verify: GET /health

Option B — MCP (`clawdcursor mcp`)

{
  "mcpServers": {
    "clawdcursor": {
      "command": "clawdcursor",
      "args": ["mcp"]
    }
  }
}

Works with Claude Code, Cursor, Windsurf, Zed, or any MCP-compatible client. All 42 tools are exposed identically.

Option C — Autonomous agent (`clawdcursor start`)

POST /task    {"task": "Open Notepad and write Hello"}   → submit task
GET  /status  → {"status": "acting"} | "idle" | "waiting_confirm"
POST /confirm {"approved": true}                         → approve safety-gated action
POST /abort                                              → stop current task

Use delegate_to_agent tool to submit tasks from within MCP/REST sessions. Requires clawdcursor start running on port 3847.

Polling pattern:

POST /task  {"task": "...", "returnPartial": true}
→ poll GET /status every 2s:
    "acting"           → still running, keep polling
    "waiting_confirm"  → STOP. Ask user → POST /confirm {"approved": true}
    "idle"             → done, check GET /task-logs for result
→ if 60s+ with no progress: POST /abort, retry with simpler phrasing

returnPartial mode — send {"returnPartial": true} with POST /task: ClawdCursor skips Stage 3 (expensive vision) and returns control to you if Stage 2 fails:

{"partial": true, "stepsCompleted": [...], "context": "got stuck on dialog"}

You finish the task with MCP tools, then call POST /learn to save what worked.

POST /learn — adaptive learning: After completing a task with your own tool calls, teach ClawdCursor for next time:

POST /learn
{
  "processName": "EXCEL",
  "task": "create table with headers",
  "actions": [
    {"action": "key", "description": "Ctrl+Home to go to A1"},
    {"action": "type", "description": "Type header name"},
    {"action": "key", "description": "Tab to next column"}
  ],
  "shortcuts": {"next_cell": "Tab", "next_row": "Enter"},
  "tips": ["Use Tab between columns, Enter between rows"]
}

This enriches the app's guide JSON. Stage 2 reads it on the next run — no vision fallback needed.

The Universal Loop

Every GUI task follows the same pattern regardless of transport:

1. ORIENT  →  read_screen() or get_windows()          see what's open and focused
2. ACT     →  smart_click() / smart_type() / key_press()   do the thing
3. VERIFY  →  check return value → window state → text check → screenshot
4. REPEAT  →  until done

Verification (cheapest to most expensive)

Tool return value — every tool reports success/failure. Check it first.
Window state — get_active_window(), get_windows() — did a dialog appear? Did the title change?
Text check — read_screen() or smart_read() — is the expected text visible?
Screenshot — desktop_screenshot() — only when text methods fail. Costs the most.
Negative check — look for error dialogs, wrong window, unchanged screen.

Always verify after: sends, saves, deletes, form submissions. Skip verification for: mid-sequence keystrokes, scrolling.

Tool Decision Trees

Perception — always start here

read_screen()          → FIRST. Accessibility tree: buttons, inputs, text, with coords.
                          Fast, structured, works on native apps.
ocr_read_screen()      → When a11y tree is empty (canvas UIs, image-based apps).
smart_read()           → Combines OCR + a11y. Good first call when unsure.
desktop_screenshot()   → LAST RESORT. Only when you need pixel-level visual detail.
desktop_screenshot_region(x,y,w,h) → Zoomed crop when you need detail in one area.

Clicking

smart_click("Save")              → FIRST. Finds by label/text via OCR + a11y, clicks.
                                   Pass processId to target the right window.
invoke_element(name="Save")      → When you know the exact automation ID from read_screen.
cdp_click(text="Submit")         → Browser elements. Requires cdp_connect() first.
mouse_click(x, y)                → LAST RESORT. Raw coordinates from a screenshot.

Typing

smart_type("Email", "user@x.com")  → FIRST. Finds field by label, focuses, types.
cdp_type(label="Email", text="…")  → Browser inputs. Requires cdp_connect() first.
type_text("hello")                 → Clipboard paste into whatever is focused.
                                     Use after manually focusing with smart_click.

Browser / CDP

1. navigate_browser(url)     → opens URL, auto-enables CDP
2. cdp_connect()             → connect to browser DevTools Protocol
3. cdp_page_context()        → list interactive elements on page
4. cdp_read_text()           → extract DOM text (returns empty on canvas apps → use OCR)
5. cdp_click(text="…")       → click by visible text
6. cdp_type(label, text)     → fill input by label
7. cdp_evaluate(script)      → run JavaScript in page context
8. cdp_scroll(direction, px) → scroll page via DOM (not mouse wheel)
9. cdp_list_tabs()           → list all open tabs
10. cdp_switch_tab(target)   → switch to a specific tab

If CDP isn't connected, switch tabs with keyboard:

key_press("ctrl+1")          → tab 1
key_press("ctrl+tab")        → next tab
key_press("ctrl+shift+tab")  → previous tab

Window Management

get_windows()                         → list all open windows (use to find PIDs)
get_active_window()                   → what's in the foreground right now
focus_window(processName="Discord")   → bring to front (auto-minimizes phantom off-screen windows)
minimize_window(processName="calc")   → minimize a window — 1 call, cross-platform
                                        also accepts: processId, title

Rule: Always focus_window() before key_press() or type_text(). Keystrokes go to whatever has focus — if that's your terminal, not the target app.

Canvas apps (Google Docs, Figma, Notion)

DOM has no readable text. Pattern:

ocr_read_screen()          → read content (DOM extraction fails)
mouse_click(x, y)          → click into the canvas area
type_text("your text")     → clipboard paste works even on canvas

Quick Patterns

Open app and type:

open_app("notepad") → wait(2) → smart_read() → type_text("Hello") → smart_read()

Read a webpage:

navigate_browser(url) → wait(3) → cdp_connect() → cdp_read_text()

Fill a web form:

cdp_connect() → cdp_type("Email", "x@x.com") → cdp_type("Password", "…") → cdp_click("Submit")

Cross-app copy/paste:

focus_window("Chrome") → key_press("ctrl+a") → key_press("ctrl+c")
→ read_clipboard() → focus_window("Notepad") → type_text(clipboard)

Send email via Outlook:

open_app("outlook") → wait(2) → smart_click("New Email")
→ mouse_click(to_field_x, to_field_y) → type_text("recipient@x.com") → key_press("Tab")
→ mouse_click(subject_x, subject_y) → type_text("Subject") → key_press("Tab")
→ mouse_click(body_x, body_y) → type_text("Body text")
→ mouse_click(send_x, send_y)

Autonomous complex task (requires clawdcursor start):

delegate_to_agent("Open Gmail, find latest email from Stripe, forward to billing@x.com")
→ poll GET /status every 2s
→ if waiting_confirm: ask user → POST /confirm {"approved": true}
→ if idle: task done

Full Tool Reference (42 tools)

Speed: ⚡ Free/instant · 🔵 Cheap · 🟡 Moderate · 🔴 Vision (expensive)

Perception (6)

Tool	What it does	When
`read_screen`	A11y tree — buttons, inputs, text, coords	⚡ Default first read
`smart_read`	OCR + a11y combined	🔵 When unsure which to use
`ocr_read_screen`	Raw OCR text with bounding boxes	🔵 Canvas UIs, empty a11y trees
`desktop_screenshot`	Full screen image (1280px wide)	⚡ Last resort visual check
`desktop_screenshot_region`	Zoomed crop of specific area	⚡ Fine-grained visual detail
`get_screen_size`	Screen dimensions and DPI	⚡ Coordinate calculations

Mouse (7)

Tool	What it does	When
`smart_click`	Find element by text/label, click	🔵 First choice for clicking
`mouse_click`	Left click at (x, y)	⚡ Last resort
`mouse_double_click`	Double click at (x, y)	⚡ Open files, select words
`mouse_right_click`	Right click at (x, y)	⚡ Context menus
`mouse_hover`	Move cursor without clicking	⚡ Hover menus
`mouse_scroll`	Scroll at position (physical mouse wheel)	⚡ Scroll content
`mouse_drag`	Drag from start to end — accepts `startX/startY/endX/endY` or `x1/y1/x2/y2`	⚡ Resize, select ranges

Keyboard (5)

Tool	What it does	When
`smart_type`	Find input by label, focus it, type	🔵 First choice for form fields
`type_text`	Clipboard paste into focused element	⚡ After manually focusing
`key_press`	Send key combo (`ctrl+s`, `Return`, `alt+tab`)	⚡ After focus_window
`shortcuts_list`	List keyboard shortcuts for current app	⚡ Before reaching for mouse
`shortcuts_execute`	Run a named shortcut (fuzzy match)	⚡ Save, copy, paste, undo

Window Management (5)

Tool	What it does	When
`get_windows`	List all open windows with PIDs and bounds	⚡ Situational awareness
`get_active_window`	Current foreground window	⚡ Check current focus
`get_focused_element`	Element with keyboard focus	⚡ Debug wrong-field typing
`focus_window`	Bring window to front (auto-clears off-screen phantoms)	⚡ Always before key_press
`minimize_window`	Minimize by processName, processId, or title	⚡ Clear focus stealers

UI Elements (2)

Tool	What it does	When
`find_element`	Search UI tree by name or type	⚡ Find automation IDs
`invoke_element`	Invoke element by automation ID or name	⚡ When ID known from read_screen

Clipboard (2)

Tool	What it does	When
`read_clipboard`	Read clipboard text	⚡ After copy operations
`write_clipboard`	Write text to clipboard	⚡ Before paste operations

Browser / CDP (11)

Tool	What it does	When
`cdp_connect`	Connect to browser DevTools Protocol	⚡ First step for any browser task
`cdp_page_context`	List interactive elements on page	⚡ After connect
`cdp_read_text`	Extract DOM text	⚡ Read page content
`cdp_click`	Click by CSS selector or visible text	⚡ Browser clicks
`cdp_type`	Type into input by label or selector	⚡ Browser form filling
`cdp_select_option`	Select dropdown option	⚡ Select elements
`cdp_evaluate`	Run JavaScript in page context	⚡ Custom queries
`cdp_scroll`	Scroll page via DOM (`direction`, `amount` px)	⚡ DOM-level scroll
`cdp_wait_for_selector`	Wait for element to appear	⚡ After navigation/AJAX
`cdp_list_tabs`	List all browser tabs	⚡ When on wrong tab
`cdp_switch_tab`	Switch to a tab by title or index	⚡ After cdp_list_tabs

Orchestration (4)

Tool	What it does	When
`open_app`	Launch application by name	⚡ First step for desktop tasks
`navigate_browser`	Open URL (auto-enables CDP)	⚡ First step for browser tasks
`wait`	Pause N seconds	⚡ After opening apps, let UI render
`delegate_to_agent`	Send task to built-in autonomous agent	🟡 Complex multi-step tasks (requires `clawdcursor start`)

Provider Setup (agent mode only)

Provider	Setup	Cost
Ollama (local)	`ollama pull qwen2.5:7b && ollama serve`	$0 — fully offline, no data leaves machine
Any cloud	Set env var: `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `GEMINI_API_KEY`, `MOONSHOT_API_KEY`, etc.	Varies
OpenClaw users	Auto-detected from `~/.openclaw/agents/main/auth-profiles.json`	No extra setup

Run clawdcursor doctor to auto-detect and validate providers.

Security

Network isolation: Binds to 127.0.0.1 only. Verify: netstat -an | findstr 3847 — should show 127.0.0.1:3847, never 0.0.0.0:3847
Ollama: 100% offline. Screenshots stay in RAM, never leave the machine.
Cloud providers: Screenshots/text sent only to your configured provider. No telemetry, no analytics, no third-party logging.
Token auth: All mutating POST endpoints require Authorization: Bearer <token>. Token at ~/.clawdcursor/token.
Safety tiers: Auto / Preview / Confirm. Agents must never self-approve Confirm actions.

Coordinate System

All mouse tools use image-space coordinates from a 1280px-wide viewport — matching screenshots from desktop_screenshot. DPI scaling is handled automatically. Do not pre-scale coordinates.

Safety

Tier	Actions	Behavior
🟢 Auto	Navigation, reading, opening apps	Runs immediately
🟡 Preview	Typing, form filling	Logged
🔴 Confirm	Send, delete, purchase	Pauses — always ask user first

Never self-approve Confirm actions.
Alt+F4 and Ctrl+Alt+Delete are blocked.
Server binds to 127.0.0.1 only.
First run requires explicit user consent for desktop control.

Error Recovery

Problem	Fix
Port 3847 not responding	`clawdcursor serve` — wait 2s — `GET /health`
401 Unauthorized	Token changed — read `~/.clawdcursor/token` and use fresh value
CDP not available	Chrome must be open. `navigate_browser(url)` auto-enables it.
CDP on wrong tab	`cdp_list_tabs()` → `cdp_switch_tab(target)`
`focus_window` fails	`get_windows()` to confirm title/processName, then retry
`smart_click` can't find element	`read_screen()` for coords → `mouse_click(x, y)`
`key_press` goes to wrong window	You skipped `focus_window` — always focus first
`cdp_read_text` returns empty	Canvas app — use `ocr_read_screen()` instead
Same action fails 3+ times	Try a completely different approach

Platform Support

Platform	A11y	OCR	CDP
Windows (x64/ARM64)	PowerShell + .NET UIA	Windows.Media.Ocr	Chrome/Edge
macOS (Intel/Apple Silicon)	JXA + System Events	Apple Vision	Chrome/Edge
Linux (x64/ARM64)	AT-SPI	Tesseract	Chrome/Edge

macOS: Grant Accessibility in System Settings → Privacy → Accessibility. Linux: sudo apt install tesseract-ocr for OCR support.

clawdcursor

Modes at a Glance

Connecting

Option A — REST (`clawdcursor serve`)

Option B — MCP (`clawdcursor mcp`)

Option C — Autonomous agent (`clawdcursor start`)

The Universal Loop

Verification (cheapest to most expensive)

Tool Decision Trees

Perception — always start here

Clicking

Typing

Browser / CDP

Window Management

Canvas apps (Google Docs, Figma, Notion)

Quick Patterns

Full Tool Reference (42 tools)

Perception (6)

Mouse (7)

Keyboard (5)

Window Management (5)

UI Elements (2)

Clipboard (2)

Browser / CDP (11)

Orchestration (4)

Provider Setup (agent mode only)

Security

Coordinate System

Safety

Error Recovery

Platform Support

More from amrdab/clawd-cursor

clawd-cursor

clawdcursor

Modes at a Glance

Connecting

Option A — REST (clawdcursor serve)

Option B — MCP (clawdcursor mcp)

Option C — Autonomous agent (clawdcursor start)

The Universal Loop

Verification (cheapest to most expensive)

Tool Decision Trees

Perception — always start here

Clicking

Typing

Browser / CDP

Window Management

Canvas apps (Google Docs, Figma, Notion)

Quick Patterns

Full Tool Reference (42 tools)

Perception (6)

Mouse (7)

Keyboard (5)

Window Management (5)

UI Elements (2)

Clipboard (2)

Browser / CDP (11)

Orchestration (4)

Provider Setup (agent mode only)

Security

Coordinate System

Safety

Error Recovery

Platform Support

More from amrdab/clawd-cursor

clawd-cursor

Option A — REST (`clawdcursor serve`)

Option B — MCP (`clawdcursor mcp`)

Option C — Autonomous agent (`clawdcursor start`)