Cursor IDE Browser Automation

Browser automation tool for Cursor IDE using MCP (Model Context Protocol) server cursor-ide-browser and accessibility snapshots for precise element interaction.

Core Mechanism

Accessibility Snapshot First: Always get a snapshot before interacting with elements. The snapshot provides structured page information with element references (ref) needed for all interactions.

// Standard workflow
browser_navigate(url="https://example.com")
browser_snapshot()  // Required: Get element references
browser_click(element="Button", ref="ref-from-snapshot")

Essential Workflow

Navigate to target page
Snapshot to get element references (required before any interaction)
Convert to Markdown (⭐ Recommended) for easier searching, locating and reading
Search with grep in md to find information or locate interactive elements
Interact using refs from snapshot
Wait for dynamic content if needed
Verify with screenshots or console messages

Quick example:

browser_navigate(url="https://example.com")
browser_snapshot()  // Creates .log file
mcp_snapshot-query_convert_to_markdown(file_path="snapshot.log")
grep(pattern="button|登录", path="snapshot.md")  // Find elements
browser_click(element="Login", ref="ref-from-grep-results")

Key Tools

Navigation:

browser_navigate(url, position?) - Navigate to URL
browser_navigate_back() - Go back

Page Information:

browser_snapshot() - Required before interactions - Get accessibility tree with element refs
browser_take_screenshot(fullPage?, filename?) - Capture visual
browser_console_messages() - Get console logs
browser_network_requests() - Get network activity

Element Interaction:

browser_click(element, ref, doubleClick?, button?, modifiers?) - Click element
browser_type(element, ref, text, submit?, slowly?) - Type text
browser_hover(element, ref) - Hover
browser_select_option(element, ref, values) - Select dropdown
browser_press_key(key) - Press key (supports PageDown, PageUp, ArrowDown, ArrowUp, Space, End, Home for scrolling)

Synchronization:

browser_wait_for(text?, textGone?, time?) - Wait for text or time

Tab Management:

browser_tabs(action, index?, position?) - Manage tabs (list/new/close/select)

Element References

**element**: Human-readable description (for permission confirmation)
**ref**: Technical reference from snapshot (required for interaction)
Refs are page-state specific - get a new snapshot after navigation or page changes

Snapshot Files

Snapshots are automatically saved as YAML files:

Location: C:\Users\{username}\.cursor\browser-logs\snapshot-{timestamp}.log
Format: YAML accessibility tree with role, ref, name, children
Usage: Extract ref values for element interactions

Querying Snapshots

⭐ Recommended Workflow: Convert to Markdown + Grep

Best practice for finding information and locating interactive elements:

Get snapshot → Creates .log file
Convert to Markdown → More readable format with structured content
Use grep → Fast text search across the entire document
Extract refs → Use found refs for interactions

// Step 1: Get page snapshot
browser_snapshot()  // Creates: snapshot-2026-01-10T23-43-30-351Z.log

// Step 2: Convert to Markdown (RECOMMENDED)
mcp_snapshot-query_convert_to_markdown(
  file_path="snapshot-2026-01-10T23-43-30-351Z.log",
  include_ref=true
) # save to snapshot-2026-01-10T23-43-30-351Z.md

// Step 3: Search with grep (much easier than querying raw YAML)
grep(pattern="搜索|button|登录", path="snapshot.md", -i=true)
grep(pattern="^\\[.*\\]\\(ref-|^\\*\\*.*\\*\\* `ref-", path="snapshot.md")  // Find all links/buttons

// Step 4: Use found refs for interaction
browser_click(element="Login button", ref="ref-found-from-grep")

Why this workflow is preferred:

✅ More readable: Markdown format is human-friendly
✅ Faster search: grep is more efficient than parsing YAML
✅ Better context: See surrounding content with -C flag
✅ Easy element discovery: Links and buttons clearly formatted
✅ Preserves refs: All element references included for interaction

Alternative: Direct Query Tools

For programmatic element finding, use snapshot-query MCP tools:

Command line:

browser_snapshot()  # Generate snapshot
uvx snapshot-query snapshot.log find-name "search"  # Find element

MCP tools:

mcp_snapshot-query_find_by_name(file_path="snapshot.log", name="搜索")
mcp_snapshot-query_find_by_role(file_path="snapshot.log", role="button")
mcp_snapshot-query_find_by_text(file_path="snapshot.log", text="登录")
mcp_snapshot-query_find_by_regex(file_path="snapshot.log", pattern="\\d+\\s*ft", field="name")
mcp_snapshot-query_find_by_name_bm25(file_path="snapshot.log", name="search query", top_k=5)
mcp_snapshot-query_count_elements(file_path="snapshot.log")
mcp_snapshot-query_get_element_path(file_path="snapshot.log", ref="ref-xxx")
mcp_snapshot-query_extract_all_refs(file_path="snapshot.log")

Integrated workflow:

browser_snapshot()  // Creates snapshot file
// Query snapshot to find element ref
const result = mcp_snapshot-query_find_by_name(file_path="snapshot.log", name="Login")
browser_click(element="Login", ref=result.ref)  // Use ref from query

⭐ snapshot-query works with OCR results too:

The snapshot-query tools can process OCR results from fast-paddleocr-mcp. After OCR processing, you get a .snapshot.log file that can be queried just like browser snapshots:

// OCR generates webpage.png.snapshot.log
mcp_fast-paddleocr-mcp_ocr_image(image_path="webpage.png", language="ch")

// Query OCR results with snapshot-query
mcp_snapshot-query_find_by_text(
  file_path="webpage.png.snapshot.log",
  text="8 ft",
  case_sensitive=false
)

// Use regex to find measurements
mcp_snapshot-query_find_by_regex(
  file_path="webpage.png.snapshot.log",
  pattern="\\d+\\s*ft|cm|meters?",
  field="name"
)

// Semantic search for better results
mcp_snapshot-query_find_by_name_bm25(
  file_path="webpage.png.snapshot.log",
  name="height measurement",
  top_k=5
)

// Convert to Markdown for analysis
mcp_snapshot-query_convert_to_markdown(
  file_path="webpage.png.snapshot.log",
  include_ref=true
)

See references/snapshot-query.md for complete snapshot-query documentation.

Common Patterns

Login flow:

browser_navigate(url="https://example.com/login")
browser_snapshot()
// Find username input ref from snapshot
browser_type(element="Username", ref="ref-username", text="user")
// Find password input ref from snapshot
browser_type(element="Password", ref="ref-password", text="pass")
// Find login button ref from snapshot
browser_click(element="Login", ref="ref-login-btn")
browser_wait_for(text="Welcome")

Search and extract (with Markdown workflow):

browser_navigate(url="https://www.baidu.com/s?wd=哈梅内伊有几个孩子")
browser_snapshot()  // Creates snapshot.log
// Convert to Markdown for easier searching
mcp_snapshot-query_convert_to_markdown(
  file_path="snapshot.log",
  include_ref=true
)
// Search for information using grep
grep(pattern="六名|6个|子女", path="snapshot.md", -i=true, -C=3)
// Find interactive elements (links/buttons)
grep(pattern="^\\[.*\\]\\(ref-|^\\*\\*.*\\*\\* `ref-", path="snapshot.md")
// Click on found link using ref
browser_click(element="Article link", ref="ref-45py92vjdrs")
browser_wait_for(text="Results")
browser_take_screenshot(filename="results.png")

Debug page issues:

browser_snapshot()
browser_console_messages()  // Check for errors
browser_network_requests()  // Check failed requests

Scrolling web pages:

browser_press_key("PageDown")   // Scroll down one page
browser_press_key("PageUp")      // Scroll up one page
browser_press_key("ArrowDown")   // Scroll down line by line
browser_press_key("ArrowUp")     // Scroll up line by line
browser_press_key("Space")       // Scroll down one screen
browser_press_key("End")         // Scroll to bottom
browser_press_key("Home")        // Scroll to top
browser_wait_for(time=1)        // Wait after scrolling for content to load

OCR processing with fast-paddleocr-mcp:

// Take screenshot of webpage
browser_take_screenshot(filename="webpage.png", fullPage=false)

// Process with OCR (generates .md and .snapshot.log files)
mcp_fast-paddleocr-mcp_ocr_image(
  image_path="webpage.png",
  language="ch"  // Use "ch" for Chinese+English, "en" for English only
)

// Query OCR results with snapshot-query
mcp_snapshot-query_find_by_text(
  file_path="webpage.png.snapshot.log",
  text="tallest",
  case_sensitive=false
)

// Use BM25 semantic search for better results
mcp_snapshot-query_find_by_name_bm25(
  file_path="webpage.png.snapshot.log",
  name="height tallest person",
  top_k=5
)

// Convert OCR snapshot to Markdown for easier analysis
mcp_snapshot-query_convert_to_markdown(
  file_path="webpage.png.snapshot.log",
  include_ref=true
)

Cross-verification workflow:

// Navigate to multiple sources for verification
browser_navigate(url="https://source1.com/article")
browser_snapshot()
// Extract information from source 1

browser_navigate(url="https://source2.com/article")
browser_snapshot()
// Extract information from source 2

// Compare and verify information consistency
// Prefer authoritative sources (Wikipedia, official records, etc.)

Important Notes

Always snapshot before interaction - Refs are required and page-specific
⭐ Convert to Markdown first - Use convert_to_markdown + grep for finding information and elements (much easier than querying raw YAML)
Wait for dynamic content - Use browser_wait_for() for async operations
Refs expire - Get new snapshot after navigation or page changes
Multi-tab support - Use viewId parameter or browser_tabs() to manage tabs
Position control - Use position="side" when user mentions side panel
OCR limitations - OCR may merge adjacent text (e.g., "otherreliablesourcesccordingtoG"). Key information is usually extracted correctly, but verify important details
Cross-verification - For critical information, verify across multiple authoritative sources (Wikipedia, official records, etc.)
Tool combination - Combine browser automation + OCR + snapshot-query for comprehensive web content analysis

Best Practices & Lessons Learned

Workflow Optimization

Standard workflow: Navigate → Snapshot → Convert to Markdown → Search → Interact
OCR workflow: Screenshot → OCR → Query with snapshot-query → Extract information
Verification workflow: Multiple sources → Extract → Compare → Verify consistency

Tool Integration

Browser + OCR: Use browser_take_screenshot() + fast-paddleocr-mcp to extract text from visual content
OCR + snapshot-query: OCR generates .snapshot.log files that can be queried with all snapshot-query tools
Markdown + grep: Convert snapshots/OCR results to Markdown for easier searching

Key Insights

snapshot-query is universal: Works with both browser snapshots and OCR results
Markdown conversion is recommended: Much easier to search and read than raw YAML
BM25 semantic search: Use find_by_name_bm25() for better relevance when exact matches are unclear
Cross-verification: Always verify critical information from multiple authoritative sources
OCR accuracy: Works well for key information but may merge adjacent text - verify important details

Detailed Reference

Complete tool reference: See references/tools.md for all tools with full parameters
Examples and patterns: See references/examples.md for detailed workflows
Snapshot file format: See references/snapshot-format.md for YAML structure details
Snapshot querying: See references/snapshot-query.md for querying snapshot files

cursor-ide-browser-skills