Claude Computer Use Advanced

Overview

Claude's computer use tool enables AI agents to interact with desktop environments programmatically. This skill provides advanced patterns for building sophisticated UI automation workflows, testing applications, extracting data from screens, and controlling applications across operating systems.

Computer use represents a major advancement in AI capability - rather than relying on APIs, Claude can interact with any software the same way a human would: by viewing the screen, clicking elements, typing text, and using keyboard shortcuts. This makes it possible to automate virtually any desktop task that has a graphical interface.

Core Capabilities:

Take screenshots and analyze visual content
Click on screen elements using precise coordinates
Type text and submit forms
Press keyboard shortcuts and special keys
Scroll, drag, and perform complex mouse operations
Use zoom tool to inspect specific screen regions (Opus 4.5 only)
Execute multi-step automation workflows
Control applications and software programmatically

Key Models:

Claude Opus 4.5 (computer_20251124): Newest version with zoom tool for enhanced vision accuracy
Claude 4 / Sonnet 3.7 (computer_20250124): Stable version with full action support

When to Use

Use computer use for these common scenarios:

Desktop Automation

Batch processing repetitive tasks
Form filling and data entry workflows
File and folder management
Configuration and setup automation
Multi-application workflows

Application Testing

Automated UI testing frameworks
Cross-platform testing workflows
Visual regression testing
User interaction simulation
Quality assurance automation

Screen Analysis & Data Extraction

Extracting text and data from applications
Analyzing visual layouts and designs
Reading content from non-API systems
Screenshot analysis and understanding
OCR-like text extraction from screens

Application Control & Integration

Controlling legacy applications without APIs
Creating automation agents for closed systems
Building RPA (Robotic Process Automation) workflows
Orchestrating multi-application processes
Software testing and validation

Computer Use Tasks

Task 1: Taking Screenshots

Screenshots form the foundation of computer use - Claude needs to see the screen to know what to do next.

When to Use:

As the first action in any workflow
Before clicking on elements (to get coordinates)
To analyze the current application state
To understand what's visible on screen
After actions to verify results

How It Works: The screenshot action captures the entire display and returns it as a base64-encoded image. Claude can then analyze the screenshot to understand the interface, identify elements, and determine the next action.

Basic Example:

action = {
    "type": "screenshot"
}

Coordinate System:

Origin (0, 0) is at the top-left of the screen
X increases to the right
Y increases downward
Coordinates refer to pixel positions on the display

Best Practices:

Always start with a screenshot to establish the initial state
Take screenshots after significant actions to verify success
Use zoom tool (Opus 4.5) for precise element location
Consider display resolution when planning coordinates
Standard resolution: 1280x800 (recommended: 1024x768)

Task 2: Clicking Elements

Clicking is the primary way to interact with UI elements - buttons, links, checkboxes, menu items, and any clickable interface element.

When to Use:

Activating buttons and submitting forms
Opening menus and selecting options
Clicking links to navigate
Toggling checkboxes and radio buttons
Selecting items in lists or dropdowns

How It Works: The click action takes x,y coordinates and sends a mouse click at that position. Claude uses the screenshot to identify element locations and then clicks on them.

Basic Example:

action = {
    "type": "left_click",
    "coordinate": [640, 400]  # x, y from screenshot
}

Coordinate Precision:

Aim for the center of clickable elements
Use zoom tool (Opus 4.5) when precise coordinates are needed
If click misses, take another screenshot and adjust
Small elements may require zoomed region for accuracy

Advanced Click Types:

left_click: Standard single click
right_click: Opens context menus
double_click: Selects text or activates double-click actions
middle_click: Some applications use middle-click
click_holding: Click and hold for drag operations

Best Practices:

Take a screenshot before clicking to find correct coordinates
Identify button/link boundaries visually
Use center coordinates for best accuracy
Verify clicks succeeded with follow-up screenshots
Handle missed clicks by retrying with adjusted coordinates

Task 3: Typing Text & Form Input

Typing enables text input - entering data into forms, search boxes, command prompts, and text fields.

When to Use:

Filling out form fields with text
Entering search queries
Typing commands or code
Inputting credentials (in secure environments only)
Text area and field population

How It Works: The type action sends keyboard input character by character. The text is typed at the current cursor position, typically after clicking a text field.

Basic Example:

# Click text field first
action = {"type": "left_click", "coordinate": [500, 300]}

# Then type text
action = {
    "type": "type",
    "text": "Hello, World!"
}

Text Input Workflow:

Take screenshot to see the form
Click on the target text field
Type the text
Optionally press Enter or Tab to submit/move to next field
Screenshot to verify input

Special Characters:

# Use key action for special characters
action = {"type": "key", "key": "Return"}       # Enter key
action = {"type": "key", "key": "Tab"}          # Tab key
action = {"type": "key", "key": "BackSpace"}    # Delete character
action = {"type": "key", "key": "ctrl+a"}       # Select all
action = {"type": "key", "key": "ctrl+c"}       # Copy
action = {"type": "key", "key": "ctrl+v"}       # Paste

Best Practices:

Always click the text field first to focus it
Clear existing text with Ctrl+A and Delete if needed
Use Tab to move between form fields
Press Enter to submit forms
Take screenshots between actions to verify input

Task 4: Keyboard Control & Special Keys

Keyboard actions provide precise control over keys, shortcuts, and special inputs beyond text typing.

When to Use:

Pressing keyboard shortcuts (Ctrl+C, Ctrl+V, etc.)
Using special keys (Enter, Tab, Escape, arrows)
Navigating menus and dialogs with keyboard
Using application-specific hotkeys
Controlling focus and navigation

How It Works: The key action presses keyboard keys or combinations. It can send single keys or key combinations (like Ctrl+A).

Common Keys:

# Navigation
{"type": "key", "key": "Return"}      # Enter key
{"type": "key", "key": "Tab"}         # Tab to next field
{"type": "key", "key": "BackSpace"}   # Delete character
{"type": "key", "key": "Delete"}      # Delete forward
{"type": "key", "key": "Escape"}      # Escape/Cancel

# Arrows
{"type": "key", "key": "Up"}
{"type": "key", "key": "Down"}
{"type": "key", "key": "Left"}
{"type": "key", "key": "Right"}

# Shortcuts
{"type": "key", "key": "ctrl+a"}      # Select all
{"type": "key", "key": "ctrl+c"}      # Copy
{"type": "key", "key": "ctrl+v"}      # Paste
{"type": "key", "key": "ctrl+z"}      # Undo
{"type": "key", "key": "ctrl+s"}      # Save
{"type": "key", "key": "alt+Tab"}     # Switch windows

Key Holding (for drag operations):

{"type": "key", "key": "shift", "held": True}  # Hold shift while clicking

Best Practices:

Use keyboard shortcuts when available
Tab through form fields instead of clicking each one
Use Escape to close dialogs or cancel operations
Combine arrow keys for navigation
Use Ctrl+A before typing to replace selected text

Task 5: Using the Zoom Tool (Opus 4.5 Exclusive)

The zoom tool is a powerful feature exclusive to Claude Opus 4.5 that lets you inspect specific regions of the screen at full resolution, enabling precise element location and visual analysis.

What It Does: The zoom tool captures a rectangular region of the screen and returns it at full resolution without downscaling. This allows Claude to see fine details, read small text, identify exact element boundaries, and determine precise click coordinates.

When to Use:

Locating small UI elements accurately
Reading fine-print text
Analyzing icon details
Identifying exact button positions
Handling crowded interfaces
Improving coordinate precision for clicks

How It Works: You provide a rectangular region defined by coordinates [x1, y1, x2, y2] where:

(x1, y1) = top-left corner of region
(x2, y2) = bottom-right corner of region

Basic Example:

# Zoom into a specific region to see details
action = {
    "type": "zoom",
    "coordinate": [400, 200, 800, 400]  # [x1, y1, x2, y2]
}

Zoom Workflow Example:

# 1. Take full screenshot to understand layout
{"type": "screenshot"}

# 2. Identify region with uncertain element location
# Need to find exact position of "Submit" button

# 3. Zoom into that region for precise view
{"type": "zoom", "coordinate": [300, 350, 700, 450]}

# 4. With precise view, identify exact coordinates
# See "Submit" button at pixel position [550, 385]

# 5. Click with confidence
{"type": "left_click", "coordinate": [550, 385]}

Region Selection:

Small regions (50x50) for individual elements
Medium regions (200x200) for control groups
Larger regions up to full screen
Leave sufficient margins around target element

Vision Accuracy Benefits:

Opus 4.5's improved vision can read text more accurately
Zoom provides full resolution for detail inspection
Better at identifying element boundaries
Helps with crowded or complex UIs
Reduces click coordinate errors

Best Practices:

Use zoom when initial screenshot doesn't clearly show element
Zoom into area 20-30 pixels beyond element on all sides
Use full region coordinates from screenshot
Zoom only when precision is critical
Combine with screenshots for optimal efficiency

Task 6: Multi-Step Automation Workflows

Complex automation requires coordinating multiple actions across steps - this task covers orchestrating sophisticated workflows.

When to Use:

Multi-application workflows
Complex data entry processes
Testing procedures with multiple steps
Sequential automation tasks
Conditional workflows (if this, then that)

How It Works: Agent loops execute sequences of actions, using screenshots to understand results and determine next steps. The loop continues until the workflow is complete.

Basic Agent Loop Pattern:

# Pseudo-code for agent loop
actions = []

# Step 1: Take screenshot to see initial state
actions.append({"type": "screenshot"})

# Step 2: Analyze screenshot and click button
actions.append({"type": "left_click", "coordinate": [100, 50]})

# Step 3: Take screenshot to see result
actions.append({"type": "screenshot"})

# Step 4: Fill form based on new state
actions.append({"type": "left_click", "coordinate": [200, 200]})
actions.append({"type": "type", "text": "Form data"})

# Step 5: Submit
actions.append({"type": "left_click", "coordinate": [200, 300]})

# Step 6: Verify with screenshot
actions.append({"type": "screenshot"})

Error Recovery:

# If a click misses or action fails:
# 1. Take screenshot
# 2. Re-evaluate coordinates
# 3. Retry with adjusted position
# 4. Use zoom for precision if needed
# 5. Continue workflow

Workflow State Tracking:

Track which steps are complete
Remember important data extracted
Maintain context about current application state
Use screenshots as state checkpoints
Save intermediate results for verification

Best Practices:

Take screenshots at workflow boundaries
Verify each major step with feedback
Handle unexpected states gracefully
Use Try/catch-like patterns for errors
Log important transitions for debugging

Task 7: Application Control & System Interaction

Beyond UI clicks, computer use enables controlling applications, navigating system interfaces, and performing system-level tasks.

When to Use:

Window/application navigation
File system interaction (opening files, folders)
System settings configuration
Application launching and management
Multi-window workflows

How It Works: Standard mouse/keyboard operations work with any application:

Clicking desktop, taskbar, menu items
Opening file dialogs and navigating folders
Using File/Edit/View menus
Performing system-level operations
Managing multiple windows

Application Navigation Example:

# Open application menu
{"type": "left_click", "coordinate": [500, 10]}  # Menu bar

# Take screenshot to see menu
{"type": "screenshot"}

# Click menu item
{"type": "left_click", "coordinate": [520, 100]}

# Wait for dialog to open
{"type": "screenshot"}

# Interact with dialog
{"type": "left_click", "coordinate": [300, 300]}

File System Navigation:

# Open File > Open dialog with keyboard shortcut
{"type": "key", "key": "ctrl+o"}

# Take screenshot to see file dialog
{"type": "screenshot"}

# Navigate to folder (multiple methods)
# Method 1: Type path directly
{"type": "type", "text": "/path/to/folder"}

# Method 2: Double-click folders in explorer
{"type": "double_click", "coordinate": [400, 200]}

# Select and open file
{"type": "left_click", "coordinate": [400, 250]}
{"type": "key", "key": "Return"}

Multi-Window Workflows:

# Switch between windows
{"type": "key", "key": "alt+Tab"}

# Take screenshot to verify window
{"type": "screenshot"}

# Interact with new window
{"type": "left_click", "coordinate": [500, 400]}

Best Practices:

Use keyboard shortcuts when available
Navigate menus through visual screenshots
Handle different menu layouts gracefully
Use file dialog navigation carefully
Take screenshots between window switches

Quick Start Example

Here's a complete example of a simple automation workflow - filling out a web form:

import anthropic
import base64

client = anthropic.Anthropic(api_key="your-api-key")

# Define the computer use tool
tools = [
    {
        "name": "computer",
        "type": "computer_20251124",  # Opus 4.5 with zoom
        "display_width_px": 1024,
        "display_height_px": 768,
        "display_number": ":1"
    }
]

# Start with a screenshot
messages = [
    {
        "role": "user",
        "content": "Fill out the contact form with name 'John Doe' and email 'john@example.com', then submit it."
    }
]

# Add the screenshot action
messages.append({
    "role": "user",
    "content": [
        {
            "type": "text",
            "text": "First, take a screenshot to see the current state of the screen."
        }
    ]
})

# Create the request with beta header
response = client.messages.create(
    model="claude-opus-4-5-20250929",
    max_tokens=1024,
    tools=tools,
    messages=messages,
    headers={"anthropic-beta": "computer-use-2025-11-24"}
)

# Process the response - it will contain tool use actions
for content_block in response.content:
    if content_block.type == "tool_use":
        action = content_block.input

        # Execute the action and get result
        if action["type"] == "screenshot":
            # Return screenshot (base64 encoded)
            result = capture_screenshot()  # Your implementation
        elif action["type"] == "left_click":
            # Execute click
            result = click_at_coordinates(action["coordinate"])
        elif action["type"] == "type":
            # Type text
            result = type_text(action["text"])

        # Continue with the agent loop
        # ... (add result back to messages and continue)

API Reference

For detailed API specifications, parameters, response formats, and advanced usage patterns, see:

references/computer-use-api.md - Complete API documentation with all tool versions and action types

Advanced Patterns

For sophisticated automation patterns, multi-step workflows, zoom tool techniques, and best practices:

references/advanced-patterns.md - Advanced automation, error handling, and optimization

Security & Deployment

For security considerations, safe deployment practices, and operational guidelines:

references/security-deployment.md - Security, containerization, monitoring, and responsible use

Security & Limitations

Security Considerations:

Isolation: Deploy in isolated virtual machines or containers with minimal privileges
Network Control: Restrict internet access via domain allowlists
Credentials: Avoid providing sensitive credentials unless absolutely necessary
Confirmation: Request human approval for significant decisions
Input Validation: Validate all user inputs to prevent prompt injection

Known Limitations:

Latency: Not suitable for time-sensitive interactive tasks
Vision Accuracy: Computer vision may misidentify elements or coordinates
Application Support: Spreadsheets and specialized applications can be unreliable
Account Management: Cannot reliably create accounts or share content on social platforms
Prompt Injection: Vulnerable to prompt injection in web-based environments
Resolution: Recommended maximum 1280x800 resolution
Token Cost: Screenshots consume tokens due to vision processing

Related Skills

anthropic-expert - Overview of Claude computer use and tool use capabilities
claude-opus-4-5-guide - Opus 4.5 features including zoom tool enhancements
multi-ai-research - Research patterns for investigating third-party applications

Learn More: Start with the Quick Start example above, then explore the reference guides for advanced patterns and complete API documentation.

claude-computer-use-advanced

Claude Computer Use Advanced

Overview

When to Use

Computer Use Tasks

Task 1: Taking Screenshots

Task 2: Clicking Elements

Task 3: Typing Text & Form Input

Task 4: Keyboard Control & Special Keys

Task 5: Using the Zoom Tool (Opus 4.5 Exclusive)

Task 6: Multi-Step Automation Workflows

Task 7: Application Control & System Interaction

Quick Start Example

API Reference

Advanced Patterns

Security & Deployment

Security & Limitations

Related Skills