GUI Agent Mobile Skill

This skill wraps gui_agent_skill CLI so Codex can execute complex Android GUI workflows.

When To Use

New task: python -m gui_agent_skill.cli execute --task "<task>" [--provider <provider>] [--device-id <id>] [--max-steps <n>] [--timeout-sec <sec>] [--stateless]
Continue task: python -m gui_agent_skill.cli continue [--session-id <id>] [--reply "<text>"] [--task "<task>"] [--device-id <id>] [--max-steps <n>] [--timeout-sec <sec>]
Status: python -m gui_agent_skill.cli status [--device-id <id>]
Providers: python -m gui_agent_skill.cli providers
Direct coordinate tap (no model planning): python -m gui_agent_skill.cli tap --x <x> --y <y> [--coord-space auto|pixel|ratio] [--device-id <id>] [--post-delay-ms <ms>] [--timeout-sec <sec>]

Fallback when module import fails:

Always parse returned JSON and report success.
Preserve and surface session_id for follow-up turns.
Respect timeout controls: pass --timeout-sec for bounded runtime and check timed_out in error responses.
When terminated_subprocesses is present, report that forced cleanup happened (timeout/interruption/tail cleanup).
Use next_action to drive interaction:
- continue: proceed with next step
- needs_reply: ask user for explicit reply content
- complete: close task
Include caption and screenshot_path when available.
Check session_mode and continuation_supported:
- session_mode=stateful: normal execute -> continue
- session_mode=stateless: do not call continue; run a new execute --stateless instead
If error=tap_only_mode_enabled, switch to tap/click; do not retry execute/continue.

Use two complementary modes based on task complexity:

Direct execution mode (default): GUI Agent can receive and execute a single complex task with multiple actions/clicks.
Planner-controlled mode (for complex global tasks): Codex/Claude acts as planner and GUI Agent acts as executor.

When to switch to planner-controlled mode:

Planner-controlled workflow:

Start one global session with execute, then iteratively use continue.
Planner inspects each new screenshot/state and decides the next micro-steps.
Executor receives explicit, concrete commands (UI element identity, relative position, row/column/layer description, buttons, sequence) and performs them.
Repeat inspect -> plan -> execute until task completion.

Direct coordinate mode:

Use tap only when the user explicitly asks for coordinate-based control.
This path skips adapter/model planning and sends adb shell input tap directly.
Prefer --coord-space ratio when user gives normalized coordinates, or auto for mixed input.
After each tap, inspect returned screenshot_path and coordinate fields before the next action.
This is the only available control path when tap_only_mode=true (for example: installed with python install.py --tap-only).

Use stateless mode for short, incremental actions where each call must start a new conversation without resetting the phone environment:

python -m gui_agent_skill.cli execute --task "<task>" --stateless [--device-id <id>] [--provider <provider>]

Behavior:

This pattern is generic and applies to games and non-game global workflows alike.

Instruction style requirement in planner-controlled mode:

Do not use coordinate-based commands.
Use semantic location language (for example: "top row middle grass tile", "leftmost tile in the second row", "bottom toolbar shuffle button").
To improve efficiency, planner can issue one or multiple semantic actions in one turn.

Instruction style requirement in direct coordinate mode:

Coordinate commands are allowed.
Verify coordinate conversion using returned coordinate.screen_size, coordinate.computed, and coordinate.tap.

execute/continue can operate real devices; confirm intent for risky actions.
If command fails, check ADB connectivity first; then check provider configuration unless running in tap-only mode.