gui-agent-mobile
SKILL.md
GUI Agent Mobile Skill
This skill wraps gui_agent_skill CLI so Codex can execute complex Android GUI workflows.
When To Use
- User asks to control an Android phone/emulator UI.
- User asks for multi-step mobile automation with session continuation.
- User asks to inspect current device/app screen state.
Command Workflow
- New task:
python -m gui_agent_skill.cli execute --task "<task>" [--provider <provider>] [--device-id <id>] [--max-steps <n>] [--timeout-sec <sec>] [--stateless] - Continue task:
python -m gui_agent_skill.cli continue [--session-id <id>] [--reply "<text>"] [--task "<task>"] [--device-id <id>] [--max-steps <n>] [--timeout-sec <sec>] - Status:
python -m gui_agent_skill.cli status [--device-id <id>] - Providers:
python -m gui_agent_skill.cli providers - Direct coordinate tap (no model planning):
python -m gui_agent_skill.cli tap --x <x> --y <y> [--coord-space auto|pixel|ratio] [--device-id <id>] [--post-delay-ms <ms>] [--timeout-sec <sec>]
Fallback when module import fails:
python cli.py execute ...python cli.py continue ...python cli.py status ...python cli.py providerspython cli.py tap ...
Response Handling
- Always parse returned JSON and report
success. - Preserve and surface
session_idfor follow-up turns. - Respect timeout controls: pass
--timeout-secfor bounded runtime and checktimed_outin error responses. - When
terminated_subprocessesis present, report that forced cleanup happened (timeout/interruption/tail cleanup). - Use
next_actionto drive interaction:continue: proceed with next stepneeds_reply: ask user for explicit reply contentcomplete: close task
- Include
captionandscreenshot_pathwhen available. - Check
session_modeandcontinuation_supported:session_mode=stateful: normalexecute -> continuesession_mode=stateless: do not callcontinue; run a newexecute --statelessinstead
- If
error=tap_only_mode_enabled, switch totap/click; do not retryexecute/continue.
Execution Modes (Direct vs Planner-Controlled)
Use two complementary modes based on task complexity:
- Direct execution mode (default): GUI Agent can receive and execute a single complex task with multiple actions/clicks.
- Planner-controlled mode (for complex global tasks): Codex/Claude acts as planner and GUI Agent acts as executor.
When to switch to planner-controlled mode:
- Long-horizon tasks with many dependent steps.
- High-branching tasks where each screen state changes next action.
- Tasks that need precise, low-risk, step-by-step control.
Planner-controlled workflow:
- Start one global session with
execute, then iteratively usecontinue. - Planner inspects each new screenshot/state and decides the next micro-steps.
- Executor receives explicit, concrete commands (UI element identity, relative position, row/column/layer description, buttons, sequence) and performs them.
- Repeat inspect -> plan -> execute until task completion.
Direct coordinate mode:
- Use
taponly when the user explicitly asks for coordinate-based control. - This path skips adapter/model planning and sends
adb shell input tapdirectly. - Prefer
--coord-space ratiowhen user gives normalized coordinates, orautofor mixed input. - After each
tap, inspect returnedscreenshot_pathandcoordinatefields before the next action. - This is the only available control path when
tap_only_mode=true(for example: installed withpython install.py --tap-only).
Stateless Mode
Use stateless mode for short, incremental actions where each call must start a new conversation without resetting the phone environment:
python -m gui_agent_skill.cli execute --task "<task>" --stateless [--device-id <id>] [--provider <provider>]
Behavior:
- Starts a fresh adapter conversation for each call.
- Skips local session persistence in
gui_agent_skill. - Keeps current app/screen context (no forced Home reset in local/gelab path).
- Best for minimal one-turn tasks.
This pattern is generic and applies to games and non-game global workflows alike.
Instruction style requirement in planner-controlled mode:
- Do not use coordinate-based commands.
- Use semantic location language (for example: "top row middle grass tile", "leftmost tile in the second row", "bottom toolbar shuffle button").
- To improve efficiency, planner can issue one or multiple semantic actions in one turn.
Instruction style requirement in direct coordinate mode:
- Coordinate commands are allowed.
- Verify coordinate conversion using returned
coordinate.screen_size,coordinate.computed, andcoordinate.tap.
Safety Notes
execute/continuecan operate real devices; confirm intent for risky actions.- If command fails, check ADB connectivity first; then check provider configuration unless running in tap-only mode.
Weekly Installs
2
Repository
ugorange/gui_agent_skillGitHub Stars
12
First Seen
13 days ago
Security Audits
Installed on
openclaw2
claude-code2
github-copilot2
codex2
kimi-cli2
gemini-cli2