browser-automation
Installation
SKILL.md
Browser Automation
CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:
- Never run midscene commands in the background. Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop.
- Run only one midscene command at a time. Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together.
- Allow enough time for each command to complete. Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex
actcommands may need even longer.- Always report task results before finishing. After completing the automation task, you MUST proactively summarize the results to the user — including key data found, actions completed, screenshots taken, and any relevant findings. Never silently end after the last automation step; the user expects a complete response in a single interaction.
Automate web browsing using npx -y @midscene/web@1. By default, launches a headless Chrome via Puppeteer that persists across CLI calls — no session loss between commands. Also supports CDP mode and Bridge mode to connect to an existing Chrome browser.
What act Can Do
Inside a single act call in the browser, Midscene can click, right-click, double-click, hover, type or clear text, press keys, scroll, drag, long-press, and continue through multi-step page flows based on what is currently visible. When touch input is enabled, it can also handle swipe- or pinch-style interactions on touch-oriented pages.
When to Use
This skill has three modes. Choose based on the user's intent: