cli-review-runner

Automates the 10-item agent-friendliness audit from cli-for-agents. Runs black-box probes against a target CLI and emits a structured report mapping each finding to a rule ID (e.g., help-examples-in-help, err-non-zero-exit-codes, safe-dry-run-flag). Default mode is read-only - probes never run destructive verbs with real arguments.

When to Apply

User asks to review or audit a CLI for agent-friendliness, automation readiness, or CI use
User has just finished building a CLI and wants a pre-ship sanity check
User is grading their own or a third-party CLI against the cli-for-agents catalog
User is asking why a CLI is hanging an agent, blowing up context, or failing to compose in a pipeline
PR review for a CLI change - quickly regress-test the --help, errors, and dry-run flags

How to Use

The skill is orchestrated by scripts/review.sh. Point it at the target CLI (absolute path or PATH-resolvable name) and pick an output format.

# Default: text table on stdout, exit 0 if all passed, 1 if any failed
bash scripts/review.sh --target /usr/local/bin/mycli

# Machine-readable output
bash scripts/review.sh --target gh --format json
bash scripts/review.sh --target kubectl --format ndjson

# Supply subcommand list when auto-discovery misses them
bash scripts/review.sh --target gh --subcommands pr,issue,repo

# Preview what would run without touching the target CLI
bash scripts/review.sh --target mycli --dry-run

# Include risky probes on destructive verbs (off by default)
bash scripts/review.sh --target mycli --include-destructive

See bash scripts/review.sh --help for the full flag list.

Workflow Overview

--target <cli>
   │
   ▼
[1] Validate target       fail fast if path missing or not executable
   │
   ▼
[2] Load rule catalog     references/rule-catalog.tsv (45 rules)
   │
   ▼
[3] Discover subcommands  parse top-level --help (gh/kubectl/commander shapes)
   │
   ▼
[4] Run probes P1..P10    each probe emits NDJSON findings to a temp file
   │
   ▼
[5] Render report         scripts/render.sh  ->  text | json | ndjson

Read references/workflow.md when you need the full probe-by-probe breakdown, failure modes, and how to extend the catalog.

Probe Coverage

Probe	Rules tested	Coverage
P1 Non-interactive	`interact-no-hang-on-stdin`, `interact-no-input-flag`, `interact-flags-first`, `interact-detect-tty`, `interact-no-timed-prompts`, `interact-no-arrow-menus`, `input-no-prompt-fallback`	Run under `</dev/null` with timeout; inspect help for interactive language
P2 Layered help	`help-per-subcommand`, `help-no-flag-required`, `help-layered-discovery`	Top-level line count; per-subcommand `--help`; zero-arg invocation
P3 Help examples	`help-examples-in-help`, `help-flag-summary`, `help-suggest-next-steps`	Grep each subcommand help for `Examples:`, short+long flag pairs, "See also"
P4 Actionable errors	`err-actionable-fix`, `err-include-example-invocation`, `err-exit-fast-on-missing-required`, `err-no-stack-traces-by-default`	Invoke with bogus flag; grep stderr for fix + example; check for raw stack traces
P5 stderr channeling	`err-stderr-not-stdout`	Error text must land on fd 2
P6 Exit codes	`err-non-zero-exit-codes`	Usage error and runtime error must produce distinct non-zero codes
P7 stdin composition	`input-accept-stdin-dash`, `input-flags-over-positional`	Grep help for `-` stdin convention; count positionals vs flags
P8 Structured output	`output-json-flag`, `output-respect-no-color`	`--json` produces JSON; `NO_COLOR=1` suppresses ANSI
P9 Destructive safety	`safe-dry-run-flag`, `safe-force-bypass-flag`, `safe-no-prompts-with-no-input`	Inspect destructive verbs' `--help` for `--dry-run` / `--yes` / `--force` / `--no-input`
P10 Command structure	`struct-resource-verb`, `struct-standard-flag-names`, `struct-no-hidden-subcommand-catchall`, `struct-flag-order-independent`	Uniform shape, `--help`/`--version` present, unknown subcommand errors, flag position independence

Coverage: 30 of the 45 rules in cli-for-agents are black-box testable. The remaining 15 (idempotency, state reconciliation, NDJSON streaming, bounded output, crash-only recovery, env-var fallback, secret-stdin, confirm-by-typing-name) require either real invocation or source-code inspection - the report lists them as "manual review required".

Configuration

config.json stores the verb classifier lists and default timeout. The skill works without any setup - defaults are reasonable. Override per-invocation via flags or edit the file for project-wide changes.

{
  "timeout_seconds": 5,
  "safe_verbs": ["list", "get", "show", "status", "describe", "help", "version", "config", "ls", "inspect"],
  "destructive_verbs": ["delete", "drop", "destroy", "remove", "reset", "purge", "rm", "del"]
}

Safety Model

All probes are read-only by default:

Safe verbs (list, get, show, ...) may be invoked with bogus flags to test error handling
Destructive verbs (delete, drop, ...) are ONLY inspected via --help - never executed with arguments
Every probe runs with a 5-second wall-clock timeout under </dev/null
The target CLI is sandboxed to its own process; no shell metacharacters in arguments

When --include-destructive is passed, probes may invoke destructive verbs with bogus flags too. This exposes the rare case where a CLI does something destructive before validating flags. Only enable this against CLIs you trust, or in a disposable test environment.

Self-test

Before shipping changes to probes, run the self-test - it generates a mock CLI that deliberately violates specific rules and asserts the probes detect them:

bash scripts/selftest.sh

Expected output: Results: 8 passed, 0 failed. Any failure points at a regression in scripts/lib/probes.sh.

Files

File	Purpose
scripts/review.sh	Main entry point - orchestrates probes, renders the report
scripts/render.sh	Output formatter: text / json / ndjson
scripts/selftest.sh	Sanity check against a deliberately-buggy mock CLI
scripts/lib/common.sh	Shared helpers: timeout, verb classifier, JSON escape, catalog loader
scripts/lib/probes.sh	Probe functions `probe_p1..probe_p10`
references/rule-catalog.tsv	45 rules from cli-for-agents, mapped to probes
references/workflow.md	Detailed probe-by-probe methodology, failure modes, extension guide
gotchas.md	Known edge cases discovered during use
config.json	Verb classifier lists and default timeout

Related Skills

cli-for-agents - the 45-rule design catalog this skill audits against. Read rule files there when the report flags an issue and you need the full explanation.