gspdev-benchmark
Snapshots live in dev/benchmarks/ as JSON files. The workflow is release-centric:
- At each release, capture a baseline:
benchmark.sh release→{version}.release.json - During development, capture snapshots:
benchmark.sh capture [label] - Before PR, compare against release:
benchmark.sh compare(auto-selects release baseline)
This measures the cumulative impact of changes heading into the next release.
Read $ARGUMENTS to determine the mode:
| Input | Mode |
|---|---|
| (no args) | Capture + auto-compare against release baseline |
capture [label] |
Capture with optional label |
release |
Capture as the release baseline for this version |
compare [v1] [v2] |
Compare two snapshots (default: release vs latest) |
trend |
Show trajectory across all snapshots |
Step 1: Execute mode
Default mode (no args)
- Run
bash dev/scripts/benchmark.sh capture - If a release baseline exists, automatically run
benchmark.sh compare - Present the comparison table and analysis
This is the most common invocation — "where are we vs the last release?"
Capture mode
Run bash dev/scripts/benchmark.sh capture [label].
After capture, read the JSON from dev/benchmarks/ and present a summary:
- Total weight and red-zone count
- Top 5 heaviest skills with their breakdown
- Rate limit risk (API conversations, double-dispatch count)
- If a release baseline exists, auto-compare and highlight changes
Release mode
Run bash dev/scripts/benchmark.sh release.
This is part of the publish flow — capture the definitive baseline for this version. All future compare calls will diff against this baseline until the next release.
Remind the user: "Add benchmark.sh release to your /gspdev-publish checklist."
Compare mode
Run bash dev/scripts/benchmark.sh compare [v1] [v2].
Default (no args) compares release baseline → latest capture. After the table output, analyze:
- Flag any skill that moved from green/yellow to red
- Flag any skill whose exec_context increased (potential pass-through regression)
- Summarize net pipeline path changes
- Rate limit risk changes
- Highlight the biggest wins and regressions
Trend mode
Run bash dev/scripts/benchmark.sh trend.
After the table, identify:
- Overall trajectory (improving, stable, regressing)
- Largest single-version improvement
- Skills that consistently stay in red zone
Step 2: Suggest next actions
Based on the results, suggest relevant actions:
- If red-zone skills exist: "Consider optimizing {skill} — {specific suggestion}"
- If exec_context is non-zero for agent-spawning skills: "Pass-through exec_context detected in {skill}"
- If rate limit risk increased: "New double-dispatch skill adds API conversation overhead"
- If tests have failures: "Fix test failures before benchmarking"
- If total weight decreased significantly: celebrate the win with specific numbers
More from jubscodes/get-shit-pretty
get-shit-pretty
Design engineering for AI coding tools. Full pipeline: brand research, strategy, identity, guidelines, UI design, critique, accessibility audit, build, and review. Expertise skills (color, typography, visuals) serve the entire pipeline. 14 specialized agents with Apple HIG, Nielsen's heuristics, WCAG 2.2 AA, and design token standards.
15gsp-visuals
Define visual direction — imagery, 3D, video, textures, and surface treatments
14gsp-accessibility
Quick contrast checks and token WCAG audits — inline, no agent
14gsp-help
Show all skills
14gsp-color
Design color systems — palettes, contrast, semantic mapping, dark mode
14gsp-typography
Design type systems — scale, pairing, fluid type, vertical rhythm
14