benchmark-sandbox
Pass
Audited by Gen Agent Trust Hub on Mar 16, 2026
Risk Level: SAFECOMMAND_EXECUTIONEXTERNAL_DOWNLOADS
Full Analysis
- Remote Command Execution: The skill orchestrates evaluations within Vercel Sandboxes (ephemeral microVMs) using the sandbox SDK. This allows for isolated execution of build, verify, and deploy pipelines using Claude Code.
- Credential Resolution: To enable authenticated benchmarking, the scripts resolve Anthropic and Vercel credentials from the local environment or macOS Keychain. These secrets are securely passed to the remote sandbox environment to facilitate API access and project deployments.
- Automated Tool Installation: The runner automatically installs required Node.js tools such as the Vercel CLI, Claude Code, and agent-browser within the sandbox. These installations ensure the benchmark environment is pre-configured for complex evaluation scenarios.
- Secret Redaction in Logs: A structured logger is implemented to identify and redact sensitive patterns (like API keys) from benchmark output and logs, helping to prevent accidental exposure of secrets in results or reports.
- Vercel Project Integration: The skill automates project linking and deployment to the vercel-labs organization. It manages the lifecycle of these ephemeral projects, including environment variable synchronization and build error retries.
Audit Metadata