codex-readiness-integration-test

Warn

Audited by Gen Agent Trust Hub on Feb 17, 2026

Risk Level: MEDIUMCOMMAND_EXECUTIONPROMPT_INJECTIONDATA_EXFILTRATION
Full Analysis
  • [COMMAND_EXECUTION] (MEDIUM): The scripts/run_plan.py utility executes shell commands derived from a JSON plan using subprocess.Popen(shell=True). These commands are generated by an LLM based on the repository context.
  • Evidence: scripts/run_plan.py (lines 112-121) executes arbitrary strings in a shell environment.
  • Mitigation: A regex-based denylist (DENYLIST_PATTERNS) is implemented to block destructive commands like rm -rf and mkfs. Additionally, SKILL.md specifies a workflow requirement where users must manually approve the prompt and plan before execution.
  • [PROMPT_INJECTION] (LOW): The skill is susceptible to indirect prompt injection. It ingests data from the local repository (such as AGENTS.md and log files) which is then processed by LLM evaluators to determine test success or failure.
  • Mandatory Evidence Chain (Category 8):
  • Ingestion points: scripts/collect_evidence.py reads AGENTS.md and logs/*.log from the current working directory.
  • Boundary markers: No explicit delimiters or instructions to ignore embedded commands are present in the evaluator prompts (references/agentic_loop_eval.md, references/change_quality.md).
  • Capability inventory: The skill can execute shell commands via scripts/run_plan.py and the codex CLI, and it can read/write files within the repository scope.
  • Sanitization: Command execution is restricted by a basic denylist in scripts/run_plan.py, but there is no sanitization of the content ingested into the LLM prompts.
  • [DATA_EXFILTRATION] (LOW): The skill collects comprehensive repository state information, including git diffs (including untracked files), directory structures, and logs, into a single evidence.json file. While no direct network exfiltration was found in the scripts, this file aggregates sensitive local data for LLM processing.
  • Evidence: scripts/collect_evidence.py uses git diff and git ls-files --others --exclude-standard to gather repository content.
Audit Metadata
Risk Level
MEDIUM
Analyzed
Feb 17, 2026, 05:05 PM