experiment-code

Fail

Audited by Gen Agent Trust Hub on Feb 22, 2026

Risk Level: HIGHCOMMAND_EXECUTIONREMOTE_CODE_EXECUTION
Full Analysis
  • Command Execution (HIGH): The skill explicitly provides implementation patterns (e.g., Pattern 1 in code-patterns.md) that use subprocess.run(['python', 'experiment.py', ...]) to execute code generated by the agent. This execution happens on the host environment without mentioned sandboxing or safety boundaries.
  • Indirect Prompt Injection (LOW): The skill accepts research plans and ideas as input ($1), which are then interpolated into system prompts (e.g., Prompt 1 in experiment-prompts.md).
  • Ingestion points: Input parameter $1 (research plan/idea) in SKILL.md.
  • Boundary markers: None. The input is directly placed into the {idea} and {title} placeholders in prompt templates.
  • Capability inventory: File system read/write (json.load, open, notes.txt) and arbitrary command execution via subprocess.run in code-patterns.md.
  • Sanitization: There is no logic to sanitize the user input or validate the generated code before it is executed.
  • Dynamic Execution (HIGH): The skill's primary workflow involves the dynamic creation of Python scripts (experiment.py, plot.py) which are subsequently executed and iteratively modified based on runtime errors and performance scores.
Recommendations
  • AI detected serious security threats
Audit Metadata
Risk Level
HIGH
Analyzed
Feb 22, 2026, 05:00 AM