evaluating-code-models

Warn

Audited by Snyk on Mar 28, 2026

Risk Level: MEDIUM
Full Analysis

MEDIUM W011: Third-party content exposure detected (indirect prompt injection risk).

  • Third-party content exposure detected (high risk: 0.90). The skill explicitly instructs loading public, user-provided datasets and models from HuggingFace (e.g., DATASET_PATH and datasets.load_dataset in references/custom-tasks.md and SKILL.md references to openai_humaneval, mbpp, evalplus, and arbitrary "username/dataset-name"), so the agent ingests untrusted third‑party content that is used to build prompts, execute tests, and drive evaluation behavior.

MEDIUM W012: Unverifiable external dependency detected (runtime URL that controls agent).

  • Potentially malicious external URL detected (high risk: 0.90). The harness fetches evaluation datasets at runtime (for example the HumanEval dataset at https://huggingface.co/datasets/openai/openai_humaneval), and those remote datasets directly supply the prompts/tests that control model inputs and are required for running the evaluations.

Issues (2)

W011
MEDIUM

Third-party content exposure detected (indirect prompt injection risk).

W012
MEDIUM

Unverifiable external dependency detected (runtime URL that controls agent).

Audit Metadata
Risk Level
MEDIUM
Analyzed
Mar 28, 2026, 06:08 PM
Issues
2