evaluating-code-models

Warn

Audited by Snyk on Feb 16, 2026

Risk Level: MEDIUM
Full Analysis

MEDIUM W011: Third-party content exposure detected (indirect prompt injection risk).

  • Third-party content exposure detected (high risk: 0.90). The skill explicitly loads and consumes public, user-sourced datasets and content (e.g., datasets.load_dataset(DATASET_PATH) in references/custom-tasks.md and named public datasets like openai_humaneval and mbpp in references/benchmarks.md), so the agent ingests untrusted third‑party content as part of its evaluation/prompting workflow.

MEDIUM W012: Unverifiable external dependency detected (runtime URL that controls agent).

  • Potentially malicious external URL detected (high risk: 0.80). The skill explicitly instructs pulling and running the remote Docker image ghcr.io/bigcode-project/evaluation-harness-multiple at runtime (docker pull / docker run), which fetches and executes remote container code used for evaluation, so this URL is a runtime external dependency that executes remote code: https://ghcr.io/bigcode-project/evaluation-harness-multiple
Audit Metadata
Risk Level
MEDIUM
Analyzed
Feb 16, 2026, 01:00 AM