nemo-evaluator-sdk

Warn

Audited by Gen Agent Trust Hub on Apr 30, 2026

Risk Level: MEDIUMCOMMAND_EXECUTIONEXTERNAL_DOWNLOADSCREDENTIALS_UNSAFEPROMPT_INJECTION
Full Analysis
  • [COMMAND_EXECUTION]: The skill implements an adapter system that supports custom interceptor discovery, allowing the launcher to load and execute arbitrary Python modules from user-specified directories or module paths provided in the configuration.
  • [COMMAND_EXECUTION]: Through the Framework Definition File (FDF) system, the skill allows for the definition of custom shell commands that are executed via subprocess to run evaluation harnesses.
  • [CREDENTIALS_UNSAFE]: The skill is designed to handle and manage sensitive credentials, including NVIDIA NGC API keys, Hugging Face tokens (HF_TOKEN), and SSH private keys required for remote Slurm cluster communication.
  • [EXTERNAL_DOWNLOADS]: The skill fetches and executes Docker containers from external registries, including NVIDIA's official container registry (nvcr.io) and potentially unverified third-party registries for custom benchmarks.
  • [PROMPT_INJECTION]: The skill acts as a harness that processes untrusted data from over 100 academic benchmarks, presenting an indirect prompt injection attack surface.
  • Ingestion points: Automated ingestion of data from academic harnesses like MMLU, HumanEval, and GPQA Diamond within SKILL.md and configuration files.
  • Boundary markers: The provided documentation and examples do not define explicit delimiters or boundary markers to prevent the model from obeying instructions embedded within benchmark samples.
  • Capability inventory: The environment has extensive capabilities, including subprocess execution (Docker/Slurm), file system access (writing results/logs), and network access (NVIDIA Build, Lepton AI, MLflow, Weights & Biases).
  • Sanitization: No evidence of sanitization or filtering of benchmark content is present in the reference documentation.
Audit Metadata
Risk Level
MEDIUM
Analyzed
Apr 30, 2026, 03:34 PM