nemo-evaluator-sdk
Warn
Audited by Gen Agent Trust Hub on Apr 30, 2026
Risk Level: MEDIUMCOMMAND_EXECUTIONEXTERNAL_DOWNLOADSCREDENTIALS_UNSAFEPROMPT_INJECTION
Full Analysis
- [COMMAND_EXECUTION]: The skill implements an adapter system that supports custom interceptor discovery, allowing the launcher to load and execute arbitrary Python modules from user-specified directories or module paths provided in the configuration.
- [COMMAND_EXECUTION]: Through the Framework Definition File (FDF) system, the skill allows for the definition of custom shell commands that are executed via subprocess to run evaluation harnesses.
- [CREDENTIALS_UNSAFE]: The skill is designed to handle and manage sensitive credentials, including NVIDIA NGC API keys, Hugging Face tokens (HF_TOKEN), and SSH private keys required for remote Slurm cluster communication.
- [EXTERNAL_DOWNLOADS]: The skill fetches and executes Docker containers from external registries, including NVIDIA's official container registry (nvcr.io) and potentially unverified third-party registries for custom benchmarks.
- [PROMPT_INJECTION]: The skill acts as a harness that processes untrusted data from over 100 academic benchmarks, presenting an indirect prompt injection attack surface.
- Ingestion points: Automated ingestion of data from academic harnesses like MMLU, HumanEval, and GPQA Diamond within SKILL.md and configuration files.
- Boundary markers: The provided documentation and examples do not define explicit delimiters or boundary markers to prevent the model from obeying instructions embedded within benchmark samples.
- Capability inventory: The environment has extensive capabilities, including subprocess execution (Docker/Slurm), file system access (writing results/logs), and network access (NVIDIA Build, Lepton AI, MLflow, Weights & Biases).
- Sanitization: No evidence of sanitization or filtering of benchmark content is present in the reference documentation.
Audit Metadata