hugging-face-evaluation

Pass

Audited by Gen Agent Trust Hub on Mar 12, 2026

Risk Level: SAFECOMMAND_EXECUTIONEXTERNAL_DOWNLOADSDATA_EXFILTRATION
Full Analysis
  • Subprocess Command Execution: The skill uses subprocess.run across several scripts (evaluation_manager.py, inspect_eval_uv.py, run_eval_job.py, run_vllm_eval_job.py) to execute CLI tools like inspect, lighteval, and hf jobs.
  • These commands are used to run evaluation frameworks and submit jobs to Hugging Face infrastructure.
  • The logic uses predefined command lists and validates inputs like hardware flavors and backend choices to prevent arbitrary injection.
  • External API and Data Ingestion: The skill fetches data from the Artificial Analysis API (artificialanalysis.ai) and model cards via huggingface_hub.
  • These operations use the user's provided AA_API_KEY and HF_TOKEN from environment variables.
  • The implementation follows standard API interaction patterns for the Hugging Face ecosystem.
  • Remote Code Execution Surface (Indirect): Several scripts include a --trust-remote-code flag (e.g., in inspect_vllm_uv.py and lighteval_vllm_uv.py).
  • This is a standard Hugging Face Transformers feature required for models with custom architectures.
  • The skill documentation includes warnings about using this flag and positions it as an opt-in parameter for the user.
  • Credential Handling: The skill processes sensitive tokens (HF_TOKEN, AA_API_KEY) primarily through environment variables and Hugging Face's secrets mechanism for remote jobs.
  • Scripts include logic to verify token presence before attempting authenticated operations.
  • Indirect Prompt Injection Surface: The extract-readme functionality parses untrusted Markdown content from model READMEs.
  • The skill mitigates this by using the markdown-it-py parser to strictly extract table structures and numeric values.
  • It includes a manual review step where extracted YAML is printed to the console for user verification before any changes are applied or pushed.
Audit Metadata
Risk Level
SAFE
Analyzed
Mar 12, 2026, 12:34 AM