hugging-face-evaluation
Pass
Audited by Gen Agent Trust Hub on Mar 12, 2026
Risk Level: SAFECOMMAND_EXECUTIONEXTERNAL_DOWNLOADSDATA_EXFILTRATION
Full Analysis
- Subprocess Command Execution: The skill uses
subprocess.runacross several scripts (evaluation_manager.py,inspect_eval_uv.py,run_eval_job.py,run_vllm_eval_job.py) to execute CLI tools likeinspect,lighteval, andhf jobs. - These commands are used to run evaluation frameworks and submit jobs to Hugging Face infrastructure.
- The logic uses predefined command lists and validates inputs like
hardwareflavors andbackendchoices to prevent arbitrary injection. - External API and Data Ingestion: The skill fetches data from the Artificial Analysis API (
artificialanalysis.ai) and model cards viahuggingface_hub. - These operations use the user's provided
AA_API_KEYandHF_TOKENfrom environment variables. - The implementation follows standard API interaction patterns for the Hugging Face ecosystem.
- Remote Code Execution Surface (Indirect): Several scripts include a
--trust-remote-codeflag (e.g., ininspect_vllm_uv.pyandlighteval_vllm_uv.py). - This is a standard Hugging Face Transformers feature required for models with custom architectures.
- The skill documentation includes warnings about using this flag and positions it as an opt-in parameter for the user.
- Credential Handling: The skill processes sensitive tokens (
HF_TOKEN,AA_API_KEY) primarily through environment variables and Hugging Face'ssecretsmechanism for remote jobs. - Scripts include logic to verify token presence before attempting authenticated operations.
- Indirect Prompt Injection Surface: The
extract-readmefunctionality parses untrusted Markdown content from model READMEs. - The skill mitigates this by using the
markdown-it-pyparser to strictly extract table structures and numeric values. - It includes a manual review step where extracted YAML is printed to the console for user verification before any changes are applied or pushed.
Audit Metadata