sparse-autoencoder-training

Pass

Audited by Gen Agent Trust Hub on Feb 17, 2026

Risk Level: SAFE
Full Analysis
  • [EXTERNAL_DOWNLOADS] (SAFE): The documentation instructs the user to install sae-lens and transformer-lens using pip. These are legitimate, widely-used open-source libraries in the AI interpretability research community.
  • [DATA_EXFILTRATION] (SAFE): The skill mentions uploading models to HuggingFace and logging metrics to Weights & Biases (W&B). These operations are standard in ML development and require user-supplied authentication tokens (e.g., hf_token), with no evidence of hardcoded credentials or unauthorized data transmission.
  • [REMOTE_CODE_EXECUTION] (SAFE): The SAE.from_pretrained method downloads model weights from HuggingFace. While weight loading can involve deserialization, the SAELens library defaults to using the Safetensors format, which is designed to prevent arbitrary code execution during the loading process.
  • [INDIRECT_PROMPT_INJECTION] (LOW): There is a potential surface for indirect prompt injection as the training runner ingests external datasets (e.g., monology/pile-uncopyrighted). However, the skill treats this data as training activations rather than instructions, and the risk of the model or agent following commands embedded in the dataset is minimal in this context.
Audit Metadata
Risk Level
SAFE
Analyzed
Feb 17, 2026, 04:57 PM