sparse-autoencoder-training
Pass
Audited by Gen Agent Trust Hub on Feb 17, 2026
Risk Level: SAFE
Full Analysis
- [EXTERNAL_DOWNLOADS] (SAFE): The documentation instructs the user to install
sae-lensandtransformer-lensusingpip. These are legitimate, widely-used open-source libraries in the AI interpretability research community. - [DATA_EXFILTRATION] (SAFE): The skill mentions uploading models to HuggingFace and logging metrics to Weights & Biases (W&B). These operations are standard in ML development and require user-supplied authentication tokens (e.g.,
hf_token), with no evidence of hardcoded credentials or unauthorized data transmission. - [REMOTE_CODE_EXECUTION] (SAFE): The
SAE.from_pretrainedmethod downloads model weights from HuggingFace. While weight loading can involve deserialization, the SAELens library defaults to using the Safetensors format, which is designed to prevent arbitrary code execution during the loading process. - [INDIRECT_PROMPT_INJECTION] (LOW): There is a potential surface for indirect prompt injection as the training runner ingests external datasets (e.g.,
monology/pile-uncopyrighted). However, the skill treats this data as training activations rather than instructions, and the risk of the model or agent following commands embedded in the dataset is minimal in this context.
Audit Metadata