openrlhf-training
Fail
Audited by Gen Agent Trust Hub on Feb 17, 2026
Risk Level: HIGHCOMMAND_EXECUTIONEXTERNAL_DOWNLOADSPROMPT_INJECTION
Full Analysis
- [COMMAND_EXECUTION] (HIGH): The skill instructs the user to execute a Docker container with the
SYS_ADMINcapability (--cap-add=SYS_ADMIN). This grants the container excessive privileges on the host system, potentially allowing it to bypass security profiles or access host resources directly. - [COMMAND_EXECUTION] (HIGH): The installation section requires the use of
sudo pip uninstallfor several packages. Executing package managers with root privileges can lead to system-wide instability and provides an attacker with a path to host compromise if the skill or environment is manipulated. - [EXTERNAL_DOWNLOADS] (MEDIUM): The skill installs the
openrlhfpackage and its dependencies (Ray, vLLM). These sources are not included in the provided Trusted GitHub Organizations list, and the installations are performed without version pinning or checksum verification. - [PROMPT_INJECTION] (LOW): (Category 8
- Indirect Prompt Injection): The skill possesses an attack surface for indirect prompt injection via the processing of untrusted external data.
- Ingestion points: The skill ingests external datasets from Hugging Face (e.g.,
OpenRLHF/preference_dataset_mixture2_and_safe_pku) during the Reward Model and PPO training steps inSKILL.md. - Boundary markers: No boundary markers or 'ignore' instructions are used to separate training data from training logic.
- Capability inventory: The skill has the capability to execute shell commands, perform network operations via Ray, and write to the filesystem.
- Sanitization: No sanitization or validation of the training data content is performed, leaving the agent susceptible to instructions embedded within the datasets.
Recommendations
- AI detected serious security threats
Audit Metadata