evaluating-code-models
Audited by Socket on Feb 15, 2026
1 alert found:
Obfuscated FileThe evaluated document describes a legitimate benchmarking tool whose capabilities (model downloading, generation, and optional execution) are necessary for computing pass@k metrics. I found no hardcoded secrets, obfuscated payloads, or direct malware in the provided text. The primary security concerns are operational: enabling --trust_remote_code and --allow_code_execution, especially combined with --use_auth_token or mounting host files into containers, can lead to arbitrary code execution and credential exposure if used with untrusted models or unisolated environments. Recommended mitigations: prefer generation-only runs on host and execute tests inside vetted containers or isolated VMs, run containers with restricted network (--network=none) and minimal mounts, avoid --trust_remote_code for third-party models, and avoid exposing long-lived tokens in the runtime. Overall the package is functionally appropriate but operationally risky without careful isolation.