model-evaluation
Pass
Audited by Gen Agent Trust Hub on Apr 1, 2026
Risk Level: SAFECOMMAND_EXECUTIONEXTERNAL_DOWNLOADS
Full Analysis
- [COMMAND_EXECUTION]: The skill performs resource discovery by invoking AWS service commands to list model packages, describe specific model versions, and check the availability of Bedrock foundation models. It also executes a local Python script, 'validate_custom_metrics.py', to ensure user-provided JSON adheres to the required schema.
- [EXTERNAL_DOWNLOADS]: The generated notebook contains commands to install or upgrade the 'sagemaker' Python SDK from official package registries to ensure compatibility with the evaluation framework.
- [SAFE]: The skill demonstrates secure design by implementing a mandatory 'Hard stop' for user agreement to Bedrock Evaluation terms and requiring explicit consent before sampling S3 dataset records. No obfuscation, persistence mechanisms, or unauthorized data exfiltration attempts were detected.
Audit Metadata