hyperpod-issue-report
HyperPod Issue Report
Collect diagnostic logs from HyperPod cluster nodes via SSM, store results in S3. Supports both EKS and Slurm clusters with auto-detection. Uses the bundled scripts/hyperpod_issue_report.py for reliable parallel collection.
Prerequisites
- AWS CLI configured with permissions:
sagemaker:DescribeCluster,sagemaker:ListClusterNodes,ssm:StartSession,s3:PutObject,s3:GetObject,eks:DescribeCluster - Python 3.8+ and uv (see uv installation docs for install options)
- SSM Agent running on target nodes; node IAM roles need
s3:GetObject/s3:PutObjecton the report bucket - For EKS clusters: kubectl installed and configured (see Workflow step 2)
Workflow
1. Gather Information
Collect from the user:
- Cluster identifier (required): accepts cluster name or full cluster ARN (e.g.,
arn:aws:sagemaker:us-west-2:123456789012:cluster/abc123) - AWS region (required unless extractable from ARN)
- S3 path for report storage (required, e.g.
s3://bucket/prefix). If the user doesn't have a bucket, create one (e.g.,s3://hyperpod-diagnostics-<account-id>-<region>) - Issue description (optional)
- Target scope: all nodes, specific instance groups, or specific node IDs (optional)
- Additional commands to run on nodes (optional)
2. Verify Environment
aws sts get-caller-identity
aws sagemaker describe-cluster --cluster-name <name-or-arn> --region <region>
If the S3 bucket doesn't exist, create it:
aws s3 mb s3://<bucket-name> --region <region>
For EKS clusters (check Orchestrator.Eks in describe-cluster output):
-
Ensure kubectl is installed (
which kubectl). If missing, install it for the current platform. -
Configure kubeconfig using the EKS cluster name from the describe-cluster response:
aws eks update-kubeconfig --name <eks-cluster-name> --region <region>
3. Run the Collection Script
uv run scripts/hyperpod_issue_report.py \
--cluster <cluster-name-or-arn> \
--region <region> \
--s3-path s3://<bucket>[/prefix]
Use --help for all options including --instance-groups, --nodes, --command, --max-workers, and --debug. Note: --instance-groups and --nodes are mutually exclusive. Node identifiers accept instance IDs (i-*), EKS names (hyperpod-i-*), or Slurm names (ip-*).
4. Present Results
After collection, the script shows statistics and offers interactive download. Report the S3 location and offer to:
- Download the report locally
- Help analyze collected diagnostics (see references/collection-details.md for what's in each file)
- Prepare a summary for AWS Support
Troubleshooting
See references/troubleshooting.md for error handling, large cluster tuning, and known limitations.