together-gpu-clusters
SKILL.md
Together GPU Clusters
Overview
Provision GPU clusters on Together AI for distributed training, large-scale inference, and HPC workloads.
- Hardware: NVIDIA H100, H200, B200 (80GB SXM)
- Cluster types: On-demand (pay-as-you-go) or Reserved (committed)
- Orchestration: Kubernetes or Slurm
- Management: tcloud CLI, Terraform, SkyPilot, REST API
- Networking: InfiniBand for high-bandwidth inter-node communication
Installation
# Python (recommended)
uv init # optional, if starting a new project
uv add together
# or with pip
pip install together
# TypeScript / JavaScript
npm install together-ai
Set your API key:
export TOGETHER_API_KEY=<your-api-key>
Workflow
- Choose hardware and cluster size
- Create cluster via tcloud CLI, Terraform, or API
- Configure orchestration (K8s or Slurm)
- Run workloads
- Monitor health and manage nodes
- Delete when done
Quick Start with CLI
The CLI supports two equivalent command forms. The examples below use together beta clusters, but you can also use tcloud cluster after installing tcloud.
Install
# Option A: Together CLI (included with Together Python SDK)
pip install together
# Option B: Standalone tcloud binary
# Mac (Universal)
curl -LO https://tcloud-cli-downloads.s3.us-west-2.amazonaws.com/releases/latest/tcloud-darwin-universal.tar.gz
tar xzf tcloud-darwin-universal.tar.gz
# Linux (AMD64)
curl -LO https://tcloud-cli-downloads.s3.us-west-2.amazonaws.com/releases/latest/tcloud-linux-amd64.tar.gz
tar xzf tcloud-linux-amd64.tar.gz
Authenticate
# Together CLI
together auth login
# tcloud
tcloud sso login
List Available Regions
together beta clusters list-regions
Example output:
{
"regions": [
{
"driver_versions": [
"CUDA_12_6_565",
"CUDA_12_5_555",
"CUDA_12_8_570",
"CUDA_12_9_575",
"CUDA_12_6_560",
"CUDA_12_4_550"
],
"name": "us-central-8",
"supported_instance_types": [
"H100_SXM",
"H200_SXM"
]
}
]
}
Create a Cluster
# On-demand Kubernetes cluster
together beta clusters create \
--name my-training-cluster \
--num-gpus 8 \
--gpu-type H100_SXM \
--region us-central-8 \
--driver-version CUDA_12_6_560 \
--billing-type ON_DEMAND \
--cluster-type KUBERNETES
# Reserved Slurm cluster with shared storage
together beta clusters create \
--name my-slurm-cluster \
--num-gpus 16 \
--gpu-type H200_SXM \
--region us-central-8 \
--driver-version CUDA_12_6_560 \
--billing-type RESERVED \
--duration-days 30 \
--cluster-type SLURM \
--volume <VOLUME_ID>
Check Status
together beta clusters list
together beta clusters retrieve <CLUSTER_ID>
Scale a Cluster
together beta clusters update <CLUSTER_ID> --num-gpus 16
Get Credentials (Kubernetes)
# Write kubeconfig to default location (~/.kube/config)
together beta clusters get-credentials <CLUSTER_ID>
# Write to a specific file
together beta clusters get-credentials <CLUSTER_ID> --file ./kubeconfig.yaml
# Print to stdout
together beta clusters get-credentials <CLUSTER_ID> --file -
# Overwrite existing context and set as default
together beta clusters get-credentials <CLUSTER_ID> \
--overwrite-existing \
--set-default-context
# Then use kubectl
export KUBECONFIG=~/.kube/config
kubectl get nodes
Create and Manage Shared Storage
# Create a shared volume
together beta clusters storage create \
--volume-name my-shared-data \
--size-tib 2 \
--region us-central-8
# List all volumes
together beta clusters storage list
# Get volume details
together beta clusters storage retrieve <VOLUME_ID>
# Delete a volume (must not be attached to a cluster)
together beta clusters storage delete <VOLUME_ID>
Delete a Cluster
together beta clusters delete <CLUSTER_ID>
Kubernetes vs Slurm
Choose Kubernetes when:
- Running containerized workloads
- Need auto-scheduling and scaling
- Using cloud-native ML frameworks (KubeFlow, Ray)
Choose Slurm when:
- Traditional HPC workloads
- Multi-node MPI training
- Familiar with Slurm job scripts
- Need fine-grained resource allocation
Key CLI Commands
together beta clusters |
tcloud cluster |
Description |
|---|---|---|
clusters create |
cluster create |
Create a new cluster |
clusters list |
cluster list |
List all clusters |
clusters retrieve <ID> |
cluster get <ID> |
Get cluster details |
clusters update <ID> |
cluster scale <ID> |
Update/scale a cluster |
clusters delete <ID> |
cluster delete <ID> |
Delete a cluster |
clusters list-regions |
-- | List regions and GPU types |
clusters get-credentials <ID> |
-- | Get K8s kubeconfig |
clusters storage create |
-- | Create shared volume |
clusters storage list |
-- | List shared volumes |
clusters storage retrieve <ID> |
-- | Get volume details |
clusters storage delete <ID> |
-- | Delete shared volume |
Terraform Integration
resource "together_gpu_cluster" "training" {
name = "my-training-cluster"
num_gpus = 8
instance_type = "H100-SXM"
region = "us-central-8"
billing_type = "prepaid"
reservation_days = 30
shared_volume {
name = "training-data"
size_tib = 5
}
}
terraform init
terraform plan
terraform apply
SkyPilot Integration
# sky.yaml
resources:
accelerators: H100:8
cloud: kubernetes
setup: |
pip install torch transformers
run: |
torchrun --nproc_per_node=8 train.py
sky launch sky.yaml
Health Monitoring
tcloud cluster health my-cluster
- Automatic health checks on GPU, network, and storage
- Unhealthy nodes flagged for repair or replacement
- Node repair can be triggered manually or automatically
Storage
- NFS: Shared filesystem across all nodes
- Object storage: S3-compatible for large datasets
- Persistent storage survives node restarts
Billing
- On-demand: Per-GPU-hour billing, no commitment
- Reserved: Committed capacity with discounted rates
- Billed while cluster is running (even if idle)
Resources
- tcloud CLI reference: See references/tcloud-cli.md
- Cluster management details: See references/cluster-management.md
- Official docs: GPU Clusters Overview
- Official docs: GPU Clusters Quickstart
- API reference: Clusters API
Weekly Installs
7
Repository
zainhas/togethe…i-skillsFirst Seen
Feb 27, 2026
Security Audits
Installed on
cline7
gemini-cli7
github-copilot7
codex7
kimi-cli7
cursor7