truefoundry-deploy
Routing note: For ambiguous user intents, use the shared clarification templates in references/intent-clarification.md.
Deploy to TrueFoundry
Route user intent to the right deployment workflow. Load only the references you need.
Intent Router
| User Intent | Action | Reference |
|---|---|---|
| "deploy", "deploy my app", "ship this" | Single HTTP service | deploy-service.md |
| "mount this file", "mount config file", "mount certificate file", "mount key file" | Single service with file mounts (no image rebuild) | deploy-service.md |
| "tfy apply", "apply manifest", "deploy from yaml" | Declarative manifest apply | deploy-apply.md |
| "deploy everything", "full stack", docker-compose, "docker-compose.yaml", "compose.yaml" | Multi-service: use compose as source of truth | deploy-multi.md + compose-translation.md |
| "async service", "queue consumer", "worker" | Async/queue service | deploy-async.md |
| "deploy LLM", "serve model" | Model serving intent (may be ambiguous) | Ask user: dedicated model serving (llm-deploy) or generic service deploy (deploy) |
| "deploy helm chart" | Helm chart intent | Confirm Helm path and collect chart details, then proceed with helm workflow |
| "deploy postgres docker", "dockerized postgres", "deploy redis docker", "database in docker/container" | Containerized database intent | Proceed with deploy workflow (do not route to Helm) |
| "deploy database", "deploy postgres", "deploy redis" | Ambiguous infra intent | Ask user: Helm chart (helm) or containerized service (deploy) |
Load only the reference file matching the user's intent. Do not preload all references.
General Principle: Ask, Don't Assume
When in doubt, ask. If any deployment parameter is ambiguous or missing — branch, workspace, image, port, resources, environment — ask the user rather than picking a value and proceeding silently. A wrong assumption can deploy to the wrong environment, from the wrong branch, or with the wrong configuration. The cost of one extra question is always lower than the cost of a bad deploy.
Examples of things to ask rather than assume:
- Which workspace to deploy to (even if only one exists)
- Which branch to build from (especially if the manifest branch differs from the local branch)
- Whether to use the existing manifest as-is or update it
- Which Docker image tag or registry to use
- Whether the service should be public or internal
Do NOT silently default to the current value of anything that could have changed or that the user has not explicitly confirmed for this deployment.
Prerequisites (All Workflows)
# 1. Check credentials
grep '^TFY_' .env 2>/dev/null || true
env | grep '^TFY_' 2>/dev/null || true
# 2. Derive TFY_HOST for CLI (MUST run before any tfy command)
export TFY_HOST="${TFY_HOST:-${TFY_BASE_URL%/}}"
# 3. Check CLI
tfy --version 2>/dev/null || echo "Install: pip install 'truefoundry==0.5.0'"
# 4. Check for existing manifests
ls tfy-manifest.yaml truefoundry.yaml 2>/dev/null
TFY_BASE_URLandTFY_API_KEYmust be set (env or.env).TFY_HOSTmust be set before anytfyCLI command. The export above handles this automatically.TFY_WORKSPACE_FQNrequired. HARD RULE: Never auto-pick a workspace. Always ask the user to confirm, even if only one workspace exists or a preference is saved. Seereferences/prerequisites.mdfor the full workspace confirmation flow.- For full credential setup, see
references/prerequisites.md.
WARNING: Never use
source .env. Thetfy-api.shscript handles.envparsing automatically. For shell access:grep KEY .env | cut -d= -f2-
CRITICAL: tfy apply vs tfy deploy
HARD RULE:
tfy applydoes NOT supportbuild_source.type: local. If the manifest has a local build source, you MUST usetfy deploy -f <manifest>. Usingtfy applywith a local build source will fail with:must match exactly one schema in oneOf.
| Scenario | Command | Works? |
|---|---|---|
Pre-built image (image.type: image) |
tfy apply -f manifest.yaml |
Yes |
build_source.type: git |
tfy apply -f manifest.yaml |
Yes |
build_source.type: git |
tfy deploy -f manifest.yaml |
Yes |
build_source.type: local |
tfy deploy -f manifest.yaml |
Yes |
build_source.type: local |
tfy apply -f manifest.yaml |
NO — will fail |
Before running any deploy command, check the manifest:
- If
build_source.type: local→ usetfy deploy -f - Otherwise →
tfy apply -fis fine
Pre-Flight Manifest Validation (MANDATORY)
Before attempting any deploy/apply, run these checks. Fix issues before deploying — do not deploy a known-bad manifest.
1. Exposed port requires host
If any port has expose: true, it must have a host field. Deploying without it will fail with: Host must be provided to expose port.
Auto-generate the host if missing:
TFY_API_SH=~/.claude/skills/truefoundry-deploy/scripts/tfy-api.sh
# Get cluster ID from workspace FQN (format: cluster-id:workspace-name)
CLUSTER_ID=$(echo "$TFY_WORKSPACE_FQN" | cut -d: -f1)
# Discover base domain from cluster manifest
bash $TFY_API_SH GET "/api/svc/v1/clusters/$CLUSTER_ID"
# → Response is at data.manifest.base_domains[] (array of strings)
# → Look for wildcard entry (e.g., "*.ml.example.truefoundry.cloud")
# → Strip "*." to get base domain: "ml.example.truefoundry.cloud"
# → Construct host: "{service-name}-{workspace-name}.{base_domain}"
Pattern: {service-name}-{workspace-name}.{base_domain}
2. Local build source requires tfy deploy
If the manifest contains build_source.type: local, ensure the deploy command is tfy deploy -f, NOT tfy apply.
3. capacity_type compatibility
spot_fallback_on_demand is not supported on all clusters. If you're unsure, use on_demand or omit capacity_type entirely to let the platform decide. Valid safe values: on_demand, spot.
4. build_spec.type must be exact
Only dockerfile and tfy-python-buildpack are valid. Do NOT use docker, build, python, or any other value.
5. Git branch mismatch (existing manifest + git source)
If an existing manifest has build_source.type: git with a branch_name set, compare it to the current local branch before deploying:
# Use only the specific manifest file for this deployment (not both at once)
# Use -h to suppress the filename prefix so the bare value can be compared
grep -h 'branch_name:' "$MANIFEST_FILE" 2>/dev/null | head -1 | sed 's/.*branch_name:[[:space:]]*//'
# Get current local branch
git branch --show-current 2>/dev/null
If the branches differ, stop and ask the user:
The manifest specifies
branch_name: {manifest_branch}, but your current local branch is{current_branch}. Which branch should be deployed?
- Keep manifest branch:
{manifest_branch}(deploy as-is, no manifest change)- Use current branch:
{current_branch}(updatebranch_namein the manifest)
Never silently override the manifest's branch_name with the current local branch.
Quick Ops (Inline)
Apply a manifest (pre-built image or git source)
# tfy CLI expects TFY_HOST when TFY_API_KEY is set
export TFY_HOST="${TFY_HOST:-${TFY_BASE_URL%/}}"
# Preview changes
tfy apply -f tfy-manifest.yaml --dry-run --show-diff
# Apply
tfy apply -f tfy-manifest.yaml
Deploy from local source
# tfy CLI expects TFY_HOST when TFY_API_KEY is set
export TFY_HOST="${TFY_HOST:-${TFY_BASE_URL%/}}"
# MUST use tfy deploy (not tfy apply) for local builds
tfy deploy -f truefoundry.yaml --no-wait
Reminder:
tfy applydoes NOT supportbuild_source.type: local. Usetfy deploy -ffor local builds.
Minimal service manifest template
name: my-service
type: service
image:
type: image
image_uri: docker.io/myorg/my-api:v1.0
ports:
- port: 8000
expose: false # Set true + add host for public access
app_protocol: http
resources:
cpu_request: 0.5
cpu_limit: 1
memory_request: 512
memory_limit: 1024
ephemeral_storage_request: 1000
ephemeral_storage_limit: 2000
env:
LOG_LEVEL: info
replicas: 1
workspace_fqn: "WORKSPACE_FQN_HERE"
Public access template (when expose: true)
ports:
- port: 8000
expose: true
host: my-service-my-workspace.ml.your-org.truefoundry.cloud # Auto-generate from cluster discovery
app_protocol: http
Host is REQUIRED when
expose: true. Auto-generate it:{service-name}-{workspace-name}.{base_domain}. Getbase_domainfrom cluster discovery (seecluster-discovery.md).
Check deployment status
TFY_API_SH=~/.claude/skills/truefoundry-deploy/scripts/tfy-api.sh
bash $TFY_API_SH GET '/api/svc/v1/apps?workspaceFqn=WORKSPACE_FQN&applicationName=SERVICE_NAME'
Or use the applications skill.
Post-Deploy Monitoring (MANDATORY)
HARD RULE: After every successful
tfy applyortfy deploycommand, you MUST monitor the deployment to completion. Do NOT stop after the apply/deploy command returns. Do NOT ask the user "should I monitor?" — just do it. Do NOT say "you can check the status" — YOU check the status. The deployment is not done until you confirm a terminal state.
Monitoring procedure
Immediately after deploy/apply succeeds, start polling. Do not wait for the user to ask.
Poll loop — execute this yourself, do not delegate to the user:
TFY_API_SH=~/.claude/skills/truefoundry-deploy/scripts/tfy-api.sh
# Run this in a loop with sleep between checks:
# Every 15s for first 2 min, every 30s for min 2-5, every 60s after that
# Timeout after 10 minutes
bash $TFY_API_SH GET '/api/svc/v1/apps?workspaceFqn=WORKSPACE_FQN&applicationName=SERVICE_NAME'
Or use MCP tool call if available:
tfy_applications_list(filters={"workspace_fqn": "WORKSPACE_FQN", "application_name": "SERVICE_NAME"})
How to check: The response is at data[0].deployment.currentStatus. Use state.isTerminalState as the authoritative check.
Terminal states (state.isTerminalState === true) — stop polling:
DEPLOY_SUCCESS→ report success, replicas, endpoint URLBUILD_FAILED,DEPLOY_FAILED,FAILED→ fetch logs, diagnose, suggest fix (see below)PAUSED→ report pausedCANCELLED→ report cancelled
Non-terminal states — keep polling, report progress each time:
INITIALIZED→ "Deployment initialized, waiting..."BUILDING(status) or transitionBUILDING→ "Build in progress..."BUILD_SUCCESS→ "Build succeeded, deploying..."ROLLOUT_STARTEDor transitionDEPLOYING→ "Deploying (X/Y replicas ready)..."DEPLOY_FAILED_WITH_RETRY→ "Deploy failed, retrying..."
On success
- Report final status and replicas (e.g., "2/2 ready")
- Show endpoint URL if service has an exposed port
- Run a quick HTTP health check if endpoint is available:
curl -sf -o /dev/null -w '%{http_code}' "https://ENDPOINT_URL" || true
On failure
- Fetch recent logs (last 5 minutes) using
logsskill or direct API - Identify root cause from logs (OOMKilled, CrashLoopBackOff, ImagePullBackOff, port mismatch, probe failure, build error)
- Follow deploy-debugging.md for diagnosis
- Apply one fix and retry once; if still failed, report to user with summary and log excerpt and stop
On timeout (10 minutes)
Report current state and elapsed time. Do NOT silently give up — tell the user:
Monitoring timed out after 10 minutes. Current status: ROLLOUT_STARTED (transition: DEPLOYING).
The deployment is still in progress. You can re-run monitoring or check the TrueFoundry dashboard.
NEVER end your response after a deploy/apply command without reporting a terminal deployment status (
state.isTerminalState === true). If you are about to end your response and you have not confirmedDEPLOY_SUCCESS,DEPLOY_FAILED,BUILD_FAILED,FAILED,PAUSED, orCANCELLED, you are violating this rule — go back and poll.
Post-Deploy Configuration (Ask After Success)
After deployment succeeds (
DEPLOY_SUCCESS), ask the user about the following configuration options. Do not silently skip these — present them as a checklist and let the user decide.
1. Public vs Private URL
Ask the user:
Your service is deployed. How should it be accessed?
1. **Public URL** — Accessible from the internet (expose: true with a host)
2. **Private/Internal only** — Only accessible within the cluster (expose: false)
If the user picks public and the port doesn't already have expose: true + host, update the manifest and redeploy.
2. Authentication
Ask the user:
Do you want to add authentication to your service?
1. **No auth** — Anyone with the URL can access it
2. **TrueFoundry login** — Users must log in via TrueFoundry (truefoundry_oauth)
3. **JWT auth** — Verify JWT tokens from a custom identity provider
4. **Basic auth** — Username/password protection
If the user picks an auth option, add the appropriate auth block to the port configuration and redeploy.
3. Auto-shutdown vs Always Running
Ask the user:
Should the service auto-shutdown when idle?
1. **Always running** — Keep replicas up at all times (default)
2. **Auto-shutdown after idle** — Scale to zero after no requests for a period (saves cost)
→ Recommended wait_time: 900 seconds (15 min) for dev, longer for staging
If the user picks auto-shutdown, add the auto_shutdown block to the manifest:
auto_shutdown:
wait_time: 900 # seconds of inactivity before scaling to zero
Skip these prompts if the user explicitly said they don't want changes, or if this is a redeploy of an existing service that already has these configured.
REST API fallback (when CLI unavailable)
See references/cli-fallback.md for converting YAML to JSON and deploying via tfy-api.sh.
Auto-Detection: Single vs Multi-Service
Before creating any manifest, scan the project:
- Check for
docker-compose.yml,docker-compose.yaml, orcompose.yamlfirst. If present (or user mentions docker-compose), treat it as the primary source of truth: load deploy-multi.md and compose-translation.md, generate manifests from the compose file, wire services per service-wiring.md, then complete deployment. Do not ask the user to manually create manifests when a compose file exists. - Look for multiple
Dockerfilefiles across the project - Check for service directories with their own dependency files in
services/,apps/,frontend/,backend/
- Compose file present or user says "docker-compose" → Multi-service from compose: load
deploy-multi.md+compose-translation.md - Single service → Load
references/deploy-service.md - Multiple services (no compose) → Load
references/deploy-multi.md
Multi-Service Deployment Order (MANDATORY)
HARD RULE: When deploying multiple services, you MUST deploy in dependency order, create secrets between tiers, and wire services before deploying dependents. Never deploy all services at once.
Tier-by-tier flow:
TIER 0: Infrastructure (DB, Cache, Queue) → deploy → wait for pods ready → create TFY secrets
TIER 1: Backend (APIs, workers) → deploy with secrets + DNS wiring → verify connectivity
TIER 2: Frontend / gateway → deploy with backend URLs → verify end-to-end
Key rules:
- Create TFY secret groups with infra credentials between Tier 0 and Tier 1 — never put raw passwords in manifests
- SPA frontends (React, Vue) MUST use backend's public URL, not internal DNS
DEPLOY_SUCCESSdoes NOT mean Helm pods are ready — poll actual readiness- Present the dependency graph and deploy plan to the user before deploying
For step-by-step orchestration, examples, and common patterns, see deploy-ordering.md. For dependency graphs, DNS wiring, and compose translation, see deploy-multi.md, service-wiring.md, and dependency-graph.md.
Secrets Handling (MANDATORY: Always Use TFY Secrets)
HARD RULE: NEVER put sensitive values directly in the manifest
envblock. ALWAYS create a TrueFoundry secret group first, then reference the secrets usingtfy-secret://format. This is non-negotiable — even for "quick" or "test" deployments.
Workflow for any env var that looks sensitive (matches *PASSWORD*, *SECRET*, *TOKEN*, *KEY*, *API_KEY*, *DATABASE_URL*, *CONNECTION_STRING*, *CREDENTIALS*, or any value the user explicitly says is sensitive):
- Ask the user for the secret values (or confirm they want to store them)
- Create a secret group using the
secretsskill:# Use the secrets skill to create a group with the sensitive keys # The skill will handle creating the group and individual secrets - Reference them in the manifest with
tfy-secret://format:
env:
LOG_LEVEL: info # plain text OK
DB_PASSWORD: tfy-secret://my-org:my-service-secrets:DB_PASSWORD # sensitive — ALWAYS use tfy-secret://
API_KEY: tfy-secret://my-org:my-service-secrets:API_KEY # sensitive — ALWAYS use tfy-secret://
Pattern: tfy-secret://<TENANT_NAME>:<SECRET_GROUP_NAME>:<SECRET_KEY> where TENANT_NAME is the subdomain of TFY_BASE_URL.
If the user provides a raw secret value in the manifest or asks you to put it directly in env:
- Warn them: "Secrets should not be stored as plain text in manifests."
- Offer to create a TFY secret group for them
- Only proceed with raw values if the user explicitly insists after the warning
Use the secrets skill for guided secret group creation. For the full workflow, see references/deploy-service.md (Secrets Handling section).
File Mounts (Config, Secrets, Shared Data)
When users ask to mount files into a deployment, prefer manifest mounts over Dockerfile edits:
type: secretfor sensitive file content (keys, certs, credentials)type: config_mapfor non-sensitive config filestype: volumefor writable/shared runtime data
See references/deploy-service.md (File Mounts section) for the end-to-end workflow.
Shared References
These references are available for all workflows — load as needed:
| Reference | Contents |
|---|---|
manifest-schema.md |
Complete YAML field reference (single source of truth) |
manifest-defaults.md |
Per-service-type defaults with YAML templates |
cli-fallback.md |
CLI detection and REST API fallback pattern |
cluster-discovery.md |
Extract cluster ID, base domains, available GPUs |
resource-estimation.md |
CPU, memory, GPU sizing rules of thumb |
health-probes.md |
Startup, readiness, liveness probe configuration |
gpu-reference.md |
GPU types and VRAM reference |
container-versions.md |
Pinned container image versions |
prerequisites.md |
Credential setup and .env configuration |
rest-api-manifest.md |
Full REST API manifest reference |
Workflow-Specific References
| Reference | Used By |
|---|---|
deploy-api-examples.md |
deploy-service |
deploy-errors.md |
deploy-service |
deploy-scaling.md |
deploy-service |
load-analysis-questions.md |
deploy-service |
codebase-analysis.md |
deploy-service |
tfy-apply-cicd.md |
deploy-apply |
tfy-apply-extra-manifests.md |
deploy-apply |
deploy-ordering.md |
deploy-multi (tier-by-tier orchestration) |
compose-translation.md |
deploy-multi |
dependency-graph.md |
deploy-multi |
multi-service-errors.md |
deploy-multi |
multi-service-patterns.md |
deploy-multi |
service-wiring.md |
deploy-multi |
deploy-debugging.md |
All deploy/apply (when status is failed) |
async-errors.md |
deploy-async |
async-queue-configs.md |
deploy-async |
async-python-library.md |
deploy-async |
async-sidecar-deploy.md |
deploy-async |
Composability
- Find workspace: Use
workspacesskill - Monitor rollout: Use
monitorskill to track deployment progress - Check what's deployed: Use
applicationsskill - View logs: Use
logsskill - Manage secrets: Use
secretsskill - Deploy Helm charts: Use
helmskill - Deploy LLMs: Use
llm-deployskill - Test after deploy: Use
service-testskill
Success Criteria
- User confirmed service name, resources, port, and deployment source before deploying
- Deployment URL and status reported back to the user
- Deployment status verified automatically immediately after apply/deploy (no extra prompt)
- Health probes configured for production deployments
- Secrets stored securely (not hardcoded in manifests)
- For multi-service: all services wired together and working end-to-end
More from truefoundry/tfy-deploy-skills
truefoundry-jobs
Deploys and monitors TrueFoundry batch jobs, scheduled cron jobs, and one-time tasks. Uses YAML manifests with `tfy apply`. Use when deploying jobs, scheduling cron tasks, checking job run status, or viewing execution history. For listing job applications, use `applications` skill.
11truefoundry-helm
Deploys infrastructure components via Helm charts on TrueFoundry. Supports any public or private OCI Helm chart including databases (Postgres, MongoDB, Redis), message brokers (Kafka, RabbitMQ), and vector databases (Qdrant, Milvus). Uses YAML manifests with `tfy apply`. Use when installing Helm charts or deploying infrastructure on TrueFoundry.
10truefoundry-docs
Fetches TrueFoundry documentation, API reference, and deployment guides. Use when the user needs platform docs or how-to guidance.
10truefoundry-logs
Views, downloads, and searches application and job logs from TrueFoundry. Supports time-range filtering, pod filtering, and error search.
10truefoundry-secrets
Manages TrueFoundry secret groups and secrets. Handles listing, creating, updating, and deleting secret groups and individual key-value secrets.
10truefoundry-workflows
Builds and deploys data processing and ML training pipelines using TrueFoundry Workflows (built on Flyte). Use when creating DAGs, orchestrating multi-step tasks, scheduling ETL pipelines, or running ML training workflows.
10