truefoundry-monitor
Routing note: For ambiguous user intents, use the shared clarification templates in references/intent-clarification.md.
Monitor Deployment
Track a TrueFoundry deployment rollout to completion, reporting status at each stage and diagnosing failures automatically.
When to Use
- After
tfy applyortfy deployto track rollout progress - User says "monitor", "watch deployment", "is my deploy done", "check rollout"
- Called automatically by the
deployskill after a successful apply/deploy
When NOT to Use
- User wants to deploy → prefer
deployskill; ask if the user wants another valid path - User wants to list all apps → prefer
applicationsskill; ask if the user wants another valid path - User wants to read historical logs → prefer
logsskill; ask if the user wants another valid path
CRITICAL BEHAVIOR RULES
RULE 1: Once monitoring starts, you MUST poll until a terminal state or timeout. Do NOT stop early. Do NOT ask the user "should I keep checking?" — just keep checking.
RULE 2: Do NOT end your response while the deployment is in a non-terminal state (BUILDING, INITIALIZED, ROLLOUT_STARTED). If you are about to stop and the status is non-terminal, you are violating this rule — continue polling.
RULE 3: Between each poll, briefly tell the user what you're waiting for. Do NOT silently loop, but also do NOT ask for permission to continue.
Required Information
Before monitoring, you need:
- Workspace FQN (
TFY_WORKSPACE_FQN) — HARD RULE: Never auto-pick. Always ask the user to confirm. - Application name — the service or job name being deployed
If invoked right after a deploy, both should already be known from the deploy context.
Execution Priority
For all status checks, use MCP tool calls first:
tfy_applications_list(filters={"workspace_fqn": "WORKSPACE_FQN", "application_name": "APP_NAME"})
If MCP tool calls are unavailable, fall back to direct API via tfy-api.sh.
When using direct API, set TFY_API_SH to the full path of this skill's scripts/tfy-api.sh. See references/tfy-api-setup.md for paths per agent.
Monitoring Flow
Step 1: Initial Status Check
TFY_API_SH=~/.claude/skills/truefoundry-monitor/scripts/tfy-api.sh
bash $TFY_API_SH GET '/api/svc/v1/apps?workspaceFqn=WORKSPACE_FQN&applicationName=APP_NAME'
Extract from the response at data[0] (the application object):
deployment.currentStatus.status— the deployment status enumdeployment.currentStatus.transition— current transition (e.g.,BUILDING,DEPLOYING)deployment.currentStatus.state.isTerminalState— boolean, most reliable terminal checkdeployment.currentStatus.state.display— human-readable state
Step 2: Poll Until Terminal State
The API response has two key fields: status (the deployment status) and transition (what's happening now). Use state.isTerminalState as the authoritative check for whether to stop polling.
Status values (from deployment.currentStatus.status):
| Status | Terminal? | Action |
|---|---|---|
INITIALIZED |
No | Report "Deployment initialized, waiting...", continue polling |
BUILDING |
No | Report "Build in progress", continue polling |
BUILD_SUCCESS |
No | Report "Build succeeded, deploying...", continue polling |
BUILD_FAILED |
Yes | Fetch build logs, report failure |
ROLLOUT_STARTED |
No | Report "Rollout started", continue polling |
DEPLOY_SUCCESS |
Yes | Report success with endpoint URL |
DEPLOY_FAILED |
Yes | Fetch pod logs, diagnose failure |
DEPLOY_FAILED_WITH_RETRY |
No | Report "Deploy failed, retrying...", continue polling |
PAUSED |
Yes | Report paused/stopped |
FAILED |
Yes | Report general failure |
CANCELLED |
Yes | Report cancelled |
Transition values (from deployment.currentStatus.transition):
| Transition | Meaning |
|---|---|
BUILDING |
Image build is in progress |
DEPLOYING |
Pods are being created/updated |
REUSING_EXISTING_BUILD |
Skipping build, reusing cached image |
COMPONENTS_DEPLOYING |
Multi-component deployment in progress |
WAITING |
Waiting for resources |
Best practice: Always check
deployment.currentStatus.state.isTerminalState === trueto decide whether to stop polling, rather than matching individual status strings. Thestate.displayfield gives a human-friendly label.
Polling schedule:
- First 2 minutes: check every 15 seconds
- Minutes 2-5: check every 30 seconds
- After 5 minutes: check every 60 seconds
- Timeout after 10 minutes — report current state and suggest the user check manually
Between polls, tell the user what you're waiting for. Do not silently loop. Do NOT ask "should I continue?" — just continue.
Step 3: On Success
When state.isTerminalState is true and status is DEPLOY_SUCCESS:
- Report the final status
- Show replicas ready (e.g., "2/2 replicas ready")
- Show the endpoint URL if the service has an exposed port
- Optionally run a quick health check on the endpoint:
# Only if the service exposes an HTTP port
curl -sf -o /dev/null -w '%{http_code}' "https://ENDPOINT_URL/health" || true
Report the HTTP status code. Do not fail the monitor if the health check fails — just report it.
Step 4: On Failure
When status is BUILD_FAILED, DEPLOY_FAILED, FAILED, or CANCELLED:
- Fetch recent logs using the
logsskill or direct API:
# Get the app ID first from the status response
TFY_API_SH=~/.claude/skills/truefoundry-monitor/scripts/tfy-api.sh
# Fetch recent logs (last 5 minutes)
bash $TFY_API_SH GET '/api/svc/v1/logs/WORKSPACE_ID/download?applicationFqn=APP_FQN&startTs=START_TS&endTs=END_TS'
- Identify the failure cause from the logs (OOMKilled, CrashLoopBackOff, ImagePullBackOff, port mismatch, etc.)
- Suggest a fix based on the error:
| Error Pattern | Suggested Fix |
|---|---|
OOMKilled |
Increase memory_limit in manifest |
CrashLoopBackOff |
Check startup command and logs for crash reason |
ImagePullBackOff |
Verify image URI and registry credentials |
| Port mismatch | Ensure manifest port matches what the app listens on |
Readiness probe failed |
Check health probe path and startup time |
| Build error | Check Dockerfile and build logs |
- Report summary with: error type, relevant log excerpt (max 20 lines), and suggested fix
- Do NOT auto-retry. Present the diagnosis and let the user decide next steps.
Presenting Status Updates
Use a consistent format for each status update:
Monitoring: my-service in cluster:workspace
Status: ROLLOUT_STARTED | Transition: DEPLOYING
Display: Deploying (1/2 replicas ready)
Elapsed: 45s
Next check in 15s...
Final summary on success:
Deployment complete: my-service
Status: DEPLOY_SUCCESS
Replicas: 2/2 ready
Endpoint: https://my-service-ws.example.com
Health check: 200 OK
Total time: 1m 32s
Final summary on failure:
Deployment failed: my-service
Status: DEPLOY_FAILED
Error: CrashLoopBackOff — container exited with code 1
Log excerpt:
> ModuleNotFoundError: No module named 'flask'
Suggested fix: Add 'flask' to requirements.txt and redeploy
<success_criteria>
Success Criteria
- Deployment status is tracked from current state to a terminal state
- User sees clear progress updates at each polling interval
- On success: replicas, endpoint URL, and optional health check are reported
- On failure: logs are fetched, root cause is identified, and a fix is suggested
- Monitor times out gracefully after 10 minutes with a status summary
- The user is never left waiting without feedback
</success_criteria>
Composability
- Before monitoring: Use
deployskill to deploy, then monitor - On failure: Use
logsskill for deeper log analysis - Check app details: Use
applicationsskill for full app info - Fix and redeploy: Use
deployskill to apply fixes
Error Handling
Application Not Found
Application "APP_NAME" not found in workspace "WORKSPACE_FQN".
Check:
- Application name is spelled correctly
- The deploy/apply command completed successfully
- You're checking the correct workspace
Timeout
Monitoring timed out after 10 minutes.
Current status: ROLLOUT_STARTED | Transition: DEPLOYING
The deployment is still in progress. Check manually:
- TrueFoundry dashboard: TFY_BASE_URL
- Or run this skill again to resume monitoring
Permission Denied
Cannot access this application. Check your API key permissions for this workspace.
More from truefoundry/tfy-deploy-skills
truefoundry-jobs
Deploys and monitors TrueFoundry batch jobs, scheduled cron jobs, and one-time tasks. Uses YAML manifests with `tfy apply`. Use when deploying jobs, scheduling cron tasks, checking job run status, or viewing execution history. For listing job applications, use `applications` skill.
11truefoundry-helm
Deploys infrastructure components via Helm charts on TrueFoundry. Supports any public or private OCI Helm chart including databases (Postgres, MongoDB, Redis), message brokers (Kafka, RabbitMQ), and vector databases (Qdrant, Milvus). Uses YAML manifests with `tfy apply`. Use when installing Helm charts or deploying infrastructure on TrueFoundry.
10truefoundry-docs
Fetches TrueFoundry documentation, API reference, and deployment guides. Use when the user needs platform docs or how-to guidance.
10truefoundry-logs
Views, downloads, and searches application and job logs from TrueFoundry. Supports time-range filtering, pod filtering, and error search.
10truefoundry-secrets
Manages TrueFoundry secret groups and secrets. Handles listing, creating, updating, and deleting secret groups and individual key-value secrets.
10truefoundry-workflows
Builds and deploys data processing and ML training pipelines using TrueFoundry Workflows (built on Flyte). Use when creating DAGs, orchestrating multi-step tasks, scheduling ETL pipelines, or running ML training workflows.
10