truefoundry-monitor

Installation

SKILL.md

Routing note: For ambiguous user intents, use the shared clarification templates in references/intent-clarification.md.

Monitor Deployment

Track a TrueFoundry deployment rollout to completion, reporting status at each stage and diagnosing failures automatically.

When to Use

After tfy apply or tfy deploy to track rollout progress
User says "monitor", "watch deployment", "is my deploy done", "check rollout"
Called automatically by the deploy skill after a successful apply/deploy

When NOT to Use

User wants to deploy → prefer deploy skill; ask if the user wants another valid path
User wants to list all apps → prefer applications skill; ask if the user wants another valid path
User wants to read historical logs → prefer logs skill; ask if the user wants another valid path

CRITICAL BEHAVIOR RULES

RULE 1: Once monitoring starts, you MUST poll until a terminal state or timeout. Do NOT stop early. Do NOT ask the user "should I keep checking?" — just keep checking.

RULE 2: Do NOT end your response while the deployment is in a non-terminal state (BUILDING, INITIALIZED, ROLLOUT_STARTED). If you are about to stop and the status is non-terminal, you are violating this rule — continue polling.

RULE 3: Between each poll, briefly tell the user what you're waiting for. Do NOT silently loop, but also do NOT ask for permission to continue.

Required Information

Before monitoring, you need:

Workspace FQN (TFY_WORKSPACE_FQN) — HARD RULE: Never auto-pick. Always ask the user to confirm.
Application name — the service or job name being deployed

If invoked right after a deploy, both should already be known from the deploy context.

Execution Priority

For all status checks, use MCP tool calls first:

tfy_applications_list(filters={"workspace_fqn": "WORKSPACE_FQN", "application_name": "APP_NAME"})

If MCP tool calls are unavailable, fall back to direct API via tfy-api.sh.

When using direct API, set TFY_API_SH to the full path of this skill's scripts/tfy-api.sh. See references/tfy-api-setup.md for paths per agent.

Monitoring Flow

Step 1: Initial Status Check

TFY_API_SH=~/.claude/skills/truefoundry-monitor/scripts/tfy-api.sh
bash $TFY_API_SH GET '/api/svc/v1/apps?workspaceFqn=WORKSPACE_FQN&applicationName=APP_NAME'

Extract from the response at data[0] (the application object):

deployment.currentStatus.status — the deployment status enum
deployment.currentStatus.transition — current transition (e.g., BUILDING, DEPLOYING)
deployment.currentStatus.state.isTerminalState — boolean, most reliable terminal check
deployment.currentStatus.state.display — human-readable state

Step 2: Poll Until Terminal State

The API response has two key fields: status (the deployment status) and transition (what's happening now). Use state.isTerminalState as the authoritative check for whether to stop polling.

Status values (from deployment.currentStatus.status):

Status	Terminal?	Action
`INITIALIZED`	No	Report "Deployment initialized, waiting...", continue polling
`BUILDING`	No	Report "Build in progress", continue polling
`BUILD_SUCCESS`	No	Report "Build succeeded, deploying...", continue polling
`BUILD_FAILED`	Yes	Fetch build logs, report failure
`ROLLOUT_STARTED`	No	Report "Rollout started", continue polling
`DEPLOY_SUCCESS`	Yes	Report success with endpoint URL
`DEPLOY_FAILED`	Yes	Fetch pod logs, diagnose failure
`DEPLOY_FAILED_WITH_RETRY`	No	Report "Deploy failed, retrying...", continue polling
`PAUSED`	Yes	Report paused/stopped
`FAILED`	Yes	Report general failure
`CANCELLED`	Yes	Report cancelled

Transition values (from deployment.currentStatus.transition):

Transition	Meaning
`BUILDING`	Image build is in progress
`DEPLOYING`	Pods are being created/updated
`REUSING_EXISTING_BUILD`	Skipping build, reusing cached image
`COMPONENTS_DEPLOYING`	Multi-component deployment in progress
`WAITING`	Waiting for resources

Best practice: Always check deployment.currentStatus.state.isTerminalState === true to decide whether to stop polling, rather than matching individual status strings. The state.display field gives a human-friendly label.

Polling schedule:

First 2 minutes: check every 15 seconds
Minutes 2-5: check every 30 seconds
After 5 minutes: check every 60 seconds
Timeout after 10 minutes — report current state and suggest the user check manually

Between polls, tell the user what you're waiting for. Do not silently loop. Do NOT ask "should I continue?" — just continue.

Step 3: On Success

When state.isTerminalState is true and status is DEPLOY_SUCCESS:

Report the final status
Show replicas ready (e.g., "2/2 replicas ready")
Show the endpoint URL if the service has an exposed port
Optionally run a quick health check on the endpoint:

# Only if the service exposes an HTTP port
curl -sf -o /dev/null -w '%{http_code}' "https://ENDPOINT_URL/health" || true

Report the HTTP status code. Do not fail the monitor if the health check fails — just report it.

Step 4: On Failure

When status is BUILD_FAILED, DEPLOY_FAILED, FAILED, or CANCELLED:

Fetch recent logs using the logs skill or direct API:

# Get the app ID first from the status response
TFY_API_SH=~/.claude/skills/truefoundry-monitor/scripts/tfy-api.sh

# Fetch recent logs (last 5 minutes)
bash $TFY_API_SH GET '/api/svc/v1/logs/WORKSPACE_ID/download?applicationFqn=APP_FQN&startTs=START_TS&endTs=END_TS'

Identify the failure cause from the logs (OOMKilled, CrashLoopBackOff, ImagePullBackOff, port mismatch, etc.)
Suggest a fix based on the error:

Error Pattern	Suggested Fix
`OOMKilled`	Increase `memory_limit` in manifest
`CrashLoopBackOff`	Check startup command and logs for crash reason
`ImagePullBackOff`	Verify image URI and registry credentials
Port mismatch	Ensure manifest port matches what the app listens on
`Readiness probe failed`	Check health probe path and startup time
Build error	Check Dockerfile and build logs

Report summary with: error type, relevant log excerpt (max 20 lines), and suggested fix
Do NOT auto-retry. Present the diagnosis and let the user decide next steps.

Presenting Status Updates

Use a consistent format for each status update:

Monitoring: my-service in cluster:workspace
Status: ROLLOUT_STARTED | Transition: DEPLOYING
Display: Deploying (1/2 replicas ready)
Elapsed: 45s
Next check in 15s...

Final summary on success:

Deployment complete: my-service
Status: DEPLOY_SUCCESS
Replicas: 2/2 ready
Endpoint: https://my-service-ws.example.com
Health check: 200 OK
Total time: 1m 32s

Final summary on failure:

Deployment failed: my-service
Status: DEPLOY_FAILED
Error: CrashLoopBackOff — container exited with code 1
Log excerpt:
  > ModuleNotFoundError: No module named 'flask'
Suggested fix: Add 'flask' to requirements.txt and redeploy

<success_criteria>

Success Criteria

Deployment status is tracked from current state to a terminal state
User sees clear progress updates at each polling interval
On success: replicas, endpoint URL, and optional health check are reported
On failure: logs are fetched, root cause is identified, and a fix is suggested
Monitor times out gracefully after 10 minutes with a status summary
The user is never left waiting without feedback

</success_criteria>

Composability

Before monitoring: Use deploy skill to deploy, then monitor
On failure: Use logs skill for deeper log analysis
Check app details: Use applications skill for full app info
Fix and redeploy: Use deploy skill to apply fixes

Error Handling

Application Not Found

Application "APP_NAME" not found in workspace "WORKSPACE_FQN".
Check:
- Application name is spelled correctly
- The deploy/apply command completed successfully
- You're checking the correct workspace

Timeout

Monitoring timed out after 10 minutes.
Current status: ROLLOUT_STARTED | Transition: DEPLOYING
The deployment is still in progress. Check manually:
- TrueFoundry dashboard: TFY_BASE_URL
- Or run this skill again to resume monitoring

Permission Denied

Cannot access this application. Check your API key permissions for this workspace.

Related skills

More from truefoundry/tfy-deploy-skills

Installs

Repository

truefoundry/tfy…y-skills

GitHub Stars

First Seen

Apr 1, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass