vtex-io-observability-and-ops
Observability & Operational Readiness
When this skill applies
Use this skill when a VTEX IO service needs better production visibility, troubleshooting behavior, or operational safety.
- Adding metrics to important client calls or flows
- Improving logs for routes, workers, or integrations
- Surfacing failures clearly for operations and support
- Reviewing whether a service is ready for production
- Monitoring rate-limit-sensitive integrations
Do not use this skill for:
- app policy declaration
- trust-boundary modeling
- frontend analytics or browser monitoring
- route contract design by itself
Decision rules
- Log enough structured context to debug failures, but do not log secrets or sensitive payloads.
- Use
ctx.vtex.loggerwith appropriate log levels such asinfo,warn, anderrorinstead ofconsole.log, so logs are properly collected and searchable in the VTEX logging stack. - Treat
ctx.vtex.loggeras the native platform logging mechanism. If a partner needs to forward logs to its own logging system, prefer doing that through a dedicated integration app or client instead of replacing the VTEX logger pattern inside every service. - Use client-level metrics on important downstream calls so integration behavior is visible below the handler layer.
- Choose metric names that reflect the integration and operation, such as
partner-get-orderorpartner-sync-catalog, so counts, latency, and error rates can be tracked over time. - Make failures observable at the point where they happen. Do not swallow errors silently in routes, events, or workers.
- For rate-limit-sensitive APIs, combine short timeouts, backoff-aware retries, and caching of frequent reads to reduce burst pressure and avoid hitting hard limits.
- Review whether expensive or fragile flows expose enough operational signals before releasing them.
Hard constraints
Constraint: Important failures must be visible in logs, metrics, or durable state
Routes, event handlers, and workers MUST not hide important failures from operators.
Why this matters
If failures disappear silently, the service becomes impossible to diagnose under real traffic and retries.
Detection
If an error is caught and ignored without logging, metric emission, or explicit failure state, STOP and surface the failure.
Correct
try {
await ctx.clients.partnerApi.sendOrder(orderId)
} catch (error) {
ctx.vtex.logger.error({
message: 'Failed to send order to partner',
orderId,
account: ctx.vtex.account,
routeId: ctx.vtex.route?.id,
})
throw error
}
Wrong
try {
await ctx.clients.partnerApi.sendOrder(orderId)
} catch (_) {
return
}
Constraint: Metrics should be attached to important integration calls
Client calls that are operationally important SHOULD include metric so request behavior can be tracked consistently.
Why this matters
Without metrics, integration failures and latency patterns are much harder to isolate from generic route behavior.
Detection
If a key downstream integration call has no metric and operations depend on it, STOP and add a meaningful metric name.
Correct
return this.http.get(`/orders/${id}`, {
metric: 'partner-get-order',
})
Wrong
return this.http.get(`/orders/${id}`)
Constraint: Logs must stay useful without leaking sensitive data
Logs MUST contain enough context to debug production behavior, but MUST NOT include secrets, tokens, or unnecessarily sensitive payloads.
Why this matters
Operational logs are only valuable if they are safe to retain and inspect. Sensitive logging creates security risk while still failing to guarantee useful diagnosis.
Detection
If a log line includes tokens, auth headers, raw personal payloads, or entire downstream responses, STOP and sanitize the log.
Correct
ctx.vtex.logger.info({
message: 'Partner sync started',
orderId,
account: ctx.vtex.account,
})
Wrong
ctx.vtex.logger.info({
message: 'Partner sync started',
body: ctx.request.body,
auth: ctx.request.header.authorization,
})
Preferred pattern
Operationally healthy VTEX IO services should:
- emit metrics for important client calls so counts, latency, and error rates are visible
- log failures with enough structured context such as domain IDs, account, and
routeId - avoid silent error swallowing
- sanitize sensitive data before logging
- review retries, caching, and throughput with rate-limit behavior in mind
Use observability to shorten diagnosis time, not just to create more logs.
Common failure modes
- Catching and ignoring errors in async flows.
- Logging too little context to diagnose production incidents.
- Logging too much sensitive data.
- Omitting metrics from important integration calls.
- Treating rate-limit failures as isolated bugs instead of operational signals.
Review checklist
- Are important failures visible to operators?
- Do key integrations emit useful metrics?
- Are logs structured and safe?
- Are retries, caching, and rate-limit behavior considered together?
- Would someone on call be able to diagnose this flow from the available signals?
Reference
- Using Node Clients - Client usage patterns relevant to metrics and retries
- Best practices for avoiding rate-limit errors - Operational guidance for stable integrations