Observability & Operational Readiness

When this skill applies

Use this skill when a VTEX IO service needs better production visibility, troubleshooting behavior, or operational safety.

Adding metrics to important client calls or flows
Improving logs for routes, workers, or integrations
Surfacing failures clearly for operations and support
Reviewing whether a service is ready for production
Monitoring rate-limit-sensitive integrations

Do not use this skill for:

app policy declaration
trust-boundary modeling
frontend analytics or browser monitoring
route contract design by itself

Decision rules

Log enough structured context to debug failures, but do not log secrets or sensitive payloads.
Use ctx.vtex.logger with appropriate log levels such as info, warn, and error instead of console.log, so logs are properly collected and searchable in the VTEX logging stack.
Treat ctx.vtex.logger as the native platform logging mechanism. If a partner needs to forward logs to its own logging system, prefer doing that through a dedicated integration app or client instead of replacing the VTEX logger pattern inside every service.
Use client-level metrics on important downstream calls so integration behavior is visible below the handler layer.
Choose metric names that reflect the integration and operation, such as partner-get-order or partner-sync-catalog, so counts, latency, and error rates can be tracked over time.
Make failures observable at the point where they happen. Do not swallow errors silently in routes, events, or workers.
For rate-limit-sensitive APIs, combine short timeouts, backoff-aware retries, and caching of frequent reads to reduce burst pressure and avoid hitting hard limits.
Review whether expensive or fragile flows expose enough operational signals before releasing them.

Hard constraints

Constraint: Important failures must be visible in logs, metrics, or durable state

Routes, event handlers, and workers MUST not hide important failures from operators.

Why this matters

If failures disappear silently, the service becomes impossible to diagnose under real traffic and retries.

Detection

If an error is caught and ignored without logging, metric emission, or explicit failure state, STOP and surface the failure.

Correct

try {
  await ctx.clients.partnerApi.sendOrder(orderId)
} catch (error) {
  ctx.vtex.logger.error({
    message: 'Failed to send order to partner',
    orderId,
    account: ctx.vtex.account,
    routeId: ctx.vtex.route?.id,
  })
  throw error
}

Wrong

try {
  await ctx.clients.partnerApi.sendOrder(orderId)
} catch (_) {
  return
}

Constraint: Metrics should be attached to important integration calls

Client calls that are operationally important SHOULD include metric so request behavior can be tracked consistently.

Why this matters

Without metrics, integration failures and latency patterns are much harder to isolate from generic route behavior.

Detection

If a key downstream integration call has no metric and operations depend on it, STOP and add a meaningful metric name.

Correct

return this.http.get(`/orders/${id}`, {
  metric: 'partner-get-order',
})

Wrong

return this.http.get(`/orders/${id}`)

Constraint: Logs must stay useful without leaking sensitive data

Logs MUST contain enough context to debug production behavior, but MUST NOT include secrets, tokens, or unnecessarily sensitive payloads.

Why this matters

Operational logs are only valuable if they are safe to retain and inspect. Sensitive logging creates security risk while still failing to guarantee useful diagnosis.

Detection

If a log line includes tokens, auth headers, raw personal payloads, or entire downstream responses, STOP and sanitize the log.

Correct

ctx.vtex.logger.info({
  message: 'Partner sync started',
  orderId,
  account: ctx.vtex.account,
})

Wrong

ctx.vtex.logger.info({
  message: 'Partner sync started',
  body: ctx.request.body,
  auth: ctx.request.header.authorization,
})

Preferred pattern

Operationally healthy VTEX IO services should:

emit metrics for important client calls so counts, latency, and error rates are visible
log failures with enough structured context such as domain IDs, account, and routeId
avoid silent error swallowing
sanitize sensitive data before logging
review retries, caching, and throughput with rate-limit behavior in mind

Use observability to shorten diagnosis time, not just to create more logs.

Common failure modes

Catching and ignoring errors in async flows.
Logging too little context to diagnose production incidents.
Logging too much sensitive data.
Omitting metrics from important integration calls.
Treating rate-limit failures as isolated bugs instead of operational signals.

Review checklist

Are important failures visible to operators?
Do key integrations emit useful metrics?
Are logs structured and safe?
Are retries, caching, and rate-limit behavior considered together?
Would someone on call be able to diagnose this flow from the available signals?

Reference

Using Node Clients - Client usage patterns relevant to metrics and retries
Best practices for avoiding rate-limit errors - Operational guidance for stable integrations

vtex-io-observability-and-ops

Observability & Operational Readiness

When this skill applies

Decision rules

Hard constraints

Constraint: Important failures must be visible in logs, metrics, or durable state

Constraint: Metrics should be attached to important integration calls

Constraint: Logs must stay useful without leaking sensitive data

Preferred pattern

Common failure modes

Review checklist

Reference