Log Analysis

When to use this skill

The main job is read-only log triage, not code changes or monitoring design.
The user wants the first actionable blocker, not a paraphrase of every line.
The evidence is application, API, worker, proxy, container, pod, browser, CI, or JSON logs.
The user needs the repeated signature or blast radius summarized after the first failure is isolated.
The prompt is really "which lines matter / what is the real error / where does the cascade start?" even if the user never says "triage".

Do not use this skill as the main workflow when:

The logs are Unity / Unreal build, cook, package, editor, or player logs → use game-build-log-triage.
The real job is instrumentation, dashboards, alerting, ingestion, retention, or observability coverage → use monitoring-observability.
The likely blocker is already known and the user now needs reproduction, hypotheses, or fixes → use debugging.
The main job is repeated anomaly/rule hunting across logs or telemetry families rather than first-failure triage → use pattern-detection.

Core idea

log-analysis should act like a packet router, not a giant troubleshooting encyclopedia.

Normalize the request into one primary log packet.
Narrow the evidence slice before interpreting it.
Isolate the earliest actionable failure.
Group repeated fallout into a pattern / blast radius note.
Route out as soon as the work becomes debugging, observability design, anomaly hunting, or engine-specialist triage.

Read these support docs before choosing the packet:

Instructions

Step 1: Normalize the request

Convert the prompt into this intake shape first:

log_analysis_packet:
  primary_packet: app-runtime | container-runtime | browser-plus-api | ci-cascade | structured-json | security-signal
  source_shape: app | proxy | worker | browser | ci | container | pod | json | mixed | unknown
  environment: local | ci | staging | production | browser | container | pod | unknown
  failure_goal: first-blocker | cascade-start | repeated-signature | blast-radius | suspicious-access | unknown
  anchor: timestamp | request-id | trace-id | job-build-id | browser-route | none | unknown
  route_after: stay-here | debugging | monitoring-observability | pattern-detection | game-build-log-triage

Choose one primary packet for the run. If two seem plausible, pick the cheaper packet that reduces uncertainty fastest.

Step 2: Choose the packet

Packet	Use when	Best fits	Typical anchors
`app-runtime`	The key evidence is app/API/worker/proxy text logs	crashes, stack traces, request failures, queue poison messages	earliest fatal/error line, route, service, request ID
`container-runtime`	The evidence comes from `docker logs`, `kubectl logs`, pod output, or deploy-window restarts	container crashes, env/config mismatch, dependency connectivity, restart loops	pod/container name, deploy window, host, request ID
`browser-plus-api`	Browser console/network symptoms need server-side confirmation	401/403/500 flows, failed fetch, CORS/auth mismatch, SSR/client divergence	route, request ID, timestamp, browser/network trace
`ci-cascade`	CI output contains many secondary failures after one blocker	install/import/test/build cascades, missing dependency/config, runner mismatch	job name, step name, stage, earliest stack trace/import error
`structured-json`	The logs are JSON or field-rich event records	grouped error families, request/trace correlation, worker/event triage	level, service, request ID, trace ID, tenant, event name
`security-signal`	Access/error logs suggest suspicious probing or auth/permission anomalies	repeated 401/403/404 probes, token misuse, rate-limit storms	IP/user/session, route family, status code, time window

Packet rules:

Prefer app-runtime for plain text stack traces and server logs.
Prefer container-runtime when restart timing, pod identity, or env/deploy context matters.
Prefer browser-plus-api when frontend symptoms are not sufficient on their own.
Prefer ci-cascade when the visible failure may be generic abort noise.
Prefer structured-json when fields make grouping and correlation cheaper than free-text scanning.
Prefer security-signal only when suspicious access/auth behavior is the main job; otherwise keep security-looking noise inside the packet that owns the first blocker.

Step 3: Narrow the slice before reading everything

Apply at least one narrowing move before interpreting the logs:

limit by time window
limit by request / trace / job / build / session / tenant identifier
separate fatal/actionable lines from retries and fallout
separate one noisy source from many affected sources
separate browser symptom lines from server-side blocker lines
in CI, locate the earliest failing step before summarizing the full transcript

Useful heuristics by packet:

app-runtime → exception / fatal / failed / timeout / refusal first
container-runtime → restart window + dependency/connectivity/env mismatch first
browser-plus-api → backend auth/config/runtime evidence before generic client symptoms
ci-cascade → earliest import/config/build/test failure before abort/footer lines
structured-json → group by message family, exception class, request ID, or service before reading raw rows
security-signal → distinguish broad probing from one broken client before escalating

Step 4: Isolate the first actionable failure

Use this order:

Hard stop — crash, panic, uncaught exception, process exit, build failure
Dependency / environment blocker — missing config, secret, DNS, file, service, auth, or connection
Request / runtime failure — 500, timeout, rejected promise, queue poison message, parser failure
Fallout — retries, secondary warnings, repeated health-check failures, broad abort text

Do not report 20 repeated downstream lines as 20 different causes.

Step 5: Correlate and classify

If the evidence spans more than one source, correlate instead of concatenating.

Primary classification buckets:

missing-config-or-secret
dependency-or-connection
auth-or-permission
request-or-runtime-error
data-shape-or-validation
resource-or-capacity
browser-network-mismatch
ci-build-test-failure
security-or-suspicious-pattern
unknown-needs-more-context

Correlation anchors to prefer:

timestamp window
request / trace / correlation ID
job/build ID or CI step
service / worker / pod / container name
route, browser action, or API endpoint
user / tenant / session identifier when safe to mention

Step 6: Return a triage brief

Default response shape:

# Log Triage

## Source
- Packet: app-runtime | container-runtime | browser-plus-api | ci-cascade | structured-json | security-signal
- Environment: local | CI | staging | production | browser | container | pod
- Confidence: high | medium | low

## First actionable failure
- Line or excerpt: `...`
- Why it matters: ...
- Why later lines look secondary: ...

## Pattern / blast radius
- Repeated signature: ...
- Scope: one request | repeated requests | one worker | one deploy window | one environment | broad

## Classification
- Primary bucket: ...
- Secondary bucket: ...

## Likely root cause
- 1-3 sentence explanation grounded in the evidence

## Next read-only checks
1. ...
2. ...
3. ...

## Route-out
- stay in `log-analysis` | `debugging` | `monitoring-observability` | `pattern-detection` | `game-build-log-triage`

Step 7: Route out aggressively

Switch when the next job is no longer first-failure log triage:

Reproduction, hypotheses, code/config fixes → debugging
Dashboards, alerts, ingestion, telemetry coverage, retention → monitoring-observability
Repeated signature hunting across many windows or datasets → pattern-detection
Unity / Unreal build/editor/package logs → game-build-log-triage

If the excerpt is too short or starts mid-cascade:

mark confidence low
ask for the earliest error cluster or 20-80 lines around the first blocker
ask for one anchor only if needed: time window, request ID, job/build, pod/container, or browser route
do not pretend certainty from a truncated excerpt

Examples

Example 1: Container dependency failure

Prompt:

kubectl logs shows Error: connect ECONNREFUSED redis:6379 and then dozens of job retry failed lines.

Good response shape:

choose container-runtime
identify the Redis connection failure as the first actionable blocker
group later retry lines as fallout
route next to debugging or monitoring-observability only after the blocker is isolated

Example 2: Browser + API mismatch

Prompt:

Browser console says Failed to fetch, the network tab shows 401 on /api/session, and the server log says JWT audience invalid.

Good response shape:

choose browser-plus-api
identify backend auth validation as the actionable blocker
treat browser failure as a symptom, not the cause
route next to debugging once the config/code suspect is clear

Example 3: CI cascade

Prompt:

CI ends with test suite aborted, but earlier there is ModuleNotFoundError: No module named 'dotenv'.

Good response shape:

choose ci-cascade
isolate the earliest import failure
treat the abort/footer text as fallout
route next to debugging after the failing dependency path is known

Example 4: Automation/webhook JSON logs

Prompt:

These JSON webhook logs show repeated status=429 retries after one invalid API key response. What actually matters?

Good response shape:

choose structured-json or security-signal depending on whether auth abuse or one bad credential is the primary job
isolate the first credential/auth failure
summarize retry volume separately
route repeated pattern hunting to pattern-detection only if the user wants broader anomaly work

Best practices

Choose the smallest packet that can answer the question.
Lead with the earliest blocker, not the loudest line.
Group repeated fallout into one signature or blast-radius summary.
Correlate browser/network/app evidence instead of summarizing each source independently.
Keep all suggested checks read-only inside this skill.
Treat engine-specific logs as a hard specialist boundary.
Route out as soon as the work becomes debugging, observability design, or anomaly hunting.

References

references/intake-packets-and-route-outs.md
references/triage-playbook.md
references/source-boundaries.md