api-observability-planner
API Observability Planner Protocol
This skill ensures that when an API goes down, the team knows exactly why before the users even notice. It shifts telemetry from "just log the errors" to a structured observability pipeline.
Core assumption: If you can't measure it, you can't manage it. Blind APIs cause prolonged outages.
1. The Three Pillars Strategy (Static)
Define exactly what your framework will emit:
- Logs (Events): Structured JSON logging. Never use raw text strings (
"User 123 failed login"vs{"event": "login_failed", "user_id": 123, "reason": "bad_password"}). - Metrics (Aggregations): Implement the RED Method:
- Rate: Requests per second.
- Errors: Failed request rate (4xx vs 5xx).
- Duration: Latency percentiles (p50, p90, p99).
- Traces (Workflows): Distributed tracing (
W3C Trace Context). Ensuretrace_idandspan_idpropagate across microservices and database calls.
2. Health & Alerting Design
Define what constitutes "Healthy" and when pagers should go off.
- Deep Health Checks:
/healthzshouldn't just return200 OK. It should verify DB connection, Redis reachability, and critical downstreams. - Alert Rules:
- Warning: p99 latency > 500ms for 5 minutes.
- Critical: 5xx error rate > 5% for 2 minutes.
3. Output Generation
Required Outputs (Must write BOTH to docs/api-report/):
- Human-Readable Markdown (
docs/api-report/api-observability-report.md)
### π API Observability Blueprint
**Instrumentation Strategy:** OpenTelemetry (OTel)
**Log Format:** Structured JSON
#### π Core Metrics (RED Method)
1. **Rate:** Tracked via Prometheus `http_requests_total`.
2. **Errors:** Alerting on HTTP 500-599. (4xx are client problems, track but don't wake up on-call).
3. **Duration:** Tracked via `http_request_duration_seconds` (Buckets: 50ms, 100ms, 500ms, 1s, 5s).
#### π¨ Alert Configuration (PagerDuty / Slack)
- **High Severity:** Order Creation 5xx Rate > 1% over 5m.
- **Low Severity:** Database Disk Space < 20%.
#### π Tracing Propagation
Inject `traceparent` and `tracestate` headers into all outgoing upstream HTTP/gRPC requests.
- Machine-Readable JSON (
docs/api-report/api-observability-output.json)
{
"skill": "api-observability-planner",
"framework": "OpenTelemetry",
"metrics": {
"latency_thresholds_ms": {"p95": 200, "p99": 500}
},
"alerts": [
{"name": "High 5xx Rate", "condition": "error_rate > 1%", "duration": "5m", "severity": "High"}
]
}
Guardrails
- Log Forging / Injection: Ensure log sanitization is implemented to prevent multiline log spoofing.
- PII in Logs: Explicitly call out that
passwords,tokens,credit_cards, andemailsmust be masked or scrubbed before being written tostdoutor log aggregators.
More from fatih-developer/fth-skills
task-decomposer
Break down large, complex, or ambiguous tasks into independent subtasks with dependency maps, execution order, and success criteria. Plan first, then execute step by step. Triggers on 'how should I do this', 'where do I start', 'plan the project', 'break it down', 'implement' or whenever a task involves multiple phases.
24context-compressor
Compress long conversation histories, large code files, research results, and documents by 70% without losing critical information. Triggers when context window fills up, when summarizing previous steps in multi-step tasks, before loading large files into context, or on 'summarize', 'compress', 'reduce context', 'save tokens'.
17multi-brain-debate
Two-round debate protocol where perspectives challenge each other before consensus. Round 1 presents independent positions, Round 2 allows counter-arguments and rebuttals. Produces battle-tested decisions for high-stakes choices.
17multi-brain-score
Confidence scoring overlay for multi-brain decisions. Each perspective rates its own confidence (1-10) with justification. Consensus uses scores as weights, flags low-confidence areas, and surfaces uncertainty explicitly.
15parallel-planner
Analyze multi-step tasks to identify which steps can run in parallel, build dependency graphs, detect conflicts (write-write, read-write, resource contention), and produce optimized execution plans. Triggers on 3+ independent steps, 'speed up', 'run simultaneously', 'parallelize', 'optimize' or any task where sequential execution wastes time.
14error-recovery
When a step fails during an agentic task, classify the error (transient, configuration, logic, or permanent), apply the right recovery strategy, and escalate to the user when all strategies are exhausted. Triggers on error messages, exceptions, tracebacks, 'failed', 'not working', 'retry', or when 2 consecutive steps fail.
12