platform-engineering
SKILL.md
Platform Engineering
Consolidates Observability and Performance Engineering to build high-performance, resilient, and observable systems.
When to Use This Skill
- Observability: Setting up Prometheus/Grafana, OpenTelemetry, logging pipelines.
- Performance: Debugging latency, optimizing DB queries, caching strategies.
- Reliability: Defining SLIs/SLOs, Load Testing, Chaos Engineering.
- Infrastructure: Incident response and post-mortems.
Core Disciplines
1. Observability (The "Three Pillars")
Logs (Events)
- Purpose: Debugging specific events. "What happened?"
- Best Practice: Structured JSON logging. Application-context only (no noise).
Metrics (Aggregates)
- Purpose: Trending and Alerting. "Is it healthy?"
- The RED Method:
- Rate: Request throughput (req/sec).
- Errors: Error throughput (errors/sec).
- Duration: Latency (p50, p99).
Traces (Context)
- Purpose: Distributed transactions. "Where is the latency?"
- Standard: OpenTelemetry (OTel). Ensure context propagation (
traceparentheader) across all microservices.
2. Performance Engineering
Optimization Workflow
- Baseline: Measure current state (RPS, Latency).
- Profile: Flame graphs for CPU/RAM. Traces for I/O.
- Optimize:
- Frontend: Core Web Vitals (LCP, CLS, INP).
- Backend: Fix N+1 queries, add DB indexes.
- Caching: Memoization (L1) -> Redis (L2) -> CDN (L3).
- Verify: Load test the fix.
Load Testing (Saturation)
- Smoke Test: Minimal load to verify logic.
- Stress Test: Find the breaking point (Saturation).
- Soak Test: Long duration to find memory leaks.
3. Reliability (SRE)
SLI / SLO / SLA
- SLI: The metric (Indicator). "P95 Latency".
- SLO: The goal (Objective). "99.9% of requests < 200ms".
- Error Budget: The allowed failure rate (1 - SLO). Burn this to ship faster.
Resources
Weekly Installs
1
Repository
mileycy516-stack/skillsFirst Seen
1 day ago
Security Audits
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1