skills/mileycy516-stack/skills/platform-engineering

platform-engineering

SKILL.md

Platform Engineering

Consolidates Observability and Performance Engineering to build high-performance, resilient, and observable systems.

When to Use This Skill

  • Observability: Setting up Prometheus/Grafana, OpenTelemetry, logging pipelines.
  • Performance: Debugging latency, optimizing DB queries, caching strategies.
  • Reliability: Defining SLIs/SLOs, Load Testing, Chaos Engineering.
  • Infrastructure: Incident response and post-mortems.

Core Disciplines

1. Observability (The "Three Pillars")

Logs (Events)

  • Purpose: Debugging specific events. "What happened?"
  • Best Practice: Structured JSON logging. Application-context only (no noise).

Metrics (Aggregates)

  • Purpose: Trending and Alerting. "Is it healthy?"
  • The RED Method:
    1. Rate: Request throughput (req/sec).
    2. Errors: Error throughput (errors/sec).
    3. Duration: Latency (p50, p99).

Traces (Context)

  • Purpose: Distributed transactions. "Where is the latency?"
  • Standard: OpenTelemetry (OTel). Ensure context propagation (traceparent header) across all microservices.

2. Performance Engineering

Optimization Workflow

  1. Baseline: Measure current state (RPS, Latency).
  2. Profile: Flame graphs for CPU/RAM. Traces for I/O.
  3. Optimize:
    • Frontend: Core Web Vitals (LCP, CLS, INP).
    • Backend: Fix N+1 queries, add DB indexes.
    • Caching: Memoization (L1) -> Redis (L2) -> CDN (L3).
  4. Verify: Load test the fix.

Load Testing (Saturation)

  • Smoke Test: Minimal load to verify logic.
  • Stress Test: Find the breaking point (Saturation).
  • Soak Test: Long duration to find memory leaks.

3. Reliability (SRE)

SLI / SLO / SLA

  • SLI: The metric (Indicator). "P95 Latency".
  • SLO: The goal (Objective). "99.9% of requests < 200ms".
  • Error Budget: The allowed failure rate (1 - SLO). Burn this to ship faster.

Resources

Weekly Installs
1
First Seen
1 day ago
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1