llmops-platform-engineering
SKILL.md
LLMOps Platform Engineering
Design and operate an internal LLM platform that supports rapid experimentation without compromising reliability, cost, or compliance.
Outcomes
- Standardized path from experiment to production
- Safe model rollout with quality and safety gates
- Repeatable infra modules for inference, vector DB, and observability
- Clear ownership model across platform, app, and security teams
Reference Architecture
- Control Plane: model registry, prompt/version catalog, policy checks, eval pipeline.
- Data Plane: inference gateway, vector database, cache, feature store.
- Ops Plane: telemetry, alerting, SLO dashboards, cost analytics.
- Security Plane: IAM boundaries, secret rotation, content filters, audit logs.
Golden Delivery Workflow
- Train/fine-tune or onboard provider model.
- Register artifact and metadata (license, intended use, constraints).
- Run automated eval suite (quality + safety + latency + cost).
- Deploy canary behind gateway with strict traffic policy.
- Promote after SLO and business KPI thresholds pass.
- Keep rollback target hot for fast reversion.
CI/CD Design for AI Services
- Build immutable containers with pinned dependencies and model hashes.
- Use environment promotion:
dev -> stage -> prod. - Fail deployment if:
- regression evals drop below baseline,
- safety tests exceed risk threshold,
- p95 latency exceeds SLO budget.
- Store deployment evidence for audits (commit SHA, eval report, approver).
Operational SLOs
- Availability:
99.9%for synchronous inference endpoints. - Latency: p95 under product-specific target (for example,
<1200ms). - Cost: per-request and per-tenant budget ceilings.
- Quality: task success rate and groundedness thresholds.
Platform Guardrails
- Enforce tenant quotas and model allow-lists.
- Require structured output contracts for automation paths.
- Default to low-risk model settings for critical workflows.
- Disable unconstrained tool execution in production.
Tooling Stack (Example)
- Orchestration: Argo Workflows / GitHub Actions / Airflow.
- Model Registry: MLflow / custom metadata DB.
- Gateway: LiteLLM / Envoy-based API gateway.
- Observability: OpenTelemetry + Prometheus + Grafana + Langfuse.
- Policy: OPA/Rego for deployment and runtime checks.
Incident Readiness
- Runbooks for model outage, provider timeout spikes, and cost surges.
- Chaos drills for provider failover and vector DB degradation.
- Pre-approved rollback path with one-command execution.
Related Skills
- ai-pipeline-orchestration - Orchestrate ingestion and inference workflows
- agent-evals - Build evaluation gates for releases
- llm-gateway - Route and control LLM traffic
Weekly Installs
3
Repository
bagelhole/devop…t-skillsGitHub Stars
13
First Seen
5 days ago
Security Audits
Installed on
opencode3
antigravity3
claude-code3
github-copilot3
codex3
zencoder3