cloud-monitoring

Installation
SKILL.md

Cloud Monitoring

This skill enables the agent to design and configure comprehensive monitoring and observability solutions for cloud infrastructure and applications. The agent understands the three pillars of observability — metrics, logs, and traces — and can set up dashboards, alerting rules, SLIs, SLOs, and SLAs using tools like Prometheus, Grafana, CloudWatch, Datadog, and OpenTelemetry. The agent also applies alerting best practices to minimize alert fatigue while ensuring critical issues are surfaced promptly.

Workflow

  1. Identify Monitoring Objectives: The agent works with the user to define what needs to be monitored and why. This includes identifying critical services, establishing Service Level Indicators (SLIs) such as request latency, error rate, and throughput, and setting Service Level Objectives (SLOs) that define acceptable performance thresholds. SLAs (Service Level Agreements) are documented as contractual commitments to customers.

  2. Select Monitoring Tools and Instrumentation: Based on the cloud provider and application architecture, the agent recommends an appropriate monitoring stack. This may include Prometheus for metrics collection, Grafana for visualization, Loki or CloudWatch Logs for log aggregation, and Jaeger or AWS X-Ray for distributed tracing. The agent configures OpenTelemetry SDKs in application code to emit standardized telemetry data.

  3. Configure Metrics Collection and Dashboards: The agent defines and deploys metric scrapers, exporters, and custom metrics. It builds dashboards that visualize the golden signals (latency, traffic, errors, saturation) and infrastructure metrics (CPU, memory, disk, network). Dashboards are organized by service tier so teams can quickly triage issues.

  4. Establish Alerting Rules: The agent configures alerts that trigger on meaningful conditions — such as error budget burn rate exceeding thresholds, sustained latency spikes, or pod restarts — rather than raw metric thresholds alone. Multi-window, multi-burn-rate alerting is used to balance detection speed with false-positive suppression. Alert routing is configured to send critical alerts to PagerDuty or Opsgenie and warnings to Slack.

  5. Set Up Log Aggregation and Trace Correlation: The agent configures centralized log collection with structured logging formats (JSON), log retention policies, and log-based alerts for error patterns. Distributed traces are correlated with logs and metrics using shared trace IDs so that a single alert can link directly to the relevant request trace and log entries.

  6. Review and Iterate: The agent periodically audits alert noise levels, dashboard usage, and SLO compliance. Unused alerts are pruned, thresholds are adjusted based on observed baselines, and new services are onboarded into the monitoring stack as the system evolves.

Supported Technologies

Related skills
Installs
9
GitHub Stars
78
First Seen
Mar 19, 2026