skills/acedergren/agentic-tools/monitoring-operations

monitoring-operations

Installation
SKILL.md

OCI Monitoring and Observability - Expert Knowledge

NEVER Do This

NEVER debug "missing metrics" within the first 15 minutes

  • Metrics are published every 1–5 minutes
  • Processing delay adds another 5–10 minutes
  • Total lag from event to visible metric: 10–15 minutes
  • Premature debugging creates false investigations

NEVER use = for alarm thresholds with sparse metrics

# WRONG - alarm never fires when metric has data gaps
MetricName[1m].mean() = 0

# RIGHT - handle missing data explicitly
MetricName[1m]{dataMissing=zero}.mean() > 0

NEVER omit the resourceId dimension in metric queries

# WRONG - returns no data (required dimension missing)
CPUUtilization[1m].mean()

# RIGHT - filter by instance OCID
CPUUtilization[1m]{resourceId="<instance-ocid>"}.mean()

Querying without dimensions returns data for ALL resources — usually not what's intended, and rate-limited at 1000 req/min.

NEVER set alarm thresholds without a trigger delay

# BAD - fires on every transient CPU spike (alert fatigue)
CPUUtilization[1m].mean() > 80

# BETTER - fires only on sustained breach
CPUUtilization[5m].mean() > 80
# + set trigger delay: 5 minutes (5 consecutive breaches)

NEVER create alarms without notification destinations

# WRONG - alarm fires but nobody is notified
oci monitoring alarm create ... --destinations '[]'

# RIGHT - always link to a notification topic
oci monitoring alarm create ... --destinations '["<notification-topic-ocid>"]'

Cost impact: undetected production outages = $5,000–50,000+/hour.

NEVER ignore Cloud Guard findings

  • Cloud Guard detects misconfigurations before they become incidents
  • Wire it: Cloud Guard → Notifications → email/Slack/PagerDuty
  • Unresolved findings fail CIS/SOC2/HIPAA audits

Metric Namespace Reference

OCI uses service-specific namespaces — using the wrong namespace returns no data with no error.

Service Namespace Key Metrics
Compute oci_computeagent CPUUtilization, MemoryUtilization
Autonomous DB oci_autonomous_database CpuUtilization, StorageUtilization
Load Balancer oci_lbaas HttpRequests, UnHealthyBackendServers
Object Storage oci_objectstorage ObjectCount, BytesUploaded

Common mistake: using oci_compute instead of oci_computeagent — the agent namespace requires the OCI Compute Agent to be running on the instance.

Alarm Missing Data Handling

Setting Behavior Use When
treatMissingDataAsBreaching Alarm fires if no data arrives Critical services (silence = outage)
treatMissingDataAsNotBreaching Alarm silent if no data Optional or intermittent monitoring
{dataMissing=zero} in MQL Treats gaps as 0 value Request counters, throughput metrics

Log Collection Troubleshooting

Logs not appearing in Log Analytics?
├─ Is logging enabled on the resource?
│  └─ Compute: is oci-compute-agent running? (systemctl status oracle-cloud-agent)
│  └─ Functions: is logging enabled in function configuration?
├─ Is Service Connector configured and ACTIVE?
│  └─ Source: Log Group → Target: Log Analytics
│  └─ Check status: oci sch service-connector get --id <ocid>
├─ IAM policy for Service Connector?
│  └─ "Allow any-user to use log-content in tenancy"
│  └─ "Allow service loganalytics to READ logcontent in tenancy"
│  └─ Missing EITHER policy causes silent failure
└─ 10–15 minute ingestion lag?
   └─ Wait before concluding logs are missing

Metric Query Performance

Unfiltered queries scan ALL resources in compartment — slow and consumes rate limit budget.

# Expensive: scans all instances
CPUUtilization[1m].mean()

# Optimized: filter to specific instance
CPUUtilization[1m]{resourceId='<instance-ocid>'}.mean()

Rate limit: 1000 metric queries/minute per tenancy. Dashboard with many unfiltered widgets can exhaust this.

Progressive Loading Reference

Load references/oci-monitoring-reference.md when:

  • Need the complete list of OCI service metric namespaces and metric names
  • Writing complex MQL expressions (composites, functions, grouping)
  • Implementing composite alarm conditions
  • Setting up Log Analytics workspace, APM, or Service Connector Hub in detail

Do NOT load for alarm threshold patterns, namespace gotchas, or log troubleshooting — this file covers those.

Weekly Installs
7
GitHub Stars
11
First Seen
Mar 20, 2026