observe
Observe
Transient, session-scoped monitoring and ad-hoc querying. Unlike manage-alerts (which
creates a durable saved object in Kibana), observe polls in-process and returns once fired,
once its window closes, or — in now / table mode — immediately.
Modes
| Mode | When to pick it | Blocks? |
|---|---|---|
anomaly (default) |
"tell me when anything unusual fires", "watch for anomalies", open-ended monitoring | Until an anomaly fires or max_wait elapses |
metric |
user names a specific metric — either with a threshold ("wait until memory drops below 80MB") or without ("show me a live chart of X") | Polls for max_wait seconds (default 60s, interval 5s) |
now |
"what is X right now", "check X", "current value of Y" — single-instance scalar read | Returns immediately |
table |
"list …", "which … are …", group-by / top-N queries, or any ES|QL result with mixed-type columns | Returns immediately |
If the user wants durable alerting ("page me whenever..."), use manage-alerts instead.
Prerequisites
| Mode | Requires |
|---|---|
anomaly |
Elastic ML anomaly detection jobs |
metric |
Any ES|QL-queryable numeric field |
now |
Any ES|QL-queryable numeric field |
table |
Any ES|QL-queryable data |
How to call observe
Anomaly mode (default)
{
"mode": "anomaly",
"min_score": 75,
"max_wait": 600,
"namespace": "otel-demo"
}
min_score: 75 default (major+), 50 for minor inclusion, 90 for critical-only.max_wait: generous (600s default). Returns immediately on trigger — long waits are free.namespace: only if the user scopes to a K8s namespace.
Metric mode — threshold condition
{
"mode": "metric",
"esql": "FROM metrics-kubeletstatsreceiver.otel* | WHERE resource.attributes.k8s.pod.name == \"frontend-7d4b8f9c5-x2k9m\" | STATS v = AVG(metrics.k8s.pod.memory.working_set)",
"condition": "< 80000000",
"description": "frontend pod memory working set",
"max_wait": 300
}
Condition format: <comparator> <threshold> — valid comparators: <, <=, >, >=, ==.
Metric mode — live sample (no threshold)
Omit condition and the tool live-samples for the full max_wait window — use for
"show me a live chart of X" prompts. The view renders an accumulating sparkline.
{
"mode": "metric",
"esql": "FROM metrics-kubeletstatsreceiver.otel* | WHERE resource.attributes.k8s.namespace.name == \"oteldemo-esyox-default\" | STATS v = AVG(metrics.k8s.pod.memory.working_set)",
"description": "oteldemo-esyox-default avg pod memory",
"max_wait": 60
}
Now mode — single read
{
"mode": "now",
"esql": "FROM metrics-kubeletstatsreceiver.otel* | WHERE resource.attributes.k8s.namespace.name == \"oteldemo-esyox-default\" | STATS v = AVG(metrics.k8s.pod.memory.working_set)",
"description": "current avg pod memory in oteldemo-esyox-default"
}
Table mode — full ES|QL rows and columns
Use when the query groups, lists, or returns mixed-type rows (strings + numbers + dates). now mode
discards everything except the first numeric cell — table mode preserves the whole result.
{
"mode": "table",
"esql": "FROM metrics-kubeletstatsreceiver.otel* | WHERE metrics.k8s.pod.memory.working_set IS NOT NULL | STATS avg_mem = AVG(metrics.k8s.pod.memory.working_set) BY resource.attributes.k8s.pod.name, resource.attributes.k8s.namespace.name | SORT avg_mem DESC | LIMIT 10",
"description": "top 10 pods by memory"
}
Rows are capped at 50 by default. Prefer tightening the ES|QL with LIMIT / SORT over raising
row_cap — very wide tables clog the context window.
Picking the right index pattern
Fields live where the data is emitted — ES|QL rejects queries that reference a field the
target index doesn't map (verification_exception). Before writing the query, match the
user's question to the right layer:
| User asks about… | Index | Carries |
|---|---|---|
| Node / pod / namespace topology, resource usage | metrics-kubeletstatsreceiver.otel* or metrics-* |
k8s.node.name, k8s.pod.name, k8s.namespace.name, service.name (via resource attrs), CPU/memory/fs gauges |
| Service behavior — latency, errors, throughput, spans | traces-*.otel-* or traces-apm* |
service.name, transaction.duration.us, event.outcome, span.* |
| Log rate / log content | logs-* |
message, log.level, service.name |
| ML anomalies | .ml-anomalies-* |
record_score, by_field_value, partition_field_value |
Cross-layer questions ("which node runs the most services") need the index that
carries both fields — that's almost always metrics-*, because OTel resource
attributes propagate through the Collector, so metrics docs carry k8s.node.name and
service.name. Trace indices (traces-apm*, traces-*.otel-*) don't carry infra
attributes like k8s.node.name — don't reach for them when the question is about nodes.
Example: "which node is running the most services"
FROM metrics-*
| WHERE @timestamp > NOW() - 5m AND k8s.node.name IS NOT NULL AND service.name IS NOT NULL
| STATS service_count = COUNT_DISTINCT(service.name) BY k8s.node.name
| SORT service_count DESC
| LIMIT 20
Common query patterns
These are the field paths this deployment's data actually uses — prefer them over guessing.
OTel Kubernetes (kubeletstats receiver)
Index: metrics-kubeletstatsreceiver.otel*
Each kubeletstats scrape emits separate documents per metric — a CPU doc, a memory doc, a network doc, etc. Always filter WHERE <field> IS NOT NULL for the field you're aggregating, otherwise most rows carry nulls for it.
Gauge fields — use AVG / MAX / MIN, never SUM:
| Signal | Field | Type |
|---|---|---|
| Pod memory working set | k8s.pod.memory.working_set |
long (bytes) |
| Pod memory RSS | k8s.pod.memory.rss |
long (bytes) |
| Pod memory available | k8s.pod.memory.available |
long (bytes) |
| Pod CPU usage | k8s.pod.cpu.usage |
double (cores — 1.0 = one full core) |
| Pod filesystem usage | k8s.pod.filesystem.usage |
long (bytes) |
| Node memory working set | k8s.node.memory.working_set |
long (bytes) |
| Node memory available | k8s.node.memory.available |
long (bytes) |
| Node CPU usage | k8s.node.cpu.usage |
double (cores) |
| Node filesystem usage | k8s.node.filesystem.usage |
long (bytes) |
For counter fields (network I/O, network errors, uptime), see the "Counter fields" section below — these require
TS+RATE().
Dimension fields — use for filtering and BY grouping:
| Dimension | Unprefixed | Prefixed (equivalent) |
|---|---|---|
| Pod name | k8s.pod.name |
resource.attributes.k8s.pod.name |
| Namespace | k8s.namespace.name |
resource.attributes.k8s.namespace.name |
| Node name | k8s.node.name |
resource.attributes.k8s.node.name |
| Cluster name | k8s.cluster.name |
(same) |
Both forms work on metrics-kubeletstatsreceiver.otel*. Prefer the unprefixed form — it's shorter and also works on counter-field queries via TS.
Common recipes:
Top pods by memory (last 5m, across all namespaces):
FROM metrics-kubeletstatsreceiver.otel*
| WHERE @timestamp > NOW() - 5 minutes AND k8s.pod.memory.working_set IS NOT NULL
| STATS avg_mem = AVG(k8s.pod.memory.working_set),
max_mem = MAX(k8s.pod.memory.working_set)
BY k8s.pod.name, k8s.namespace.name
| SORT max_mem DESC
| LIMIT 20
Which pods are on a specific node:
FROM metrics-kubeletstatsreceiver.otel*
| WHERE @timestamp > NOW() - 5 minutes
AND k8s.node.name == "<node>" AND k8s.pod.name IS NOT NULL
| STATS last_seen = MAX(@timestamp) BY k8s.pod.name, k8s.namespace.name
| SORT last_seen DESC
Namespace-wide memory average (single scalar — works in now/metric mode):
FROM metrics-kubeletstatsreceiver.otel*
| WHERE k8s.namespace.name == "oteldemo-esyox-default"
AND k8s.pod.memory.working_set IS NOT NULL
| STATS v = AVG(k8s.pod.memory.working_set)
Is this node under memory pressure (working-set vs available):
FROM metrics-kubeletstatsreceiver.otel*
| WHERE @timestamp > NOW() - 5 minutes
AND k8s.node.name == "<node>" AND k8s.node.memory.working_set IS NOT NULL
| STATS working_set = AVG(k8s.node.memory.working_set),
available = AVG(k8s.node.memory.available)
Counter fields — require TS + RATE()
Network I/O, network errors, and uptime fields are stored as monotonically-increasing counters (counter_long), not instantaneous gauges. FROM + MAX/AVG/SUM/VALUES on a counter field is a hard error — ES|QL returns argument of [...] must be [...numeric except counter types].
Counter fields in this deployment:
| Field | Notes |
|---|---|
k8s.pod.network.io |
bytes, carries direction attribute (transmit / receive) — emitted as separate docs per direction |
k8s.pod.network.errors |
error count, also carries direction |
k8s.node.network.io, k8s.node.network.errors |
node-level equivalents |
k8s.node.uptime, k8s.pod.uptime |
seconds since start |
Correct pattern: use TS as the source command, wrap RATE() in an aggregation, filter the counter field IS NOT NULL, and group by direction whenever you query network fields.
Network throughput by cluster, last 15m (result in bytes/sec):
TS metrics-kubeletstatsreceiver.otel*
| WHERE @timestamp > NOW() - 15 minutes
AND k8s.pod.network.io IS NOT NULL
| STATS rate_bps = AVG(RATE(k8s.pod.network.io))
BY k8s.cluster.name, direction
| SORT rate_bps DESC
Rules:
TS, notFROM.FROMwill be rejected.- Wrap
RATE()inAVG()(or similar) when grouping — bareRATE(...) BY ...is rejected. - Network counters are emitted as separate docs per direction. Without
BY directionor adirection == "..."filter, transmit and receive aggregate into a meaningless combined number. - Without
IS NOT NULLthe query spans many kubeletstats docs that carry a different metric — you get nulls, not errors.
Escape hatch — raw counter snapshot: if you want the current counter value (e.g. "how long has node X been up"), cast first. TO_LONG strips the counter type and unlocks standard aggregations:
FROM metrics-kubeletstatsreceiver.otel*
| WHERE @timestamp > NOW() - 5 minutes AND k8s.node.uptime IS NOT NULL
| EVAL u = TO_LONG(k8s.node.uptime)
| STATS uptime_s = MAX(u) BY k8s.node.name
APM traces
Primary index: traces-*.otel-* (OTel-native). Fallback: traces-apm* (classic APM — only if the OTel path returns empty).
In EDOT-ingested clusters, traces-*.otel-* carries both OTel-native fields (duration, kind, status.code) and classic-APM-compatible fields (processor.event, event.outcome, transaction.duration.us on transaction-level docs). The cluster's "APM-ness" isn't determined by the index — it's determined by which field shape you query.
| Signal | OTel-native (preferred) | Classic APM |
|---|---|---|
| Duration | duration (nanoseconds, long, populated on every span) |
transaction.duration.us (microseconds, populated only on processor.event == "transaction" docs) |
| Error signal | event.outcome == "failure" — use this, 100% populated |
status.code == "Error" (sparse; only set when instrumentation explicitly calls SetStatus) |
| Span kind | kind — values Server, Internal, Client, Producer, Consumer (title case, not SERVER/CLIENT) |
transaction.type |
| Scope filter | kind == "Server" isolates incoming requests |
processor.event == "transaction" |
| Service name | service.name |
service.name |
Unit warning. OTel
durationis in nanoseconds. Divide by 1,000,000 for milliseconds. Classictransaction.duration.usis in microseconds — divide by 1,000. Mixing these across a comparison produces wildly wrong numbers.
Service p95 latency (OTel-native), last 15m — result in ms:
FROM traces-*.otel-*
| WHERE service.name == "checkout" AND @timestamp > NOW() - 15 minutes
AND kind == "Server"
| STATS p95_ms = PERCENTILE(duration, 95) / 1000000
Error rate for a service — event.outcome is reliable here:
FROM traces-*.otel-*
| WHERE service.name == "checkout" AND @timestamp > NOW() - 15 minutes
AND kind == "Server"
| STATS errors = COUNT(*) WHERE event.outcome == "failure", total = COUNT(*)
| EVAL error_rate_pct = ROUND(errors * 100.0 / total, 2)
| KEEP error_rate_pct, errors, total
If traces-*.otel-* returns empty, the deployment is classic-APM-only — fall back to traces-apm* with processor.event == "transaction" and transaction.duration.us.
Throughput trend — use the pre-aggregated rollup when possible. metrics-service_summary.1m.otel-* carries per-minute request counts in service_summary (a regular long, designed to SUM). Cheaper and faster than scanning raw traces for "how many requests/min over the last hour":
FROM metrics-service_summary.1m.otel-*
| WHERE service.name == "frontend" AND @timestamp > NOW() - 1 hour
| STATS throughput = SUM(service_summary)
BY bucket = BUCKET(@timestamp, 1 minute)
| SORT bucket ASC
Log rate
Index: logs-*
FROM logs-*
| WHERE service.name == "cartservice" AND @timestamp > NOW() - 5m
| STATS v = COUNT(*)
Query-construction rules
- For
nowandmetricmode, the query must return a single row with a numeric first column — the tool reads the first numeric cell. Fortablemode this restriction doesn't apply: any shape is fine. - Scope with
@timestamp > NOW() - <window>when the user implies "right now" (default 5m is usually fine; let the window match the user's language). - When the user names a namespace, match it exactly (e.g.
oteldemo-esyox-default, nototel-demo). If unsure, callapm-health-summaryfirst — itsnamespace_candidatesfield surfaces fuzzy matches. - Match the aggregation to the field's storage shape. Three shapes to recognize:
- Gauges (
memory.working_set,memory.available,cpu.usage,filesystem.usageinmetrics-kubeletstatsreceiver.otel*): useAVG/MAX/MIN. Do notSUMa gauge — it will add every ~15s kubelet sample over your window and inflate the value by hundreds or thousands. - Counters (
k8s.pod.network.io,k8s.node.uptime, etc. —counter_longtype): requireTS+RATE(). See the "Counter fields" section above.FROM+MAX/AVG/SUMon a counter is a hard error, not a silent wrong number. - Pre-aggregated rollups (
service_summaryonmetrics-service_summary.1m.otel-*,span.destination.service.response_time.countonmetrics-service_destination.1m.otel-*): designed forSUMacross the window. Each doc is already a per-minute bucket count.
- Gauges (
After the tool returns
The observe MCP App view renders inline in one of several modes, picked automatically from the result:
- Now mode (
status: NOW) — compact card: big unit-formatted number, ES|QL subtitle, "evaluated Xs ago" stamp, and three follow-up actions (re-check, escalate to live observation, create alert rule). - Metric mode — area + line + dots sparkline with optional threshold line; stat cards for
current / threshold / peak / baseline. Covers
CONDITION_MET,TIMEOUT, andSAMPLED. - Anomaly mode — severity-scored trigger card with affected entities and click-to-send investigation prompts.
- Table mode (
status: TABLE) — styled HTML table with column headers, type-aware alignment (numeric right, text left), and zebra-striped rows. Row count + truncation notice in the subtitle. - Error (
status: ERROR) — red-toned card with the ES|QL failure message verbatim. Surfaces instead of throwing when the query is bad (unknown field, index missing, syntax error).
All modes surface an investigation_actions list as buttons. Follow up in chat too — don't rely
on the buttons alone.
Status-by-status guidance
NOW— State the value plainly. Offer to escalate to a live observation if the user seems to want ongoing visibility.TABLE— Summarize what the rows show (top entity, total count, any outliers). Don't just dump the full table back — the user can read the widget. If the result was truncated, say so and offer to tighten the ES|QL.ERROR— Read the error message, explain what likely went wrong (unknown field, index pattern, syntax), and propose a corrected query. Don't retry blindly.ALERT(anomaly fired) — The response includes affected entities, affected services, top anomalies, andinvestigation_hintsnaming the next tool to reach for. Follow those hints immediately — don't just report the alert, start investigating and narrate your reasoning.CONDITION_MET(metric threshold satisfied) — Confirm to the user and describe the trend from the returned history. If this was post-remediation validation, explicitly state the fix has been validated. Offer to graduate the condition into a durable rule viamanage-alerts.SAMPLED(live sample completed without a condition) — Summarize the trend (trending up / down / flat, peak, typical). Offer "keep observing" (extend window) or graduate to an alert rule.TIMEOUT(metric condition never met) — Tell the user the metric didn't stabilize. Suggest follow-ups: checkml-anomalies, persist as alert rule, re-examine the ES|QL.QUIET(anomaly mode, nothing fired) — Suggest adjustments: lowermin_score, widenlookback, verify ML jobs are running.
Accumulating timelines
Every metric-mode response includes an observe_key derived from esql + condition. When Claude
re-invokes observe with the same ES|QL (e.g. via the "Extend observation (+60s)" button), the view
merges the new samples into the existing sparkline instead of resetting — so the user sees a
continuous timeline across multiple tool calls. To keep this continuity, reuse the exact same
ES|QL string and condition when extending. Capped at 240 points to keep the chart readable.
Tools
| Tool | Purpose |
|---|---|
observe |
Polls and blocks. Four modes: anomaly, metric, now, table. |
ml-anomalies |
Follow-up: deeper look at the anomaly that fired. |
apm-service-dependencies |
Follow-up: topology of affected services (if APM available). |
apm-health-summary |
Follow-up: cluster-wide context, and useful for discovering which namespaces actually have data. |
k8s-blast-radius |
Follow-up: infra impact if a node is implicated. |
manage-alerts |
Graduate to persistent alerting once the pattern is well-understood. |
Key principles
- Observe is transient. Nothing is saved. If the user wants an ongoing rule, use
manage-alerts. - Pick the mode from the user's phrasing. "What is X right now" (scalar) →
now. "Show me a live chart of X" or "watch X for 60s" →metric(no condition). "Wait until X drops below Y" →metric(with condition). "Tell me when anything unusual fires" →anomaly. "List …", "which pods are on node X", "top N by Y" →table. If a user asks "what is X" and X is actually a list or grouping (not a single number), picktable, notnow. - Use the known field paths. Don't probe generic
metrics-*patterns when the deployment indexes undermetrics-kubeletstatsreceiver.otel*. The cheat sheet above is authoritative for this environment. - On ALERT, start investigating immediately. The
investigation_hintsare suggestions — follow them and narrate your reasoning. - Don't start with
observefor vague triage. If the user reports a symptom without naming a specific metric ("something feels slow", "what's wrong with prod"), reach forapm-health-summaryfirst — it surfaces the worst-offender services without needing a query.observeneeds a target metric to poll; use it to drill in after the rollup names something. - Don't over-tune
min_score. 75 catches the important stuff; dropping below 50 produces noise.