data-health-monitor

Installation
SKILL.md

Data Health Monitor

Purpose

Answers "is my data flowing correctly?" with a single command. Aggregates health signals across streams, jobs, schema, and event quota into a unified, actionable report. Purely read-only.

Environment

Requires authenticated API access. See ../references/auth.md for credential resolution.

Flow

Run all four health checks, then present a unified report. If the user asks about a specific area, focus on that dimension but still show a summary of others.

Check 1: Stream Health

curl -s "${LYTICS_API_URL:-https://api.lytics.io}/v2/stream" \
  -H "Authorization: ${LYTICS_API_TOKEN}"

For each stream, evaluate:

Signal How to Detect Severity
Active last_msg_ts within last hour HEALTHY
Stale (continuous) last_msg_ts 1-24 hours ago WARNING
Stale (batch) last_msg_ts 2-7 days ago WARNING
Dead last_msg_ts > 24h ago (continuous) or > 7d (batch) ERROR
Never received ct == 0 ERROR

Distinguish batch vs continuous by checking if the stream has associated jobs with periodic schedules.

For streams with issues, fetch recent stats for more detail:

curl -s "${LYTICS_API_URL:-https://api.lytics.io}/v2/stream/${STREAM_NAME}/stats" \
  -H "Authorization: ${LYTICS_API_TOKEN}"

Check 2: Job Health

# Active jobs (default: non-terminal states)
curl -s "${LYTICS_API_URL:-https://api.lytics.io}/v2/job" \
  -H "Authorization: ${LYTICS_API_TOKEN}"

# Also check recently failed jobs
curl -s "${LYTICS_API_URL:-https://api.lytics.io}/v2/job?show_completed=true" \
  -H "Authorization: ${LYTICS_API_TOKEN}"

Evaluate each job:

Status Severity Action
runnable HEALTHY Running normally
sleeping HEALTHY Scheduled, waiting for next run
paused WARNING Intentional but flag for awareness
fault ERROR Needs investigation -- fetch logs
failed ERROR Terminal failure -- fetch logs
killed INFO Manually stopped

For faulted/failed jobs, fetch logs:

curl -s "${LYTICS_API_URL:-https://api.lytics.io}/v2/job/${JOB_ID}/logs" \
  -H "Authorization: ${LYTICS_API_TOKEN}"

Also check for stale jobs: if a runnable job hasn't been updated in over 1 hour, it may be stuck.

Check 3: Schema Health

# Get all fields with metadata
curl -s "${LYTICS_API_URL:-https://api.lytics.io}/v2/schema/user/field" \
  -H "Authorization: ${LYTICS_API_TOKEN}"

Check:

  • Identity fields: Count fields where IsIdentifier == true. Flag if fewer than 2.
  • PII fields: Count fields marked IsPII == true for awareness.
  • Stale fields: Fields with Modified timestamp older than 30 days that are actively used.

For deeper coverage analysis:

curl -s "${LYTICS_API_URL:-https://api.lytics.io}/api/schema/user/fieldinfo" \
  -H "Authorization: ${LYTICS_API_TOKEN}"

Check field presence/absence ratios. Flag fields with very low coverage that appear in segment FilterQL.

Check 4: Event Quota

curl -s "${LYTICS_API_URL:-https://api.lytics.io}/v2/control/eventquota/thresholds" \
  -H "Authorization: ${LYTICS_API_TOKEN}"

Report current usage against thresholds (50%, 75%, 100%, 125%).

Optional: Metrics Deep Dive

When the user wants trends or deeper analysis:

# Stream throughput over last 24h
curl -s "${LYTICS_API_URL:-https://api.lytics.io}/v2/metric?dimension=stream&range=now-24h" \
  -H "Authorization: ${LYTICS_API_TOKEN}"

# Job execution metrics over last 24h
curl -s "${LYTICS_API_URL:-https://api.lytics.io}/v2/metric?dimension=works&range=now-24h" \
  -H "Authorization: ${LYTICS_API_TOKEN}"

# Segment size trends
curl -s "${LYTICS_API_URL:-https://api.lytics.io}/v2/metric?dimension=segment&range=now-24h" \
  -H "Authorization: ${LYTICS_API_TOKEN}"

Present as trends: "Stream throughput is down 40% vs yesterday" or "Segment sizes are stable."

Output Format

Present the report as:

## Data Health Report

### Overall: HEALTHY | NEEDS ATTENTION | UNHEALTHY

### Streams (N total)
  HEALTHY: X streams actively receiving data
  WARNING: 'stream_name' -- last event 3 days ago
  ERROR: 'stream_name' -- never received events

### Jobs (N active)
  HEALTHY: X jobs running normally
  FAULT: 'job_name' -- error message from logs
  PAUSED: 'job_name' -- paused since date

### Schema (user table, N fields)
  Identity fields: N configured (field1, field2, ...)
  Low coverage: 'field' at X%
  Stale: 'field' not updated in N days

### Event Quota
  Current usage: X% of monthly quota

### Recommendations
1. Specific actionable recommendation
2. Another recommendation
3. ...

Severity Logic

Overall Status Criteria
HEALTHY All streams active, all jobs running, no faults, quota < 75%
NEEDS ATTENTION Any: stale streams, paused jobs, low-coverage fields, quota 75-100%
UNHEALTHY Any: faulted/failed jobs, dead streams, quota > 100%

Recommendations Engine

Generate specific, actionable recommendations based on findings:

  • Faulted job → "Investigate 'job_name' fault. Check auth credentials or bounce the job."
  • Dead stream → "Stream 'name' hasn't received data in N days. Check the source integration."
  • Zero-event stream → "Stream 'name' is configured but has never received data. Verify the integration is set up correctly."
  • Low identity fields → "Only N identity fields configured. Consider marking additional fields as identifiers for better profile resolution."
  • Quota approaching → "Event quota at X%. Consider reviewing high-volume streams or upgrading your plan."
  • Stale field → "Field 'name' hasn't been updated in N days. Check if the source integration is still active."

Error Handling

  • API errors on any check: Report the error for that dimension, continue with other checks. Never let one failed check block the whole report.
  • Empty responses: Report "No [streams/jobs/fields] found" -- this may indicate a new or unconfigured account.
  • Timeout: If a check takes too long, skip it with a note and proceed.

Dependencies

  • Composes: stream-inspector skill, job-manager skill, schema-manager skill
  • References: ../references/auth.md, ../references/api-client.md
Related skills
Installs
28
GitHub Stars
2
First Seen
Apr 2, 2026