signoz-health-check
SigNoz Health Check
Perform a comprehensive health check of the SigNoz observability platform, analyzing services, logs, metrics, traces, and alerts to identify highlights and issues.
Usage
/signoz-health-check [timeRange]
Examples:
/signoz-health-check-- Check last 24 hours (default)/signoz-health-check 1h-- Check last 1 hour/signoz-health-check 6h-- Check last 6 hours/signoz-health-check 7d-- Check last 7 days
What It Checks
1. Service Health
- List all active services
- Call rates and request volumes
- Error rates (identify services with >0% errors)
- P99 latency metrics
- Top operations per service
2. Error Analysis
- Recent ERROR/FATAL logs across all services
- Error patterns and frequency
- Services most affected by errors
- Common error messages
3. Traces
- Error traces for high-error services
- Slow traces (P99 > threshold)
- Trace patterns and bottlenecks
4. Alerts
- Active alerts currently firing
- Alert frequency and patterns
- Services with no alerts configured (risk)
5. Metrics
- Available metric keys
- Key performance indicators
- Resource utilization patterns
6. Dashboards
- List of configured dashboards
- Coverage gaps
Output Format
The check produces a structured report with:
Highlights
- ✅ Well-performing services (0% error rate)
- 📊 High traffic endpoints
- 🎯 Key metrics and thresholds met
Issues
- 🔴 Critical: Services with high error rates (>1%)
- 🟡 Warning: Recurring errors or performance degradation
- 🟠 Attention: Missing alerts or monitoring gaps
Recommendations
- Immediate actions required
- Monitoring improvements needed
- Performance optimizations to consider
Time Range Format
Supported formats:
30m,1h,2h,6h-- Hours/minutes24h,7d-- Days- Default:
24h(24 hours)
Prerequisites
- SigNoz MCP server must be configured and connected
- Access to SigNoz API required
Example Output
# SigNoz Health Check - Last 24 Hours
## 📊 System Overview
- 5 active services
- 141,588 total requests
- 103 errors (0.07% overall)
## ✅ Highlights
- api: 103,878 calls, 0% errors
- ws-server: 13,268 calls, 0% errors
- 6 dashboards configured
## ⚠️ Issues
🔴 code-api: 1.98% error rate (103/5,200 calls)
- P99 latency: 56.8s
- Error: Session errors on POST /v1/sessions/:sessionId/v1/messages
🟡 Recurring errors:
- "Test case not found" - 30+ occurrences
- "Failed to ensure TTLs on job tracker keys" - Every 5 minutes
## 💡 Recommendations
1. Investigate code-api session errors immediately
2. Create alerts for error rate > 1%
3. Fix Redis TTL management issues
How It Works
- Query services: Get all active services for the time range
- Check alerts: List active and historical alerts
- Analyze logs: Query error logs (ERROR/FATAL severity)
- Inspect traces: Sample error traces from high-error services
- Review metrics: Check available metrics and dashboards
- Synthesize: Combine data into actionable insights
Common Issues
"No services found"
- Check SigNoz is running and instrumentation is active
- Verify the time range has data
"MCP server not connected"
- Ensure SigNoz MCP server is configured in Claude settings
- Check network connectivity to SigNoz
"Incomplete data"
- Some services may not have full instrumentation
- Check for dataWarning fields indicating overflow
Integration with Other Skills
This skill complements:
- Investigation workflows (drill into specific errors)
- Alerting setup (identify missing alerts)
- Performance optimization (find bottlenecks)
Key Metrics Tracked
| Metric | Threshold | Action |
|---|---|---|
| Error rate | >1% | Critical investigation |
| P99 latency | >10s | Performance review |
| Alert count | 0 configured | Add monitoring |
| Recurring errors | >10/hour | Root cause analysis |
When to Use
- Regular health checks: Run daily or weekly
- Incident investigation: Start with this to get context
- Performance reviews: Identify optimization opportunities
- Capacity planning: Understand traffic patterns
- Post-deployment: Verify system health after changes
Advanced Usage
The skill collects data that can be further analyzed:
- Correlate error spikes with deployments
- Track error rate trends over time
- Identify cascade failures across services
- Monitor specific error patterns
For deeper investigation of specific issues, use:
mcp__signoz__signoz_get_trace_detailsfor trace analysismcp__signoz__signoz_search_logs_by_servicefor detailed logsmcp__signoz__signoz_get_alert_historyfor alert patterns
More from supatest-ai/supa-skills
bug
Investigate, reproduce, and fix a bug using a structured workflow. Captures a complete bug report, reproduces the issue reliably, finds the true root cause, writes a failing test before fixing, verifies the fix, and optionally files a well-formed GitHub issue.
1debug-prod
Investigate a production incident or live bug. Follows mitigation-first protocol — stops the bleeding before investigating, then uses the observability funnel (metrics → traces → logs) to find root cause. Produces a structured findings report and optional postmortem.
1agent-readiness
Evaluate how ready a codebase is for autonomous AI agent work. Produces a scored report across 11 dimensions with a maturity level and prioritized action plan.
1work-summary
Generate a summary of work done based on git commit history across one or more repositories. Use for standups, timesheets, retrospectives, or any time you need a structured overview of what was accomplished in a time period.
1test-feature
Interactively test a completed feature using browser automation. Navigates the live app, exercises all key flows, captures a GIF video of the happy path, and produces a visual verification report. Use after implementing a new feature or making code changes that need testing verification.
1review-pr
Review a pull request or set of code changes. Filters to issues introduced by this PR (not pre-existing), scores confidence, and produces a prioritized report with severity tiers. Requires user approval before posting anything to GitHub.
1