ops-inspector

Installation
SKILL.md

Standalone Install Note

If this environment only installed the current skill, start from the CloudBase main entry and use the published cloudbase/references/... paths for sibling skills.

  • CloudBase main entry: https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/SKILL.md
  • Current skill raw source: https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/ops-inspector/SKILL.md

Keep local references/... paths for files that ship with the current skill directory. When this file points to a sibling skill such as cloud-functions or cloudrun-development, use the standalone fallback URL shown next to that reference.

Activation Contract

Use this first when

  • The user wants to check the health or status of CloudBase resources (cloud functions, CloudRun, databases, storage, etc.).
  • The user reports errors, failures, or abnormal behavior and wants a quick diagnosis.
  • The user asks for an "inspection", "health check", "巡检", "诊断", or "troubleshooting" of their CloudBase environment.
  • The user wants to review recent error logs across services.

Read before writing code if

  • The inspection reveals code-level issues in cloud functions or CloudRun services — then read the relevant implementation skill before suggesting fixes.
  • The user wants to fix a problem found during inspection rather than just diagnose it.

Then also read

  • Cloud function issues -> ../cloud-functions/SKILL.md (standalone fallback: https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloud-functions/SKILL.md)
  • CloudRun issues -> ../cloudrun-development/SKILL.md (standalone fallback: https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloudrun-development/SKILL.md)
  • Database issues -> ../relational-database-tool/SKILL.md (standalone fallback: https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/relational-database-tool/SKILL.md) or ../no-sql-web-sdk/SKILL.md (standalone fallback: https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/no-sql-web-sdk/SKILL.md)
  • Platform overview -> ../cloudbase-platform/SKILL.md (standalone fallback: https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloudbase-platform/SKILL.md)

Do NOT use for

  • Deploying new resources or writing application code. This skill is read-only and diagnostic.
  • Replacing proper monitoring/alerting infrastructure. It provides point-in-time inspection, not continuous monitoring.
  • Directly fixing problems — it diagnoses and recommends; actual fixes should use the appropriate implementation skill.

Common mistakes / gotchas

  • Running a full inspection without first confirming the environment is bound (auth tool must show logged-in and env-bound state).
  • Ignoring CLS log service status — if CLS is not enabled, queryLogs will fail; always check first with queryLogs(action="checkLogService").
  • Searching logs without a time range — this can return excessive or irrelevant results. Always scope searches to a relevant time window.
  • Treating a single error log as the root cause without correlating across resources. A function error may stem from a database or config issue.

Minimal checklist

  • Environment is bound and accessible (envQuery(action="info"))
  • CLS log service is enabled (queryLogs(action="checkLogService"))
  • All target resources are listed before diving into details
  • Time range is specified for any log searches
  • Findings are summarized with severity levels and actionable recommendations

How to use this skill (for a coding agent)

Inspection Modes

The skill supports two modes based on user intent:

Mode When to use Scope
Full inspection User asks for a general health check / 巡检 / 全面检查 All resource types in the environment
Targeted inspection User reports a specific error or asks about a specific resource One resource type or a specific resource

Full Inspection Workflow

Follow these steps in order for a comprehensive environment health check:

Step 1 — Environment Check

envQuery(action="info")

Confirm the environment is accessible. Record the envId for console link generation.

Step 2 — Log Service Status

queryLogs(action="checkLogService")

If CLS is not enabled, note this as a warning — log-based diagnosis will be unavailable. Recommend enabling CLS in the console: https://tcb.cloud.tencent.com/dev?envId=${envId}#/devops/log

Step 3 — Cloud Functions Inspection

queryFunctions(action="listFunctions")

For each function, check:

  • Status: Is the function in an active/deployed state?
  • Recent errors: queryFunctions(action="listFunctionLogs", functionName="<name>", startTime="<recent>")
  • Common issues:
    • Timeout errors (execution exceeded limit)
    • Memory limit exceeded
    • Runtime errors (unhandled exceptions)
    • Cold start frequency

Step 4 — CloudRun Services Inspection

queryCloudRun(action="list")

For each service, check:

  • Status: Is the service running?
  • Detail: queryCloudRun(action="detail", detailServerName="<name>")
  • Common issues:
    • Service not running (scaled to zero or crashed)
    • Image pull failures
    • OOMKilled events
    • Health check failures

Step 5 — Error Log Aggregation (if CLS is enabled)

queryLogs(action="searchLogs", queryString="ERROR", service="tcb", startTime="<24h-ago>", limit=50)
queryLogs(action="searchLogs", queryString="ERROR", service="tcbr", startTime="<24h-ago>", limit=50)

Look for patterns:

  • Repeated error messages (same error many times)
  • Cascading failures (errors in multiple services around the same time)
  • Timeout patterns

Step 6 — Summary Report

Generate a structured report:

# CloudBase Resource Inspection Report

**Environment**: ${envId}
**Inspection Time**: ${timestamp}

## Overall Health: ✅ Healthy / ⚠️ Warnings Found / ❌ Issues Found

### Cloud Functions
| Function | Status | Recent Errors | Severity |
|----------|--------|---------------|----------|
| ... | ... | ... | ... |

### CloudRun Services
| Service | Status | Issues | Severity |
|---------|--------|--------|----------|
| ... | ... | ... | ... |

### Error Log Summary
- Total errors in last 24h: N
- Top error patterns: ...

## Recommendations
1. ...
2. ...

## Console Links
- Cloud Functions: https://tcb.cloud.tencent.com/dev?envId=${envId}#/scf
- CloudRun: https://tcb.cloud.tencent.com/dev?envId=${envId}#/platform-run
- Logs: https://tcb.cloud.tencent.com/dev?envId=${envId}#/devops/log

Targeted Inspection Workflow

When the user specifies a resource type or a specific resource:

  1. Cloud function errors: queryFunctions(action="listFunctionLogs", functionName="<name>") then queryLogs(action="searchLogs", queryString="* AND functionName:<name> AND level:ERROR", ...)
  2. CloudRun errors: queryCloudRun(action="detail", detailServerName="<name>") then queryLogs(action="searchLogs", queryString="ERROR", service="tcbr", ...)
  3. Database issues: Check querySqlDatabase or readNoSqlDatabaseStructure depending on type
  4. General error search: queryLogs(action="searchLogs", queryString="<error-keyword>", ...)

AIOps Methodology

This skill follows AIOps principles for intelligent inspection:

  1. Data Collection: Gather logs and resource states via MCP tools
  2. Pattern Recognition: Identify recurring errors, anomaly patterns, and correlations across services
  3. Root Cause Hypothesis: Based on error patterns, suggest likely root causes (e.g., a function timeout may be caused by a database query bottleneck)
  4. Actionable Recommendations: Provide specific, prioritized remediation steps with links to relevant skills and console pages

Severity Levels

Level Icon Meaning
Critical Service is down or data is at risk; requires immediate action
Warning ⚠️ Errors detected but service is still partially functional; investigate soon
Info ℹ️ No errors found; informational status only
Healthy Resource is operating normally

Preferred Tool Map

Operation MCP Tool Call
Check environment envQuery(action="info")
Check CLS status queryLogs(action="checkLogService")
List cloud functions queryFunctions(action="listFunctions")
Get function detail queryFunctions(action="getFunctionDetail", functionName="<name>")
Get function logs queryFunctions(action="listFunctionLogs", functionName="<name>", startTime="<time>", endTime="<time>")
Get function log detail queryFunctions(action="getFunctionLogDetail", requestId="<id>")
List CloudRun services queryCloudRun(action="list")
Get CloudRun detail queryCloudRun(action="detail", detailServerName="<name>")
Search CLS logs queryLogs(action="searchLogs", queryString="<query>", service="tcb|tcbr", startTime="<time>", endTime="<time>")
Check NoSQL structure readNoSqlDatabaseStructure(action="listCollections")
Check MySQL status querySqlDatabase(action="getContext")

Common CLS Query Patterns

Scenario queryString
All errors ERROR
Function timeout timeout OR 超时
Function OOM OOM OR out of memory OR 内存超限
CloudRun crash crash OR OOMKilled OR Error
Specific function errors functionName:<name> AND level:ERROR
5xx HTTP errors statusCode:>499
Cold start issues coldStart OR 冷启动

Time Range Guidance

  • Quick check: Last 1 hour (startTime = 1 hour ago)
  • Standard inspection: Last 24 hours
  • Trend analysis: Last 7 days
  • Specific incident: Narrow to the reported time window

Always use ISO 8601 format for startTime/endTime, e.g., "2025-01-15 00:00:00".

Related Skills

  • cloud-functions — Cloud function development, deployment, and debugging
  • cloudrun-development — CloudRun backend deployment and management
  • cloudbase-platform — General platform knowledge and console navigation
  • relational-database-tool — MySQL database management and diagnostics
Weekly Installs
13
GitHub Stars
44
First Seen
3 days ago
Installed on
opencode13
gemini-cli13
deepagents13
antigravity13
github-copilot13
codex13