ops-inspector by tencentcloudbase/skills

Standalone Install Note

If this environment only installed the current skill, start from the CloudBase main entry and use the published cloudbase/references/... paths for sibling skills.

CloudBase main entry: https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/SKILL.md
Current skill raw source: https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/ops-inspector/SKILL.md

Keep local references/... paths for files that ship with the current skill directory. When this file points to a sibling skill such as cloud-functions or cloudrun-development, use the standalone fallback URL shown next to that reference.

Activation Contract

Use this first when

The user wants to check the health or status of CloudBase resources (cloud functions, CloudRun, databases, storage, etc.).
The user reports errors, failures, or abnormal behavior and wants a quick diagnosis.
The user asks for an "inspection", "health check", "巡检", "诊断", or "troubleshooting" of their CloudBase environment.
The user wants to review recent error logs across services.

Read before writing code if

The inspection reveals code-level issues in cloud functions or CloudRun services — then read the relevant implementation skill before suggesting fixes.
The user wants to fix a problem found during inspection rather than just diagnose it.

Then also read

Cloud function issues -> ../cloud-functions/SKILL.md (standalone fallback: https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloud-functions/SKILL.md)
CloudRun issues -> ../cloudrun-development/SKILL.md (standalone fallback: https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloudrun-development/SKILL.md)
Database issues -> ../relational-database-tool/SKILL.md (standalone fallback: https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/relational-database-tool/SKILL.md) or ../no-sql-web-sdk/SKILL.md (standalone fallback: https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/no-sql-web-sdk/SKILL.md)
Platform overview -> ../cloudbase-platform/SKILL.md (standalone fallback: https://cnb.cool/tencent/cloud/cloudbase/cloudbase-skills/-/git/raw/main/skills/cloudbase/references/cloudbase-platform/SKILL.md)

Do NOT use for

Deploying new resources or writing application code. This skill is read-only and diagnostic.
Replacing proper monitoring/alerting infrastructure. It provides point-in-time inspection, not continuous monitoring.
Directly fixing problems — it diagnoses and recommends; actual fixes should use the appropriate implementation skill.

Common mistakes / gotchas

Running a full inspection without first confirming the environment is bound (auth tool must show logged-in and env-bound state).
Ignoring CLS log service status — if CLS is not enabled, queryLogs will fail; always check first with queryLogs(action="checkLogService").
Searching logs without a time range — this can return excessive or irrelevant results. Always scope searches to a relevant time window.
Treating a single error log as the root cause without correlating across resources. A function error may stem from a database or config issue.

Minimal checklist

Environment is bound and accessible (envQuery(action="info"))
CLS log service is enabled (queryLogs(action="checkLogService"))
All target resources are listed before diving into details
Time range is specified for any log searches
Findings are summarized with severity levels and actionable recommendations

How to use this skill (for a coding agent)

Inspection Modes

The skill supports two modes based on user intent:

Mode	When to use	Scope
Full inspection	User asks for a general health check / 巡检 / 全面检查	All resource types in the environment
Targeted inspection	User reports a specific error or asks about a specific resource	One resource type or a specific resource

Full Inspection Workflow

Follow these steps in order for a comprehensive environment health check:

Step 1 — Environment Check

envQuery(action="info")

Confirm the environment is accessible. Record the envId for console link generation.

Step 2 — Log Service Status

queryLogs(action="checkLogService")

If CLS is not enabled, note this as a warning — log-based diagnosis will be unavailable. Recommend enabling CLS in the console: https://tcb.cloud.tencent.com/dev?envId=${envId}#/devops/log

Step 3 — Cloud Functions Inspection

queryFunctions(action="listFunctions")

For each function, check:

Status: Is the function in an active/deployed state?
Recent errors: queryFunctions(action="listFunctionLogs", functionName="<name>", startTime="<recent>")
Common issues:
- Timeout errors (execution exceeded limit)
- Memory limit exceeded
- Runtime errors (unhandled exceptions)
- Cold start frequency

Step 4 — CloudRun Services Inspection

queryCloudRun(action="list")

For each service, check:

Status: Is the service running?
Detail: queryCloudRun(action="detail", detailServerName="<name>")
Common issues:
- Service not running (scaled to zero or crashed)
- Image pull failures
- OOMKilled events
- Health check failures

Step 5 — Error Log Aggregation (if CLS is enabled)

queryLogs(action="searchLogs", queryString="ERROR", service="tcb", startTime="<24h-ago>", limit=50)
queryLogs(action="searchLogs", queryString="ERROR", service="tcbr", startTime="<24h-ago>", limit=50)

Look for patterns:

Repeated error messages (same error many times)
Cascading failures (errors in multiple services around the same time)
Timeout patterns

Step 6 — Summary Report

Generate a structured report:

# CloudBase Resource Inspection Report

**Environment**: ${envId}
**Inspection Time**: ${timestamp}

## Overall Health: ✅ Healthy / ⚠️ Warnings Found / ❌ Issues Found

### Cloud Functions
| Function | Status | Recent Errors | Severity |
|----------|--------|---------------|----------|
| ... | ... | ... | ... |

### CloudRun Services
| Service | Status | Issues | Severity |
|---------|--------|--------|----------|
| ... | ... | ... | ... |

### Error Log Summary
- Total errors in last 24h: N
- Top error patterns: ...

## Recommendations
1. ...
2. ...

## Console Links
- Cloud Functions: https://tcb.cloud.tencent.com/dev?envId=${envId}#/scf
- CloudRun: https://tcb.cloud.tencent.com/dev?envId=${envId}#/platform-run
- Logs: https://tcb.cloud.tencent.com/dev?envId=${envId}#/devops/log

Targeted Inspection Workflow

When the user specifies a resource type or a specific resource:

Cloud function errors: queryFunctions(action="listFunctionLogs", functionName="<name>") then queryLogs(action="searchLogs", queryString="* AND functionName:<name> AND level:ERROR", ...)
CloudRun errors: queryCloudRun(action="detail", detailServerName="<name>") then queryLogs(action="searchLogs", queryString="ERROR", service="tcbr", ...)
Database issues: Check querySqlDatabase or readNoSqlDatabaseStructure depending on type
General error search: queryLogs(action="searchLogs", queryString="<error-keyword>", ...)

AIOps Methodology

This skill follows AIOps principles for intelligent inspection:

Data Collection: Gather logs and resource states via MCP tools
Pattern Recognition: Identify recurring errors, anomaly patterns, and correlations across services
Root Cause Hypothesis: Based on error patterns, suggest likely root causes (e.g., a function timeout may be caused by a database query bottleneck)
Actionable Recommendations: Provide specific, prioritized remediation steps with links to relevant skills and console pages

Severity Levels

Level	Icon	Meaning
Critical	❌	Service is down or data is at risk; requires immediate action
Warning	⚠️	Errors detected but service is still partially functional; investigate soon
Info	ℹ️	No errors found; informational status only
Healthy	✅	Resource is operating normally

Preferred Tool Map

Operation	MCP Tool Call
Check environment	`envQuery(action="info")`
Check CLS status	`queryLogs(action="checkLogService")`
List cloud functions	`queryFunctions(action="listFunctions")`
Get function detail	`queryFunctions(action="getFunctionDetail", functionName="<name>")`
Get function logs	`queryFunctions(action="listFunctionLogs", functionName="<name>", startTime="<time>", endTime="<time>")`
Get function log detail	`queryFunctions(action="getFunctionLogDetail", requestId="<id>")`
List CloudRun services	`queryCloudRun(action="list")`
Get CloudRun detail	`queryCloudRun(action="detail", detailServerName="<name>")`
Search CLS logs	`queryLogs(action="searchLogs", queryString="<query>", service="tcb\|tcbr", startTime="<time>", endTime="<time>")`
Check NoSQL structure	`readNoSqlDatabaseStructure(action="listCollections")`
Check MySQL status	`querySqlDatabase(action="getContext")`

Common CLS Query Patterns

Scenario	queryString
All errors	`ERROR`
Function timeout	`timeout OR 超时`
Function OOM	`OOM OR out of memory OR 内存超限`
CloudRun crash	`crash OR OOMKilled OR Error`
Specific function errors	`functionName:<name> AND level:ERROR`
5xx HTTP errors	`statusCode:>499`
Cold start issues	`coldStart OR 冷启动`

Time Range Guidance

Quick check: Last 1 hour (startTime = 1 hour ago)
Standard inspection: Last 24 hours
Trend analysis: Last 7 days
Specific incident: Narrow to the reported time window

Always use ISO 8601 format for startTime/endTime, e.g., "2025-01-15 00:00:00".

Related Skills

cloud-functions — Cloud function development, deployment, and debugging
cloudrun-development — CloudRun backend deployment and management
cloudbase-platform — General platform knowledge and console navigation
relational-database-tool — MySQL database management and diagnostics