triaging-live-sql-activity
Triaging Live SQL Activity
Diagnoses live cluster performance issues by identifying currently active long-running queries, busy sessions, and active transactions. Uses SQL-only interfaces (SHOW statements and crdb_internal views) to provide immediate triage without requiring DB Console, HTTP endpoints, or Prometheus access.
When to Use This Skill
- Users report "the cluster is slow right now"
- High CPU or memory usage on cluster nodes
- Need to identify runaway queries or stuck transactions
- Want to find which applications/users are consuming resources
- Require immediate triage without DB Console access
- Need to generate SQL to cancel problematic sessions/queries
For historical performance analysis: Use profiling-statement-fingerprints to analyze query patterns over time, identify slow fingerprints, and investigate trends without needing live queries. For transaction-level analysis: Use profiling-transaction-fingerprints to analyze historical transaction retry patterns, commit latency trends, and statement composition. For background job monitoring: Use monitoring-background-jobs to monitor schema changes, backups, and automatic jobs that don't appear in SHOW CLUSTER STATEMENTS.
Prerequisites
Required SQL access:
- Connection to any CockroachDB node
- For cluster-wide visibility:
VIEWACTIVITYorVIEWACTIVITYREDACTEDprivilegeVIEWACTIVITYREDACTED: Redacts constants in other users' queries (recommended for privacy)VIEWACTIVITY: Shows full query text for all users- Without these: Only see your own sessions/queries
- Basic understanding of SQL query execution
- (Optional)
CANCELQUERY/CANCELSESSIONprivileges for cancellation operations
Check your privileges:
SHOW GRANTS ON ROLE <username>;
See permissions reference for detailed RBAC setup.
Core Diagnostic Approach
CockroachDB provides SQL-only interfaces for live activity triage:
| Interface | Purpose | Cluster-wide? |
|---|---|---|
SHOW CLUSTER STATEMENTS |
Currently executing queries | Yes (with VIEWACTIVITY) |
SHOW CLUSTER SESSIONS |
Active client sessions | Yes (with VIEWACTIVITY) |
crdb_internal.cluster_transactions |
In-progress transactions | Yes (with VIEWACTIVITY) |
Triage workflow:
- Identify long-running queries (> 5-10 minutes)
- Correlate to sessions and applications
- Check transaction retry counts (high retries = contention)
- Drill down by app/user/client
- (Optional) Cancel runaway work
Safety: All diagnostic queries are read-only. Cancellation is opt-in with explicit warnings.
Core Diagnostic Queries
Long-Running Queries
Identify queries running longer than a specified threshold:
-- Queries running longer than 5 minutes
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT
query_id,
node_id,
session_id,
user_name,
client_address,
application_name,
start,
now() - start AS running_for,
substring(query, 1, 200) AS query_preview,
distributed,
phase
FROM q
WHERE start < now() - INTERVAL '5 minutes'
ORDER BY start
LIMIT 50;
Key columns:
running_for: How long the query has been executingquery_preview: First 200 characters (protects against massive queries)phase: execution phase (preparing, executing, etc.)distributed: whether query spans multiple nodes
Customizable thresholds:
- Change
INTERVAL '5 minutes'to'10 minutes','30 seconds', etc. - Adjust
LIMITbased on cluster size and expected load
Active Sessions
Find sessions with long-running active queries:
-- Sessions with active queries running > 5 minutes
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT
node_id,
session_id,
user_name,
client_address,
application_name,
status,
active_query_start,
now() - active_query_start AS active_query_for,
substring(active_queries, 1, 200) AS active_queries_preview,
substring(last_active_query, 1, 200) AS last_query_preview
FROM s
WHERE active_query_start IS NOT NULL
AND active_query_start < now() - INTERVAL '5 minutes'
ORDER BY active_query_start
LIMIT 50;
Key columns:
active_query_for: Duration of current active queryapplication_name: Source application for drill-downclient_address: Client IP/hostname for troubleshootingstatus: Session state (Idle, Active, etc.)
Active Transactions
Identify long-running transactions (potential blockers):
-- Transactions running > 5 minutes
SELECT
id AS txn_id,
node_id,
session_id,
application_name,
start,
now() - start AS running_for,
num_stmts,
num_retries,
num_auto_retries,
substring(txn_string, 1, 200) AS txn_string_preview
FROM crdb_internal.cluster_transactions
WHERE start < now() - INTERVAL '5 minutes'
ORDER BY start
LIMIT 50;
Key columns:
num_retries/num_auto_retries: High retry counts indicate contentionnum_stmts: Number of statements in transaction (large = potentially problematic)txn_string: Transaction fingerprint
Production safety note: crdb_internal.cluster_transactions is production-approved and safe for triage.
Drill-Down by Application, User, or Client
Once you identify suspicious activity, drill down by filtering:
Filter by Application
-- All activity from specific application
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT query_id, user_name, start, now() - start AS running_for,
substring(query, 1, 200) AS query_preview
FROM q
WHERE application_name = 'payments-api'
ORDER BY start;
Filter by User
-- All activity from specific user
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT session_id, application_name, client_address,
active_query_start, substring(active_queries, 1, 200) AS active_queries_preview
FROM s
WHERE user_name = 'app_user'
AND active_query_start IS NOT NULL
ORDER BY active_query_start;
Filter by Client Address
-- All sessions from specific client IP
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT session_id, user_name, application_name,
status, substring(active_queries, 1, 200) AS active_queries_preview
FROM s
WHERE client_address LIKE '10.0.1.%'
ORDER BY active_query_start;
Combined Filters
-- Long queries from specific app and user
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT query_id, node_id, start, now() - start AS running_for,
substring(query, 1, 200) AS query_preview
FROM q
WHERE application_name = 'payments-api'
AND user_name = 'app_user'
AND start < now() - INTERVAL '10 minutes'
ORDER BY start;
Safety Considerations
Read-only operations:
All diagnostic queries (SHOW statements, crdb_internal.cluster_transactions) are read-only and safe to run in production.
Cancellation operations (opt-in):
CAUTION: Canceling queries/sessions terminates user work
Only proceed if:
- You've confirmed the query/session is runaway or stuck
- You have authorization to interrupt user workloads
- You've notified stakeholders if appropriate
- You have
CANCELQUERYorCANCELSESSIONprivileges
Canceling Runaway Work (Opt-In)
Cancel a Specific Query
-- 1. Identify the query_id from triage queries above
-- 2. Cancel it
CANCEL QUERY '<query_id>';
Example:
CANCEL QUERY '15f9e0e91f072f0f0000000000000001';
Cancel an Entire Session
-- 1. Identify the session_id from triage queries above
-- 2. Cancel all queries in that session
CANCEL SESSION '<session_id>';
Example:
CANCEL SESSION '15f9e0e91f072f0f';
Verification: After canceling, re-run the triage queries to confirm the query/session is gone.
Required privileges:
CANCELQUERYsystem privilege to cancel queriesCANCELSESSIONsystem privilege to cancel sessions- Admin role has both by default
See permissions reference for granting these privileges.
Common Triage Workflows
Workflow 1: "Cluster is slow" investigation
Scenario: Users report general slowness.
-
Check for long-running queries:
-- Run the "Long-Running Queries" diagnostic -- Look for queries running > 5-10 minutes -
Identify source applications:
-- Group by application to find culprits WITH q AS (SHOW CLUSTER STATEMENTS) SELECT application_name, COUNT(*) AS num_queries, AVG(now() - start) AS avg_duration FROM q WHERE start < now() - INTERVAL '5 minutes' GROUP BY application_name ORDER BY num_queries DESC; -
Drill down into specific app:
-- Filter by top application from step 2 -- Use "Filter by Application" query -
Decide on action:
- Contact app team to investigate query patterns
- Cancel specific runaway queries if critical
- Check for schema/index issues if queries are legitimate
Workflow 2: Find high-retry transactions
Scenario: Suspect contention issues.
-
Check for high retry counts:
SELECT application_name, AVG(num_retries) AS avg_retries, MAX(num_retries) AS max_retries, COUNT(*) AS num_txns FROM crdb_internal.cluster_transactions WHERE start < now() - INTERVAL '5 minutes' GROUP BY application_name HAVING AVG(num_retries) > 5 ORDER BY avg_retries DESC; -
Investigate specific transactions:
-- Find transactions with >10 retries SELECT id, application_name, num_retries, num_stmts, substring(txn_string, 1, 200) AS txn_preview FROM crdb_internal.cluster_transactions WHERE num_retries > 10 ORDER BY num_retries DESC; -
Next steps:
- Review transaction patterns for contention
- Check for lock conflicts or hotspots
- Consider schema changes to reduce contention
Workflow 3: Identify resource hogs by user
Scenario: Need to attribute load to specific users.
-
Count active queries per user:
WITH q AS (SHOW CLUSTER STATEMENTS) SELECT user_name, COUNT(*) AS num_active_queries, AVG(now() - start) AS avg_duration FROM q GROUP BY user_name ORDER BY num_active_queries DESC; -
Drill down to specific user's activity:
-- Use "Filter by User" query -
Take action:
- Contact user if unexpected load
- Review user's query patterns
- Cancel if clearly runaway
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
SHOW CLUSTER STATEMENTS returns empty |
No active queries, or insufficient privileges | Grant VIEWACTIVITY or VIEWACTIVITYREDACTED; verify cluster has active load |
Query text shows <hidden> |
Using VIEWACTIVITYREDACTED privilege |
This is expected for privacy; use VIEWACTIVITY if full text needed |
| Can't cancel query: "permission denied" | Missing CANCELQUERY privilege |
Grant CANCELQUERY system privilege to your user |
crdb_internal.cluster_transactions slow |
High transaction volume on cluster | Add filters (application_name, time threshold) to reduce result set |
| "relation does not exist" error | Typo in table name or old CockroachDB version | Verify you're using production-approved tables; check CockroachDB version compatibility |
| Triage queries themselves are slow | Cluster under extreme load | Use more aggressive filters (shorter time window, specific apps); consider canceling obvious runaway work first |
Key Considerations
- Privacy: Use
VIEWACTIVITYREDACTEDinstead ofVIEWACTIVITYto protect sensitive query constants in multi-tenant environments - Performance impact: Triage queries are read-only and lightweight, but avoid running them in tight loops during extreme load
- LIMIT clause: Always include
LIMITto prevent overwhelming output on large clusters - Time thresholds: Adjust
INTERVALbased on your workload (5 minutes is a reasonable default, but fast OLTP may need 30 seconds) - Cancellation is disruptive: Only cancel queries/sessions after confirming they're problematic; coordinate with application teams when possible
- Not for historical analysis: These queries show current state only; for trends over time, use DB Console or Prometheus metrics
- Production-approved sources: Only use
SHOW CLUSTER STATEMENTS,SHOW CLUSTER SESSIONS, andcrdb_internal.cluster_transactionsfor production triage
References
Skill references:
Related skills:
- profiling-statement-fingerprints - For historical performance pattern analysis and trend identification
- profiling-transaction-fingerprints - For historical transaction-level analysis including retry storms and commit latency
Official CockroachDB Documentation:
More from cockroachlabs/cockroachdb-skills
cockroachdb-sql
Use when writing, generating, or optimizing SQL for CockroachDB, designing CockroachDB schemas, or when the user asks about CockroachDB-specific SQL patterns, type mappings, and distributed database best practices. Also use when encountering CockroachDB anti-patterns like missing primary keys, sequential ID hotspots, or incorrect type usage.
31analyzing-range-distribution
Analyzes CockroachDB range distribution across tables and indexes using SHOW RANGES to identify range count, size patterns, leaseholder placement, and replication health. Use when investigating hotspots, uneven data distribution, range fragmentation, or validating zone configuration effects without DB Console access.
27managing-cluster-settings
Reviews, audits, and modifies CockroachDB cluster settings. Self-Hosted has full control over all settings and start flags. Advanced/BYOC can modify most SQL-level settings but infrastructure settings are managed by CRL. Standard has limited settings access — session variables are the primary tuning mechanism. Basic has minimal settings — use session variables and Cloud Console. Use when auditing configuration, tuning performance, or troubleshooting settings-related issues.
25hardening-user-privileges
Hardens CockroachDB user privileges by auditing and tightening role-based access control, reducing admin grants, restricting PUBLIC role permissions, and applying least-privilege principles. Use when reducing excessive privileges, cleaning up admin access, or implementing RBAC best practices.
25auditing-table-statistics
Audits optimizer table statistics for staleness, missing coverage, and data quality issues using SHOW STATISTICS. Use when diagnosing poor query performance, unexpected plan changes, or after bulk data changes to identify stale statistics requiring refresh via CREATE STATISTICS.
25monitoring-background-jobs
Monitors CockroachDB background job health by identifying failed, paused, and long-running jobs using SHOW JOBS and SHOW AUTOMATIC JOBS. Surfaces schema changes, backups/restores, automatic statistics collection, and SQL stats compaction jobs without DB Console access. Use when investigating schema change delays, failed backups, or automatic job issues.
24