monitoring-background-jobs
Monitoring Background Jobs
Monitors background job health by identifying failed, paused, and long-running jobs that are distinct from user queries. Uses SQL-only interfaces (SHOW JOBS and SHOW AUTOMATIC JOBS) to surface schema changes, backups/restores, automatic statistics collection, and SQL stats compaction without requiring DB Console access.
When to Use This Skill
- Schema changes appear stuck or delayed (ALTER TABLE, CREATE INDEX, DROP operations)
- Backups or restores are failing or taking longer than expected
- Need to verify automatic statistics collection is running
- Investigating "waiting for MVCC GC" status in schema change cleanup
- Troubleshooting failed jobs without DB Console access
- Monitoring long-running operations that don't appear in query metrics
For live query monitoring: Use triaging-live-sql-activity to monitor currently executing user queries. Note that background jobs execute statements that may not appear in SHOW CLUSTER STATEMENTS.
For historical query analysis: Use profiling-statement-fingerprints for query pattern trends. Note that background jobs are excluded from statement statistics.
Prerequisites
Required SQL access:
- Connection to any CockroachDB node
- For cluster-wide job visibility:
VIEWJOBsystem privilege (read-only monitoring) - For job control operations:
CONTROLJOBrole option (pause/cancel/resume jobs) - Without these: Limited visibility into jobs you created
Check your privileges:
SHOW GRANTS ON ROLE <username>;
Look for:
VIEWJOBin theprivilege_typecolumn (system privilege)CONTROLJOBin role options (check withSHOW USERS)
See permissions reference for detailed RBAC setup.
Core Concepts
Jobs vs Statements
Key distinction:
- Statements: User-initiated SQL queries tracked by SHOW CLUSTER STATEMENTS and statement statistics
- Background jobs: Long-running operations tracked separately by SHOW JOBS
Background jobs are excluded from:
SHOW CLUSTER STATEMENTS(live query monitoring)crdb_internal.statement_statistics(historical query analysis)- Statement fingerprint metrics and DB Console Statements page
Common job types:
| Category | Job Types | Examples |
|---|---|---|
| User-initiated | SCHEMA CHANGE, BACKUP, RESTORE, IMPORT, CHANGEFEED | ALTER TABLE, CREATE INDEX, BACKUP DATABASE, RESTORE |
| Automatic | SCHEMA CHANGE GC, AUTO CREATE STATS, AUTO SQL STATS COMPACTION | Post-DROP cleanup, table statistics refresh, stats table maintenance |
See job types reference for complete catalog.
SHOW JOBS vs SHOW AUTOMATIC JOBS
| Interface | Scope | Time Window | Use Case |
|---|---|---|---|
SHOW JOBS |
User-initiated + automatic | Last 12 hours (default) | Monitor backups, schema changes, user operations |
SHOW AUTOMATIC JOBS |
Automatic only | Configurable (recommend 24h) | Monitor AUTO CREATE STATS, AUTO SQL STATS COMPACTION, SCHEMA CHANGE GC |
Time retention:
- Default retention: 14 days in
crdb_internal.jobstable SHOW JOBSdisplay window: 12 hours (configurable withSHOW JOBS SELECT * FROM [SHOW JOBS] WHERE ...)SHOW AUTOMATIC JOBSdisplay window: Configurable withWHERE created > now() - INTERVAL '...'
Job Status Values
| Status | Meaning | Action Required |
|---|---|---|
running |
Job is actively executing | Monitor progress via fraction_completed |
succeeded |
Job completed successfully | None |
failed |
Job encountered an error | Investigate error column, may need to retry |
paused |
Job manually paused | Resume with RESUME JOB if appropriate |
canceled |
Job was canceled (terminal state) | Retry operation if needed |
pending |
Job queued but not started | Monitor; may indicate resource constraints |
reverting |
Job failed and is rolling back changes | Wait for completion; check error after |
Running status sub-states:
performing backup: Backup job actively transferring datarestoring: Restore job actively applying datawaiting for MVCC GC: SCHEMA CHANGE GC waiting for garbage collection eligibility
See job states reference for detailed state transitions and "waiting for MVCC GC" explanation.
Core Diagnostic Queries
Query 1: Failed Jobs (Last 12 Hours)
Identify jobs that failed with error messages:
-- Failed jobs in last 12 hours
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
finished,
now() - created AS total_duration,
error
FROM j
WHERE status = 'failed'
AND created > now() - INTERVAL '12 hours'
ORDER BY created DESC
LIMIT 50;
Key columns:
error: Failure reason (check for permission errors, disk space, network issues)description: Human-readable description of what the job was doingtotal_duration: How long the job ran before failing
Common failure patterns:
- Permission denied: User lacks required privileges
- Disk space: Backup destination full
- Network timeout: External storage unreachable
- Constraint violation: Restore conflicts with existing data
Query 2: Long-Running Jobs
Find jobs running longer than expected threshold:
-- Jobs running longer than 1 hour
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
status,
running_status,
created,
now() - created AS running_for,
fraction_completed,
coordinator_id
FROM j
WHERE status = 'running'
AND created < now() - INTERVAL '1 hour'
ORDER BY created
LIMIT 50;
Key columns:
running_for: Total elapsed time since job startedfraction_completed: Progress estimate (0.0 to 1.0, NULL if unavailable)running_status: Sub-state details (e.g., "waiting for MVCC GC")
Customizable thresholds:
- Schema changes: 30 minutes to several hours (depends on table size)
- Backups: 1-6+ hours (depends on data volume)
- Automatic jobs: Usually < 30 minutes
Query 3: Paused Jobs
Identify jobs that are paused and may need attention:
-- Paused jobs needing resume
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS paused_for,
coordinator_id
FROM j
WHERE status = 'paused'
ORDER BY created
LIMIT 50;
Action required:
Resume with RESUME JOB <job_id> after verifying the pause reason.
Common reasons for paused jobs:
- Manual user pause for maintenance
- Resource constraints (cluster paused the job)
- Error requiring manual intervention
Query 4: Schema Changes Waiting for MVCC GC
Find SCHEMA CHANGE GC jobs waiting for garbage collection:
-- Schema change cleanup jobs waiting for GC
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS waiting_for,
running_status
FROM j
WHERE status = 'running'
AND job_type = 'SCHEMA CHANGE GC'
AND running_status LIKE '%waiting for MVCC GC%'
ORDER BY created
LIMIT 50;
Interpretation:
- Normal: SCHEMA CHANGE GC jobs wait for data to become garbage-collectable based on
gc.ttlsecondssetting (default 25 hours) - Expected duration: Up to
gc.ttlseconds+ some overhead - When to worry: Waiting > 2x
gc.ttlseconds(check setting withSHOW CLUSTER SETTING gc.ttlseconds)
Why this happens: After DROP TABLE/INDEX operations, CockroachDB must wait for all reads at older timestamps to complete before physically removing data. This prevents "time-travel" queries from failing.
See job states reference for detailed MVCC GC explanation.
Query 5: Automatic Job Health (24h Window)
Monitor automatic background jobs like statistics collection:
-- Automatic jobs in last 24 hours
SELECT
job_id,
job_type,
description,
status,
created,
finished,
COALESCE(finished, now()) - created AS duration
FROM [SHOW AUTOMATIC JOBS]
WHERE created > now() - INTERVAL '24 hours'
AND job_type IN ('AUTO CREATE STATS', 'AUTO SQL STATS COMPACTION')
ORDER BY created DESC
LIMIT 50;
Key job types:
AUTO CREATE STATS: Automatic table statistics refresh (critical for query optimizer)AUTO SQL STATS COMPACTION: Periodic cleanup of statement/transaction statistics tables
Health indicators:
- Healthy: Regular successful executions (every few hours)
- Unhealthy: No recent executions, or high failure rate
- Impact of failure: Stale statistics lead to poor query plans and slow queries
Query 6: Jobs by Type and Status
Aggregated view for pattern analysis:
-- Job distribution by type and status (last 24h)
WITH j AS (SHOW JOBS)
SELECT
job_type,
status,
COUNT(*) AS job_count,
MIN(created) AS oldest,
MAX(created) AS newest
FROM j
WHERE created > now() - INTERVAL '24 hours'
GROUP BY job_type, status
ORDER BY job_type, status;
Use case:
- Identify patterns (e.g., all BACKUP jobs failing, multiple schema changes stuck)
- Spot anomalies (e.g., unusual job type volume)
- Track job success rates by type
Query 7: Backup and Restore Progress
Track progress of backup/restore jobs:
-- Active backup/restore jobs with progress
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS running_for,
ROUND(COALESCE(fraction_completed, 0) * 100, 2) AS percent_complete,
CASE
WHEN fraction_completed > 0 AND fraction_completed < 1 THEN
((now() - created) / fraction_completed) - (now() - created)
ELSE NULL
END AS estimated_time_remaining,
running_status
FROM j
WHERE status = 'running'
AND job_type IN ('BACKUP', 'RESTORE')
ORDER BY created
LIMIT 50;
Key columns:
percent_complete: Progress percentage (0-100)estimated_time_remaining: Rough estimate based on current progress raterunning_status: Detailed status (e.g., "performing backup to s3://...")
Note: fraction_completed may be NULL for some job types or early in execution.
Common Workflows
Workflow 1: Schema Change Stuck Investigation
Scenario: User reports ALTER TABLE or CREATE INDEX appears stuck.
-
Check for running schema changes:
WITH j AS (SHOW JOBS) SELECT job_id, description, created, now() - created AS running_for, fraction_completed, running_status FROM j WHERE status = 'running' AND job_type IN ('SCHEMA CHANGE', 'NEW SCHEMA CHANGE') ORDER BY created; -
Identify MVCC GC waits:
-- Use Query 4 to find "waiting for MVCC GC" jobs -
Interpret results:
- If
running_status= "waiting for MVCC GC": Normal for post-DROP cleanup (wait up togc.ttlseconds) - If long-running with low
fraction_completed: Check for contention, large table size, or resource constraints - If failed: Check
errorcolumn for specific failure reason
- If
-
Next steps:
- MVCC GC wait: Verify
SHOW CLUSTER SETTING gc.ttlsecondsand wait - Resource constraints: Check cluster CPU/memory usage
- Failed job: Address error (permissions, constraints) and retry operation
- MVCC GC wait: Verify
Workflow 2: Failed Backup Triage
Scenario: Scheduled backup job failed.
-
Find recent failed backups:
-- Use Query 1 filtered for BACKUP job type WITH j AS (SHOW JOBS) SELECT job_id, description, created, finished, error FROM j WHERE status = 'failed' AND job_type = 'BACKUP' AND created > now() - INTERVAL '24 hours' ORDER BY created DESC; -
Analyze error messages:
- "permission denied": Check external storage credentials
- "timeout": Network connectivity to backup destination
- "no space left": Destination storage full
- "connection refused": External storage endpoint unreachable
-
Verify backup destination:
-- Check SHOW BACKUP for successful backups to same destination SHOW BACKUP 's3://bucket/path'; -
Remediate and retry:
- Fix underlying issue (credentials, storage, network)
- Re-run backup command
- Monitor with Query 7 for progress
Workflow 3: Automatic Job Health Check
Scenario: Proactive monitoring of automatic background jobs.
-
Check AUTO CREATE STATS frequency:
-- Use Query 5 to see recent automatic statistics jobs SELECT job_type, status, COUNT(*) AS job_count, MAX(created) AS most_recent FROM [SHOW AUTOMATIC JOBS] WHERE created > now() - INTERVAL '24 hours' AND job_type = 'AUTO CREATE STATS' GROUP BY job_type, status; -
Expected pattern:
- Multiple successful AUTO CREATE STATS jobs per day (depends on table update frequency)
- Regular AUTO SQL STATS COMPACTION (typically once per hour)
-
Warning signs:
- No AUTO CREATE STATS in last 24h: Statistics collection may be disabled
- High failure rate: Check cluster resource constraints or permission issues
- No AUTO SQL STATS COMPACTION: Stats table may grow unbounded
-
Verify settings:
SHOW CLUSTER SETTING sql.stats.automatic_collection.enabled; -- Should be true SHOW CLUSTER SETTING sql.stats.automatic_collection.min_stale_rows;
Workflow 4: Long-Running Job Monitoring
Scenario: Track progress of expected long-running operations.
-
Identify long-running jobs:
-- Use Query 2 with custom threshold WITH j AS (SHOW JOBS) SELECT job_id, job_type, description, now() - created AS running_for, fraction_completed FROM j WHERE status = 'running' AND created < now() - INTERVAL '30 minutes' ORDER BY created; -
Monitor progress over time:
-- Re-run every 10-15 minutes, track fraction_completed changes -- Example: 0.25 → 0.40 → 0.55 indicates steady progress -
Estimate completion:
-- Use Query 7 for backup/restore jobs with time estimates -
Decide on action:
- Steady progress: Continue monitoring
- Stalled progress (fraction_completed not increasing): Investigate with triaging-live-sql-activity
- Failed: Use Query 1 to check error
Safety Considerations
Read-only operations (all diagnostic queries):
All SHOW JOBS and SHOW AUTOMATIC JOBS queries are read-only and safe to run in production. No performance impact on cluster operations.
Job control operations (opt-in):
CAUTION: Pausing or canceling jobs can have data integrity implications
Only proceed with job control if:
- You have
CONTROLJOBrole option - You understand the implications (e.g., canceling a schema change mid-execution may require manual cleanup)
- You have authorization to interrupt cluster operations
- You've verified the job is truly problematic (not just slow)
Job control commands:
-- Pause a running job (can be resumed later)
PAUSE JOB <job_id>;
-- Resume a paused job
RESUME JOB <job_id>;
-- Cancel a job (terminal - cannot be resumed)
CANCEL JOB <job_id>;
Risks by job type:
- SCHEMA CHANGE: Canceling may leave schema in inconsistent state; prefer PAUSE and investigation
- BACKUP: Canceling is safe (can retry); pausing is better for temporary issues
- RESTORE: Canceling may leave database partially restored; requires cleanup
- AUTO CREATE STATS: Canceling is safe (will retry later automatically)
Best practice: Focus on monitoring and diagnosis; only use control operations when explicitly required and authorized.
See permissions reference for CONTROLJOB role option setup.
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
SHOW JOBS returns empty |
No jobs in last 12h, or insufficient privileges | Grant VIEWJOB privilege; verify cluster has recent job activity |
| "waiting for MVCC GC" for many hours | Normal behavior for SCHEMA CHANGE GC after DROP operations | Wait up to gc.ttlseconds (default 25h); check SHOW CLUSTER SETTING gc.ttlseconds |
| Can't pause/resume job: "permission denied" | Missing CONTROLJOB role option |
Use ALTER ROLE <username> WITH CONTROLJOB (not GRANT SYSTEM) |
Job stuck at same fraction_completed |
Job may be processing large batch, or actually stuck | Wait 15-30 min and re-check; if no change, investigate with live query triage |
| No AUTO CREATE STATS jobs | Automatic collection disabled | Check sql.stats.automatic_collection.enabled = true |
SHOW AUTOMATIC JOBS shows old jobs only |
Need to filter by time window | Add WHERE created > now() - INTERVAL '24 hours' |
| Failed job with "schema change GC" error | Expected for post-DROP cleanup failures | Usually safe to ignore; job will retry automatically |
| Job error: "concurrent schema change" | Multiple schema changes on same table | Wait for first schema change to complete, then retry |
Key Considerations
- Jobs vs queries: Background jobs execute statements that don't appear in SHOW STATEMENTS or statement statistics
- Time windows: SHOW JOBS default 12h retention; use
crdb_internal.jobsfor up to 14 days - MVCC GC waiting: Normal and expected for post-DROP cleanup; duration tied to
gc.ttlseconds - LIMIT clauses: Always include for production safety (prevents overwhelming output)
- Privilege model: VIEWJOB (system privilege) for read-only; CONTROLJOB (role option) for control
- Automatic job health: Regular AUTO CREATE STATS is critical for query optimizer performance
- Progress estimates:
fraction_completedmay be NULL or sparse for some job types - Job control risks: PAUSE is safer than CANCEL; some cancellations require manual cleanup
References
Skill references:
Related skills:
- triaging-live-sql-activity - For live query monitoring (job-executed statements may not appear)
- profiling-statement-fingerprints - For historical query analysis (background jobs excluded)
Official CockroachDB Documentation:
More from cockroachlabs/cockroachdb-skills
cockroachdb-sql
Use when writing, generating, or optimizing SQL for CockroachDB, designing CockroachDB schemas, or when the user asks about CockroachDB-specific SQL patterns, type mappings, and distributed database best practices. Also use when encountering CockroachDB anti-patterns like missing primary keys, sequential ID hotspots, or incorrect type usage.
31analyzing-range-distribution
Analyzes CockroachDB range distribution across tables and indexes using SHOW RANGES to identify range count, size patterns, leaseholder placement, and replication health. Use when investigating hotspots, uneven data distribution, range fragmentation, or validating zone configuration effects without DB Console access.
27managing-cluster-settings
Reviews, audits, and modifies CockroachDB cluster settings. Self-Hosted has full control over all settings and start flags. Advanced/BYOC can modify most SQL-level settings but infrastructure settings are managed by CRL. Standard has limited settings access — session variables are the primary tuning mechanism. Basic has minimal settings — use session variables and Cloud Console. Use when auditing configuration, tuning performance, or troubleshooting settings-related issues.
25hardening-user-privileges
Hardens CockroachDB user privileges by auditing and tightening role-based access control, reducing admin grants, restricting PUBLIC role permissions, and applying least-privilege principles. Use when reducing excessive privileges, cleaning up admin access, or implementing RBAC best practices.
25auditing-table-statistics
Audits optimizer table statistics for staleness, missing coverage, and data quality issues using SHOW STATISTICS. Use when diagnosing poor query performance, unexpected plan changes, or after bulk data changes to identify stale statistics requiring refresh via CREATE STATISTICS.
25analyzing-schema-change-storage-risk
Estimates storage requirements for CockroachDB online schema change backfills using SHOW RANGES WITH DETAILS, KEYS, INDEXES. Use before CREATE INDEX, ADD COLUMN with INDEX/UNIQUE, ALTER PRIMARY KEY, CREATE MATERIALIZED VIEW, CREATE TABLE AS, REFRESH, or SET LOCALITY on tables with large per-index footprints, to avoid mid-backfill disk exhaustion.
24