troubleshooting
Installation
SKILL.md
Troubleshooting Guide
When to use this skill
Load when users report errors, performance issues, or system problems.
OOM (Out of Memory)
Diagnosis
- Check
system.query_logfor queries with highmemory_usage - Look at
system.processesfor currently running memory-heavy queries system.metricsWHERE metric = 'MemoryTracking' for current usage
Solutions
- Set
max_memory_usageper query (e.g., 10GB) - Set
max_memory_usage_for_userto limit per-user consumption - Use
max_bytes_before_external_group_byfor spill-to-disk - Use
max_bytes_before_external_sortfor large sorts - Reduce JOIN sizes with pre-filtering
- Use SAMPLE for approximate aggregations
Slow Merges
Diagnosis
system.merges— checkelapsed,progress,total_size_bytes_compressedsystem.part_logWHERE event_type = 'MergeParts' for historical throughput- Too many parts: check
max_parts_count_for_partitioninsystem.asynchronous_metrics
Solutions
- Increase
background_pool_size(default: 16) - Check disk I/O with
system.asynchronous_metrics(ReadBufferFromFileDescriptorReadBytes) - Reduce insert frequency (batch larger)
- Partition strategy: ensure not too many partitions
OPTIMIZE TABLE ... FINALto force merge (expensive, use off-peak)
Replication Lag
Diagnosis
system.replicas—absolute_delay,queue_size,is_readonlysystem.replication_queue— pending tasks,num_tries,last_exception- Network: check inter-server connectivity
Solutions
- Check ZooKeeper/Keeper health and latency
- Increase
background_fetches_pool_size - Check for stuck entries in replication_queue
- If readonly: check ZooKeeper session, disk space
SYSTEM RESTART REPLICAas last resort
Disk Full
Diagnosis
system.disks—free_space,total_spacesystem.parts— find largest tables/partitions- Check for detached parts:
system.detached_parts
Solutions
- DROP old partitions:
ALTER TABLE t DROP PARTITION 'YYYYMM' - Set up TTL for automatic cleanup
- Move data to cold storage with tiered storage
- Clean detached parts:
ALTER TABLE t DROP DETACHED PART 'name' - Reduce replication factor if desperate
Stuck Mutations
Diagnosis
system.mutationsWHERE is_done = 0 — checkparts_to_do,latest_fail_reason- Mutations block new merges on affected parts
Solutions
KILL MUTATION WHERE mutation_id = '...'to cancel- Fix the underlying issue (schema mismatch, disk space)
- Re-submit the mutation after fixing
- Consider using INSERT + ReplacingMergeTree instead of UPDATE mutations
Query Failures
Diagnosis
system.query_logWHERE type = 'ExceptionWhileProcessing'- Check
exception_codeandexceptioncolumns - Common codes: 60 (table not found), 47 (unknown column), 241 (memory limit)
Solutions
- Code 60: verify table exists, check database name
- Code 47: use
get_table_schemato check column names - Code 241: reduce query scope, add LIMIT, use SAMPLE
- Code 159: timeout — add time filters, use LIMIT
- Code 252: too many parts — wait for merges or optimize
Related skills
More from duyet/clickhouse-monitoring
replication-guide
ReplicatedMergeTree operations, failover procedures, lag diagnosis, quorum writes, and Keeper management.
2query-optimization
Query optimization strategies: PREWHERE, JOIN patterns, materialized views, EXPLAIN analysis, index usage, and query profiling.
2storage-optimization
Compression codecs, TTL policies, tiered storage, part management, and disk space optimization.
2security-hardening
RBAC configuration, row policies, quotas, network security, audit logging, and access control best practices.
2migration-patterns
Schema migrations, ALTER patterns, engine changes, data backfill, and zero-downtime migration strategies.
2cluster-operations
Distributed table management, resharding, node addition/removal, and cluster topology operations.
2