replication-guide
Installation
SKILL.md
Replication Guide
When to use this skill
Load when users ask about replication setup, lag, failover, or Keeper/ZooKeeper.
ReplicatedMergeTree Basics
- Engine:
ReplicatedMergeTree('/clickhouse/tables/{shard}/{database}/{table}', '{replica}') - Requires ZooKeeper or ClickHouse Keeper
- All replicas are equal — any replica can accept writes
- Replication is asynchronous by default
Monitoring Replication
system.replicas— per-table status:absolute_delay,queue_size,is_leader,is_readonlysystem.replication_queue— pending operations: fetches, merges, mutations- Key health indicators:
absolute_delay = 0— fully caught upis_readonly = 0— accepting writesqueue_size < 10— healthy queueactive_replicas = total_replicas— all replicas online
Failover Procedures
- Check replica status:
SELECT * FROM system.replicas WHERE is_readonly = 1 - Verify Keeper connectivity:
SELECT * FROM system.zookeeper WHERE path = '/' - If replica is readonly due to Keeper disconnect, it auto-recovers when connection restores
- For permanent failures:
SYSTEM DROP REPLICA 'replica_name' FROM TABLE db.table
Quorum Writes
SET insert_quorum = 2— wait for N replicas to confirmSET insert_quorum_parallel = 1— parallel quorum inserts (v21.8+)SET insert_quorum_timeout = 60000— timeout in ms- Use for critical data that must survive node failures
Keeper Management
- ClickHouse Keeper is the recommended replacement for ZooKeeper
- Monitor:
system.zookeepertable for browsing ZK tree - Key paths:
/clickhouse/tables/for table metadata - Check Keeper health:
SELECT * FROM system.asynchronous_metrics WHERE metric LIKE '%Keeper%'
Common Issues
- Split brain: Multiple leaders — usually Keeper issue, restart Keeper
- Readonly replica: Lost Keeper session — check network, Keeper logs
- Queue buildup: Slow fetches — check network bandwidth, disk I/O
- Diverged replicas:
SYSTEM SYNC REPLICA db.tableto force sync
Related skills
More from duyet/clickhouse-monitoring
troubleshooting
Diagnose and resolve common ClickHouse issues: OOM, slow merges, replication lag, disk full, stuck mutations, and query failures.
2query-optimization
Query optimization strategies: PREWHERE, JOIN patterns, materialized views, EXPLAIN analysis, index usage, and query profiling.
2storage-optimization
Compression codecs, TTL policies, tiered storage, part management, and disk space optimization.
2security-hardening
RBAC configuration, row policies, quotas, network security, audit logging, and access control best practices.
2migration-patterns
Schema migrations, ALTER patterns, engine changes, data backfill, and zero-downtime migration strategies.
2cluster-operations
Distributed table management, resharding, node addition/removal, and cluster topology operations.
2