skills/altinity/skills/altinity-expert-clickhouse-merges

altinity-expert-clickhouse-merges

SKILL.md

Diagnostics

Run all queries from checks.sql (cluster-wide) one by one and produce a decision-ready report.

Triage Order (Mandatory)

  1. Check current active merges from system.merges.
  2. Check merge success/failure trend from system.part_log.
  3. Check table-level status (merge_ok, merge_failed, last success/failure).
  4. Check merge reason + algorithm (merge_reason, merge_algorithm).
  5. Check merge RAM now + historical peak RAM (system.merges.memory_usage, system.part_log.peak_memory_usage).
  6. Check part-count offenders with split:
    • database = 'system'
    • database != 'system'
  7. Check relevant settings, including TTL merge concurrency.

Decision Rules

Use one of these final verdicts explicitly:

  • PROVED: cluster-wide merge stop (no successful merges in the selected window).
  • DECLINED: cluster-wide stop is false.
  • PARTIAL: merges are blocked for specific table(s) while others still merge.

Additional rules:

  • If some tables still have successful merges in same timeframe, do not report global merge stop.
  • If a target table has merge_ok = 0 with repeated MEMORY_LIMIT_EXCEEDED, report table-level block.
  • If merges are 100% Horizontal, state that planner selected horizontal merges (do not say vertical is disabled unless settings prove it).
  • If max part count is driven by system.* tables, call out alert-source mismatch to avoid misattribution to business tables.

Problem-Specific Investigation

"Too Many Parts" Investigation

For a specific table, run ad-hoc checks (time-bound and limited):

select
    toStartOfMinute(event_time) as minute,
    countIf(event_type = 'NewPart') as new_parts,
    countIf(event_type = 'MergeParts') as merges,
    countIf(event_type = 'MergeParts') - countIf(event_type = 'NewPart') as net_reduction
from system.part_log
where database = '{database}'
  and table = '{table}'
  and event_time > now() - interval 1 hour
group by minute
order by minute desc
limit 60

If net_reduction is negative consistently, inserts outpace merges.

TTL merge pressure and merge-size settings snapshot

Check system.merge_tree_settings if modified Suggest changing (reducing or increasing) in case of a problem as remediation.

  • max_parts_to_merge_at_once
  • max_bytes_to_merge_at_max_space_in_pool
  • max_bytes_to_merge_at_min_space_in_pool
  • enable_vertical_merge_algorithm
  • vertical_merge_algorithm_min_rows_to_activate
  • vertical_merge_algorithm_min_columns_to_activate
  • max_number_of_merges_with_ttl_in_pool
  • max_replicated_merges_with_ttl_in_queue
  • parts_to_delay_insert
  • parts_to_throw_insert

Structural Fix Guidance (When Settings Are Not Enough)

Call out anti-patterns explicitly:

  • Single hot partition (partition_id='all')
  • Heavy TTL ... GROUP BY ... SET ... on a hot ingestion table
  • Persistent large horizontal merge attempts with OOM failures

Recommended long-term direction:

  • Add time-based partitioning
  • Move heavy rollup logic from TTL path to MV/batch table
  • Keep base-table TTL simple (delete-oriented)

Ad-Hoc Query Guidelines

Required Safeguards

-- Always include LIMIT
limit 100

-- Always time-bound historical queries
where event_time >= now() - interval 24 hour

-- For part_log, always filter event_type
where event_type in ('NewPart', 'MergeParts', 'MutatePart')

Avoid

  • select * from system.part_log
  • Unbounded scans on *_log tables
  • Large joins in-context (aggregate in SQL)

Cross-Module Triggers

Finding Load Module Reason
High memory during merges / OOM altinity-expert-clickhouse-memory Memory limits and pressure
Slow merges + normal disk altinity-expert-clickhouse-schema ORDER BY/partitioning anti-patterns
Slow merges + high disk IO altinity-expert-clickhouse-storage Storage bottleneck
Merges blocked by mutations altinity-expert-clickhouse-mutations Mutation backlog
Replication lag + merge issues altinity-expert-clickhouse-replication Queue/replica bottlenecks

Final Report Sections (Mandatory)

  1. Environment header
  2. Global vs table-specific merge status
  3. Current and peak merge RAM
  4. Merge reason and merge algorithm findings
  5. Max-part offenders (system vs non-system)
  6. Current settings and recommended deltas
  7. Immediate mitigation and structural remediation
Weekly Installs
43
Repository
altinity/skills
GitHub Stars
5
First Seen
Jan 24, 2026
Installed on
codex41
claude-code37
opencode20
gemini-cli20
github-copilot19
kimi-cli16