altinity-expert-clickhouse-kafka

Installation
SKILL.md

Diagnostics

Run all queries from the file checks.sql and analyze the results.


Interpreting Results

Consumer Health

Check if consumers are stuck by comparing exception time vs activity times:

  • last_exception_time >= last_poll_time OR last_exception_time >= last_commit_time → consumer stuck on error, not progressing
  • Otherwise → consumer healthy

The exceptions column is a tuple of arrays with matching indices — exceptions.time[-1] and exceptions.text[-1] give the most recent error.

Thread Pool Capacity

  • kafka_consumers > mb_pool_size → thread starvation — consumers waiting for available threads
  • Fix: increase background_message_broker_schedule_pool_size (default: 16)
  • Sizing: total Kafka + RabbitMQ/NATS consumers + 25% buffer

Slow Materialized Views (Poll Interval Risk)

  • MV avg duration > 30s → consumer may exceed max.poll.interval.ms and get kicked from the group
  • MV executions with error status → likely consumer rebalances (consumer kicked, MV interrupted mid-batch)
  • Most common root cause for slow MVs: multiple JSONExtract calls re-parsing the same JSON blob
  • Fix: rewrite to one-pass JSONExtract(json, 'Tuple(...)') AS parsed + tupleElement() — see troubleshooting.md

Pool Utilization Trends (12h)

  • Sustained high values near pool size → capacity pressure
  • Spikes correlating with lag → temporary overload
  • Flat zero → Kafka consumers may not be active

Advanced Diagnostics

For deeper investigation, run queries from advanced_checks.sql:

  • Consumer exception drill-down — filter to a specific problematic Kafka table
  • Consumption speed measurement — snapshot-based rate calculation
  • Topic lag via rdkafka_stat — total lag per table and per-partition breakdown
  • Broker connection health — connection state, errors, disconnects

Important: rdkafka_stat is not enabled by default in ClickHouse. It requires <statistics_interval_ms> in the Kafka engine settings. See advanced_checks.sql for setup instructions.


Common Issues

For troubleshooting common errors and configuration guidance, see troubleshooting.md:

  • Topic authorization / ACL errors
  • Poll interval exceeded (slow MV / JSON parsing optimization)
  • Thread pool starvation
  • Parsing errors / dead letter queue
  • Data loss with multiple materialized views
  • Offset rewind / replay
  • Parallel consumption tuning

Cross-Module Triggers

Finding Load Module Reason
Slow MV inserts altinity-expert-clickhouse-ingestion Insert pipeline analysis
High merge memory altinity-expert-clickhouse-merges Merge patterns
Query-level issues altinity-expert-clickhouse-reporting Query optimization
Schema concerns altinity-expert-clickhouse-schema Table design

Settings Reference

Setting Scope Notes
background_message_broker_schedule_pool_size Server Thread pool for Kafka/RabbitMQ/NATS consumers (default: 16)
kafka_num_consumers Table Parallel consumers per table (limited by cores)
kafka_thread_per_consumer Table Required for parallel inserts (= 1)
kafka_handle_error_mode Table stream (21.6+) or dead_letter (25.8+)
max_poll_interval_ms librdkafka Max time between polls before consumer is kicked (default: 300s)
statistics_interval_ms librdkafka Enable rdkafka_stat collection (disabled by default)
Related skills
Installs
31
Repository
altinity/skills
GitHub Stars
5
First Seen
Feb 9, 2026