skills/altinity/skills/altinity-expert-clickhouse-kafka

altinity-expert-clickhouse-kafka

SKILL.md

Diagnostics

Run all queries from the file checks.sql and analyze the results.


Interpreting Results

Consumer Health

Check if consumers are stuck by comparing exception time vs activity times:

  • last_exception_time >= last_poll_time OR last_exception_time >= last_commit_time → consumer stuck on error, not progressing
  • Otherwise → consumer healthy

The exceptions column is a tuple of arrays with matching indices — exceptions.time[-1] and exceptions.text[-1] give the most recent error.

Thread Pool Capacity

  • kafka_consumers > mb_pool_size → thread starvation — consumers waiting for available threads
  • Fix: increase background_message_broker_schedule_pool_size (default: 16)
  • Sizing: total Kafka + RabbitMQ/NATS consumers + 25% buffer

Slow Materialized Views (Poll Interval Risk)

  • MV avg duration > 30s → consumer may exceed max.poll.interval.ms and get kicked from the group
  • MV executions with error status → likely consumer rebalances (consumer kicked, MV interrupted mid-batch)
  • Most common root cause for slow MVs: multiple JSONExtract calls re-parsing the same JSON blob
  • Fix: rewrite to one-pass JSONExtract(json, 'Tuple(...)') AS parsed + tupleElement() — see troubleshooting.md

Pool Utilization Trends (12h)

  • Sustained high values near pool size → capacity pressure
  • Spikes correlating with lag → temporary overload
  • Flat zero → Kafka consumers may not be active

Advanced Diagnostics

For deeper investigation, run queries from advanced_checks.sql:

  • Consumer exception drill-down — filter to a specific problematic Kafka table
  • Consumption speed measurement — snapshot-based rate calculation
  • Topic lag via rdkafka_stat — total lag per table and per-partition breakdown
  • Broker connection health — connection state, errors, disconnects

Important: rdkafka_stat is not enabled by default in ClickHouse. It requires <statistics_interval_ms> in the Kafka engine settings. See advanced_checks.sql for setup instructions.


Common Issues

For troubleshooting common errors and configuration guidance, see troubleshooting.md:

  • Topic authorization / ACL errors
  • Poll interval exceeded (slow MV / JSON parsing optimization)
  • Thread pool starvation
  • Parsing errors / dead letter queue
  • Data loss with multiple materialized views
  • Offset rewind / replay
  • Parallel consumption tuning

Cross-Module Triggers

Finding Load Module Reason
Slow MV inserts altinity-expert-clickhouse-ingestion Insert pipeline analysis
High merge memory altinity-expert-clickhouse-merges Merge patterns
Query-level issues altinity-expert-clickhouse-reporting Query optimization
Schema concerns altinity-expert-clickhouse-schema Table design

Settings Reference

Setting Scope Notes
background_message_broker_schedule_pool_size Server Thread pool for Kafka/RabbitMQ/NATS consumers (default: 16)
kafka_num_consumers Table Parallel consumers per table (limited by cores)
kafka_thread_per_consumer Table Required for parallel inserts (= 1)
kafka_handle_error_mode Table stream (21.6+) or dead_letter (25.8+)
max_poll_interval_ms librdkafka Max time between polls before consumer is kicked (default: 300s)
statistics_interval_ms librdkafka Enable rdkafka_stat collection (disabled by default)
Weekly Installs
23
Repository
altinity/skills
GitHub Stars
5
First Seen
Feb 9, 2026
Installed on
codex22
claude-code19
opencode12
github-copilot12
kimi-cli12
gemini-cli12