scylladb-data-modeling

Installation
SKILL.md

ScyllaDB Data Modeling

CQL data modeling patterns and anti-patterns for ScyllaDB. Bad schema is the root cause of most ScyllaDB performance issues — no amount of cluster scaling can fix a fundamentally wrong data model.

When to Apply

Reference these guidelines when:

  • Designing a new ScyllaDB schema from scratch
  • Migrating from SQL/relational databases, MongoDB, or Cassandra to ScyllaDB
  • Reviewing existing table designs for performance issues
  • Troubleshooting slow queries, timeouts, or hot nodes
  • Deciding how to structure primary keys (partition key + clustering columns)
  • Modeling time-series, IoT, or event data
  • Seeing large partition warnings in logs
  • Encountering ALLOW FILTERING in queries or code reviews
  • Adding secondary indexes or materialized views

Key Principle

"Start from the queries, not from the entities."

This is ScyllaDB's core data modeling philosophy. Unlike relational databases where you normalize entities and then write queries against them, in ScyllaDB you:

  1. List your application's queries — every SELECT, UPDATE, and DELETE your application will run
  2. Design one table per query pattern — each table's primary key is crafted to serve a specific query efficiently
  3. Accept denormalization — the same data may exist in multiple tables, each optimized for a different access pattern

In ScyllaDB, the partition key determines which node holds the data and the clustering columns determine the sort order within that partition. The primary key IS your access pattern.

Quick Reference

1. Anti-Patterns — 3 rules

  • antipattern-allow-filteringALLOW FILTERING forces a full-scan. Consult this reference whenever you see it in a query, or when a query does not include the full partition key.
  • antipattern-large-partitions — Partitions that grow without bounds cause memory pressure, slow reads, and compaction issues. Consult when designing time-series tables or any table where rows accumulate per partition key.
  • antipattern-hot-partitions — Uneven partition key distribution causes some nodes/shards to be overloaded while others sit idle. Consult when choosing partition keys for high-write workloads.

2. Fundamentals — 3 rules

  • query-first-design — Design tables from queries, not entities. The methodology for translating application access patterns into CQL table definitions.
  • partition-key-design — How to choose partition keys that distribute data evenly and support your queries. Includes composite partition keys.
  • clustering-columns — How clustering columns control sort order within a partition, and how to use them for range queries and time-ordered data.

3. Patterns — 2 rules

  • pattern-bucketing — Split unbounded partitions into bounded time buckets (e.g., one partition per day/hour). Essential for time-series and IoT data.
  • secondary-indexes-and-mv — When to use secondary indexes, local secondary indexes, and materialized views — and when to prefer a denormalized table instead.

Primary Key Structure

PRIMARY KEY ((partition_key_col1, partition_key_col2), clustering_col1, clustering_col2)
            |___________________________________|   |__________________________________|
                     Partition key                         Clustering columns
                 (determines node/shard)             (determines sort order within
                                                      the partition)

Rules

  1. The partition key MUST appear in every WHERE clause — queries that don't specify the full partition key require ALLOW FILTERING (a full-scan) or a secondary index
  2. Clustering columns define sort orderASC by default, configurable with WITH CLUSTERING ORDER BY
  3. Partition key determines data distribution — all rows with the same partition key live on the same node/shard

How to Use

Each reference file listed above contains detailed explanations and CQL examples. Use the descriptions in the Quick Reference to identify which files are relevant to your current task.

Each reference file contains:

  • Brief explanation of why it matters
  • Incorrect CQL example with explanation
  • Correct CQL example with explanation
  • "When NOT to use" exceptions
  • Performance impact

Comparison with SQL / MongoDB

Concept SQL MongoDB ScyllaDB
Schema design Entity-first (normalize) Document-first (embed) Query-first (denormalize per query)
Data unit Row in a table Document in a collection Row in a partition
Joins JOIN clause $lookup aggregation Not supported — denormalize instead
Flexible schema No (strict DDL) Yes (schema-optional) No (strict CQL DDL)
Index required for queries Often optional Recommended Partition key required in every query
Unique constraint UNIQUE unique: true index Primary key only
Transactions Full ACID Multi-document ACID Lightweight Transactions (LWT) for single-partition linearizable ops
Related skills
Installs
11
GitHub Stars
4
First Seen
Apr 16, 2026