Cassandra SSTable Format Parsing

Installation

SKILL.md

Cassandra SSTable Format Parsing

This skill helps with parsing and understanding Cassandra 5.0+ SSTable file formats.

When to Use This Skill

Parsing Data.db, Index.db, Statistics.db files
Debugging binary format mismatches
Analyzing hex dumps of SSTable data
Working with compression (LZ4, Snappy, Deflate)
Investigating offset calculation errors
Understanding BTI (Big Table Index) format
Validating partition boundaries

Key SSTable Components

Data.db

Contains the actual row data with:

Partition headers
Row data (clustering + cells)
Compression blocks
Checksums

Index.db

Contains partition index entries with:

BTI (Big Table Index) format in Cassandra 5.0+
Partition key → file offset mapping
Promoted index entries

Statistics.db

Contains serialization metadata:

Encoding stats (min/max timestamps, TTLs)
Column definitions
Schema information
Compression parameters

Summary.db

Contains sampling of index entries for faster lookups

Format References

Primary Source of Truth: docs/sstables-definitive-guide/

Key chapters:

Ch.5: Data.db Format - Row layout, flags, V5CompressedLegacy
Ch.6: Index.db and Summary.db - Partition lookups
Ch.9: CompressionInfo.db - Compression metadata, chunking
Ch.17: BTI Formats - Trie-based indexes
Appendix B: Encoding Cheat Sheet - VInt, cell flags
Appendix F: Known Limitations - What doesn't work yet

Common Debugging Techniques

Hex Dump Analysis

When debugging parsing errors:

Extract hex at specific offset:
```
hexdump -C Data.db -s <offset> -n 64
```
Compare with expected format:
- Check magic bytes (if applicable)
- Verify VInt encoding
- Validate flag bytes
Look for patterns:
- Repeated byte sequences may indicate arrays/collections
- All zeros may indicate padding
- Non-zero high bytes suggest multi-byte integers

Offset Validation

Track byte consumption at each parsing stage:

Clustering prefix (may be 0 bytes)
Row sizes (2 VInts)
Liveness info (conditional)
Deletion info (conditional)
Column bitmap (conditional)
Cell data

Zero-Copy Considerations

When implementing parsers:

Use Bytes crate for buffer sharing
Avoid copying large cell values
Keep references to original buffer
Use byte slices not owned Vecs

Integration with Rust Code

Current implementation in cqlite-core/src/storage/sstable/reader/parsing/:

v5_compressed_legacy.rs - Main V5 format parser (1997 lines)
Uses zero-copy patterns with Bytes
Handles compression transparently

PRD Alignment

Supports Milestone M1 (Core Reading Library):

100% Cassandra 5 SSTable format support
All compression formats (LZ4, Snappy, Deflate)
Zero-copy deserialization
Memory target: <128MB for large files

Quick Reference

Flag Bytes (Row)

0x01: HAS_IS_MARKER
0x02: HAS_ALL_COLUMNS (inverted - 0x20 means all present)
0x04: HAS_TIMESTAMP
0x08: HAS_TTL
0x10: HAS_DELETION
0x20: HAS_ALL_COLUMNS (flag set = all columns present)
0x40: IS_STATIC
0x80: EXTENSION_FLAG

VInt Encoding

Variable-length integer encoding:

First byte indicates length
Subsequent bytes contain value
Used for row sizes, timestamps, offsets

Next Steps

When parser encounters issues:

Log byte offsets at each stage
Compare against Java source (UnfilteredSerializer.java)
Validate against sstabledump output
Check compression block boundaries
Verify delta encoding calculations

Related skills

More from pmcfadin/cqlite

Installs

–

Repository

pmcfadin/cqlite

First Seen

–

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass

Cassandra SSTable Format Parsing

Cassandra SSTable Format Parsing

When to Use This Skill

Key SSTable Components

Data.db

Index.db

Statistics.db

Summary.db

Format References

Common Debugging Techniques

Hex Dump Analysis

Offset Validation

Zero-Copy Considerations

Integration with Rust Code

PRD Alignment

Quick Reference

Flag Bytes (Row)

VInt Encoding

Next Steps

More from pmcfadin/cqlite

napi-rs-node-bindings

pyo3-maturin-bindings

rust performance & safety patterns

cql type system & schema handling

test data generation & validation

ci/cd validation & merge workflow