data-formats
Installation
SKILL.md
Data Formats
How to work with diverse and unknown data formats.
Format Detection
Always inspect before parsing:
file <filename> # MIME type detection
xxd <filename> | head -5 # hex dump (first bytes)
head -3 <filename> # text preview
python3 -c "
with open('<filename>', 'rb') as f:
h = f.read(16)
print(h, h.hex())
"
Common Formats
Binary
- Magic bytes: Most binary formats start with a signature (ELF:
\x7fELF, PNG:\x89PNG) - Endianness: Check if little-endian or big-endian (
struct.unpack('<I', ...)vs'>I') - Alignment: Fields are often aligned to 4 or 8 bytes
- Offsets: Binary headers often contain offsets to other sections
Structured text
- CSV/TSV: Check delimiter (comma, tab, pipe), quoting, header row
- JSON:
python3 -c "import json; json.load(open('f'))" - YAML: Check indentation, anchors/aliases
- TOML:
python3 -c "import tomllib; ..." - XML: Check encoding declaration, namespaces
Checkpoints / Model files
- PyTorch:
.pt,.pth→torch.load(f, map_location='cpu') - TensorFlow:
.ckpt→ index + data files, usetf.train.load_checkpoint() - NumPy:
.npy,.npz→numpy.load() - HuggingFace:
config.json+model.safetensors - ONNX:
onnx.load()
Database files
- SQLite:
filesays "SQLite 3.x database" →sqlite3 <file> ".tables" - WAL files: SQLite write-ahead log — recover with
sqlite3PRAGMA - CSV dumps: Often need schema inference
Parsing Strategies
Unknown binary format
- Hex dump first 256 bytes:
xxd file | head -16 - Look for magic bytes, version numbers, string tables
- Check file size — does it suggest a pattern? (e.g., N * record_size)
- Look for documentation of the format online
- Write a minimal parser, test on known values
Large structured files
- Never load entirely — sample first:
head,tail,shuf -n 10 - Check consistency: are all lines the same format?
- Count fields:
head -1 file | awk -F',' '{print NF}' - Watch for: mixed types, missing values, encoding issues
Multi-file datasets
- List all files and sizes
- Look for manifest/index files (often JSON or CSV)
- Check naming patterns — timestamps, sequence numbers, shards
- Process one file first, then generalize
Common Pitfalls
- Assuming UTF-8 when the file is Latin-1 or binary
- Assuming CSV when it's TSV (or vice versa)
- Ignoring the header row
- Not handling quoted fields with embedded delimiters
- Reading binary files as text (corrupts data)
- Endianness mismatch (x86 is little-endian, network byte order is big-endian)
Related skills
More from vstorm-co/pydantic-deepagents
data-analysis
Comprehensive data analysis skill for CSV files using Python and pandas
21code-review
Systematic code review for bugs, security, style, and performance
11test-generator
Generate pytest test cases for Python functions and classes
10report-writing
Guidelines for writing well-structured, cited research reports
1