sentencepiece
Originally fromovachiever/droid-tings
Installation
SKILL.md
SentencePiece - Language-Independent Tokenization
Unsupervised tokenizer that works on raw text without language-specific preprocessing.
When to use SentencePiece
Use SentencePiece when:
- Building multilingual models (no language-specific rules)
- Working with CJK languages (Chinese, Japanese, Korean)
- Need reproducible tokenization (deterministic vocabulary)
- Want to train on raw text (no pre-tokenization needed)
- Require lightweight deployment (6MB memory, 50k sentences/sec)
Performance:
- Speed: 50,000 sentences/sec
- Memory: ~6MB for loaded model
- Languages: All (language-independent)