skills/eyadsibai/ltk/huggingface-tokenizers

huggingface-tokenizers

SKILL.md

HuggingFace Tokenizers

Fast, production-ready tokenization - Rust-powered, Python API.

When to Use

  • High-performance tokenization (<20s per GB)
  • Train custom tokenizers from scratch
  • Track token-to-text alignment
  • Production NLP pipelines
  • Need BPE, WordPiece, or Unigram tokenization

Quick Start

from tokenizers import Tokenizer

# Load pretrained
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

# Encode
output = tokenizer.encode("Hello, how are you?")
print(output.tokens)  # ['hello', ',', 'how', 'are', 'you', '?']
print(output.ids)     # [7592, 1010, 2129, 2024, 2017, 1029]

# Decode
text = tokenizer.decode(output.ids)

Train Custom Tokenizer

BPE (GPT-2 style)

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel

# Initialize
tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))
tokenizer.pre_tokenizer = ByteLevel()

# Configure trainer
trainer = BpeTrainer(
    vocab_size=50000,
    special_tokens=["<|endoftext|>", "<|pad|>"],
    min_frequency=2
)

# Train
tokenizer.train(files=["data.txt"], trainer=trainer)

# Save
tokenizer.save("my-tokenizer.json")

WordPiece (BERT style)

from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

trainer = WordPieceTrainer(
    vocab_size=30000,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)

tokenizer.train(files=["data.txt"], trainer=trainer)

Encoding Options

# Single text
output = tokenizer.encode("Hello world")

# Batch encoding
outputs = tokenizer.encode_batch(["Hello", "World"])

# With padding
tokenizer.enable_padding(pad_id=0, pad_token="[PAD]")
outputs = tokenizer.encode_batch(texts)

# With truncation
tokenizer.enable_truncation(max_length=512)
output = tokenizer.encode(long_text)

Access Encoding Data

output = tokenizer.encode("Hello world")

output.ids           # Token IDs
output.tokens        # Token strings
output.attention_mask  # Attention mask
output.offsets       # Character offsets (alignment)
output.word_ids      # Word indices

Pre-tokenizers

from tokenizers.pre_tokenizers import (
    Whitespace,      # Split on whitespace
    ByteLevel,       # Byte-level (GPT-2)
    BertPreTokenizer,  # BERT style
    Punctuation,     # Split on punctuation
    Sequence,        # Chain multiple
)

# Chain pre-tokenizers
from tokenizers.pre_tokenizers import Sequence, Whitespace, Punctuation
tokenizer.pre_tokenizer = Sequence([Whitespace(), Punctuation()])

Post-processing

from tokenizers.processors import TemplateProcessing

# BERT-style: [CLS] ... [SEP]
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
)

Normalization

from tokenizers.normalizers import (
    NFD, NFKC, Lowercase, StripAccents, Sequence
)

# BERT normalization
tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()])

With Transformers

from transformers import PreTrainedTokenizerFast

# Wrap for transformers compatibility
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

# Now works with transformers
encoded = fast_tokenizer("Hello world", return_tensors="pt")

Save and Load

# Save
tokenizer.save("tokenizer.json")

# Load
tokenizer = Tokenizer.from_file("tokenizer.json")

# From HuggingFace Hub
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

Performance Tips

  1. Use batch encoding for multiple texts
  2. Enable padding/truncation once, not per-encode
  3. Pre-tokenizer choice affects speed significantly
  4. Train on representative data for better vocabulary

vs Alternatives

Tool Best For
tokenizers Speed, custom training, production
SentencePiece T5/ALBERT, language-independent
tiktoken OpenAI models (GPT)

Resources

Weekly Installs
31
Repository
eyadsibai/ltk
First Seen
Jan 28, 2026
Installed on
gemini-cli26
opencode24
github-copilot23
codex23
antigravity20
claude-code20