gpt2-codegolf
GPT-2 Code Golf
Overview
This skill provides guidance for implementing GPT-2 or similar transformer model inference in minimal code, typically for code golf challenges. These tasks require parsing binary checkpoint formats, implementing BPE tokenization, and performing forward passes through transformer architectures—all within strict size constraints.
Critical Principles
1. Verify Before Optimizing
Never optimize for size before functionality is verified. A working 10KB solution is better than a broken 4KB solution. Size optimization should be the final step, not a constraint during development.
2. Incremental Development and Testing
Build and test components independently before integration:
- Checkpoint parsing - Verify weights load correctly
- Tokenization - Verify BPE encoding/decoding works
- Forward pass - Verify matrix operations produce reasonable outputs
- Integration - Combine components only after individual verification
3. Format Analysis Before Implementation
Before writing any parsing code, analyze the actual file formats:
# Examine checkpoint structure
hexdump -C model.ckpt.data-00000-of-00001 | head -100
# Check index file format
cat model.ckpt.index | xxd | head -50
# Examine vocabulary/BPE file structure
head -50 vocab.bpe
Approach: Checkpoint Parsing
Understanding TensorFlow Checkpoint Format
TensorFlow checkpoints consist of multiple files:
.indexfile: Contains tensor metadata (names, shapes, offsets).data-*files: Contains actual weight values
The index file uses protocol buffers. Parsing it correctly requires understanding:
- Varint encoding for integers
- String length prefixes
- Tensor shape encodings
Verification Strategy for Weight Loading
After implementing checkpoint parsing, verify weights loaded correctly:
// Print statistics for loaded weights
printf("wte sum: %f\n", sum_tensor(wte, vocab_size * n_embd));
printf("wpe[0]: %f %f %f\n", wpe[0], wpe[1], wpe[2]);
printf("First layer attn weight samples: %f %f\n",
attn_w[0][0], attn_w[0][100]);
Red flags indicating parsing failure:
- All values are zero
- All values are identical
- Values are extremely large (e.g., 1e30+)
- NaN or Inf values
Common Mistakes in Checkpoint Parsing
- Using arbitrary magic offsets - Offsets depend on tensor order and checkpoint version
- Ignoring endianness - TensorFlow uses little-endian floats
- Skipping index file - The data file alone lacks tensor boundaries
- Pattern matching on binary data - Binary patterns like
0x08are not reliable markers
Approach: BPE Tokenization
Understanding BPE Structure
GPT-2 uses byte-pair encoding with:
- A vocabulary file mapping tokens to IDs
- Merge rules defining how to combine byte sequences
Verification Strategy for Tokenization
Test with known input/output pairs:
// Known tokenization for GPT-2
// "Hello" -> [15496]
// " world" -> [995]
// "Hello world" -> [15496, 995]
int tokens[MAX_TOKENS];
int n = tokenize("Hello world", tokens);
assert(n == 2);
assert(tokens[0] == 15496);
assert(tokens[1] == 995);
Common Mistakes in BPE Implementation
- Word-level splitting - GPT-2 BPE operates on bytes, not words
- Ignoring merge order - Merges must be applied in priority order
- Missing special handling - Spaces are encoded as part of tokens (e.g., " world" is one token)
- Reading wrong file sections - Vocabulary IDs vs merge rules are in different sections
Approach: Forward Pass
Verification Strategy for Forward Pass
- Test with simple inputs first:
// Single token input, check output shape and range
float logits[VOCAB_SIZE];
forward_pass(single_token, 1, logits);
// Logits should be roughly in range [-10, 10]
// Softmax should sum to 1.0
- Compare against reference implementation:
# Generate reference outputs with Hugging Face
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
# Save intermediate activations for comparison
- Check numerical stability:
// After attention, values should be bounded
// After layer norm, mean should be ~0, std ~1
Common Numerical Issues
- Overflow in softmax - Subtract max before exponentiating
- Accumulation errors - Use double precision for reductions
- Missing layer normalization - Each transformer block requires layer norm
Verification Checklist
Before declaring completion, verify each component:
- Checkpoint parsing: Print sample weights, verify non-zero and reasonable values
- Vocabulary loading: Print sample token mappings, verify expected tokens exist
- BPE encoding: Test with known strings, verify token IDs match reference
- BPE decoding: Round-trip test: encode then decode should return original string
- Forward pass: Single token produces logits in expected range
- Sampling: Top token for common prefixes matches expected continuations
- End-to-end: Generated text is coherent (not garbage or repeated patterns)
When to Simplify vs. Ask for Alternatives
If proper implementation of a component would exceed constraints:
- Checkpoint format too complex: Ask if weights can be provided in a simpler format (raw binary floats, NumPy arrays)
- Full BPE too large: Ask if a pre-tokenized input is acceptable
- Full transformer too large: Ask about acceptable accuracy tradeoffs (fewer layers, smaller dimensions)
Never implement a non-functional simplification silently. If corners must be cut, communicate clearly which functionality is affected.
Anti-Patterns to Avoid
- Claiming success without testing - Always run the program with actual inputs before completion
- Ignoring self-identified risks - If a limitation is noted, address it or communicate it
- Premature size optimization - Get it working first, then optimize
- Testing only happy path - Verify error handling and edge cases
- Assuming file format knowledge - Always examine actual files before parsing