GPT-2 Code Golf

Overview

This skill provides guidance for implementing GPT-2 or similar transformer model inference in minimal code, typically for code golf challenges. These tasks require parsing binary checkpoint formats, implementing BPE tokenization, and performing forward passes through transformer architectures—all within strict size constraints.

Critical Principles

1. Verify Before Optimizing

Never optimize for size before functionality is verified. A working 10KB solution is better than a broken 4KB solution. Size optimization should be the final step, not a constraint during development.

2. Incremental Development and Testing

Build and test components independently before integration:

Checkpoint parsing - Verify weights load correctly
Tokenization - Verify BPE encoding/decoding works
Forward pass - Verify matrix operations produce reasonable outputs
Integration - Combine components only after individual verification

3. Format Analysis Before Implementation

Before writing any parsing code, analyze the actual file formats:

# Examine checkpoint structure
hexdump -C model.ckpt.data-00000-of-00001 | head -100

# Check index file format
cat model.ckpt.index | xxd | head -50

# Examine vocabulary/BPE file structure
head -50 vocab.bpe

Approach: Checkpoint Parsing

Understanding TensorFlow Checkpoint Format

TensorFlow checkpoints consist of multiple files:

.index file: Contains tensor metadata (names, shapes, offsets)
.data-* files: Contains actual weight values

The index file uses protocol buffers. Parsing it correctly requires understanding:

Varint encoding for integers
String length prefixes
Tensor shape encodings

Verification Strategy for Weight Loading

After implementing checkpoint parsing, verify weights loaded correctly:

// Print statistics for loaded weights
printf("wte sum: %f\n", sum_tensor(wte, vocab_size * n_embd));
printf("wpe[0]: %f %f %f\n", wpe[0], wpe[1], wpe[2]);
printf("First layer attn weight samples: %f %f\n",
       attn_w[0][0], attn_w[0][100]);

Red flags indicating parsing failure:

All values are zero
All values are identical
Values are extremely large (e.g., 1e30+)
NaN or Inf values

Common Mistakes in Checkpoint Parsing

Using arbitrary magic offsets - Offsets depend on tensor order and checkpoint version
Ignoring endianness - TensorFlow uses little-endian floats
Skipping index file - The data file alone lacks tensor boundaries
Pattern matching on binary data - Binary patterns like 0x08 are not reliable markers

Approach: BPE Tokenization

Understanding BPE Structure

GPT-2 uses byte-pair encoding with:

A vocabulary file mapping tokens to IDs
Merge rules defining how to combine byte sequences

Verification Strategy for Tokenization

Test with known input/output pairs:

// Known tokenization for GPT-2
// "Hello" -> [15496]
// " world" -> [995]
// "Hello world" -> [15496, 995]

int tokens[MAX_TOKENS];
int n = tokenize("Hello world", tokens);
assert(n == 2);
assert(tokens[0] == 15496);
assert(tokens[1] == 995);

Common Mistakes in BPE Implementation

Word-level splitting - GPT-2 BPE operates on bytes, not words
Ignoring merge order - Merges must be applied in priority order
Missing special handling - Spaces are encoded as part of tokens (e.g., " world" is one token)
Reading wrong file sections - Vocabulary IDs vs merge rules are in different sections

Approach: Forward Pass

Verification Strategy for Forward Pass

Test with simple inputs first:

// Single token input, check output shape and range
float logits[VOCAB_SIZE];
forward_pass(single_token, 1, logits);
// Logits should be roughly in range [-10, 10]
// Softmax should sum to 1.0

Compare against reference implementation:

# Generate reference outputs with Hugging Face
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
# Save intermediate activations for comparison

Check numerical stability:

// After attention, values should be bounded
// After layer norm, mean should be ~0, std ~1

Common Numerical Issues

Overflow in softmax - Subtract max before exponentiating
Accumulation errors - Use double precision for reductions
Missing layer normalization - Each transformer block requires layer norm

Verification Checklist

Before declaring completion, verify each component:

Checkpoint parsing: Print sample weights, verify non-zero and reasonable values
Vocabulary loading: Print sample token mappings, verify expected tokens exist
BPE encoding: Test with known strings, verify token IDs match reference
BPE decoding: Round-trip test: encode then decode should return original string
Forward pass: Single token produces logits in expected range
Sampling: Top token for common prefixes matches expected continuations
End-to-end: Generated text is coherent (not garbage or repeated patterns)

When to Simplify vs. Ask for Alternatives

If proper implementation of a component would exceed constraints:

Checkpoint format too complex: Ask if weights can be provided in a simpler format (raw binary floats, NumPy arrays)
Full BPE too large: Ask if a pre-tokenized input is acceptable
Full transformer too large: Ask about acceptable accuracy tradeoffs (fewer layers, smaller dimensions)

Never implement a non-functional simplification silently. If corners must be cut, communicate clearly which functionality is affected.

Anti-Patterns to Avoid

Claiming success without testing - Always run the program with actual inputs before completion
Ignoring self-identified risks - If a limitation is noted, address it or communicate it
Premature size optimization - Get it working first, then optimize
Testing only happy path - Verify error handling and edge cases
Assuming file format knowledge - Always examine actual files before parsing

gpt2-codegolf