Write Compressor

Overview

This skill provides strategies for implementing encoders that produce output compatible with existing decoders. It applies to tasks involving compression algorithms, arithmetic coding, entropy encoding, or any scenario requiring the construction of an encoder as the mathematical inverse of a decoder.

Core Principle: Encoder as Decoder Inverse

When implementing an encoder for an existing decoder, the encoder must be the exact mathematical inverse of the decoder operations. Every decoder operation has a corresponding encoder operation that must produce the exact values the decoder expects to read.

Mathematical Derivation First

Before writing any code:

Document the decoder's state machine - Identify all state variables (e.g., low, high, range, fraction) and how they evolve
Derive encoder operations algebraically - For each decoder read operation, derive what the encoder must write
Verify the inverse relationship on paper - Prove mathematically that encoder output → decoder input produces the original data

Recommended Approach

Phase 1: Understand the Decoder Completely

Read the entire decoder implementation thoroughly
Trace through the decoder manually with simple inputs
Document every state variable and its valid ranges
Identify the bit/byte reading patterns and what values they produce
Map out conditional branches and what triggers each path

Phase 2: Start with Minimal Cases

Build the encoder incrementally, verifying each step:

Zero operations - Encode an empty/minimal input and verify it decodes correctly
Single simple operation - Encode one basic element (e.g., one literal character)
Two operations - Verify state carries correctly between operations
Gradually increase complexity - Only after simpler cases work

Phase 3: Side-by-Side State Verification

Create a verification harness that:

Runs encoder and decoder in parallel (or simulates this)
Compares state variables after every single operation
Immediately identifies the first point of divergence
Prints both encoder and decoder state at each step for debugging

Phase 4: Full Implementation

Only after phases 1-3 succeed, proceed to full file encoding.

Verification Strategies

Unit Testing Individual Components

For arithmetic coding or similar algorithms, test each component independently:

Bit encoding/decoding in isolation
Integer encoding/decoding in isolation
Symbol encoding/decoding in isolation
Back-reference or special token encoding in isolation

Round-Trip Testing

original_data → encoder → compressed → decoder → recovered_data
assert original_data == recovered_data

Run round-trip tests at each complexity level before proceeding.

State Trace Comparison

Build a debugging mode that outputs encoder state at each step. Feed the compressed output to the decoder with similar tracing. Compare traces to find divergence.

Common Pitfalls

1. Renormalization Formula Errors

In arithmetic coding, the renormalization step is critical. The formula for outputting bytes during renormalization must exactly match how the decoder reconstructs the fraction from bytes.

Prevention: Trace through specific numeric examples by hand. If the decoder reads bytes as fraction += read_byte() - 1, derive exactly what the encoder must output.

2. Off-by-One Errors

Common in:

Range calculations
Byte output values (e.g., +1, -1, % 256 adjustments)
Loop bounds for flush/finalization

Prevention: Use concrete numeric examples with known expected outputs.

3. Flushing/Finalization Errors

The final bytes to flush the encoder state are often implemented incorrectly.

Prevention: Test the flush procedure separately with known encoder states.

4. Premature Optimization

Worrying about output size before achieving correctness.

Prevention: First make it work, then make it small. A working 3KB output is infinitely better than a broken 2KB output.

5. Trial-and-Error Implementation

Making random changes to formulas hoping something works.

Prevention: Every change should be justified by mathematical reasoning about why the previous version was wrong and why the new version is correct.

6. Parallel Implementation Attempts

Creating multiple encoder files (encoder.py, encoder2.py, encoder_v3.py) spreads effort thin.

Prevention: Work on one implementation. Use version control to track changes. Debug deeply rather than rewriting from scratch.

Debugging Strategy

When the decoder crashes or produces wrong output:

Identify the first failure point - Where exactly does decoding first go wrong?
Compare states at that point - What did the encoder think the state was vs. what the decoder computed?
Trace backward - Find the operation that caused the divergence
Fix with mathematical justification - Don't just try random changes

For Segmentation Faults in Decoder

A segfault typically means:

Invalid memory access from corrupted indices
The compressed stream is structurally invalid
The encoder produced bytes the decoder interprets as impossible values

Debug by:

Adding bounds checking to the decoder (temporarily)
Printing decoder state before the crash
Identifying what impossible state was reached
Tracing back to what encoder output caused this

Decision Checklist

Before claiming the encoder is complete:

Does the simplest possible input (empty/zero) encode and decode correctly?
Does a single-element input encode and decode correctly?
Have edge cases been tested (empty strings, maximum values, boundary conditions)?
Has a side-by-side state trace been performed for at least one non-trivial input?
Does the full input encode and decode correctly?
If there are size constraints, does the output meet them?

write-compressor