LLM Inference Batching Scheduler

This skill provides guidance for solving LLM inference batching and scheduling optimization problems, where requests must be grouped into batches while minimizing cost, padding waste, and latency.

Problem Understanding

Before implementation, thoroughly analyze the problem structure:

Constraint Analysis

Identify all hard constraints - Extract exact limits for:
- Maximum unique shapes allowed (e.g., ≤ 8 shapes across all buckets)
- Latency thresholds (P95, P99)
- Cost budget thresholds
- Padding ratio limits
Compute hard bounds early - Before coding, calculate:
- Minimum possible padding from alignment requirements
- Minimum number of batches required for coverage
- Maximum achievable efficiency given constraints
Decompose the cost function - Understand each component:
- Per-batch overhead (fixed cost per batch)
- Shape compilation costs (often quadratic in sequence length)
- Prefill/decode costs (variable per request)
- Document as: Cost ≈ overhead × num_batches + shape_compile_cost + prefill_cost + decode_cost

Data Analysis

Profile the request distribution - Examine:
- Distribution of prompt lengths (prompt_len)
- Distribution of generation lengths (gen_len)
- Identify outliers that may disproportionately impact metrics
Verify coverage requirements - Ensure:
- The largest prompt_len in each bucket is covered by chosen shapes
- Edge cases with extreme gen_len values are handled

Implementation Approach

Build Reusable Evaluation Infrastructure

Before iterating on parameters, create a systematic evaluation harness:

1. Write a function that takes parameters (shape_list, gen_bucket_sizes) and returns all metrics
2. Include automatic constraint verification with assertions
3. Enable rapid parameter comparison without manual re-runs

Parameter Search Strategy

Avoid random trial-and-error. Instead:

Grid search for small parameter spaces - When parameters are bounded (e.g., gen_bucket_size in [15, 50]), systematically evaluate combinations
Binary search for single parameters - When optimizing one parameter while holding others fixed, use binary search to find optimal values
Document the optimization landscape - Track which parameter combinations produce which metric values to understand trade-offs

Shape Selection Guidelines

When selecting shapes for sequence length bucketing:

Analyze the length distribution - Choose shapes that minimize padding for the most common lengths
Consider power-of-two or geometric progressions - These often balance coverage vs. shape count
Account for both buckets jointly - If shapes are shared across buckets, optimize globally not independently

Generation Length Bucketing

The gen_len bucketing parameter significantly impacts padding ratio:

Smaller buckets = Lower padding ratio but more batches (higher cost)
Larger buckets = Fewer batches but higher padding from variance in gen_len
Find the sweet spot by computing the padding budget and working backwards

Verification Strategies

Early Constraint Checks

Immediately after generating output, verify:

1. All request IDs appear exactly once
2. Number of unique shapes ≤ limit
3. Each request's prompt_len ≤ assigned shape
4. No missing shapes that would cause alignment failures

Metric Validation

Before considering a solution complete:

Run the official evaluation script (if provided)
Compare all metrics against all thresholds
Check each bucket independently - passing one bucket does not guarantee passing others

Common Verification Failures

Watch for these issues:

Missing shapes that cause coverage gaps (e.g., shape 2048 missing when needed)
Single-request batches that waste per-batch overhead
Shape constraints violated when optimizing buckets independently

Common Pitfalls

Premature Optimization

Mistake: Jumping into implementation before understanding mathematical constraints
Fix: Spend time upfront computing exact budgets (e.g., "bucket_1 can tolerate at most 25,735 padding tokens")

Insufficient Cost Analysis

Mistake: Not understanding which cost component dominates
Fix: Compute and document the full cost breakdown before optimizing

Independent Bucket Optimization

Mistake: Optimizing each bucket separately when constraints span both
Fix: Consider joint optimization, especially for shared shape constraints

Manual Parameter Tuning Loops

Mistake: Repeatedly changing parameters manually and re-running
Fix: Write a parameter sweep script that tests multiple combinations automatically

Late Discovery of Constraint Violations

Mistake: Only checking constraints after extensive iteration
Fix: Add assertion checks immediately after output generation

Ignoring Baseline Implementation

Mistake: Not analyzing provided baseline code to understand inefficiencies
Fix: Study baseline_packer.py (or equivalent) to identify what specifically makes it suboptimal

Optimization Trade-offs

Document these trade-offs explicitly before optimizing:

Lever	Decreases	Increases
More shapes	Padding ratio	Shape compilation cost
Smaller gen buckets	Padding ratio	Batch count (cost)
Larger batch sizes	Per-batch overhead	Potential padding
Tighter shape intervals	Padding	Number of shapes needed

Recommended Workflow

Analyze - Parse problem, identify constraints, compute hard bounds
Instrument - Build evaluation harness with automatic constraint checking
Baseline - Run baseline solution, understand its deficiencies
Decompose - Break cost into components, identify dominant terms
Search - Use systematic parameter search (grid/binary), not random exploration
Verify - Check all constraints and metrics for all buckets
Iterate - If failing, identify which metric is furthest from threshold and focus there

llm-inference-batching-scheduler