llm-inference-batching-scheduler
LLM Inference Batching Scheduler
This skill provides guidance for solving LLM inference batching and scheduling optimization problems, where requests must be grouped into batches while minimizing cost, padding waste, and latency.
Problem Understanding
Before implementation, thoroughly analyze the problem structure:
Constraint Analysis
-
Identify all hard constraints - Extract exact limits for:
- Maximum unique shapes allowed (e.g., ≤ 8 shapes across all buckets)
- Latency thresholds (P95, P99)
- Cost budget thresholds
- Padding ratio limits
-
Compute hard bounds early - Before coding, calculate:
- Minimum possible padding from alignment requirements
- Minimum number of batches required for coverage
- Maximum achievable efficiency given constraints
-
Decompose the cost function - Understand each component:
- Per-batch overhead (fixed cost per batch)
- Shape compilation costs (often quadratic in sequence length)
- Prefill/decode costs (variable per request)
- Document as:
Cost ≈ overhead × num_batches + shape_compile_cost + prefill_cost + decode_cost
Data Analysis
-
Profile the request distribution - Examine:
- Distribution of prompt lengths (prompt_len)
- Distribution of generation lengths (gen_len)
- Identify outliers that may disproportionately impact metrics
-
Verify coverage requirements - Ensure:
- The largest prompt_len in each bucket is covered by chosen shapes
- Edge cases with extreme gen_len values are handled
Implementation Approach
Build Reusable Evaluation Infrastructure
Before iterating on parameters, create a systematic evaluation harness:
1. Write a function that takes parameters (shape_list, gen_bucket_sizes) and returns all metrics
2. Include automatic constraint verification with assertions
3. Enable rapid parameter comparison without manual re-runs
Parameter Search Strategy
Avoid random trial-and-error. Instead:
-
Grid search for small parameter spaces - When parameters are bounded (e.g., gen_bucket_size in [15, 50]), systematically evaluate combinations
-
Binary search for single parameters - When optimizing one parameter while holding others fixed, use binary search to find optimal values
-
Document the optimization landscape - Track which parameter combinations produce which metric values to understand trade-offs
Shape Selection Guidelines
When selecting shapes for sequence length bucketing:
- Analyze the length distribution - Choose shapes that minimize padding for the most common lengths
- Consider power-of-two or geometric progressions - These often balance coverage vs. shape count
- Account for both buckets jointly - If shapes are shared across buckets, optimize globally not independently
Generation Length Bucketing
The gen_len bucketing parameter significantly impacts padding ratio:
- Smaller buckets = Lower padding ratio but more batches (higher cost)
- Larger buckets = Fewer batches but higher padding from variance in gen_len
- Find the sweet spot by computing the padding budget and working backwards
Verification Strategies
Early Constraint Checks
Immediately after generating output, verify:
1. All request IDs appear exactly once
2. Number of unique shapes ≤ limit
3. Each request's prompt_len ≤ assigned shape
4. No missing shapes that would cause alignment failures
Metric Validation
Before considering a solution complete:
- Run the official evaluation script (if provided)
- Compare all metrics against all thresholds
- Check each bucket independently - passing one bucket does not guarantee passing others
Common Verification Failures
Watch for these issues:
- Missing shapes that cause coverage gaps (e.g., shape 2048 missing when needed)
- Single-request batches that waste per-batch overhead
- Shape constraints violated when optimizing buckets independently
Common Pitfalls
Premature Optimization
- Mistake: Jumping into implementation before understanding mathematical constraints
- Fix: Spend time upfront computing exact budgets (e.g., "bucket_1 can tolerate at most 25,735 padding tokens")
Insufficient Cost Analysis
- Mistake: Not understanding which cost component dominates
- Fix: Compute and document the full cost breakdown before optimizing
Independent Bucket Optimization
- Mistake: Optimizing each bucket separately when constraints span both
- Fix: Consider joint optimization, especially for shared shape constraints
Manual Parameter Tuning Loops
- Mistake: Repeatedly changing parameters manually and re-running
- Fix: Write a parameter sweep script that tests multiple combinations automatically
Late Discovery of Constraint Violations
- Mistake: Only checking constraints after extensive iteration
- Fix: Add assertion checks immediately after output generation
Ignoring Baseline Implementation
- Mistake: Not analyzing provided baseline code to understand inefficiencies
- Fix: Study baseline_packer.py (or equivalent) to identify what specifically makes it suboptimal
Optimization Trade-offs
Document these trade-offs explicitly before optimizing:
| Lever | Decreases | Increases |
|---|---|---|
| More shapes | Padding ratio | Shape compilation cost |
| Smaller gen buckets | Padding ratio | Batch count (cost) |
| Larger batch sizes | Per-batch overhead | Potential padding |
| Tighter shape intervals | Padding | Number of shapes needed |
Recommended Workflow
- Analyze - Parse problem, identify constraints, compute hard bounds
- Instrument - Build evaluation harness with automatic constraint checking
- Baseline - Run baseline solution, understand its deficiencies
- Decompose - Break cost into components, identify dominant terms
- Search - Use systematic parameter search (grid/binary), not random exploration
- Verify - Check all constraints and metrics for all buckets
- Iterate - If failing, identify which metric is furthest from threshold and focus there