NYC
skills/letta-ai/skills/llm-inference-batching-scheduler

llm-inference-batching-scheduler

SKILL.md

LLM Inference Batching Scheduler

This skill provides guidance for solving LLM inference batching and scheduling optimization problems, where requests must be grouped into batches while minimizing cost, padding waste, and latency.

Problem Understanding

Before implementation, thoroughly analyze the problem structure:

Constraint Analysis

  1. Identify all hard constraints - Extract exact limits for:

    • Maximum unique shapes allowed (e.g., ≤ 8 shapes across all buckets)
    • Latency thresholds (P95, P99)
    • Cost budget thresholds
    • Padding ratio limits
  2. Compute hard bounds early - Before coding, calculate:

    • Minimum possible padding from alignment requirements
    • Minimum number of batches required for coverage
    • Maximum achievable efficiency given constraints
  3. Decompose the cost function - Understand each component:

    • Per-batch overhead (fixed cost per batch)
    • Shape compilation costs (often quadratic in sequence length)
    • Prefill/decode costs (variable per request)
    • Document as: Cost ≈ overhead × num_batches + shape_compile_cost + prefill_cost + decode_cost

Data Analysis

  1. Profile the request distribution - Examine:

    • Distribution of prompt lengths (prompt_len)
    • Distribution of generation lengths (gen_len)
    • Identify outliers that may disproportionately impact metrics
  2. Verify coverage requirements - Ensure:

    • The largest prompt_len in each bucket is covered by chosen shapes
    • Edge cases with extreme gen_len values are handled

Implementation Approach

Build Reusable Evaluation Infrastructure

Before iterating on parameters, create a systematic evaluation harness:

1. Write a function that takes parameters (shape_list, gen_bucket_sizes) and returns all metrics
2. Include automatic constraint verification with assertions
3. Enable rapid parameter comparison without manual re-runs

Parameter Search Strategy

Avoid random trial-and-error. Instead:

  1. Grid search for small parameter spaces - When parameters are bounded (e.g., gen_bucket_size in [15, 50]), systematically evaluate combinations

  2. Binary search for single parameters - When optimizing one parameter while holding others fixed, use binary search to find optimal values

  3. Document the optimization landscape - Track which parameter combinations produce which metric values to understand trade-offs

Shape Selection Guidelines

When selecting shapes for sequence length bucketing:

  1. Analyze the length distribution - Choose shapes that minimize padding for the most common lengths
  2. Consider power-of-two or geometric progressions - These often balance coverage vs. shape count
  3. Account for both buckets jointly - If shapes are shared across buckets, optimize globally not independently

Generation Length Bucketing

The gen_len bucketing parameter significantly impacts padding ratio:

  1. Smaller buckets = Lower padding ratio but more batches (higher cost)
  2. Larger buckets = Fewer batches but higher padding from variance in gen_len
  3. Find the sweet spot by computing the padding budget and working backwards

Verification Strategies

Early Constraint Checks

Immediately after generating output, verify:

1. All request IDs appear exactly once
2. Number of unique shapes ≤ limit
3. Each request's prompt_len ≤ assigned shape
4. No missing shapes that would cause alignment failures

Metric Validation

Before considering a solution complete:

  1. Run the official evaluation script (if provided)
  2. Compare all metrics against all thresholds
  3. Check each bucket independently - passing one bucket does not guarantee passing others

Common Verification Failures

Watch for these issues:

  • Missing shapes that cause coverage gaps (e.g., shape 2048 missing when needed)
  • Single-request batches that waste per-batch overhead
  • Shape constraints violated when optimizing buckets independently

Common Pitfalls

Premature Optimization

  • Mistake: Jumping into implementation before understanding mathematical constraints
  • Fix: Spend time upfront computing exact budgets (e.g., "bucket_1 can tolerate at most 25,735 padding tokens")

Insufficient Cost Analysis

  • Mistake: Not understanding which cost component dominates
  • Fix: Compute and document the full cost breakdown before optimizing

Independent Bucket Optimization

  • Mistake: Optimizing each bucket separately when constraints span both
  • Fix: Consider joint optimization, especially for shared shape constraints

Manual Parameter Tuning Loops

  • Mistake: Repeatedly changing parameters manually and re-running
  • Fix: Write a parameter sweep script that tests multiple combinations automatically

Late Discovery of Constraint Violations

  • Mistake: Only checking constraints after extensive iteration
  • Fix: Add assertion checks immediately after output generation

Ignoring Baseline Implementation

  • Mistake: Not analyzing provided baseline code to understand inefficiencies
  • Fix: Study baseline_packer.py (or equivalent) to identify what specifically makes it suboptimal

Optimization Trade-offs

Document these trade-offs explicitly before optimizing:

Lever Decreases Increases
More shapes Padding ratio Shape compilation cost
Smaller gen buckets Padding ratio Batch count (cost)
Larger batch sizes Per-batch overhead Potential padding
Tighter shape intervals Padding Number of shapes needed

Recommended Workflow

  1. Analyze - Parse problem, identify constraints, compute hard bounds
  2. Instrument - Build evaluation harness with automatic constraint checking
  3. Baseline - Run baseline solution, understand its deficiencies
  4. Decompose - Break cost into components, identify dominant terms
  5. Search - Use systematic parameter search (grid/binary), not random exploration
  6. Verify - Check all constraints and metrics for all buckets
  7. Iterate - If failing, identify which metric is furthest from threshold and focus there
Weekly Installs
22
Repository
letta-ai/skills
First Seen
Jan 24, 2026
Installed on
gemini-cli15
claude-code15
antigravity14
opencode14
codex13
openclaw11