h100
H100
Overview
Use this skill to do SGLang development on the H100 box through h100_sglang.
The default container is sglang_bbuf and the repo lives at /sgl-workspace/sglang.
Prefer it whenever local validation is insufficient for CUDA, Triton, diffusion pipelines, or other GPU-backed SGLang behavior.
This environment is already prepared:
sglang_bbufis running onlmsysorg/sglang:dev- the repo is cloned at
/sgl-workspace/sglang - editable installs for
python[all]andpython[diffusion]are already done /root/.cacheis mounted as the cache path- Infiniband paths are mounted into the container for RDMA-aware workflows:
/sys/class/infiniband,/dev/infiniband, and/usr/sbin/show_gids
Hugging Face cache is already mounted, but do not assume HF_TOKEN is visible in
every docker exec context. Interactive shells and non-interactive docker exec ... bash -lc "<cmd>" can behave differently. Always verify with
echo ${HF_TOKEN:+set} before gated-model or Hub-backed runs.
Quick Start
- Check the host, container, and GPU state.
ssh h100_sglang 'hostname && whoami'
ssh h100_sglang 'docker ps --format "table {{.Names}}\t{{.Status}}" | sed -n "1,20p"'
ssh h100_sglang 'nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total --format=csv,noheader,nounits'
- Enter the container and repo.
ssh h100_sglang 'docker exec -it sglang_bbuf /bin/zsh'
cd /sgl-workspace/sglang
echo ${HF_TOKEN:+set}
If HF_TOKEN is unexpectedly missing in the current shell, export it manually before Hub-backed workflows:
export HF_TOKEN=<your-hf-token>
export HUGGINGFACE_HUB_TOKEN="$HF_TOKEN"
For non-interactive docker exec ... bash -lc "<cmd>" runs, prefer exporting both
variables inside the command itself instead of assuming the shell startup path
will populate them.
- Pick a free GPU.
Use a GPU with 0 utilization and only a few MiB allocated.
Set CUDA_VISIBLE_DEVICES=<gpu_id> for every GPU-backed validation command.
- This host currently does not provide the
kill-idlehelper.
Do not assume you can reclaim other users' idle allocations automatically.
If the free GPU list is tight, re-check nvidia-smi, choose another GPU, or coordinate before proceeding.
- If the container is not running, start it first.
ssh h100_sglang 'docker start sglang_bbuf'
Safe Remote Workflow
- Inspect the default repo before editing it.
ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /sgl-workspace/sglang && git branch --show-current && git status --short"'
- Fast-forward
/sgl-workspace/sglangto the latest cleanmainbefore creating any validation worktree.
ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /sgl-workspace/sglang && git fetch origin && git checkout main && git pull --ff-only origin main"'
-
Avoid writing directly into
/sgl-workspace/sglangwhen it is dirty or when the local snapshot differs from the remoteHEAD. -
Prefer one of these isolation strategies.
Create a detached worktree for remote-only experiments:
ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /sgl-workspace/sglang && git worktree add --detach /tmp/sglang_validate_h100 HEAD"'
Stream the exact local working tree into the container when validating the current local snapshot:
COPYFILE_DISABLE=1 tar --exclude=.git -cf - . | \
ssh h100_sglang 'docker exec -i sglang_bbuf sh -lc "rm -rf /tmp/sglang_local_validate && mkdir -p /tmp/sglang_local_validate && tar -xf - -C /tmp/sglang_local_validate"'
ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "find /tmp/sglang_local_validate -name '\''._*'\'' -delete"'
Use the streamed copy when the goal is "validate exactly what is in the local repo right now". For patch-oriented remote validation, another good option is:
- update remote
main - create a detached worktree from that clean commit
- stream or apply a focused local patch diff into the worktree only
That keeps /sgl-workspace/sglang clean while still validating the exact local delta.
Validation Workflow
- Start with import or syntax-level checks.
ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /tmp/sglang_local_validate && python -m compileall python/sglang"'
For diffusion-specific edits, prefer a narrower first pass:
ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /tmp/sglang_local_validate && python -m compileall python/sglang/jit_kernel/diffusion/triton python/sglang/multimodal_gen/runtime/layers"'
- Run targeted tests for the changed area.
ssh h100_sglang 'docker exec sglang_bbuf env PYTHONPATH=python zsh -lc "cd /tmp/sglang_local_validate && pytest -q path/to/test.py -q"'
For diffusion changes, start with the fused modulation regression:
ssh h100_sglang 'docker exec sglang_bbuf env CUDA_VISIBLE_DEVICES=0 PYTHONPATH=python zsh -lc "cd /tmp/sglang_local_validate && pytest -q python/sglang/jit_kernel/tests/diffusion/test_qwen_image_modulation.py -q"'
- For GPU-backed changes, pin a free GPU explicitly.
ssh h100_sglang 'docker exec sglang_bbuf env CUDA_VISIBLE_DEVICES=0 PYTHONPATH=python zsh -lc "cd /tmp/sglang_local_validate && pytest -q path/to/gpu_test.py -q"'
- For kernel-heavy diffusion work, run a targeted smoke script for the changed primitives before attempting a model-level run.
Cover at least these when relevant:
rms_norm_fnRMSNormundertorch.compilenorm_inferapply_rotary_embedding
Pipe the script through docker exec -i ... python for pure kernel smoke.
- Use a real
.pyfile withif __name__ == "__main__":when callingDiffGenerator.from_pretrained(..., local_mode=True)or any flow that relies onmultiprocessing.spawn.
multiprocessing.spawn will fail if the script is executed from stdin or from unguarded top-level code.
- Attempt model-level or server-level smoke only after unit, kernel, or targeted regression checks pass.
Treat checkpoint, dependency, and environment failures separately from code regressions.
If a workflow reads from Hugging Face Hub, verify HF_TOKEN first and re-export it
explicitly in the current shell or command when needed.
Torch Compile Attribution
When a benchmark compares eager vs torch.compile, do not stop at the speedup number.
Capture matching eager and compile perf dumps or profile dirs. Compare structured
perf dumps with python python/sglang/multimodal_gen/benchmarks/compare_perf.py eager.json compile.json, then use llm-torch-profiler-analysis on the matching
profile dirs to explain whether the gain came from fewer launches, fewer copies,
or fused kernels replacing eager ATen ops.
Cleanup
Remove temporary validation directories when finished.
ssh h100_sglang 'docker exec sglang_bbuf rm -rf /tmp/sglang_local_validate /tmp/sglang_validate_h100'
More from bbuf/sglang-auto-driven-skills
h100-sglang-diffusion
SSH into host `h100_sglang`, enter Docker container `sglang_bbuf`, work in `/data/bbuf/repos/sglang`, and use the ready H100 remote environment for SGLang **diffusion** development and validation. Use when a task needs diffusion model smoke tests, Triton/CUDA kernel validation, torch.compile diffusion checks, or a safe remote copy for diffusion-specific SGLang changes.
24sglang-prod-incident-triage
Replay-first debug flow for SGLang serving problems. Use when a live or recent server shows health-check failures, latency or throughput regressions, queue growth, timeouts, distributed stalls, crash dumps, wrong outputs after deploys, or PD/EP/HiCache issues, and the job is to turn the problem into a replay plus the right next debug tool.
24sglang-minimax-m2-series-optimization
PR-backed and current-main optimization manual for the `MiniMaxAI/MiniMax-M2` series, including M2, M2.1, M2.5, M2.7, and M2.7-highspeed. Use when Codex needs to recover, extend, or audit MiniMax-specific optimizations, TP QK norm/all-reduce behavior, parser contracts, distributed runtime behavior, quantized loading, or backend-specific validation.
15sglang-torch-profiler-analysis
Compact SGLang torch-profiler triage skill. Use when Codex should inspect an existing `trace.json(.gz)` or profile directory, trigger `sglang.profiler` against a live server, and return one compact report with kernel, overlap-opportunity, and fuse-pattern tables. Single-trace triage is enough for quick diagnosis; mapping+formal two-trace triage gives stronger overlap conclusions.
15sglang-kimi-k2-k25-optimization
PR-backed and current-main optimization manual for `moonshotai/Kimi-K2*` and `moonshotai/Kimi-K2.5*` in SGLang. Use when Codex needs to recover, extend, or audit Kimi optimizations, including K2 router/MoE fast paths, K2 thinking Marlin paths, K2.5 wrapper/multimodal/runtime plumbing, W4AFP8/W4A16 quant tracks, parser contracts, LoRA coverage, and backend-specific validation.
15llm-serving-auto-benchmark
Framework-independent LLM serving benchmark skill for comparing SGLang, vLLM, TensorRT-LLM, or another serving framework. Use when a user wants to find the best deployment command for one model across multiple serving frameworks under the same workload, GPU budget, and latency SLA.
10