GPU Kernel AKO4ALL

Use this skill to run a disciplined GPU-kernel optimization loop with AKO4ALL as the outer framework and stack-specific bundled references for Triton, CUDA C++/PTX, CUTLASS/CuTe C++, and CuTe DSL.

This is a derivative synthesis, not original material. Every upstream skill or document referenced is copied into this skill under references/ and templates/. Do not go to .claude/skills, .copilot/skills, temporary clones, or source repositories to read the upstream skills. Read the bundled materials. Preserve the attribution in references/source-attribution.md when copying, publishing, or adapting this skill.

Use This Skill When

optimizing an existing kernel after a real hotspot has been proven
writing a new Triton, CUDA C++/PTX, CUTLASS/CuTe C++, or CuTe DSL kernel for an AI-infra path
creating an AKO4ALL microbench harness for a kernel family
interpreting nsys or ncu results before changing tiling, memory movement, pipeline, or epilogue structure
porting a kernel between implementation styles while preserving correctness and performance evidence

Do not start here from a vague "make it faster" request. First establish the target kernel, shape family, dtype/layout contract, hardware, and baseline runtime.

Mandatory Reference Gate

Read the AKO loop reference before any implementation:

gpu-kernel-ako4all

GPU Kernel AKO4ALL

Use This Skill When

Mandatory Reference Gate

More from bbuf/sglang-auto-driven-skills

h100

h100-sglang-diffusion

sglang-prod-incident-triage

llm-serving-auto-benchmark

llm-torch-profiler-analysis

sglang-sota-performance