gpu-kernel-ako4all

Installation
SKILL.md

GPU Kernel AKO4ALL

Use this skill to run a disciplined GPU-kernel optimization loop with AKO4ALL as the outer framework and stack-specific bundled references for Triton, CUDA C++/PTX, CUTLASS/CuTe C++, and CuTe DSL.

This is a derivative synthesis, not original material. Every upstream skill or document referenced is copied into this skill under references/ and templates/. Do not go to .claude/skills, .copilot/skills, temporary clones, or source repositories to read the upstream skills. Read the bundled materials. Preserve the attribution in references/source-attribution.md when copying, publishing, or adapting this skill.

Use This Skill When

  • optimizing an existing kernel after a real hotspot has been proven
  • writing a new Triton, CUDA C++/PTX, CUTLASS/CuTe C++, or CuTe DSL kernel for an AI-infra path
  • creating an AKO4ALL microbench harness for a kernel family
  • interpreting nsys or ncu results before changing tiling, memory movement, pipeline, or epilogue structure
  • porting a kernel between implementation styles while preserving correctness and performance evidence

Do not start here from a vague "make it faster" request. First establish the target kernel, shape family, dtype/layout contract, hardware, and baseline runtime.

Mandatory Reference Gate

Read the AKO loop reference before any implementation:

Related skills

More from bbuf/sglang-auto-driven-skills

Installs
9
GitHub Stars
272
First Seen
3 days ago