Optimization Catalog Router

Use this skill as a dispatcher. The shared root knowledge base remains the canonical location for algorithmic patterns that transfer across implementation languages. Language-specific overlays capture implementation details tied to a specific DSL or programming model.

Load Order

Read the shared root knowledge under docs/knowledge/optimizations/ and docs/knowledge/anti-patterns/ when the pattern is algorithmic.
Load the language-specific optimization catalog skill for the chosen implementation language.
Read language overlays in docs/knowledge/languages/<language>/... only when the implementation details depend on that surface.

Language-Specific Catalog Skills

Language key	Load this skill	Language-specific knowledge root
`cutile-dsl`	`/optimization-catalog-cutile-dsl`	`docs/knowledge/languages/cutile-dsl/`
`cute-dsl`	`/optimization-catalog-cute-dsl`	`docs/knowledge/languages/cute-dsl/`

Classification Rule

Shared root catalogs are for algorithmic patterns, workload-shape rules, and anti-patterns that transfer across DSLs.
Language overlays are for implementation details that depend on a specific API, compiler behavior, code-generation surface, or scheduling model.
When in doubt, keep the reusable mechanism in the shared root and add a language-specific overlay only for the implementation surface.

Current Migration State

The shared root catalogs in docs/knowledge/optimizations/ and docs/knowledge/anti-patterns/ remain the compatibility-preserving baseline.
New language-specific overlays are being added under docs/knowledge/languages/.
Existing references to the legacy root paths continue to work during the migration.

Knowledge Base Principles

The goal of this catalog is to build reusable cuTile kernel design knowledge, not to accumulate device folklore.

Every entry must therefore be written as:

a pattern or anti-pattern that can transfer across kernels,
with a clear context describing when it applies,
with explicit performance metrics affected,
with evidence separated into local validation, and generalized takeaway when relevant.

Device and architecture facts are encouraged when they sharpen the rule:

architecture families such as Blackwell or Hopper,
architecture capabilities such as Tensor Cores, TMA, thread block clusters, scheduler behavior, or shared-memory limits,
device-level specs such as peak FP16/BF16 TFLOP/s, peak memory bandwidth, ridge point, SM count, registers/SM, or SMEM capacity.

But those facts must be converted into reusable guidance. Entries should avoid collapsing into "kernel X on device Y liked config Z" unless that observation is only being used as evidence for a broader pattern.

How This Catalog Works

Index below maps trigger conditions → optimization → detail file
Orchestrator reads index to pick optimizations based on profiling results
Kernel designer reads detail files to implement specific optimizations
New optimizations: create detail file in docs/knowledge/optimizations/ + add row to index
Failed optimizations: create anti-pattern in docs/knowledge/anti-patterns/ + add row below

Optimization Index

ID	Optimization	Trigger Conditions	Est. Impact	Pitfalls	Detail File
O1	Split-KV proportional block allocation	Dual-path kernel has strongly unbalanced latent vs decompressed work; equal block split leaves one path starved	Very High	Requires a reduction kernel and intermediate tensors	`docs/knowledge/optimizations/split-kv.md`
O2	Head grouping for shared-operand reuse	Same operand is reused across heads (for example latent `Z`) and `Bh=1` produces skinny per-head MMAs or near-zero TC utilization	Very High	Raises register pressure; not appropriate for per-head-only operands like decompressed `K/V`	`docs/knowledge/optimizations/head-grouping.md`
O3	Independent path scheduling	After Split-KV, the latent and decompressed paths still want different streaming granularities or load scheduling	Low-Med	Gains are modest if register pressure and occupancy do not improve	`docs/knowledge/optimizations/independent-tiles.md`
O4	Swizzle scheduling	Tiled kernel reuses one operand across neighboring CTAs and block order measurably affects L2 locality	Low	Compute-bound kernels may see only marginal benefit; 1D remap adds index overhead	`docs/knowledge/optimizations/swizzle.md`
O5	Latency hints	Load/compute overlap tuning is needed after the kernel structure is already sound, and the per-path DRAM pressure differs enough that compiler guidance may help scheduling	Low	Difficult to isolate; hints are suggestions rather than guarantees	`docs/knowledge/optimizations/latency-hints.md`
O6	CGA thread block clusters	Hopper+ kernel has genuine cross-CTA data sharing opportunity or needs to reason correctly about `num_ctas` semantics and cluster launch behavior	Medium	Hardware-specific and easy to misuse when there is no real sharing benefit	`docs/knowledge/optimizations/cga.md`
O7	Fast math for online softmax	Online-softmax loop is heavy on `exp2`/division, larger tiles collapse compute throughput, and reduced-precision inference math is acceptable	High	Not a substitute for sane tiling; gains shrink if spills or bandwidth dominate	`docs/knowledge/optimizations/fast-math-online-softmax.md`
O8	Causal K-loop split	Causal attention has many fully unmasked K-tiles, only diagonal/tail tiles need mask logic, and future tiles can be skipped	Medium-High	Can be neutral or negative if masking is no longer material on the target bottleneck	`docs/knowledge/optimizations/causal-k-loop-split.md`
O9	Causal ProgramId remap	Triangular work causes wave-tail imbalance across CTAs and a simple launch-order reversal is available	Low-Med	Small gain if work distribution is already uniform or the grid is tiny	`docs/knowledge/optimizations/causal-program-id-remap.md`
O10	Adaptive tile and occupancy autotuning	Best tile/occupancy point changes with sequence length or device envelope; fixed manual choice underfills short shapes or overcommits long ones	High	Search space can explode if not curated; cannot rescue a structurally broken kernel	`docs/knowledge/optimizations/adaptive-tile-autotuning.md`
O11	Pipeline-driven low-occupancy scheduling	Large unavoidable per-CTA state forces one CTA/SM or very low occupancy, and the kernel must win via overlap and scheduling rather than more resident warps	High	Requires architecture/DSL support for fine-grained scheduling; cannot be faked with a monolithic CTA	`docs/knowledge/optimizations/pipeline-driven-low-occupancy.md`

Anti-Pattern Index

ID	Anti-Pattern	Failure Mode	Source	Detail File
A1	Equal block split across heterogeneous paths	50/50 block allocation for unequal latent vs decompressed work produces sequential bottlenecks instead of overlap	MLA-var6+ V0	`docs/knowledge/anti-patterns/equal-block-split.md`
A2	`TILE_M=1` combine/decompression	Single-row combine GEMMs disable Tensor Cores and leave a persistent tail at small `s`	MLA-var6+ V0/V2/V3	`docs/knowledge/anti-patterns/tile-m-one-combine.md`
A3	Underfilled persistent concurrency	Resident block budgets below the SM count leave the GPU idle and make overlap ineffective	MLA-var6+ V4	`docs/knowledge/anti-patterns/underfilled-persistent.md`
A4	Register-ceiling persistent stage	Persistent kernel lands at the register ceiling (`255` regs/thread), collapses occupancy, and becomes latency-bound	MLA-var6+ V4	`docs/knowledge/anti-patterns/spill-heavy-persistent.md`
A5	Blind large-tile port	Copying a tile shape from another device without revalidating registers, SMEM, local-memory traffic, and grid size crosses a resource cliff	FlashAttention learning chain	`docs/knowledge/anti-patterns/blind-large-tile-port.md`
A6	Uniform causal masking	Paying full causal-mask logic on every K-tile wastes work even though most tiles are fully valid or fully skipped	FlashAttention learning chain	`docs/knowledge/anti-patterns/uniform-causal-masking.md`
A7	Sub-16 MMA dimension	Shrinking any effective MMA dimension (`M`, `N`, or `K`) below `16` usually breaks Tensor Core coverage and explodes latency instead of fixing occupancy	FlashMLA V1 manual probes + Tensor Core shape constraints	`docs/knowledge/anti-patterns/sub-16-mma-dimension.md`
A8	Blind dataflow flip	Rewriting a resource-sensitive Tensor Core kernel into an algebraically equivalent operand orientation changes codegen enough to reintroduce spills or worsen SMEM behavior	FlashMLA V1 → V2	`docs/knowledge/anti-patterns/blind-dataflow-flip.md`

Cross-Referencing: Metrics → Optimizations

Profiling Metric State	Likely Optimizations
Equal-duration kernels with obviously unequal work	O1 (Split-KV proportional block allocation)
`Bh=1` or per-head skinny MMAs on a shared operand	O2 (Head grouping)
Dual-path kernel plateaus near the ridge point after Split-KV	O3 (Independent path scheduling)
L2 hit < 90% and neighboring CTAs reload the same operand tiles	O4 (Swizzle scheduling)
Same kernel structure but different path-level memory pressure suggests compiler scheduling changes may matter	O5 (Latency hints)
Cross-CTA data sharing on Hopper+ or confusion about what `num_ctas` actually controls	O6 (CGA thread block clusters)
Tensor Core/MMA structure is sound, but online-softmax special functions dominate the hot loop	O7 (Fast math for online softmax)
Causal kernel still spends meaningful time in mask/control flow because many K-tiles are fully valid and future tiles are fully skipped	O8 (Causal K-loop split)
Causal/triangular workload shows tail imbalance across CTAs or waves finish unevenly	O9 (Causal ProgramId remap)
Short shapes underfill the GPU while long shapes prefer different tiles or occupancy hints	O10 (Adaptive tile and occupancy autotuning)
Registers/SMEM force one CTA/SM, eligible warps stay very low, and the kernel is compute-bound but still under-utilizes Tensor Cores	O11 (Pipeline-driven low-occupancy scheduling)
Combine stage has `TILE_M=1` and `TC Util = 0%`	A2 (`TILE_M=1` combine/decompression)
Persistent resident blocks < SM count or waves/SM < 1	A3 (Underfilled persistent concurrency)
Persistent stage at `255` regs/thread with single-digit occupancy	A4 (Register-ceiling persistent stage)
Imported large tile causes register/SMEM cliffs, local-memory traffic, or abrupt grid shrinkage on the target device	A5 (Blind large-tile port)
Causal masking logic is paid uniformly across all K-tiles despite obvious fully-valid and fully-skipped regions	A6 (Uniform causal masking)
Candidate tile drives any effective MMA dimension below `16`, or Tensor Core FLOPs collapse far below algorithmic FLOPs after a "smaller tile" change	A7 (Sub-16 MMA dimension)
Proposed optimization is only an algebraic operand/dataflow flip, especially near a register or SMEM cliff	A8 (Blind dataflow flip)

Knowledge Base Update Protocol

For the full update protocol, detail file template, and step-by-step instructions, load the /orchestration-workflow skill.

optimization-catalog

Optimization Catalog Router

Load Order

Language-Specific Catalog Skills

Classification Rule

Current Migration State

Knowledge Base Principles

How This Catalog Works

Optimization Index

Anti-Pattern Index

Cross-Referencing: Metrics → Optimizations

Knowledge Base Update Protocol

More from pepperu96/hyper-mla

profile-kernel

mla-analysis

design-kernel

learn-cute-dsl

cute-dsl-ref

orchestration-workflow