MoonMath AI group has launched a bf16 ahead consideration kernel for AMD’s MI300X GPU. It’s written in HIP, not hand-written meeting. The code is open-source below the MIT license. The MoonMath.ai group reviews it beats AITER v3, AMD’s personal optimized kernel, on each examined form. Naked-metal entry got here from HotAisle, an AMD cloud supplier.
Consideration is the fused softmax(QKᵀ/√d)·V operation inside each transformer. The MI300X is AMD’s CDNA3 data-center GPU, with the ISA goal (gfx942). This kernel runs on that {hardware} solely.
TL;DR
- MoonMath.ai open-sources a bf16 ahead consideration kernel for AMD MI300X, written in HIP, not meeting (MIT).
- It beats AMD’s AITER v3 on each form and rounding mode — geomean 1.18×/1.15×/1.08×, as much as 1.26×.
- The core trick: one-instruction asm wrappers allow you to decide the opcode whereas the compiler allocates registers.
- A lot of the speedup is reminiscence placement — Okay in LDS, V scorching in L1, Q and accumulators in registers.
- An actual SGLang PR used it to hurry up Wan2.1 video diffusion by 1.23×, with no high quality regression.
Understanding Kernel
A kernel is a small program that runs straight on the GPU’s many cores to carry out one particular computation—right here, the eye math—as quick because the {hardware} permits. The kernel computes ahead consideration in bf16 on MI300X solely. It takes inputs in both BSHD or BHSD format, with no transpose. Head dimension is fastened at 128. It helps any sequence size, together with cross-attention.
There are actual limits. There isn’t any causal masks, no GQA, and no varlen batching. Outputs are bf16, and it runs on gfx942 {hardware} solely.
Numerics are tightly managed. All three rounding modes match AITER’s per-mode rounding rule. Each finite output sits inside 1 bf16 ULP of AITER. NaN and Inf dealing with is bit-identical, and outcomes are deterministic.
The Core Trick: One-Instruction asm Wrappers
The core method avoids a well-known dilemma. Compiler intrinsics preserve code tidy however let the compiler reorder or rename operands. Uncooked inline meeting provides management however forces handbook register and handle administration.
MoonMath wraps precisely one instruction in a __device__ __forceinline__ perform. Prolonged asm constraints describe the operands. The analysis group picks the opcode. The compiler nonetheless allocates registers and tracks knowledge circulate.
// in/out tied to the SAME VGPR → no accumulator rename, no v_mov copy.
__device__ __forceinline__ void asm_mfma(bf16x4_t a, bf16x4_t b, fp32x4_t& c) {
asm risky("v_mfma_f32_16x16x16_bf16 %0, %1, %2, %0"
: "+v"(c) : "v"(a), "v"(b));
}
The "+v"(c) constraint ties the accumulator enter and output to the identical VGPR. No copy instruction is emitted. This retains the kernel near strange HIP. It nonetheless steers the machine one instruction at a time.
The Structure: Eight Waves, Two Teams, Two Obstacles
A CDNA3 compute unit has 4 SIMD models. The textbook block is 4 waves. MoonMath as an alternative runs eight waves per block, in two teams of 4.
The 2 teams run the identical Q*Okay, softmax, O += P*V sequence. They’re offset by a section. Whereas one group saturates the matrix core, the opposite runs softmax and points masses. Then they swap, so the matrix core by no means idles.
There are two s_barriers per iteration. One sits on the section handoff. One sits on the iteration boundary. Per-counter waits deal with the remainder of the synchronization.
This echoes FlashAttention-3’s matmul and softmax alternation. It doesn’t copy FA3’s producer and client warp cut up. On CDNA3, each reminiscence transfer is already asynchronous, so a devoted producer wave is pointless.
The place Knowledge Lives, and Why 16×16×16
A lot of the speedup comes from reminiscence placement. Okay streams from HBM into LDS, double-buffered, shared by all eight waves. V stays scorching in L1, learn on each PV matmul. Q and accumulators reside in registers.
The analysis group picked the 16×16×16 MFMA over 32×32×8. Each shapes have an identical throughput. The smaller tile accumulates into 4 fp32 parts per lane, in opposition to 16. Decrease accumulator stress leaves room for deeper prefetch and a 3rd Q tile.
| Determination | Selection | Purpose |
|---|---|---|
| Waves per block | 8 (two teams of 4) | Plan the pipeline straight; share one Okay copy |
| MFMA form | 16×16×16 bf16 | Similar throughput, decrease VGPR stress, higher energy effectivity |
| Okay placement | LDS, double-buffered, 32 KiB | Shared by all 8 waves, swapped per iteration |
| V placement | L1, resident, prefetched | Reread throughout PV, stored scorching intentionally |
| Q + accumulators | VGPRs | Learn each iteration, by no means reloaded |
Two later wins shut the hole. A 3rd Q tile (3Q) raises knowledge reuse per loaded Okay and V tile. A Flash-Decoding-style tail KV cut up rescues the stranded fractional spherical throughout MI300X’s 304 CUs. These wins cascade. Shifting V to L1 freed the LDS that the third Q tile then fills.
Benchmark
Assessments ran on MI300X in bf16, head dimension 128. Every form was measured at three rounding modes. RTNE rounds to nearest even. RTNA rounds to nearest, ties away from zero. RTZ truncates towards zero.
| Form (B, H, S, D) | Spherical | Ours (ms) | AITER v3 (ms) | vs AITER | vs MAX |
|---|---|---|---|---|---|
| (2, 24, 8192, 128) | RTNE | 3.083 | 3.792 | 1.23× | 1.37× |
| (2, 24, 16384, 128) | RTNE | 11.670 | 14.691 | 1.26× | 1.54× |
| (4, 16, 16384, 128) | RTZ | 15.055 | 16.183 | 1.07× | 1.47× |
| (2, 24, 32768, 128) | RTNA | 44.440 | 52.363 | 1.18× | 1.57× |
| (1, 16, 131072, 128) | RTNE | 232.517 | 269.278 | 1.16× | 1.46× |
Geomeans throughout the sweep favor MoonMath. Versus AITER, it scores 1.18× (RTNE), 1.15× (RTNA), and 1.08× (RTZ). Versus Modular MAX, geomeans run 1.44× to 1.49×, and per-shape speedups attain 1.59×.
RTZ is AITER’s personal quickest mode and the tightest race. The (4, 16, 16384) RTZ form moved from 0.95× to 1.07×. The tail KV cut up is what closed that ultimate hole.
Interactive Explainer
/* —- reminiscence map —- */
var MEM={
q:{t:’Q tile — VGPRs, persistent’,b:’The Q tile is learn each iteration and by no means reloaded, so it stays resident within the vector register file. Two of three Q tiles per wave keep register-resident and scorching.’},
acc:{t:’scores · O — fp32 accumulators in VGPRs’,b:’Matrix-core outputs (the rating matrix and the working output) by no means depart registers till the ultimate retailer. The 16×16×16 MFMA accumulates into simply 4 fp32 parts per lane, protecting accumulator stress low.’},
ok:{t:’Okay tile — LDS, double-buffered, 32 KiB’,b:’One copy of Okay is shared by all eight waves and swapped per iteration by way of a double buffer. Okay streams from HBM straight into LDS by direct DMA, by no means passing via a VGPR. An XOR swizzle breaks financial institution conflicts with zero padding.’},
q3:{t:’third Q tile — LDS, 32 KiB, streamed’,b:’Shifting V to L1 freed 32 KiB of LDS. The kernel spends it on a 3rd Q tile (48 q-rows per wave). It’s parked in LDS and streamed via a ping-pong buffer in the course of the QK matmul, elevating Okay/V reuse.’},
v:{t:’V_t tile — L1, resident’,b:’The pre-transposed V tile is stored scorching in L1 and reread on each PV matmul. L1 shouldn’t be addressable, so residency is engineered by prefetching the following iterationu2019s strains right into a throwaway register — the information lands in L1 as a aspect impact.’},
src:{t:’Okay / V supply — HBM, staged by way of L2′,b:’A head-first chiplet swizzle maps all of a (batch, head)u2019s Q blocks onto a single XCD, so its Okay and V keep resident in that XCDu2019s slice of L2 as an alternative of beating throughout all eight.’}
};
perform showMem(ok){
$all(‘.chip’).forEach(perform(c){c.setAttribute(‘aria-pressed’, c.getAttribute(‘data-k’)===ok ? ‘true’:’false’);});
$(‘#mm-detail .dt’).textContent=MEM[k].t;
$(‘#mm-detail .db’).textContent=MEM[k].b;
reportHeight();
}
$all(‘.chip’).forEach(perform(c){c.addEventListener(‘click on’,perform(){showMem(c.getAttribute(‘data-k’));});});
/* —- init —- */
drawBench();showMem(‘q’);
/* —- auto-resize for WordPress embed —- */
perform reportHeight(){
var h=root.offsetHeight+40;
if(window.guardian){window.guardian.postMessage({sort:’mm-cdna3-height’,top:h},’*’);}
}
window.addEventListener(‘load’,reportHeight);
window.addEventListener(‘resize’,reportHeight);
setTimeout(reportHeight,300);setTimeout(reportHeight,900);
})();
