Monday, June 22, 2026

MoonMath AI Open-Sources a HIP Consideration Kernel for AMD MI300X That Beats AITER v3 on Each Form and Rounding Mode

MoonMath AI group has launched a bf16 ahead consideration kernel for AMD’s MI300X GPU. It’s written in HIP, not hand-written meeting. The code is open-source below the MIT license. The MoonMath.ai group reviews it beats AITER v3, AMD’s personal optimized kernel, on each examined form. Naked-metal entry got here from HotAisle, an AMD cloud supplier.

Consideration is the fused softmax(QKᵀ/√d)·V operation inside each transformer. The MI300X is AMD’s CDNA3 data-center GPU, with the ISA goal (gfx942). This kernel runs on that {hardware} solely.

TL;DR

  • MoonMath.ai open-sources a bf16 ahead consideration kernel for AMD MI300X, written in HIP, not meeting (MIT).
  • It beats AMD’s AITER v3 on each form and rounding mode — geomean 1.18×/1.15×/1.08×, as much as 1.26×.
  • The core trick: one-instruction asm wrappers allow you to decide the opcode whereas the compiler allocates registers.
  • A lot of the speedup is reminiscence placement — Okay in LDS, V scorching in L1, Q and accumulators in registers.
  • An actual SGLang PR used it to hurry up Wan2.1 video diffusion by 1.23×, with no high quality regression.

Understanding Kernel

A kernel is a small program that runs straight on the GPU’s many cores to carry out one particular computation—right here, the eye math—as quick because the {hardware} permits. The kernel computes ahead consideration in bf16 on MI300X solely. It takes inputs in both BSHD or BHSD format, with no transpose. Head dimension is fastened at 128. It helps any sequence size, together with cross-attention.

There are actual limits. There isn’t any causal masks, no GQA, and no varlen batching. Outputs are bf16, and it runs on gfx942 {hardware} solely.

Numerics are tightly managed. All three rounding modes match AITER’s per-mode rounding rule. Each finite output sits inside 1 bf16 ULP of AITER. NaN and Inf dealing with is bit-identical, and outcomes are deterministic.

The Core Trick: One-Instruction asm Wrappers

The core method avoids a well-known dilemma. Compiler intrinsics preserve code tidy however let the compiler reorder or rename operands. Uncooked inline meeting provides management however forces handbook register and handle administration.

MoonMath wraps precisely one instruction in a __device__ __forceinline__ perform. Prolonged asm constraints describe the operands. The analysis group picks the opcode. The compiler nonetheless allocates registers and tracks knowledge circulate.

// in/out tied to the SAME VGPR → no accumulator rename, no v_mov copy.
__device__ __forceinline__ void asm_mfma(bf16x4_t a, bf16x4_t b, fp32x4_t& c) {
    asm risky("v_mfma_f32_16x16x16_bf16 %0, %1, %2, %0"
                 : "+v"(c) : "v"(a), "v"(b));
}

The "+v"(c) constraint ties the accumulator enter and output to the identical VGPR. No copy instruction is emitted. This retains the kernel near strange HIP. It nonetheless steers the machine one instruction at a time.

The Structure: Eight Waves, Two Teams, Two Obstacles

A CDNA3 compute unit has 4 SIMD models. The textbook block is 4 waves. MoonMath as an alternative runs eight waves per block, in two teams of 4.

The 2 teams run the identical Q*Okay, softmax, O += P*V sequence. They’re offset by a section. Whereas one group saturates the matrix core, the opposite runs softmax and points masses. Then they swap, so the matrix core by no means idles.

There are two s_barriers per iteration. One sits on the section handoff. One sits on the iteration boundary. Per-counter waits deal with the remainder of the synchronization.

This echoes FlashAttention-3’s matmul and softmax alternation. It doesn’t copy FA3’s producer and client warp cut up. On CDNA3, each reminiscence transfer is already asynchronous, so a devoted producer wave is pointless.

The place Knowledge Lives, and Why 16×16×16

A lot of the speedup comes from reminiscence placement. Okay streams from HBM into LDS, double-buffered, shared by all eight waves. V stays scorching in L1, learn on each PV matmul. Q and accumulators reside in registers.

The analysis group picked the 16×16×16 MFMA over 32×32×8. Each shapes have an identical throughput. The smaller tile accumulates into 4 fp32 parts per lane, in opposition to 16. Decrease accumulator stress leaves room for deeper prefetch and a 3rd Q tile.

Determination Selection Purpose
Waves per block 8 (two teams of 4) Plan the pipeline straight; share one Okay copy
MFMA form 16×16×16 bf16 Similar throughput, decrease VGPR stress, higher energy effectivity
Okay placement LDS, double-buffered, 32 KiB Shared by all 8 waves, swapped per iteration
V placement L1, resident, prefetched Reread throughout PV, stored scorching intentionally
Q + accumulators VGPRs Learn each iteration, by no means reloaded

Two later wins shut the hole. A 3rd Q tile (3Q) raises knowledge reuse per loaded Okay and V tile. A Flash-Decoding-style tail KV cut up rescues the stranded fractional spherical throughout MI300X’s 304 CUs. These wins cascade. Shifting V to L1 freed the LDS that the third Q tile then fills.

Benchmark

Assessments ran on MI300X in bf16, head dimension 128. Every form was measured at three rounding modes. RTNE rounds to nearest even. RTNA rounds to nearest, ties away from zero. RTZ truncates towards zero.

Form (B, H, S, D) Spherical Ours (ms) AITER v3 (ms) vs AITER vs MAX
(2, 24, 8192, 128) RTNE 3.083 3.792 1.23× 1.37×
(2, 24, 16384, 128) RTNE 11.670 14.691 1.26× 1.54×
(4, 16, 16384, 128) RTZ 15.055 16.183 1.07× 1.47×
(2, 24, 32768, 128) RTNA 44.440 52.363 1.18× 1.57×
(1, 16, 131072, 128) RTNE 232.517 269.278 1.16× 1.46×

Geomeans throughout the sweep favor MoonMath. Versus AITER, it scores 1.18× (RTNE), 1.15× (RTNA), and 1.08× (RTZ). Versus Modular MAX, geomeans run 1.44× to 1.49×, and per-shape speedups attain 1.59×.

RTZ is AITER’s personal quickest mode and the tightest race. The (4, 16, 16384) RTZ form moved from 0.95× to 1.07×. The tail KV cut up is what closed that ultimate hole.

Interactive Explainer


Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles