whisper.cpp/ggml/src
Johannes Gäßler e62d5893f4 HIP: RDNA3 mma FA, faster AMD transpose, tune AMD (llama/22880)
Adds RDNA3 support to the CUDA mma FA kernel. To make the RDNA3 tensor cores work with the FP16 accumulation for VKQ the tiles they need to be 32 logical units long in direction of the attention head; for head sizes 80 and 112 that are not exactly divided by 32 the regular length of 16 with FP32 accumulation is used instead. The longer tiles also enable more efficient transposition for a warp size of 32 which is why it's also used for RDNA4. However, this scrambles the data layout of the accumulators along the attention head dimension. To prevent accidental misuse I added another entry to ggml_cuda_mma::data_layout.

I also tuned the kernel parameters for RDNA3, RDNA4, and CDNA1 in general, during which I discovered that the kernel can be made to work for head sizes up to 256 for CDNA. For RDNA3/4 I was not able to get better performance that the tile kernel for head sizes > 128.
2026-05-25 12:26:07 +03:00
..
ggml-blas vulkan: add get/set tensor 2d functions (llama/22514) 2026-05-01 13:07:35 +03:00
ggml-cann vulkan: add get/set tensor 2d functions (llama/22514) 2026-05-01 13:07:35 +03:00
ggml-cpu ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (llama/22863) 2026-05-25 12:26:07 +03:00
ggml-cuda HIP: RDNA3 mma FA, faster AMD transpose, tune AMD (llama/22880) 2026-05-25 12:26:07 +03:00
ggml-hexagon hexagon: add unary tanh op (llama/22999) 2026-05-14 21:26:48 +03:00
ggml-hip ggml: backend-agnostic tensor parallelism (experimental) (llama/19378) 2026-04-30 11:29:05 +03:00
ggml-metal logs : reduce (llama/23021) 2026-05-25 12:26:07 +03:00
ggml-musa ggml-cuda: native bf16 flash attention for vec kernel (llama/20525) 2026-03-29 15:04:36 +03:00
ggml-opencl opencl: add q5_0 and q5_1 MoE for Adreno (llama/22985) 2026-05-14 21:26:48 +03:00
ggml-openvino openvino: driver setup, CI split, thread safety, and NPU optimizations (llama/21944) 2026-04-30 11:29:15 +03:00
ggml-rpc rpc : use graph uid instead of graph cache (llama/22701) 2026-05-14 21:26:48 +03:00
ggml-sycl SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations (llama/21597) 2026-05-25 12:26:07 +03:00
ggml-virtgpu ggml-virtgpu : include missing mutex header (llama/22810) 2026-05-14 21:26:48 +03:00
ggml-vulkan vulkan: fix matmul integer pipeline selection (llama/23005) 2026-05-25 12:26:07 +03:00
ggml-webgpu ggml-webgpu: makes the flash attn vec path subgroup-aware (llama/23040) 2026-05-25 12:26:07 +03:00
ggml-zdnn vulkan: add get/set tensor 2d functions (llama/22514) 2026-05-01 13:07:35 +03:00
ggml-zendnn ggml-zendnn : adaptive fallback to CPU backend for small batch sizes (llama/22681) 2026-05-14 21:26:48 +03:00
CMakeLists.txt ggml : revert to -lm linking instead of find_library (llama/22355) 2026-04-30 11:29:21 +03:00
ggml-alloc.c ggml : remove ggml-ext.h (llama/21869) 2026-04-30 11:29:09 +03:00
ggml-backend-dl.cpp hexagon: enable offloading to Hexagon on Windows on Snapdragon (llama/19150) 2026-01-30 15:56:40 +02:00
ggml-backend-dl.h hexagon: enable offloading to Hexagon on Windows on Snapdragon (llama/19150) 2026-01-30 15:56:40 +02:00
ggml-backend-impl.h ggml: backend-agnostic tensor parallelism (experimental) (llama/19378) 2026-04-30 11:29:05 +03:00
ggml-backend-meta.cpp vulkan: add get/set tensor 2d functions (llama/22514) 2026-05-01 13:07:35 +03:00
ggml-backend-reg.cpp ggml : skip already registered backends and devices (llama/22296) 2026-04-30 11:29:21 +03:00
ggml-backend.cpp ggml: update SCHED_DEBUG output to use ggml_op_desc() (llama/22825) 2026-05-14 21:26:48 +03:00
ggml-common.h ggml: add Q1_0 1-bit quantization support (CPU) (llama/21273) 2026-04-30 11:29:01 +03:00
ggml-impl.h ggml: add graph_reused (llama/21764) 2026-04-30 11:29:11 +03:00
ggml-opt.cpp fix: free ctx_copy in ggml_opt_free to plug per-training-session leak (llama/21592) 2026-04-30 11:29:03 +03:00
ggml-quants.c ggml: add Q1_0 1-bit quantization support (CPU) (llama/21273) 2026-04-30 11:29:01 +03:00
ggml-quants.h ggml: add Q1_0 1-bit quantization support (CPU) (llama/21273) 2026-04-30 11:29:01 +03:00
ggml-threading.cpp ggml : build backends as libraries (llama/10256) 2024-11-20 21:00:08 +02:00
ggml-threading.h remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS (llama/10797) 2024-12-18 12:52:16 +02:00
ggml.c ggml : implement fast walsh-hadamard transform for kv rotation (#21352) (llama/22631) 2026-05-14 21:26:48 +03:00
ggml.cpp ggml : Print backtrace on uncaught C++ exceptions (ggml/1232) 2025-05-29 09:56:26 +03:00
gguf.cpp llama: fix llama-model-saver (llama/20503) 2026-03-29 15:04:36 +03:00