whisper.cpp/ggml
Aman Gupta d9bf63cfb8
CUDA: add a fused top-K MoE kernel (llama/16130)
* CUDA: add a fused top-K MoE kernel

This kernel does the following:
1. softmax over the logits per token [n_experts, n_tokens]
2. argmax reduce over the top-k (n_experts_used) logits
3. write weights + ids to global memory

It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models

* Refactor into ggml_cuda_should_use_topk_moe

* Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before

* Review: format + micro-optimizations

* Fix bug: fix tie breakers

* Add optional norm + clean-up code

* Use smem for final write

* Add bounds check

* Use better memory pattern for writeback
2025-09-29 15:18:10 +03:00
..
cmake ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (llama/15094) 2025-08-18 20:30:45 +03:00
include llama: print memory breakdown on exit (llama/15860) 2025-09-29 15:18:10 +03:00
src CUDA: add a fused top-K MoE kernel (llama/16130) 2025-09-29 15:18:10 +03:00
.gitignore whisper : reorganize source code + improve CMake (#2256) 2024-06-26 19:34:09 +03:00
CMakeLists.txt ggml : remove -dev suffix from release version (ggml/1355) 2025-09-29 15:18:10 +03:00