whisper.cpp/ggml
Gaurav Garg ae6a9bb9a5 CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183)
- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1

Fixes Issue: #12182
---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-03-27 11:06:03 +02:00
..
cmake cmake: Comment out GGML_BIN_DIR for now (ggml/1139) 2025-03-27 11:06:03 +02:00
include llama: Add support for RWKV v7 architecture (llama/12412) 2025-03-27 11:06:03 +02:00
src CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183) 2025-03-27 11:06:03 +02:00
.gitignore whisper : reorganize source code + improve CMake (#2256) 2024-06-26 19:34:09 +03:00
CMakeLists.txt SYCL: using graphs is configurable by environment variable and compile option (llama/12371) 2025-03-27 11:06:03 +02:00