whisper.cpp/ggml
Matt Corallo 982533fc0c vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (llama/23056)
Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even
though they're only 2-byte aligned, and Q3_K still wins on
NVIDIA as well.

mesa isn't all that great at coalescing back-to-back loads from
alternating arrays, so we force it instead. Further, we can do
subtraction directly on a full int32_t rather than an i8vec4
with bit twiddling because the high bit is always free to start.

On Intel BMG on mesa, the switch to MMVQ provides an immediate
~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and
~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

The futher switch to block loads leads to a ~24% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA
override for K quants on Xe2 as well.
2026-06-08 14:36:36 +03:00
..
cmake ggml : Parallelize quant LUT init (llama/23595) 2026-05-25 12:26:07 +03:00
include ggml: `gguf_init_from_callback` and `gguf_init_from_buffer` (llama/22341) 2026-05-25 12:44:04 +03:00
src vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (llama/23056) 2026-06-08 14:36:36 +03:00
.gitignore
CMakeLists.txt ggml : bump version to 0.13.1 (ggml/1523) 2026-05-29 09:47:30 +03:00