whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Jeff Bolz	85bbc82209	vulkan: Support F16 OP_FILL (llama/22177)	2026-04-30 11:29:14 +03:00
Ruben Ortlam	7fe6b8e171	vulkan: optimize im2col (llama/21713) * vulkan: improve im2col memory write layout * cap workgroups * minimal device tuning * use vendor_id instead of subgroup size	2026-04-30 11:29:10 +03:00
Jeff Bolz	45365fa111	vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it (llama/21572) * vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it * use FetchContent to get SPIRV-Headers * Fetch spirv-headers unconditionally * remove fetchcontent, rely on installed headers * fix ubuntu job * Update docs/build.md	2026-04-30 11:29:08 +03:00
Jeff Bolz	cdeaa34174	vulkan: Support GGML_TYPE_NVFP4 (llama/21455) This adds nvfp4 support for get_rows, dequant, and mul_mat(_id). For mul_mat, it does not add support for the dp4/q8_1 path, it's all via fp16/fp32.	2026-04-30 11:29:07 +03:00
Ruben Ortlam	0f99a47177	vulkan: Flash Attention DP4A shader for quantized KV cache (llama/20797) * use integer dot product for quantized KV flash attention * small improvements * fix SHMEM_STAGING indexing * add missing KV type quants * fixes * add supported quants to FA tests * readd fast paths for <8bit quants * fix mmq gate and shmem checks	2026-04-30 11:29:07 +03:00
Jeff Bolz	458ad1d93e	vulkan: Support Q1_0 (llama/21539) * vulkan: Support Q1_0 * use get_dm	2026-04-30 11:29:05 +03:00
Johannes Gäßler	bb895c843d	ggml: backend-agnostic tensor parallelism (experimental) (llama/19378) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (llama/8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (llama/9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (llama/7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (llama/11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (llama/12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (llama/16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (llama/17) * meta : formatting, naming, indentation (llama/18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-30 11:29:05 +03:00
Ruben Ortlam	1d555510de	vulkan: unify type macros to use Vx instead of _VECx (llama/21605)	2026-04-30 11:29:04 +03:00
Tom Overlund	78b4fd85e1	ggml: Vulkan build, Linux -- output error string for errno on fork failure (#20868 ) (llama/20904)	2026-04-30 11:29:02 +03:00
mkoker	18c98ffaf7	vulkan: add FA dequant for q4_1, q5_0, q5_1, iq4_nl (llama/21029) Add dequantize4() implementations for Q4_1, Q5_0, Q5_1, and IQ4_NL in the flash attention base shader. Register them in the shader generator, pipeline creation, and enable in the scalar/coopmat1 FA support check.	2026-04-30 11:29:02 +03:00
Ruben Ortlam	759f0084b4	vulkan: add noncontiguous GLU support (llama/21081) * vulkan: add noncontiguous GLU support * fix compile issue	2026-03-29 15:04:36 +03:00
Matt Corallo	22710fdb82	Add shader count for Intel Arc Pro B60 (llama/20818)	2026-03-29 15:04:36 +03:00
Jeff Bolz	49b505bcc5	vulkan: change gated_delta_net to shard a column across a subgroup (llama/20662) * vulkan: change gated_delta_net to shard a column across a subgroup This is based on https://github.com/ggml-org/llama.cpp/pull/20391, I used an LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of subgroup to invocation id, using subgroupAdd optionally, etc.). This fixes a perf regression from the transposing of the values in memory (!20443). * vulkan: Spread columns across fewer lanes to reduce the number of workgroups	2026-03-29 15:04:36 +03:00
Eve	43c7c0f86c	vulkan: dequantize iq4_xs 4 at a time (llama/20657)	2026-03-29 15:04:36 +03:00
Ruben Ortlam	16ca5e6fb1	vulkan: disable mmvq on Intel Windows driver (llama/20672) * vulkan: disable mmvq on Intel Windows driver * improve comment	2026-03-29 15:04:36 +03:00
Ruben Ortlam	0ad6ceef59	vulkan: async and event fixes (llama/20518) * vulkan: fix event wait submission, event command buffer reset * fix event command buffer reset validation error * also reset command buffers before reuse * use timeline semaphores instead of fences for event_synchronize * don't use initializer list for semaphore wait info * use multiple events to avoid reset issues * fix event reuse issue with multiple vectors * add semaphore wait condition also if compute_ctx already exists * remove event pending stage	2026-03-29 15:04:36 +03:00
Ruben Ortlam	49adc8b470	vulkan: allow graphics queue only through env var (llama/20599) * vulkan: avoid graphics queue on non-RADV AMD drivers * avoid graphics queues on small GPUs * change to only use graphics queue if overridden with env var GGML_VK_ALLOW_GRAPHICS_QUEUE * reenable transfer queue if graphics queue is not used	2026-03-29 15:04:36 +03:00
Ruben Ortlam	724ea71cf9	vulkan: fix flash attention dot product precision (llama/20589)	2026-03-29 15:04:36 +03:00
Ruben Ortlam	cd02195b8f	vulkan: use graphics queue on AMD (llama/20551) * vulkan: use graphics queue on AMD for slightly better performance * disable async transfer queue on AMD	2026-03-16 13:10:15 +02:00
Georgi Gerganov	c7abcd577b	graph : remove redundant GDN state transposes (llama/20443) * ggml : transpose fused GDN state access for coalesced memory reads (llama/20436) The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix column-wise on row-major storage, causing strided reads (stride S_v = 128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a 39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused path. Transpose the state indexing so threads read contiguously: - Metal: s_ptr[isS_v] -> s_ptr[is] (stride 1 vs S_v) - CUDA: curr_state[iS_v+col] -> curr_state[colS_v+i] (coalesced) - CPU: restructured loops for row-wise transposed access Also add --fused-gdn [on\|off\|auto] CLI flag (mirrors --flash-attn) so users can control fused GDN independently of auto-detection. All GATED_DELTA_NET backend-ops tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> ggml : use SIMD dot products in CPU GDN kernel, couple AR/chunked fused flags - Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized dot products in the CPU fused GDN kernel (delta and attention output) - Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one path lacks device support, disable both to prevent state layout mismatch between transposed (fused) and non-transposed (unfused) formats Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * llama : rever fgdn argument changes * graph : remove GDN state transposes * vulkan : adapt * cuda : remove obsolete smem code --------- Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Oliver Simons <osimons@nvidia.com>	2026-03-16 13:10:15 +02:00
ProgenyAlpha	2450919665	vulkan: add GATED_DELTA_NET op support (llama/20334) * vulkan: add GATED_DELTA_NET op support Implements the fused gated delta net recurrence as a Vulkan compute shader with full support for scalar gate, KDA vector gate, GQA broadcast, multi-token sequences, and permuted (non-contiguous) q/k inputs. Specialization constants select head size (32/64/128) and KDA mode at pipeline creation time. Passes all 13 test-backend-ops cases on AMD Radeon 890M (RADV GFX1150). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: optimize GATED_DELTA_NET shader (Phase 1) - vec4 dot products on all inner loops (dp4 hardware intrinsic) - Cache exp(g) in shared memory for KDA path, eliminating ~32K redundant global reads and ~16K redundant exp() calls per token - vec4 fused decay + rank-1 update (3 vec4 ops vs 12 scalar ops) - Add perf benchmark cases for GATED_DELTA_NET to test-backend-ops KDA TG: +5.4% throughput. Non-KDA: no regressions. 13/13 test-backend-ops passing on AMD Radeon 890M (RADV GFX1150). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: address review feedback for GATED_DELTA_NET Pipeline array refactor [3][2], A_TYPE/D_TYPE/FLOAT_TYPE shader macros, scale in push constants, supports_op fix, dispatch restructuring. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: use FLOAT_TYPE for buffer/shared declarations, align formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: add explicit FLOAT_TYPE casts for buffer loads Wrap data_q, data_k, and data_g buffer reads with FLOAT_TYPE() casts to ensure correct behavior across all Vulkan configurations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: fix Q/K broadcast for interleaved head layout Adapt to the interleaved broadcast convention from #20340: head_id / rq1 → head_id % neq1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 13:10:15 +02:00
ProgenyAlpha	44c12c642e	vulkan: fix SSM_CONV PP scaling with large ubatch sizes (llama/20379) * vulkan: optimize SSM_CONV workgroup dispatch for large ubatch Tile tokens into 2D workgroups (32x16) to reduce workgroup launch overhead at large ubatch sizes. Add vec4 fast path for nc=4 (common d_conv size). Fixes PP performance degradation with ubatch > 512. Ref: ggml-org/llama.cpp#18725 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: remove unused shared memory declaration in SSM_CONV Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 13:10:15 +02:00
Jeff Bolz	86e312d61d	vulkan: fix l2_norm epsilon handling (llama/20350)	2026-03-16 13:10:15 +02:00
Jeff Bolz	6c5e3aac3e	vulkan: fix OOB check in flash_attn_mask_opt (llama/20296)	2026-03-16 13:10:15 +02:00
Masato Nakasaka	26ee4f7362	vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (llama/20059) * Changed to reuse command buffers to fix crashing on Intel GPU * Removed unused parameter * Fixed compile error and minor mistake * Fix logging * Changing to use usage flag per command buffer * fixed style * added buffer reset * Removed cmd_buffer_idx for reuse consistency * Fixed style	2026-03-16 13:10:15 +02:00
Bertay Eren	65dbf3c31a	ggml-vulkan: add SGN operator, auto-generate Vulkan.csv and ops.md (llama/20219)	2026-03-16 13:10:15 +02:00
Ruben Ortlam	890c047e30	vulkan: skip zero size tensors in backend copies (llama/20233)	2026-03-16 13:10:15 +02:00
GiantPrince	8d97f59639	ggml-vulkan: Add ELU op support (llama/20183) * ggml-Vulkan: add ELU support * ggml-Vulkan: remove extra spaces and variables * ggml-Vulkan: fix format issue * ggml-Vulkan: fix format issue * fix whitespace issue * Update Vulkan.csv and ops.md	2026-03-16 13:10:15 +02:00
Jeff Bolz	4b0653a792	vulkan: Fix data races in coopmat1 mul_mat(_id) (llama/20084) * vulkan: Fix data races in coopmat1 mul_mat(_id) Add barriers between coopmat store and regular loads. We sort of got away with this because it was the same subgroup accessing the values, but it's still a race and may not work. * switch to subgroup control barriers	2026-03-16 13:10:15 +02:00
Marcel Petrick	67abc63e9d	chore : correct typos [no ci] (llama/20041) * fix(docs): correct typos found during code review Non-functional changes only: - Fixed minor spelling mistakes in comments - Corrected typos in user-facing strings - No variables, logic, or functional code was modified. Signed-off-by: Marcel Petrick <mail@marcelpetrick.it> * Update docs/backend/CANN.md Co-authored-by: Aaron Teo <taronaeo@gmail.com> * Revert "Auxiliary commit to revert individual files from 846d1c301281178efbc6ce6060ad34c1ebe45af8" This reverts commit 02fcf0c7db661d5ff3eff96b2b2db9fdb7213256. * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Signed-off-by: Marcel Petrick <mail@marcelpetrick.it> Co-authored-by: Aaron Teo <taronaeo@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-16 13:10:15 +02:00
Ruben Ortlam	923a292429	vulkan: tune MMVQ for Intel Windows (llama/19988)	2026-03-16 13:10:15 +02:00
Ruben Ortlam	2a9649c420	vulkan: improve partial offloading performance on AMD (llama/19976) * vulkan: fix and enable cpy_tensor_async function * use transfer_queue for async transfers on AMD, synchronize with timeline semaphore * update offload_op logic * fix missing transfer submission * disable async transfer queue on AMD GCN * revert op batch size change * fix cpy_tensor_async checks	2026-03-16 13:10:15 +02:00
Ruben Ortlam	e722ee1bf5	vulkan: fix fp16 Flash Attention on Windows AMD RDNA2 and below (llama/19921)	2026-02-27 20:57:58 +02:00
Jeff Bolz	fb55b2654b	vulkan: check for memory overlap before doing fusion (llama/19768) * vulkan: check for memory overlap before doing fusion * Update ggml/src/ggml-vulkan/ggml-vulkan.cpp * address feedback	2026-02-27 20:57:58 +02:00
Ruben Ortlam	90800b5aa5	Vulkan Scalar Flash Attention Refactor (llama/19625) * vulkan: allow using fp16 in scalar flash attention shader * split rows inside of subgroups for faster synchronization * use row_split when Br >= 4, change reductions to use shared memory if row_split == 1 * use f32 scalar FA if f16 is not supported by device * fix amd workgroup size issue * optimize masksh use * add medium rows FA shader Br size * fixes * add padding to mask shmem buffer * cache q values into registers for KQ * fuse lf accumulation, pf and v accumulation into a loop * stage K loads through shmem * stage V loads through shmem * only stage through shmem on Nvidia * default to Bc 32 * also stage V through shmem when this is done for K * dynamic subgroups for intel * use vectorized stores * use float_type for dequantize4 functions * use smaller scalar rows size for smaller rows count * relax flash attention split_k condition to allow non-gqa use * use minimal subgroup size on Intel * fix shmem support function * fix rebase issues * fixes * Bc 4 for scalar FA is not a valid configuration * Use wave32 on AMD RDNA for scalar FA * add Intel shader core count lookup-table * fix regressions * device tuning * tmpsh size fix * fix editorconfig * refactor fa tuning logic into a single place * fix gqa opt logic * fix block_rows with small n_rows * amd tuning * fix hsk=72/80 issue * tuning * allow condition skipping for column check * use float16 for Of if available * address feedback * fix bad RDNA performance on head size <= 128 by limiting occupancy * allow printing pipeline stats * cleanup and fixes * limit occupancy for GCN for small batch FA with large HSK * disable f16 FA for GCN AMD GPUs on the proprietary driver	2026-02-27 20:57:58 +02:00
Jeff Bolz	dcc877688d	vulkan: fix coopmat1 without bf16 support (llama/19793)	2026-02-27 20:57:58 +02:00
Jeff Bolz	344eae3d22	vulkan: fix data race in mul_mat_id shader (llama/19790)	2026-02-27 20:57:58 +02:00
Ruben Ortlam	3f68f30907	vulkan: fix MMQ shader push constants and multi-dispatch (llama/19732)	2026-02-27 20:57:58 +02:00
Jeff Bolz	f1da0a26f5	vulkan: split mul_mat into multiple dispatches to avoid overflow (llama/19509) * vulkan: split mul_mat into multiple dispatches to avoid overflow The batch dimensions can be greater than the max workgroup count limit, in which case we need to split into multiple dispatches and pass the base index through a push constant. Fall back for the less common p021 and nc variants. * address feedback	2026-02-27 20:57:58 +02:00
Jeff Bolz	cc448def01	vulkan: support L2_NORM with contiguous rows (llama/19604)	2026-02-15 21:44:37 +02:00
Jeff Bolz	197e9ab6eb	vulkan: support GGML_OP_SET (llama/19584)	2026-02-15 21:44:37 +02:00
Sophon	fc6bbab817	vulkan: Add vendor id for Qualcomm drivers (llama/19569) This commit allows Qualcomm native vulkan driver to be used on Windows instead of Mesa Dozen.	2026-02-15 21:44:37 +02:00
Jeff Bolz	ec57bf407c	vulkan: restore -inf check in FA shaders (llama/19582)	2026-02-15 21:44:37 +02:00
ymcki	628b545b7e	fix vulkan ggml_acc only works in 3d but not 4d (llama/19426) * fix vulkan ggml_acc only works in 3d but not 4d * removed clamp in test_acc_block * use the correct stride and its test case * cuda : fix "supports op" condition * change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check * version without boundary check * revert back to boundary check version --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-02-15 21:44:37 +02:00
Jeff Bolz	cea22b3075	vulkan: For coopmat2 FA, use fp16 accumulators for the final result (llama/19376) The cpu and cuda backends use fp16 for the VKQ accumulator type, this change does the same for vulkan. This helps particularly with large head sizes which are very register-limited. I tried this for the coopmat1 path and it slowed down a bit. I didn't try for scalar. I applied the softmax bias that the cuda backend uses to avoid overflow, although I was not able to reproduce the original bug without it.	2026-02-08 09:29:10 +02:00
Jeff Bolz	c1b63354bb	vulkan: make FA mask/softcap enables spec constants (llama/19309) * vulkan: make FA mask/softcap enables spec constants * don't specialize for sinks * bump timeout a little bit	2026-02-08 09:29:10 +02:00
Jeff Bolz	a567c140a3	vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (llama/19281) Write out a 2-bit code per block and avoid loading the mask when it matches these two common cases. Apply this optimization when the mask is relatively large (i.e. prompt processing).	2026-02-08 09:29:10 +02:00
Oleksandr Kuvshynov	932def3198	vulkan: fix GPU deduplication logic. (llama/19222) * vulkan: fix GPU deduplication logic. As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the (same uuid, same driver) logic is problematic for windows+intel igpu. Let's just avoid filtering for MoltenVK which is apple-specific, and keep the logic the same as before 88d23ad5 - just dedup based on UUID. Verified that MacOS + 4xVega still reports 4 GPUs with this version. * vulkan: only skip dedup when both drivers are moltenVk	2026-02-08 09:29:10 +02:00
Jeff Bolz	5a786f7648	vulkan: Set k_load_shmem to false when K is too large (llama/19301)	2026-02-08 09:29:10 +02:00
Jeff Bolz	e0a3f393ad	vulkan: fix non-contig rope (llama/19299)	2026-02-08 09:29:10 +02:00

1 2 3 4 5 ...

463 Commits