whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Jeff Bolz	273e4fe7ae	vulkan: Use fewer rows for scalar FA when HS is not a multiple of 16 (llama/17455)	2025-12-12 17:53:07 +02:00
Jeff Bolz	553d57a4e7	vulkan: more FA details in vk_perf_logger (llama/17443)	2025-12-12 17:53:07 +02:00
Jeff Bolz	deb4958add	vulkan: remove a couple unnecessary switches (llama/17419)	2025-12-12 17:53:06 +02:00
Jeff Bolz	cdc1a776be	vulkan: disable async for older Intel devices (llama/17369) * vulkan: disable async for older Intel devices * update detection logic * use name string for detection	2025-12-12 17:53:05 +02:00
Giuseppe Scrivano	24b14cad87	vulkan: implement ADD1, ARANGE, FILL, SOFTPLUS, STEP, ROUND, CEIL, FLOOR, TRUNC (llama/17319) * vulkan: initialize array * vulkan: implement ADD1 * vulkan: implement ARANGE * vulkan: implement FILL * vulkan: implement SOFTPLUS * vulkan: implement STEP * vulkan: implement ROUND * vulkan: implement CEIL * vulkan: implement FLOOR * vulkan: implement TRUNC * docs: update Vulkan ops Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-12-12 17:53:04 +02:00
Jeff Bolz	95d0b0b0cf	vulkan: support larger argsort (llama/17313) * vulkan: support larger argsort This is an extension of the original bitonic sorting shader that puts the temporary values in global memory and when more than 1024 threads are needed it runs multiple workgroups and synchronizes through a pipelinebarrier. To improve the memory access pattern, a copy of the float value is kept with the index value. I've applied this same change to the original shared memory version of the shader, which is still used when ncols <= 1024. * Reduce the number of shader variants. Use smaller workgroups when doing a single pass, for a modest perf boost * reduce loop overhead * run multiple cols per invocation, to reduce barrier overhead	2025-12-12 17:53:04 +02:00
Jeff Bolz	ae8865c6e6	vulkan: Add copy_transpose shader (llama/17371)	2025-12-12 17:53:04 +02:00
Ruben Ortlam	2097a9c1bd	vulkan: force full subgroups for flash attention to fix intel subgroup crash (llama/17356)	2025-12-12 17:53:03 +02:00
Jeff Bolz	24b981eff7	vulkan: support noncontig i32 copy (llama/17328)	2025-12-12 17:53:03 +02:00
Ruben Ortlam	b7dfced37f	vulkan: add log RTE support to fix Nvidia CI (llama/17320) * vulkan: add log RTE support to fix Nvidia CI * actually use the rte shader	2025-12-12 17:53:02 +02:00
Pavels Zaicenkovs	9d95d9a1ee	vulkan: add LOG operation support for F32 and F16 (llama/17183) * vulkan: add LOG operation support for F32 and F16 Part of #14909. * vulkan: Fix LOG operation types * docs: Update operation support documentation for Vulkan LOG operation * vulkan: fix log_f16 shader * docs: restore missing LOG test cases and regenerate ops.md	2025-11-17 21:05:46 +02:00
Ruben Ortlam	f571655e8e	vulkan: fix MMQ quantize_y condition (llama/17301)	2025-11-17 21:05:46 +02:00
Jeff Bolz	ea3ebd8b0d	vulkan: Fuse mul_mat_id+add_id+mul and mul_mat+add+add. (llama/17287) These both show up in gpt-oss. Also, cleanup the mul_mat_vec fusion code a bit.	2025-11-17 21:05:46 +02:00
Giuseppe Scrivano	4c4e663da0	vulkan: implement ABS and NEG (llama/17245) * docs: update Vulkan ops * vulkan: add NEG op * vulkan: add ABS op --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-11-17 21:05:46 +02:00
Jeff Bolz	e1846fc599	vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths (llama/17244) * vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths * set allow_misalign	2025-11-17 21:05:46 +02:00
Jeff Bolz	9614a56314	vulkan: skip all-negative-inf blocks in FA (llama/17186)	2025-11-17 21:05:46 +02:00
Jeff Bolz	37d4bba152	vulkan: change graph_compute to be async and enable get_tensor_async (llama/17158) * vulkan: change graph_compute to be async and enable get_tensor_async This allows some additional CPU/GPU overlap for large pp workloads. Also seems to help a bit for token gen, maybe getting rid of a small bubble between graph_compute and get_tensor. Async set and copy functions seem to be very rarely used, so I didn't enable them because I didn't have a good way to test them. The async commands need to be ordered against each other, so put them all on the compute queue. The non-async commands still use the transfer queue. The fence for graph_compute/get_tensor_async is submitted and waited on in ggml_vk_synchronize. * fix thread safety errors * teardown context cleanly * Handle async read to non-pinned dst	2025-11-17 21:05:46 +02:00
Eve	559091005a	disable rms norm mul rope for chips with no fp16 rte (llama/17134)	2025-11-17 21:05:46 +02:00
Ruben Ortlam	43f2c1ff54	vulkan: fix validation issue introduced by #16868 (llama/17145)	2025-11-17 21:05:46 +02:00
Acly	58a97d988f	cuda/vulkan : bicubic interpolation (llama/17022) * vulkan : implement upscale with bicubic interpolation * cuda : implement upscale with bicubic interpolation * tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests * adapt OpenCL backend to not support the OP in that case so tests don't fail * print scale mode & flags in test-backend-ops	2025-11-17 21:05:46 +02:00
Ruben Ortlam	2e04e7a906	vulkan: fix memory allocations (llama/17122)	2025-11-17 21:05:46 +02:00
Ruben Ortlam	1993e397bb	vulkan: iGPU memory reporting fix (llama/17110) * vulkan: use all device-local heaps for memory availability reporting Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com> * use all available heaps for iGPU memory reporting * Allow multiple memory types per buffer request for devices with split heaps --------- Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-11-09 23:38:03 +02:00
Ruben Ortlam	ee8349cf10	vulkan: fix mmq out of bounds reads (llama/17108) * vulkan: fix mmq out of bounds reads, streamline outdated matmul host code * fix mul_mat_id quantization call * Fix compiler warnings	2025-11-09 23:38:03 +02:00
Jeff Bolz	db98e8c5b4	vulkan: fuse mul_mat_id + mul (llama/17095) * vulkan: fuse mul_mat_id + mul This comes up in qwen3 moe. * split mul_mat_id fusion tests into a separate class	2025-11-09 23:38:03 +02:00
Jeff Bolz	6de3404773	vulkan: Use spec constants for conv2d s/d/p and kernel W/H (llama/16978) * vulkan: Use spec constants for conv2d s/d/p and kernel W/H Also add some additional unroll hints, which seems to help. * lock around map lookup	2025-11-09 23:38:03 +02:00
Jeff Bolz	257ce2f5c0	vulkan: fuse rms_norm + mul + rope (+ view + set_rows) (llama/16977) This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.	2025-11-09 23:38:03 +02:00
Jeff Bolz	4eef518167	vulkan: Fix test-thread-safety crashes (llama/17024) The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the same time, which needs to hold the lock. To be safe, hold the lock for all of ggml_vk_load_shaders.	2025-11-09 23:38:03 +02:00
Acly	11543bf446	vulkan : refactor buffer handling in vk_op_f32 (llama/16840) * vulkan : refactor/simplify buffer handling in vk_op_* functions * Combine UMA handling into ggml_vk_tensor_subbuffer	2025-11-09 23:38:03 +02:00
Jeff Bolz	558a04c9c7	vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (llama/16919)	2025-11-09 23:38:03 +02:00
Jeff Bolz	1672d41ab0	vulkan: remove the need for the dryrun (llama/16826) * vulkan: remove the need for the dryrun Allocate pipelines and descriptor sets when requested. Reallocate the prealloc buffers when needed, and flush any pending work before reallocating. For rms_partials and total_mul_mat_bytes, use the sizes computed the last time the graph was executed. * remove dryrun parameters	2025-11-09 23:38:03 +02:00
Jeff Bolz	2001457367	vulkan: Fix multi_add invalid descriptor usage (llama/16899)	2025-11-09 23:38:03 +02:00
Jeff Bolz	90be9c9de1	vulkan: fuse mul_mat+add and mul_mat_id+add_id (llama/16868) * vulkan: fuse mul_mat+add and mul_mat_id+add_id The fusion is only applied for the mat-vec mul paths. * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix 32b build --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-09 23:38:03 +02:00
Masato Nakasaka	e2b3eca0dc	vulkan: Fix crash when FP16 mul_mat accumulation is not supported (llama/16796) * Experimenting crash fix * added assert for aborting and fixed comment * changed to check if a pipeline is empty or not * Moved function in class definition * replaced with is_empty * Modified is_empty to check only unaligned pipelines	2025-11-09 23:38:03 +02:00
JJJYmmm	e1780b209d	model: add support for qwen3vl series (llama/16780) * support qwen3vl series. Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> * bugfix: fix the arch check for qwen3vl-moe. * use build_ffn * optimize deepstack structure * optimize deepstack feature saving * Revert "optimize deepstack feature saving" for temporal fix This reverts commit f321b9fdf13e59527408152e73b1071e19a87e71. * code clean * use fused qkv in clip * clean up / rm is_deepstack_layers for simplification * add test model * move test model to "big" section * fix imrope check * remove trailing whitespace * fix rope fail * metal : add imrope support * add imrope support for sycl * vulkan: add imrope w/o check * fix vulkan * webgpu: add imrope w/o check * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix tensor mapping --------- Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-09 23:38:03 +02:00
Jeff Bolz	887d984558	vulkan: Handle argsort with a large number of rows (llama/16851)	2025-11-09 23:38:03 +02:00
Jeff Bolz	efe8099268	vulkan: Fuse rope+set_rows (llama/16769) This pattern appears in a lot of models, the rope operation is applied right before storing into the KV cache (usually on the K tensor). Add a path to some of the rope shaders that computes the destination address based on the set_rows tensor. Compile variants of the shader with D_TYPE of f16 (the usual KV cache type). Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs the fourth for the row indices. Add fused_ops_write_mask to indicate which intermediate tensors need to write their results to memory. Skipping writing the roped K value helps to allow more nodes to run concurrently. Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It rarely starts out that way in the graph. Add new backend tests.	2025-11-09 23:38:03 +02:00
Jeff Bolz	35a3fda240	vulkan: Update topk_moe fusion to handle gpt's late softmax (llama/16656) * vulkan: Update topk_moe fusion to handle gpt's late softmax Based on #16649. * Add ggml_check_edges * Add sync logging to show fusion effects * handle clamp added in #16655 * Update ggml/src/ggml-impl.h Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-11-09 23:38:03 +02:00
Ruben Ortlam	bc944bddc8	Vulkan MMQ Integer Dot Refactor and K-Quant support (llama/16536) * vulkan: add mmq q2_k integer dot support * Refactor mmq caching * Reduce mmq register use * Load 4 quant blocks into shared memory in one step * Pack q2_k blocks into caches of 32 * Use 32-bit accumulators for integer dot matmul * Add q4_k mmq * Add q3_k mmq * Add q5_k mmq * Add q6_k mmq * Add mxfp4 mmq, enable MMQ MUL_MAT_ID * Fix mmv dm loads	2025-11-09 23:38:03 +02:00
Jeff Bolz	82a23ca9c4	vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy (llama/16793) This lets the copy to the destination device use the host-visible vidmem optimization.	2025-11-09 23:38:03 +02:00
Acly	bcda7c3e58	ggml : fix interpolate with align-corners and ne=1 (llama/16700) * ggml : fix interpolate with align-corners and ne=1 * avoid division by zero if one of the spatial dimensions is 1 * cpu, cuda, opencl returned correct result anyway due to clamp * vulkan didn't clamp for align-corners so results were broken * fix clang warning	2025-11-09 23:38:03 +02:00
Gilad S	c00ab7e5e6	vulkan: deduplicate Microsoft Direct3D12 devices (llama/16689) * fix: deduplicate and deprioritize Microsoft Direct3D12 vulkan devices from the `vulkan-dozen` driver * style: indent * fix: decrease priority * fix: switch to `\|\|`	2025-11-09 23:38:03 +02:00
Giuseppe Scrivano	d0b544da70	vulkan: delete dead code (llama/16732) ggml_vk_create_buffer_temp is not used anywhere, and it is the only caller for ggml_vk_pool_malloc. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-11-09 23:38:03 +02:00
Jeff Bolz	070b24f65c	vulkan: Optimize SSM_SCAN (llama/16645)	2025-11-09 23:38:03 +02:00
Jeff Bolz	414901a42c	vulkan: Implement topk_moe fused shader, ported from CUDA (llama/16641) This is similar to the CUDA shader from #16130, but doesn't use shared memory and handles different subgroup sizes.	2025-10-22 12:58:11 +03:00
Giuseppe Scrivano	d22008b631	vulkan: Add State Space Model (SSM) Operations Support (llama/16463) * vulkan: implement SSM scan operation Add State Space Model scan operation to the Vulkan backend. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> * vulkan: implement SSM conv operation Add State Space Model conv operation to the Vulkan backend. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-10-22 12:58:11 +03:00
Jeff Bolz	393fbbc80b	vulkan: Support FA with K/V in F32 (llama/16543)	2025-10-15 09:29:17 +03:00
Jeff Bolz	2e6888089f	vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (llama/16354) * vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers. The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed beyond that limit. This allows > 4GB buffers to be allocated on some implementations (e.g. NVIDIA) and tensors this large can be used for im2col and mul_mat. For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange. I'm not sure this check is ideal, but we always use these buffers as a single full size binding and the limit may be smaller than maxMemoryAllocationSize or maxBufferSize, so I think this is reasonable. Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range. The maxStorageBufferRange may be smaller than the maxBufferSize or maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and it's invalid usage if VK_WHOLE_SIZE computes a range larger than maxStorageBufferRange. With this change, it should be possible to generate videos using wan networks in stable-diffusion.cpp. * vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull	2025-10-12 11:16:23 +03:00
Jeff Bolz	fd11cd97ab	vulkan: in flash attention, bounds check against nem1 (don't rely on GGML_KQ_MASK_PAD) (llama/16316)	2025-10-12 11:16:23 +03:00
Eve	b0560310aa	vulkan: make ggml_vk_default_dispatcher support older vulkan headers (llama/16345) * make ggml_vk_default_dispatcher support older vulkan headers * simpilfy with using	2025-10-12 11:16:23 +03:00
Jeff Bolz	55d45edf6d	vulkan: 64-bit im2col (llama/16135) * vulkan: 64-bit im2col Add variants of the im2col shaders that use buffer_device_address/buffer_reference, and use 64-bit address calculations. This is needed for large convolutions used in stable-diffusion.cpp. * fix validation error for large im2col	2025-09-29 15:18:12 +03:00

1 2 3 4 5 ...

255 Commits