whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Jeff Bolz	67473fef57	vulkan: handle rope with large number of rows (llama/18306)	2025-12-31 17:52:09 +02:00
Jeff Bolz	f863735caa	vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (llama/18302)	2025-12-31 17:52:09 +02:00
Ruben Ortlam	1356600679	vulkan: use fewer FA rows for small cache runs (llama/18280)	2025-12-31 17:52:09 +02:00
Jeff Bolz	dbbe6c11b5	vulkan: Extend rope fusions to allow mrope (llama/18264) Extend the test-backend-ops tests as well.	2025-12-31 17:52:09 +02:00
Jeff Bolz	98e59a43d1	vulkan: Implement set_tensor_async and the event interfaces (llama/18047) The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from #7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.	2025-12-31 17:52:09 +02:00
Jeff Bolz	b893e0813a	vulkan: fix im2col overflowing maxworkgroupcount (llama/18180)	2025-12-31 17:52:09 +02:00
Jeff Bolz	f407c5e562	vulkan/cuda: fix topk_moe with exp_probs_b (llama/18071) I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn and added coverage for exp_probs_b and some other missing combinations. This exposed a bug in both CUDA and Vulkan backends where they were assuming the input to argsort and the input to get_rows are the same. I'd like to optimize this graph in another change, but for now just get it functional. CUDA also had a bug where it got n_experts from the wrong place, leading to GGML_ASSERT failures in some of the new tests.	2025-12-31 17:52:09 +02:00
Jeff Bolz	ad6ee3865d	vulkan: support GGML_UNARY_OP_XIELU (llama/18062)	2025-12-31 17:52:09 +02:00
Jeff Bolz	3cd141f1a9	vulkan: in graph_optimize, try to group ADD operations (llama/18060) I saw the adds not staying together in the new nemotron 3 nano model.	2025-12-31 17:52:09 +02:00
lovedheart	449fc7c024	Vulkan: some improvement on mul_mat_iq2_xs (llama/18031) * Some improvement on mul_mat_iq2_xs Refactor calculations for db values and grid data to optimize performance and reduce redundancy. * Fix trailing whitespace	2025-12-31 17:52:09 +02:00
Jeff Bolz	195d8d0c65	vulkan: Add perf logger mode with concurrency (llama/17944) This implements a variation of the perf logger where rather than timing each operation individually with effectively a barrier in between, we put the timing boundaries where we already synchronize and time the groups of work that normally overlap. This can be useful to help understand whether individual operations need to be optimized, or if the group is already running efficiently. GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when GGML_VK_PERF_LOGGER is also set). GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch.	2025-12-31 17:52:09 +02:00
Ruben Ortlam	3bb4e1e0ac	vulkan: fix mul_mat_vec_iq1_s formatting (llama/18026)	2025-12-18 08:20:56 +02:00
Jeff Bolz	af2c8cba6f	vulkan: Fix data race/hang in scalar/cm1 flash attention (llama/17887)	2025-12-18 08:20:56 +02:00
lovedheart	7e5df2975e	vulkan: improve mul_mat_vec_iq1_s speed (llama/17874)	2025-12-18 08:20:56 +02:00
Eve	cdadfc3b72	vulkan: faster q6_k matmul (llama/17813) * q6_k faster mul mat * 8 values * fix comment * switch to two at a time * start ci for .glsl files	2025-12-18 08:20:56 +02:00
Jeff Bolz	b901ebe4a3	vulkan: support get_rows for i32 (llama/17941)	2025-12-18 08:20:56 +02:00
Jeff Bolz	f33446643e	vulkan: support GGML_OP_DIAG (llama/17893)	2025-12-18 08:20:56 +02:00
Jeff Bolz	939d3085e9	vulkan: Multi-pass softmax for large number of cols (llama/17892) When the number of cols is large, split each row across multiple workgroups. There are three phases that communicate partial results through temp buffers: (1) compute max partials (2) take max of partials, compute sum(exp(x-max)) partials (3) sum partials, compute scaled result	2025-12-18 08:20:56 +02:00
Jeff Bolz	13bb296dbf	vulkan: Allow non-pow2 n_experts in topk_moe (llama/17872)	2025-12-18 08:20:56 +02:00
lovedheart	d6d44fac69	Vulkan: improve mul_mat_vec_iq1_m (llama/16907) * Optimize Vulkan shader for matrix-vector multiplication * Revert changes on compute_outputs and main Refactor compute_outputs to handle remaining rows correctly. * Fix trailing whitespace	2025-12-12 17:53:21 +02:00
Jeff Bolz	898f876fe2	vulkan: perf_logger improvements (llama/17672) * vulkan: perf_logger improvements - Move perf_logger from device to ctx. - Add an env var to control the frequency we dump the stats. If you set a very large value, it just dumps when the ctx is destroyed. - Add a fusion info string to the tracking, only log one item per fused op. - Fix MUL_MAT_ID flops calculation. * fix vector sizes	2025-12-12 17:53:21 +02:00
Phylliida Dev	c5e1807071	ggml : add circular tiling support to pad, for Vulkan, CUDA, and CPU (used for making seamless textures) (llama/16985) * Feat: Added vulkan circular tiling support * Feat: Added cpu circular * Feat: Added cuda kernels * Added tests * Added tests * Removed non-pad operations * Removed unneded changes * removed backend non pad tests * Update test-backend-ops.cpp * Fixed comment on pad test * removed trailing whitespace * Removed unneded test in test-backend-ops * Removed removed test from calls * Update ggml/src/ggml-vulkan/vulkan-shaders/pad.comp Co-authored-by: Ruben Ortlam <picard12@live.de> * Fixed alignment * Formatting Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Format pad * Format * Clang format * format * format * don't change so much stuff * clang format and update to bool * fix duplicates * don't need to fix the padding * make circular bool * duplicate again * rename vulkan to wrap around * Don't need indent * moved to const expr * removed unneded extra line break * More readable method calls * Minor wording changes * Added final newline * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Added circular pad ext tests * Gate non circular pad devices * Cleaned gating of non-circular pad devices --------- Co-authored-by: Phylliida <phylliidadev@gmail.com> Co-authored-by: Ruben Ortlam <picard12@live.de> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:20 +02:00
Jeff Bolz	c66c71e9f4	vulkan: Use one row per workgroup for f32 mmv (llama/17711) The MoE models have a mul_mat_vec with very small m (32, 64, 128) right before the topk_moe selection. Running multiple rows per wg doesn't utilize the SMs well. I think even for larger m, f32 is so bandwidth-limited that running multiple rows doesn't help.	2025-12-12 17:53:20 +02:00
Jeff Bolz	875d861473	vulkan: support solve_tri with larger N/K values (llama/17781) Split N into chunks to fit into shared memory. If K > 128, use a larger workgroup with enough invocations. Add perf tests matching qwen3next.	2025-12-12 17:53:20 +02:00
Masato Nakasaka	a8d02735f7	vulkan: Replace deprecated VK_EXT_validation_features (llama/17637) * replaced deprecated VK_EXT_validation_features * forgot to remove old code	2025-12-12 17:53:19 +02:00
Masato Nakasaka	191e5f46a2	vulkan: Fix mismatch in TOPK_MOE unit test (llama/17541) * Fix shader to support 2D workgroup mapping to a single subgroup * Set required_subgroup_size topk_moe shader requires static WARP_SIZE and actual subgroup size to match	2025-12-12 17:53:19 +02:00
Jeff Bolz	64a3f573e0	vulkan: add more num_blocks instantiations in rms_norm (llama/17701)	2025-12-12 17:53:19 +02:00
Jeff Bolz	0484147ab2	vulkan: fix top_k bug when there are ties in the input (llama/17659) * vulkan: Reduce temporary memory usage for TOP_K - Compute row size for the temp buffer based on the output of the first pass. - Update shader addressing math to use the output row size - Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k" For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer from about 3.2MB to 500KB. * vulkan: fix top_k bug when there are ties in the input I noticed by inspection a bug in the vulkan top_k shader where if the least value in the top_k appears multiple times we could end up writing those extra copies out rather than some larger values (if the larger values are on higher numbered threads). I rewrote the test verification to handle this case, where the final index set is not necessarily the same. * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:19 +02:00
Acly	0b53759b29	vulkan : support conv-2d with large output size (llama/17685)	2025-12-12 17:53:19 +02:00
Jeff Bolz	7e97d3b069	vulkan: enable mmvq for q2_k on NVIDIA (llama/17675)	2025-12-12 17:53:18 +02:00
Jeff Bolz	32ba1ec8e0	vulkan: set all memory allocations to high priority (llama/17624) * vulkan: set all memory allocations to high priority * gate by env var	2025-12-12 17:53:18 +02:00
Jeff Bolz	86cb5ab93f	vulkan: Reduce temporary memory usage for TOP_K (llama/17623) - Compute row size for the temp buffer based on the output of the first pass. - Update shader addressing math to use the output row size - Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k" For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer from about 3.2MB to 500KB.	2025-12-12 17:53:15 +02:00
Tarek Dakhran	0defeee679	model: LFM2-VL fixes (llama/17577) * Adjust to pytorch * Add antialiasing upscale * Increase number of patches to 1024 * Handle default marker insertion for LFM2 * Switch to flag * Reformat * Cuda implementation of antialias kernel * Change placement in ops.cpp * consistent float literals * Pad only for LFM2 * Address PR feedback * Rollback default marker placement changes * Fallback to CPU implementation for antialias implementation of upscale	2025-12-12 17:53:14 +02:00
Acly	2258930c2e	vulkan : fix FA mask load with bounds check (coopmat2) (llama/17606)	2025-12-12 17:53:13 +02:00
Ruben Ortlam	2fcc0a3a9f	Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support (llama/16900) * vulkan: split mul_mmq_funcs for mul_mat_vecq use * add mxfp4 mmvq * add q2_k mmvq * add q3_k mmvq * add q4_k and q5_k mmvq * add q6_k mmvq * handle 4x4 quants per mmvq thread * enable MUL_MAT_ID mmvq support * enable subgroup optimizations for mul_mat_vec_id shaders * device tuning * request prealloc_y sync after quantization * fix indentation * fix llvmpipe test failures * fix mul_mat_id mmvq condition * fix unused variable warning	2025-12-12 17:53:12 +02:00
Jeff Bolz	dbf8766ffa	vulkan: improve topk perf for large k, fix overflow in unit tests (llama/17582)	2025-12-12 17:53:12 +02:00
Jeff Bolz	7a20963140	vulkan: Implement GGML_OP_TRI (llama/17503) * vulkan: Implement GGML_OP_TRI * check types match	2025-12-12 17:53:11 +02:00
Jeff Bolz	3727a36c48	vulkan: Implement SOLVE_TRI (llama/17486) * vulkan: Implement SOLVE_TRI * load B matrix through shared memory * use FLOAT_TYPE	2025-12-12 17:53:10 +02:00
Acly	ac92424b59	vulkan : move contiguous checks to device_supports_op (llama/17490) * vulkan : remove op_supports_incontiguous and add missing constraints in device_supports_op * im2col: remove contraints on src0 (kernel input)	2025-12-12 17:53:10 +02:00
Jeff Bolz	310db24fca	vulkan: use a fixed 1KB buffer for the add_rms_fusion opt (llama/17514)	2025-12-12 17:53:10 +02:00
Jeff Bolz	c8050e5fdc	vulkan: allow graph_optimize for prompt processing workloads (llama/17475)	2025-12-12 17:53:09 +02:00
Jeff Bolz	d8b61e05f8	vulkan: Implement top-k (llama/17418) * vulkan: Implement top-k Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10) and discards all but the top K. Repeat until only K are left. And there's a fast path when K==1 to just find the max value rather than sorting. * fix pipeline selection * vulkan: Add N-ary search algorithm for topk * microoptimizations	2025-12-12 17:53:09 +02:00
Jeff Bolz	208450048c	vulkan: Implement GGML_OP_CUMSUM (llama/17479)	2025-12-12 17:53:08 +02:00
Jeff Bolz	273e4fe7ae	vulkan: Use fewer rows for scalar FA when HS is not a multiple of 16 (llama/17455)	2025-12-12 17:53:07 +02:00
Jeff Bolz	553d57a4e7	vulkan: more FA details in vk_perf_logger (llama/17443)	2025-12-12 17:53:07 +02:00
Jeff Bolz	deb4958add	vulkan: remove a couple unnecessary switches (llama/17419)	2025-12-12 17:53:06 +02:00
Jeff Bolz	cdc1a776be	vulkan: disable async for older Intel devices (llama/17369) * vulkan: disable async for older Intel devices * update detection logic * use name string for detection	2025-12-12 17:53:05 +02:00
Giuseppe Scrivano	24b14cad87	vulkan: implement ADD1, ARANGE, FILL, SOFTPLUS, STEP, ROUND, CEIL, FLOOR, TRUNC (llama/17319) * vulkan: initialize array * vulkan: implement ADD1 * vulkan: implement ARANGE * vulkan: implement FILL * vulkan: implement SOFTPLUS * vulkan: implement STEP * vulkan: implement ROUND * vulkan: implement CEIL * vulkan: implement FLOOR * vulkan: implement TRUNC * docs: update Vulkan ops Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-12-12 17:53:04 +02:00
Jeff Bolz	95d0b0b0cf	vulkan: support larger argsort (llama/17313) * vulkan: support larger argsort This is an extension of the original bitonic sorting shader that puts the temporary values in global memory and when more than 1024 threads are needed it runs multiple workgroups and synchronizes through a pipelinebarrier. To improve the memory access pattern, a copy of the float value is kept with the index value. I've applied this same change to the original shared memory version of the shader, which is still used when ncols <= 1024. * Reduce the number of shader variants. Use smaller workgroups when doing a single pass, for a modest perf boost * reduce loop overhead * run multiple cols per invocation, to reduce barrier overhead	2025-12-12 17:53:04 +02:00
Jeff Bolz	ae8865c6e6	vulkan: Add copy_transpose shader (llama/17371)	2025-12-12 17:53:04 +02:00

1 2 3 4 5 ...

380 Commits