whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Jeff Bolz	dc794303d8	vulkan: reduce iq1 shared memory usage for mul_mm (llama/24287)	2026-06-15 10:33:53 +03:00
Ruben Ortlam	686bc802d1	vulkan: add `v_dot2_f32_f16` support in matrix-matrix multiplication and Flash Attention (llama/24123) * vulkan: add support for valve fp16 dot2 extension * use macro for dot2 path choice * properly check for the feature * add dot_product abstraction to reduce preprocessor branching	2026-06-15 10:33:53 +03:00
Pascal	28c7ed3db7	ggml : add GGML_OP_COL2IM_1D (llama/24206) * cpu: add GGML_OP_COL2IM_1D Add the overlap-add (scatter-add) step of a 1D transposed convolution. A ConvTranspose1d factorizes as a GEMM followed by col2im: a weight pre-permuted to [IC, KOC] is contracted against the [IC, T_in] input with mul_mat to produce a column matrix [KOC, T_in], and col2im_1d scatters those columns back into the [T_out, OC] signal, with T_out = (T_in - 1)s0 + K - 2p0. Keeping the contraction as a plain mul_mat leaves the heavy work on the optimized (and quantizable) matmul kernels, so col2im_1d only does the cheap overlap-add. CPU uses a gather formulation parallelized over output channels, supporting F32, F16 and BF16 with an F32 accumulator. * tests: add backend coverage for GGML_OP_COL2IM_1D Add test_col2im_1d next to the conv_transpose_1d cases, covering F32, F16 and BF16 across eight geometries: the canonical kernel = 2stride DAC upsampling shape, overlap, no overlap, cropping (p0 = 1 and p0 = stride/2), kernel < stride with zeroed gaps, kernel not a multiple of stride, and a single column unfold. Perf mode gets three real vocoder stage shapes reporting memory bandwidth. max_nmse_err relaxes to 5e-4 for F16 and BF16. cpu: harden GGML_OP_COL2IM_1D ggml_col2im_1d validates s0, oc, p0 and input contiguity at graph build time, before the oc division, protecting every backend at once. The kernel asserts the contiguity its flat indexing assumes and its doc states the full output length including the crop term. The kernel parallelizes over the time axis: the split stays balanced down to OC = 1, where the previous channel split was single threaded. Values are bit identical on the three real vocoder chains, two out of three improve. * tests: extend the GGML_OP_COL2IM_1D grid The eval grid grows to eleven geometries: OC = 1 (mono output stage), K = 1 with stride > 1 (sparse scatter, every gap position zeroed) and a crop down to T_out = 2 where all the gather bounds act at once. * tests: add col2im_1d equivalence test tests/test-col2im-1d.cpp proves mul_mat + col2im_1d matches the native ggml_conv_transpose_1d on the CPU backend, F32 bit exact, F16 and BF16 through casts of the column matrix. test-backend-ops cannot cover this for a CPU only op since the CPU backend is its own reference there. * rpc: bump protocol patch version for GGML_OP_COL2IM_1D GGML_OP_COUNT goes from 96 to 97 with the new op, which trips the static_assert in ggml-rpc.h. Bump RPC_PROTO_PATCH_VERSION since the op is appended and no existing op code shifts.	2026-06-15 10:33:53 +03:00
Yash Raj Pandey	2d68a3066f	ggml-cpu : fix rms_norm_back wrong output under in-place aliasing (llama/24305) * ggml-cpu : fix rms_norm_back wrong output under in-place aliasing * cont : clean-up comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-06-15 10:33:53 +03:00
ravel7524	72894aa250	Remove case for GGML_TYPE_Q4_K in mvvq.cu (llama/23528)	2026-06-15 10:33:53 +03:00
Reese Levine	e69e5138fe	ggml-webgpu: Add clang-format job (llama/24308) * Add clang-format job * try local formatting	2026-06-15 10:33:53 +03:00
Masashi Yoshimura	aa42b48312	ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants (llama/24225) * ggml-webgpu: Improve prefill speeds + refactor matmul for quants * Fixes for editroconfig checker	2026-06-15 10:33:53 +03:00
Nikhil Jain	15e5d401d1	Handle buffer overlap / buffer aliasing for concat operator (llama/24000) * Only run webgpu CI on my fork * Add webgpu only workflow * handle buffer overlap case for concat operator * restore build-webgpu.yml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Run clang-format * Update ggml/src/ggml-webgpu/wgsl-shaders/concat.wgsl --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-06-15 10:33:53 +03:00
Nikhil Jain	490e50056c	Implement 2D workgroups for scale, binary, and unary ops (llama/24044) * Only run webgpu CI on my fork * Add webgpu only workflow * Implement 2d workgroups for more operations * fix * Fix type * Move back to global_invocation_id	2026-06-15 10:33:53 +03:00
Jeff Bolz	fbf720dc9f	vulkan: Use cm2 decode_vector for mul_mat_id B matrix loads (llama/23991) This allows vec4 loads of the B elements. Also increase BK to 64 when this is enabled. Neither of these alone is consistently faster, but together these give a nice speedup. In ggml-vulkan.cpp, we need to make sure the B matrix alignment and stride are multiples of 4.	2026-06-15 10:33:53 +03:00
Ruben Ortlam	782f1226c8	cuda: reset cuda context after reading memory size (llama/23935) * cuda: reset device in get_memory function if no backend is active * also count device and host buffers * exclude hip and musa from counting and device reset * use device mutex instead of atomic * undo backend_free function move	2026-06-15 10:33:53 +03:00
Daniel Bevenius	df7638d822	ci : pin github actions to commit sha's (#3865 )	2026-06-09 12:51:00 +02:00
Christopher Albert	ba573929cd	coreml : fix --quantize crash for mlprogram format; fix --optimize-ane label (#3868 ) commit `8b92060` switched ct.convert() to mlprogram, but did not update the --quantize path. quantize_weights() from neural_network.quantization_utils only works with the legacy neuralnetwork format. Running with --quantize crashed with: Exception: MLModel of type mlProgram cannot be loaded just from the model spec object. It also needs the path to the weights file. Fix: pass compute_precision=ct.precision.FLOAT16 into ct.convert() when --quantize is set. This matches the original intent of nbits=16 (F16 storage) without changing the quantization scheme or model accuracy. Also fix the three boolean CLI flags (--encoder-only, --quantize, --optimize-ane) to use a _str_to_bool helper so that both --flag True and --flag False parse correctly. The type=bool form accepted "False" as True because bool("False") == True. Remove the "currently broken" label from --optimize-ane: the ANE path (WhisperANE with Conv2d attention and LayerNormANE) converts and loads correctly with both PyTorch 2.x and coremltools 9.x.	2026-06-09 08:34:31 +02:00
Georgi Gerganov	84bd03a438	talk-llama : sync llama.cpp	2026-06-08 14:36:36 +03:00
Georgi Gerganov	4df9a57df2	sync : ggml	2026-06-08 14:36:36 +03:00
Georgi Gerganov	b31466b4a1	ggml : bump version to 0.14.0 (ggml/1533)	2026-06-08 14:36:36 +03:00
Georgi Gerganov	b932ec5529	sync : ggml	2026-06-08 14:36:36 +03:00
Harkirat Gill	4669631d20	HIP: add gfx1152 and gfx1153 to RDNA3.5 (llama/24129)	2026-06-08 14:36:36 +03:00
Xuan-Son Nguyen	2c139c2e5e	metal : fix im2col 1D case (audio models) (llama/24220)	2026-06-08 14:36:36 +03:00
Ruben Ortlam	1777deff4c	vulkan: check coopmat2 features before reporting support (llama/24186)	2026-06-08 14:36:36 +03:00
lhez	a87e950a06	opencl: improve get_rows, cpy, concat and q6_k flat gemv (llama/24160) * opencl: allow multiple workgroups for large rows * opencl: improve small cpy * opencl: packed concat for small input * opencl: tweak flat q6_K gemv, increase N_DST and remap threads	2026-06-08 14:36:36 +03:00
Ruben Ortlam	5a1feed8ca	vulkan: add fwht support for Intel with shmem reduction (llama/23964) * vulkan: add fwht support for Intel with shmem reduction * don't use N as workgroup size * disable subgroup shuffle on MoltenVK AMD * disable fwht shader on Intel Windows due to driver bug	2026-06-08 14:36:36 +03:00
Charles Xu	facb02c4c3	kleidiai : dynamic chunck-based scheduling for hybrid execution (llama/23819)	2026-06-08 14:36:36 +03:00
Oliver Simons	4fa1e0687e	CUDA: enroll mul_mat_vec_q_moe into pdl (llama/24087) * Enroll mul_mat_vec_q_moe into PDL, boosting MTP performance on BW Data collected on a B4500: Before ``` (llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=202.8 code_cpp pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=212.8 explain_concept pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=196.4 summarize pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=226.6 qa_factual pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=225.1 translation pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=201.5 creative_short pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=197.2 stepwise_math pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=209.2 long_code_review pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=208.9 ``` After ``` (llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=211.9 code_cpp pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=224.6 explain_concept pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=207.8 summarize pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=240.2 qa_factual pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=238.5 translation pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=213.4 creative_short pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=208.8 stepwise_math pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=221.7 long_code_review pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=220.7 ``` Server launched with: ``` ➜ llama.cpp git:(osimons/enroll_mul_mat_vec_q_moe_into_PDL) ✗ ./build-x64-linux-gcc-reldbg/bin/llama-server \ -m /mnt/share/gguf/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -dio \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ -ngl all \ -fa on \ --host 0.0.0.0 \ --port 8080 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" ``` * LC to overlap with following kernels	2026-06-08 14:36:36 +03:00
Mason Milburn	4ecede8c8b	sycl : port multi-column MMVQ from CUDA backend (llama/21845) mmvq: Port the ncols_dst optimization from ggml-cuda/mmvq.cu to SYCL. Read weights once per dispatch instead of once per column. Covers all standard quant types + reorder paths for Q4_0, Q8_0, Q3_K, Q4_K, Q5_K, Q6_K. IQ types (except IQ4_XS) excluded due to incompatible vec_dot signatures. ggml-sycl: The weight reorder was only bootstrapped on single-token mat-vec (ne[1] == 1). Speculative / MTP verify issues only multi-column mat-vec, so it never triggered the reorder and ran on the slower non-reorder kernel. Bootstrap it on small multi-column batches (ne[1] <= 8) too.	2026-06-08 14:36:36 +03:00
Kartik Sirohi	991b5a8b4a	ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128 (llama/22209) * ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128 Optimize the inner loop of ggml_vec_dot_q4_1_q8_1_generic using WASM SIMD128 intrinsics, gated behind #ifdef __wasm_simd128__ so non-wasm builds are completely unaffected. Approach: - single wasm_v128_load covers all 32 packed 4-bit weights - nibbles unpacked via AND/SHR into two u8x16 registers - widened to i16 before multiply (WASM SIMD has no i8i8 instruction) - 4x wasm_i32x4_dot_i16x8 calls accumulate all 32 element pairs - horizontal reduce via 4x wasm_i32x4_extract_lane Benchmark (node v25, emcc -O3 -msimd128, 64 blocks x QK8_1=32, 200k iterations): \| impl \| ns/call \| speedup \| \|--------\|---------\|---------\| \| scalar \| 880.7 \| 1.00x \| \| simd \| 257.8 \| 3.42x \| Correctness verified against scalar reference across 10 random seeds with exact output match. ggml: move q4_1_q8_1 WASM SIMD implementation to wasm backend Relocate the SIMD128 implementation of ggml_vec_dot_q4_1_q8_1 to ggml/src/ggml-cpu/arch/wasm/quants.c to follow architecture-specific layout. Restore the generic implementation in ggml/src/ggml-cpu/quants.c. Move for loop in the else block. * ggml: use generic q4_1_q8_1 fallback in wasm backend	2026-06-08 14:36:36 +03:00
Georgi Gerganov	9d6e561f69	metal : reduce rset heartbeat from 500ms -> 5ms (llama/24074)	2026-06-08 14:36:36 +03:00
Reese Levine	e9dbd0c18a	ggml-webgpu: FlashAttention refactor + standardize quantization support (llama/23834) * Start work on flash_attn refactor * Refactor * Split k/v quantization * Refactor and abstract quantization logic for flash_attn and mul_mat * Add quantization support to tile path * formatting * Move to functions, add a check	2026-06-08 14:36:36 +03:00
rehan-10xengineer	a1a3186887	ggml-cpu: extend RVV quantization vec dot to higher VLENs (llama/22754) * ggml-cpu: add rvv 512b,1024b impls for iq4_xs * ggml-cpu: refactor; add rvv 512b, 1024b impls for q6_K, i-quants * ggml-cpu: refactor; add 512 and 1024 implementations of tq3_s, iq3_xxs, iq2_s, iq2_xs, iq2_xxs improve iq2_xs impl for rvv 256 Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> --------- Co-authored-by: taimur-10x <taimur.ahmad@10xengineers.ai> Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>	2026-06-08 14:36:36 +03:00
Andreas Kieslinger	00a9728de3	Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030) * Removes __restrict__ from PDL kernel headers due to incompatibility with PDL. Adds preprocessor directives based on arch in kernel body to add __restrict__ to retain performance on older architectures. * Simplifies new __restrict__ usage via macro * Add hopper to PDL __restrict__ fix. Co-authored-by: Oliver Simons <osimons@nvidia.com> --------- Co-authored-by: Oliver Simons <osimons@nvidia.com>	2026-06-08 14:36:36 +03:00
Charles Xu	750fa4ca35	ggml-cpu: use runtime SVE width in FWHT (llama/24059)	2026-06-08 14:36:36 +03:00
Aman Gupta	d5a49ebec8	cuda: reserve space for quantize kv-cache at startup (llama/23907) * cuda: reserve space for quantize kv-cache at startup * address review comments * remove forward decl Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * remove assert in ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-06-08 14:36:36 +03:00
lhez	f110ff540c	opencl: use flat variants of q4_K and q6_K gemv for very large M (llama/24006)	2026-06-08 14:36:36 +03:00
Max Krasnyansky	d31cb20b25	hexagon: profiler output fix and script updates (llama/24042) * hex-ops: fix profiler output (ie remove the redundant NONEs) * hex-prof: update profiling script to support tot.usec column	2026-06-08 14:36:36 +03:00
Max Krasnyansky	8d61a9edf0	hexagon: MUL_MAT, MUL_MAT_ID, FLASH_ATTN and GDN cleanup and optimizations for latest models (llama/23989) * hex-mm: initial support for F32 * F32 -> F32 matmuls * hex-rms-norm: fix src1 stride use in fused rms_norm_mul * hex-ops: clear spad pointers in the ops that clober it This fixes an odd case where fused rms-norm-mul was failing but only in qwen3.5-2B and only at searth op-bath sizes. * hmx-mm: add support for F32 * F32 -> F32 matmul_2d on HMX Decided to use Q4_0 * F32 -> F32 matmul for this. Q4_0 gets dequantized and tiled into F16, and here we quantize and tile F32 into F16. Super simple and pretty efficient. * hmx-mm: route f16 2D matmuls through the same kernel used for all other types * hmx-mm: re-introduce pipelined vs non-pipelined mode that we used to have but is much more generic way This update futher improves matmul performance and at the same time removes most of the redudant logic we had in different paths. * hmx-fa: slighlty improved pipeline simimar to matmul updates * hmx-mm: initial version of MAT_MUL_ID support for HMX * hmx-mm: fixed mxfp4 handling for MUL_MAT_ID * hex-gdn: optimize GATED_DELTA_NET DMA prefetch/double-buff, vectorize everything with HVX, in other words -- the usual :) * hmx-mm: missed one more case where we can use fastmod * hexagon: update DCVS settings for a slight perf bump * hmx-fa: use fastdiv in hmx-flash-attn * hmx-fa: precompute slope values to avoid disrupting the inner loop * hvx-utils/fa: new HVX helpers for powf and logf and using those to speed up FA alibi * hex-ops: fixed a bug in fusion logic that was messing up the order of the src tensors when some srcs are empty * hex-fa: correctly fallback to HVX if we have sinks or the dims are not quite right	2026-06-08 14:36:36 +03:00
Todor Boinovski	754247f28b	hexagon: add gelu_quick (llama/24007)	2026-06-08 14:36:36 +03:00
Anav Prasad	79223704a1	clean up unused variables warnings (llama/23975)	2026-06-08 14:36:36 +03:00
lhez	9a0265d13b	opencl: fix compiler warnings for non-adreno path (llama/23922) * opencl: fix compiler warnings for non-adreno path * opencl: fix const cast warning	2026-06-08 14:36:36 +03:00
Masashi Yoshimura	db2a39507c	revert to using global_invocation_id for cpy shader (llama/23955)	2026-06-08 14:36:36 +03:00
shaofeiqi	e728bae159	opencl: add basic support for q5_0 and q5_1 (llama/23548) * opencl: add general q5_0 support * opencl: add general q5_1 support * opencl: support non-uniform workgrp size --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-06-08 14:36:36 +03:00
Shrivas Shankar	050b8567a0	metal: template GLU kernels to support f16/f32 (llama/23882) Drops the hardcoded f32 GLU kernels in favor of a single template. We now load/store in the native tensor type (half or float) to save memory bandwidth, but keep the actual ALU compute in float to avoid exploding math in geglu/swiglu. Also opened up the dispatch gate to allow f16 inputs.	2026-06-08 14:36:36 +03:00
Jeff Bolz	71d80aa49e	vulkan: don't hold the device mutex while compiling pipelines (llama/23641) * vulkan: don't hold the device mutex while compiling pipelines We need to hold a lock while we traverse all pipelines and lazily initialize them, but we don't need to hold it while the pipeline is being compiled. And it doesn't need to be the same lock as the device mutex. We call load_shaders each time a pipeline is needed, so we only need to compile that one pipeline (and, for example, don't want to end up compiling a pipeline that another thread should be compiling). * remove 'needed'	2026-06-08 14:36:36 +03:00
Winston Ma	c471bcce1b	vulkan: reduce host memory lock contention (llama/23376) * vulkan: reduces lock contention * replace unique_lock with lock_guard	2026-06-08 14:36:36 +03:00
Johannes Gäßler	e815b264eb	TP: quantized KV cache support (llama/23792) * TP: quantized KV cache support * fix partial view * remove overly strict assert	2026-06-08 14:36:36 +03:00
Matt Corallo	982533fc0c	vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (llama/23056) Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even though they're only 2-byte aligned, and Q3_K still wins on NVIDIA as well. mesa isn't all that great at coalescing back-to-back loads from alternating arrays, so we force it instead. Further, we can do subtraction directly on a full int32_t rather than an i8vec4 with bit twiddling because the high bit is always free to start. On Intel BMG on mesa, the switch to MMVQ provides an immediate ~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and ~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K. The futher switch to block loads leads to a ~24% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K. Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA override for K quants on Xe2 as well.	2026-06-08 14:36:36 +03:00
Winston Ma	aea93ada61	vulkan: Removed unused functions (llama/23175)	2026-06-08 14:36:36 +03:00
Neo Zhang	ec0c661950	Support Q4_1, Q5_0, Q5_1 in Flash-attention (llama/23812) * support Q4_1, Q5_0, Q5_1 * update ut case	2026-06-08 14:36:36 +03:00
Neo Zhang	20323e48c4	Add more types in GET_ROWS OP (llama/23710) * add to support Q1_0, NVFP4, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ1_S, IQ1_M, IQ3_S, IQ4_NL, IQ4_XS, I32, MXFP4, Q2_K, Q3_K, Q5_K, and Q6_K in GET_ROWS OP * correct the link	2026-06-08 14:36:36 +03:00
Neo Zhang	687fbcb149	sycl : Optimize Q3_K mul_mat by reorder (llama/23725)	2026-06-08 14:36:36 +03:00
lhez	1c0d1f0f7c	opencl: support bf16 by converting to f16 (llama/23839)	2026-06-08 14:36:36 +03:00

1 2 3 4 5 ...

4662 Commits All Branches Search

4662 Commits

All Branches