whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Charles Xu	750fa4ca35	ggml-cpu: use runtime SVE width in FWHT (llama/24059)	2026-06-08 14:36:36 +03:00
Aman Gupta	d5a49ebec8	cuda: reserve space for quantize kv-cache at startup (llama/23907) * cuda: reserve space for quantize kv-cache at startup * address review comments * remove forward decl Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * remove assert in ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-06-08 14:36:36 +03:00
lhez	f110ff540c	opencl: use flat variants of q4_K and q6_K gemv for very large M (llama/24006)	2026-06-08 14:36:36 +03:00
Max Krasnyansky	d31cb20b25	hexagon: profiler output fix and script updates (llama/24042) * hex-ops: fix profiler output (ie remove the redundant NONEs) * hex-prof: update profiling script to support tot.usec column	2026-06-08 14:36:36 +03:00
Max Krasnyansky	8d61a9edf0	hexagon: MUL_MAT, MUL_MAT_ID, FLASH_ATTN and GDN cleanup and optimizations for latest models (llama/23989) * hex-mm: initial support for F32 * F32 -> F32 matmuls * hex-rms-norm: fix src1 stride use in fused rms_norm_mul * hex-ops: clear spad pointers in the ops that clober it This fixes an odd case where fused rms-norm-mul was failing but only in qwen3.5-2B and only at searth op-bath sizes. * hmx-mm: add support for F32 * F32 -> F32 matmul_2d on HMX Decided to use Q4_0 * F32 -> F32 matmul for this. Q4_0 gets dequantized and tiled into F16, and here we quantize and tile F32 into F16. Super simple and pretty efficient. * hmx-mm: route f16 2D matmuls through the same kernel used for all other types * hmx-mm: re-introduce pipelined vs non-pipelined mode that we used to have but is much more generic way This update futher improves matmul performance and at the same time removes most of the redudant logic we had in different paths. * hmx-fa: slighlty improved pipeline simimar to matmul updates * hmx-mm: initial version of MAT_MUL_ID support for HMX * hmx-mm: fixed mxfp4 handling for MUL_MAT_ID * hex-gdn: optimize GATED_DELTA_NET DMA prefetch/double-buff, vectorize everything with HVX, in other words -- the usual :) * hmx-mm: missed one more case where we can use fastmod * hexagon: update DCVS settings for a slight perf bump * hmx-fa: use fastdiv in hmx-flash-attn * hmx-fa: precompute slope values to avoid disrupting the inner loop * hvx-utils/fa: new HVX helpers for powf and logf and using those to speed up FA alibi * hex-ops: fixed a bug in fusion logic that was messing up the order of the src tensors when some srcs are empty * hex-fa: correctly fallback to HVX if we have sinks or the dims are not quite right	2026-06-08 14:36:36 +03:00
Todor Boinovski	754247f28b	hexagon: add gelu_quick (llama/24007)	2026-06-08 14:36:36 +03:00
Anav Prasad	79223704a1	clean up unused variables warnings (llama/23975)	2026-06-08 14:36:36 +03:00
lhez	9a0265d13b	opencl: fix compiler warnings for non-adreno path (llama/23922) * opencl: fix compiler warnings for non-adreno path * opencl: fix const cast warning	2026-06-08 14:36:36 +03:00
Masashi Yoshimura	db2a39507c	revert to using global_invocation_id for cpy shader (llama/23955)	2026-06-08 14:36:36 +03:00
shaofeiqi	e728bae159	opencl: add basic support for q5_0 and q5_1 (llama/23548) * opencl: add general q5_0 support * opencl: add general q5_1 support * opencl: support non-uniform workgrp size --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-06-08 14:36:36 +03:00
Shrivas Shankar	050b8567a0	metal: template GLU kernels to support f16/f32 (llama/23882) Drops the hardcoded f32 GLU kernels in favor of a single template. We now load/store in the native tensor type (half or float) to save memory bandwidth, but keep the actual ALU compute in float to avoid exploding math in geglu/swiglu. Also opened up the dispatch gate to allow f16 inputs.	2026-06-08 14:36:36 +03:00
Jeff Bolz	71d80aa49e	vulkan: don't hold the device mutex while compiling pipelines (llama/23641) * vulkan: don't hold the device mutex while compiling pipelines We need to hold a lock while we traverse all pipelines and lazily initialize them, but we don't need to hold it while the pipeline is being compiled. And it doesn't need to be the same lock as the device mutex. We call load_shaders each time a pipeline is needed, so we only need to compile that one pipeline (and, for example, don't want to end up compiling a pipeline that another thread should be compiling). * remove 'needed'	2026-06-08 14:36:36 +03:00
Winston Ma	c471bcce1b	vulkan: reduce host memory lock contention (llama/23376) * vulkan: reduces lock contention * replace unique_lock with lock_guard	2026-06-08 14:36:36 +03:00
Johannes Gäßler	e815b264eb	TP: quantized KV cache support (llama/23792) * TP: quantized KV cache support * fix partial view * remove overly strict assert	2026-06-08 14:36:36 +03:00
Matt Corallo	982533fc0c	vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (llama/23056) Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even though they're only 2-byte aligned, and Q3_K still wins on NVIDIA as well. mesa isn't all that great at coalescing back-to-back loads from alternating arrays, so we force it instead. Further, we can do subtraction directly on a full int32_t rather than an i8vec4 with bit twiddling because the high bit is always free to start. On Intel BMG on mesa, the switch to MMVQ provides an immediate ~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and ~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K. The futher switch to block loads leads to a ~24% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K. Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA override for K quants on Xe2 as well.	2026-06-08 14:36:36 +03:00
Winston Ma	aea93ada61	vulkan: Removed unused functions (llama/23175)	2026-06-08 14:36:36 +03:00
Neo Zhang	ec0c661950	Support Q4_1, Q5_0, Q5_1 in Flash-attention (llama/23812) * support Q4_1, Q5_0, Q5_1 * update ut case	2026-06-08 14:36:36 +03:00
Neo Zhang	20323e48c4	Add more types in GET_ROWS OP (llama/23710) * add to support Q1_0, NVFP4, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ1_S, IQ1_M, IQ3_S, IQ4_NL, IQ4_XS, I32, MXFP4, Q2_K, Q3_K, Q5_K, and Q6_K in GET_ROWS OP * correct the link	2026-06-08 14:36:36 +03:00
Neo Zhang	687fbcb149	sycl : Optimize Q3_K mul_mat by reorder (llama/23725)	2026-06-08 14:36:36 +03:00
lhez	1c0d1f0f7c	opencl: support bf16 by converting to f16 (llama/23839)	2026-06-08 14:36:36 +03:00
Georgi Gerganov	bf74b557d2	metal : restore im2col implementation for large kernels (llama/23901)	2026-06-08 14:36:36 +03:00
Jinyang He	64b0d6b7fc	ggml : add some lsx support (llama/23798) * loongarch : optimize LSX fp16 load/store with native intrinsics Use __lsx_vfcvtl_s_h and __lsx_vfcvt_h_s instead of scalar loops in __lsx_f16x4_load and __lsx_f16x4_store. * loongarch : add LSX implementation for q8_0 dot product * loongarch : add LSX implementation for q6_K dot product * loongarch : add LSX implementation for iq4_xs dot product * Improve reduce ops when sun int16 pairs to int32	2026-06-08 14:36:36 +03:00
Ruben Ortlam	4317ddbe2b	vulkan: add Flash Attention support for BFloat16 KV cache (llama/23420) * vulkan: add flash attention bf16 kv support * vulkan: bf16 FA coopmat1 support * vulkan: bf16 FA coopmat2 support * fix FA bf16 f32 fallback * fix FA bf16 coopmat1 shader * fix FA bf16 coopmat2 shader * code cleanup * cleanup comment change * address feedback * add O_TYPE for cm2 FA * use O_TYPE for gqaStore function * reduce BFLOAT16 ifdefs	2026-06-08 14:36:36 +03:00
Reese Levine	9147a9676b	ggml-webgpu: Check earlier for WebGPU required features (llama/23879)	2026-06-08 14:36:36 +03:00
Reese Levine	acd91d2c38	ggml-webgpu: add q4_0/q8_0 SET_ROWS (llama/23760) * Add q8_0 and q4_0 set_rows * Add fast(er) quantization set_rows path * formatting/naming * a little more naming * Remove unused constant * Don't override other override * Avoid bitcast * Narrow relaxation	2026-06-08 14:36:36 +03:00
Oliver Simons	f7aad4ed7e	CUDA: Check PTX version on host side to guard PDL dispatch (llama/23530) * CUDA: Check PTX version on host side to guard PDL dispatch Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this variable doesn't differentiate between compiling for say sm_90, sm_90a or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX). Thus, one can have a bug when compiling with `DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly dispatch to PDL on sm_90/sm_120 in forward-JIT mode. This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of the incoming kernel at runtime. A check on ptxVersion alone is sufficient, as device-codes will always be >= ptxVersion (and any violation of this would be a severe bug in CUDA/nvcc), see: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code * Implement MurmurHash3 mixer for better hash distribution Magic constants were taken from boost: `2698b43803/include/boost/container_hash/detail/hash_mix.hpp (L19-L65)` * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments, make seed non-zero * Apply code-formatting * Replace std::size_t -> size_t for consistency --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-06-08 14:36:36 +03:00
fairydreaming	c50e951afd	model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (llama/23346) * llama : support DeepSeek V3.2 model family (with DSA lightning indexer) * convert : handle DeepseekV32ForCausalLM architecture * ggml : support for f16 GGML_OP_FILL * memory : separate hparams argument in llama_kv_cache constructor * memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache) * llama : support for LLM_ARCH_DEEPSEEK32 * model : llama_model_deepseek32 implementation * model : merge two scale operations into one in DSA lightning indexer implementation * chore : remove unused code * model : support NVFP4 in DeepSeek V3.2 Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * memory : refactoring TODO Co-authored-by: ggerganov <ggerganov@users.noreply.github.com> --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>	2026-06-08 14:36:36 +03:00
Georgi Gerganov	92fc3f2a58	ggml : bump version to 0.13.1 (ggml/1523)	2026-05-29 09:47:30 +03:00
Andreas Kieslinger	e90501e179	cuda : disables launch_fattn PDL enrollment due to compiler bug (llama/23825)	2026-05-29 09:47:30 +03:00
Matt Corallo	f1b687da28	meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (llama/23480) Without this at least the vulkan backend will skip the `* 0` for !COMPUTE tensors, causing corrupt output.	2026-05-29 09:47:30 +03:00
Max Krasnyansky	442be1789d	hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (llama/23835) Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.	2026-05-29 09:47:30 +03:00
lhez	94922ce12c	opencl: move backend info printing into its own function (llama/23702) * opencl: move backend info print into its own function * opencl: move new log line * opencl: fix for non adreno path	2026-05-29 09:47:30 +03:00
fl0rianr	e1faa7cb4d	ggml: auto apply iGPU flag CUDA/HIP if integrated device (llama/23007)	2026-05-29 09:47:30 +03:00
redfox	4e8af441e5	mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … (#23729 ) * mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING * avoid a mismatch for JIT compilation of Turing device code for Ampere or newer Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-29 09:47:30 +03:00
Jaden_Mach	04795e6272	CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (llama/23227) * CUDA: per-quant MMVQ/MMQ batch threshold on AMD MFMA hardware The dispatcher uses a single global threshold (MMVQ_MAX_BATCH_SIZE = 8) to choose between mul_mat_vec_q (per-row GEMV) and mul_mat_q (MFMA-tiled GEMM) for quantized matmul. On AMD CDNA, the optimal crossover differs substantially by quant family because the per-row GEMV cost is dominated by dequantisation, not the dot-product itself: K-quants pay a heavier super-block decode and so MMQ wins sooner; legacy and IQ quants have lean decode and stay ahead until the batch fully populates an MFMA tile. This patch introduces ggml_cuda_should_use_mmvq(type, cc, ne11) -> bool, mirroring the existing ggml_cuda_should_use_mmq, and gates per-quant thresholds on amd_mfma_available(cc): Q3_K, Q4_K, Q5_K : MMVQ <= 3 (MMQ wins from batch=4: +5% .. +76%) Q2_K, Q6_K : MMVQ <= 5 (MMQ wins from batch=6: +8% .. +35%) others : MMVQ <= 8 (legacy & IQ regress under MMQ; unchanged) Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical to master. GGML_CUDA_FORCE_MMVQ=1 restores the original global threshold for A/B testing. Measured on MI250X (gfx90a, ROCm 7.2.1) with Llama-3.2-3B-Instruct, llama-bench pp512 across all 20 supported quants, ubatch 1..8, 10 reps. Full table in PR description. Selected pp512 throughput (tok/s, ub=8): Q4_K_S: 559 -> 940 (+68%) Q5_K_S: 503 -> 884 (+76%) Q3_K_S: 629 -> 879 (+40%) Q2_K : 615 -> 809 (+32%) Q6_K : 582 -> 776 (+33%) Selected pp512 throughput (tok/s, ub=4): Q4_K_S: 444 -> 480 (+ 8%) Q4_0 : 682 -> 685 (+ 0%) (no regression - retains MMVQ) IQ4_XS: 706 -> 698 (- 1%) (no regression - retains MMVQ) * CUDA: address review — inline MMVQ batch table, drop env hatch & doc block * tune kernel selection logic for CDNA1 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-29 09:47:30 +03:00
Max Krasnyansky	1b241b879c	hexagon: minor refresh for HMX FA and MM (llama/23796) * hex-fa: clean up qf32/fp32 handling and stride handling * hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79 * hex-fa: vectorize leftover handling * hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity * hmx-mm: remove dead code * hmx-mm: use fastdiv in x4x2 dequant * hmx-mm: sandwich dequant and scatter to improve perf * hmx-mm: fixed rebase conflicts * hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv * hmx-mm: an even earlier dispatch for per-type dequant * hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs This is a bit faster than LUT. * hex-cmake: one more tweak for lto --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>	2026-05-29 09:47:30 +03:00
Jeff Bolz	b896e91f18	vulkan: fast path for walsh-hadamard transform (llama/23687) * vulkan: fast path for walsh-hadamard transform * disable for intel due to segfault	2026-05-29 09:47:30 +03:00
Winston Ma	816c3029bc	vulkan: fix wrong index variable in inner loop (llama/23665)	2026-05-29 09:47:30 +03:00
Winston Ma	5db94bac04	vulkan: Fix memory logger unsafe iterator access (llama/23667)	2026-05-29 09:47:30 +03:00
fairydreaming	60e420ff6a	cuda : fix KQ mask offset integer overflow in fattn MMA kernel (llama/23610) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2026-05-29 09:47:30 +03:00
Martin Klacer	8e40325876	ggml: fixed Arm SVE usage bug in vec.h, vec.cpp (llama/22841) * Updated vec.h/vec.cpp code to accumulate to F32 rather than F16 Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8 Signed-off-by: Martin Klacer <martin.klacer@arm.com> Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>	2026-05-29 09:47:30 +03:00
ymcki	d284e1c3aa	Hexagon: OP_GATED_DELTA_NET K>1 support (llama/23531) * K>1 state snapshot support * removed picky indent multiple of 4 fixes	2026-05-29 09:47:30 +03:00
ymcki	7e843a80e1	opencl: OP_GATED_DELTA_NET (llama/23312) * OP_GATED_DELTA_NET impl * add back lanes_per_column declaration * removed has_subgroup_arithmetic and has_subgroup_clustered_reduce * removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot * support for K>1 state snapshot * removed picky indent multiple of 4 fixes * removed return that won\'t be executed	2026-05-29 09:47:30 +03:00
Reese Levine	8c8f213dac	ggml-webgpu: remove legacy constants (llama/23672)	2026-05-29 09:47:30 +03:00
Max Krasnyansky	3bbe93378c	hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (llama/23647) * hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now * hmx-mm: add support for Q4_1 * hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot * hexagon: fix repack scratch buffer overflow * hex-mm: fix Q4_1 repack buffer sizing * hexagon: flip the build order for mm and fa (seems to help LTO) * hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1 * hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output * hexagon: resurrect early-wake and add support for polling for op-batch completions With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax. This is a good thing! But it does add extra latency for the pure benchmark runs. Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking. --------- Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>	2026-05-29 09:47:30 +03:00
Masashi Yoshimura	a52bd385d6	ggml-webgpu: Fix how to dispatch WG to some ops (llama/23750)	2026-05-29 09:47:30 +03:00
Matt Corallo	8bce478ee8	vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 (llama/22887) * vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 Against mesa git, this shows a 4.8% performance improvement for tg128 on Qwen3.5-9B:BF16 on Intel BMG. Note that this breaks some tests until the last commit which fixes OOB A reads. * vulkan: Use aligned loads in mul_mat_vec when available Against mesa git, this shows a 3.3% performance improvement for tg128 on Qwen3.5-9B:BF16 on Intel BMG. * Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec Mesa's UUB logic can't see through conditionals, limiting its ability to understand the bounds on the `num_rows` field in the cleanup run. Making it explicit that `num_rows` is, indeed, always <= `NUM_ROWS` helps mesa make slightly better codegen. Against mesa git, this currently shows a 1% performance improvement in tg128 on Qwen3.5-9B:BF16 on Intel BMG. * vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes There was a TODO to fix the OOB reads from the A matrix which we do here. It is within performance noise (+<0.1%) in tg128 for Qwen3.5-9B:BF16 on Intel BMG.	2026-05-29 09:47:30 +03:00
Jeff Bolz	1b590bbb9a	vulkan: use GL_NV_cooperative_matrix_decode_vector for faster matmul (llama/23541)	2026-05-29 09:47:30 +03:00
l8bloom	c5cde8c717	vulkan: add REPEAT op support for f16 to f16. (llama/23298) * feat: extend repeat op for vulkan * feat: add repeat_f16 vulkan pipeline * fix: ensure same dst and src types * fix: use type_size instead of data types * fix: use int16 and int32 for repeat shader op * chore: rename repeat_f* to repeat_i* * chore: rename repeat vulkan pipelines	2026-05-29 09:47:30 +03:00
Oliver Simons	98c6722fec	CUDA: restrict PDL to CTK >= 12.3 due to MSVC issues (llama/23742)	2026-05-29 09:47:30 +03:00

1 2 3 4 5 ...

2563 Commits