whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Jinyang He	64b0d6b7fc	ggml : add some lsx support (llama/23798) * loongarch : optimize LSX fp16 load/store with native intrinsics Use __lsx_vfcvtl_s_h and __lsx_vfcvt_h_s instead of scalar loops in __lsx_f16x4_load and __lsx_f16x4_store. * loongarch : add LSX implementation for q8_0 dot product * loongarch : add LSX implementation for q6_K dot product * loongarch : add LSX implementation for iq4_xs dot product * Improve reduce ops when sun int16 pairs to int32	2026-06-08 14:36:36 +03:00
Ruben Ortlam	4317ddbe2b	vulkan: add Flash Attention support for BFloat16 KV cache (llama/23420) * vulkan: add flash attention bf16 kv support * vulkan: bf16 FA coopmat1 support * vulkan: bf16 FA coopmat2 support * fix FA bf16 f32 fallback * fix FA bf16 coopmat1 shader * fix FA bf16 coopmat2 shader * code cleanup * cleanup comment change * address feedback * add O_TYPE for cm2 FA * use O_TYPE for gqaStore function * reduce BFLOAT16 ifdefs	2026-06-08 14:36:36 +03:00
Reese Levine	9147a9676b	ggml-webgpu: Check earlier for WebGPU required features (llama/23879)	2026-06-08 14:36:36 +03:00
Reese Levine	acd91d2c38	ggml-webgpu: add q4_0/q8_0 SET_ROWS (llama/23760) * Add q8_0 and q4_0 set_rows * Add fast(er) quantization set_rows path * formatting/naming * a little more naming * Remove unused constant * Don't override other override * Avoid bitcast * Narrow relaxation	2026-06-08 14:36:36 +03:00
Oliver Simons	f7aad4ed7e	CUDA: Check PTX version on host side to guard PDL dispatch (llama/23530) * CUDA: Check PTX version on host side to guard PDL dispatch Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this variable doesn't differentiate between compiling for say sm_90, sm_90a or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX). Thus, one can have a bug when compiling with `DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly dispatch to PDL on sm_90/sm_120 in forward-JIT mode. This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of the incoming kernel at runtime. A check on ptxVersion alone is sufficient, as device-codes will always be >= ptxVersion (and any violation of this would be a severe bug in CUDA/nvcc), see: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code * Implement MurmurHash3 mixer for better hash distribution Magic constants were taken from boost: `2698b43803/include/boost/container_hash/detail/hash_mix.hpp (L19-L65)` * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments, make seed non-zero * Apply code-formatting * Replace std::size_t -> size_t for consistency --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-06-08 14:36:36 +03:00
fairydreaming	c50e951afd	model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (llama/23346) * llama : support DeepSeek V3.2 model family (with DSA lightning indexer) * convert : handle DeepseekV32ForCausalLM architecture * ggml : support for f16 GGML_OP_FILL * memory : separate hparams argument in llama_kv_cache constructor * memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache) * llama : support for LLM_ARCH_DEEPSEEK32 * model : llama_model_deepseek32 implementation * model : merge two scale operations into one in DSA lightning indexer implementation * chore : remove unused code * model : support NVFP4 in DeepSeek V3.2 Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * memory : refactoring TODO Co-authored-by: ggerganov <ggerganov@users.noreply.github.com> --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>	2026-06-08 14:36:36 +03:00
Georgi Gerganov	92fc3f2a58	ggml : bump version to 0.13.1 (ggml/1523)	2026-05-29 09:47:30 +03:00
Andreas Kieslinger	e90501e179	cuda : disables launch_fattn PDL enrollment due to compiler bug (llama/23825)	2026-05-29 09:47:30 +03:00
Matt Corallo	f1b687da28	meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (llama/23480) Without this at least the vulkan backend will skip the `* 0` for !COMPUTE tensors, causing corrupt output.	2026-05-29 09:47:30 +03:00
Max Krasnyansky	442be1789d	hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (llama/23835) Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.	2026-05-29 09:47:30 +03:00
lhez	94922ce12c	opencl: move backend info printing into its own function (llama/23702) * opencl: move backend info print into its own function * opencl: move new log line * opencl: fix for non adreno path	2026-05-29 09:47:30 +03:00
fl0rianr	e1faa7cb4d	ggml: auto apply iGPU flag CUDA/HIP if integrated device (llama/23007)	2026-05-29 09:47:30 +03:00
redfox	4e8af441e5	mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … (#23729 ) * mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING * avoid a mismatch for JIT compilation of Turing device code for Ampere or newer Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-29 09:47:30 +03:00
Jaden_Mach	04795e6272	CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (llama/23227) * CUDA: per-quant MMVQ/MMQ batch threshold on AMD MFMA hardware The dispatcher uses a single global threshold (MMVQ_MAX_BATCH_SIZE = 8) to choose between mul_mat_vec_q (per-row GEMV) and mul_mat_q (MFMA-tiled GEMM) for quantized matmul. On AMD CDNA, the optimal crossover differs substantially by quant family because the per-row GEMV cost is dominated by dequantisation, not the dot-product itself: K-quants pay a heavier super-block decode and so MMQ wins sooner; legacy and IQ quants have lean decode and stay ahead until the batch fully populates an MFMA tile. This patch introduces ggml_cuda_should_use_mmvq(type, cc, ne11) -> bool, mirroring the existing ggml_cuda_should_use_mmq, and gates per-quant thresholds on amd_mfma_available(cc): Q3_K, Q4_K, Q5_K : MMVQ <= 3 (MMQ wins from batch=4: +5% .. +76%) Q2_K, Q6_K : MMVQ <= 5 (MMQ wins from batch=6: +8% .. +35%) others : MMVQ <= 8 (legacy & IQ regress under MMQ; unchanged) Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical to master. GGML_CUDA_FORCE_MMVQ=1 restores the original global threshold for A/B testing. Measured on MI250X (gfx90a, ROCm 7.2.1) with Llama-3.2-3B-Instruct, llama-bench pp512 across all 20 supported quants, ubatch 1..8, 10 reps. Full table in PR description. Selected pp512 throughput (tok/s, ub=8): Q4_K_S: 559 -> 940 (+68%) Q5_K_S: 503 -> 884 (+76%) Q3_K_S: 629 -> 879 (+40%) Q2_K : 615 -> 809 (+32%) Q6_K : 582 -> 776 (+33%) Selected pp512 throughput (tok/s, ub=4): Q4_K_S: 444 -> 480 (+ 8%) Q4_0 : 682 -> 685 (+ 0%) (no regression - retains MMVQ) IQ4_XS: 706 -> 698 (- 1%) (no regression - retains MMVQ) * CUDA: address review — inline MMVQ batch table, drop env hatch & doc block * tune kernel selection logic for CDNA1 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-29 09:47:30 +03:00
Max Krasnyansky	1b241b879c	hexagon: minor refresh for HMX FA and MM (llama/23796) * hex-fa: clean up qf32/fp32 handling and stride handling * hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79 * hex-fa: vectorize leftover handling * hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity * hmx-mm: remove dead code * hmx-mm: use fastdiv in x4x2 dequant * hmx-mm: sandwich dequant and scatter to improve perf * hmx-mm: fixed rebase conflicts * hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv * hmx-mm: an even earlier dispatch for per-type dequant * hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs This is a bit faster than LUT. * hex-cmake: one more tweak for lto --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>	2026-05-29 09:47:30 +03:00
Jeff Bolz	b896e91f18	vulkan: fast path for walsh-hadamard transform (llama/23687) * vulkan: fast path for walsh-hadamard transform * disable for intel due to segfault	2026-05-29 09:47:30 +03:00
Winston Ma	816c3029bc	vulkan: fix wrong index variable in inner loop (llama/23665)	2026-05-29 09:47:30 +03:00
Winston Ma	5db94bac04	vulkan: Fix memory logger unsafe iterator access (llama/23667)	2026-05-29 09:47:30 +03:00
fairydreaming	60e420ff6a	cuda : fix KQ mask offset integer overflow in fattn MMA kernel (llama/23610) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2026-05-29 09:47:30 +03:00
Martin Klacer	8e40325876	ggml: fixed Arm SVE usage bug in vec.h, vec.cpp (llama/22841) * Updated vec.h/vec.cpp code to accumulate to F32 rather than F16 Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8 Signed-off-by: Martin Klacer <martin.klacer@arm.com> Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>	2026-05-29 09:47:30 +03:00
ymcki	d284e1c3aa	Hexagon: OP_GATED_DELTA_NET K>1 support (llama/23531) * K>1 state snapshot support * removed picky indent multiple of 4 fixes	2026-05-29 09:47:30 +03:00
ymcki	7e843a80e1	opencl: OP_GATED_DELTA_NET (llama/23312) * OP_GATED_DELTA_NET impl * add back lanes_per_column declaration * removed has_subgroup_arithmetic and has_subgroup_clustered_reduce * removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot * support for K>1 state snapshot * removed picky indent multiple of 4 fixes * removed return that won\'t be executed	2026-05-29 09:47:30 +03:00
Reese Levine	8c8f213dac	ggml-webgpu: remove legacy constants (llama/23672)	2026-05-29 09:47:30 +03:00
Max Krasnyansky	3bbe93378c	hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (llama/23647) * hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now * hmx-mm: add support for Q4_1 * hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot * hexagon: fix repack scratch buffer overflow * hex-mm: fix Q4_1 repack buffer sizing * hexagon: flip the build order for mm and fa (seems to help LTO) * hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1 * hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output * hexagon: resurrect early-wake and add support for polling for op-batch completions With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax. This is a good thing! But it does add extra latency for the pure benchmark runs. Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking. --------- Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>	2026-05-29 09:47:30 +03:00
Masashi Yoshimura	a52bd385d6	ggml-webgpu: Fix how to dispatch WG to some ops (llama/23750)	2026-05-29 09:47:30 +03:00
Matt Corallo	8bce478ee8	vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 (llama/22887) * vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 Against mesa git, this shows a 4.8% performance improvement for tg128 on Qwen3.5-9B:BF16 on Intel BMG. Note that this breaks some tests until the last commit which fixes OOB A reads. * vulkan: Use aligned loads in mul_mat_vec when available Against mesa git, this shows a 3.3% performance improvement for tg128 on Qwen3.5-9B:BF16 on Intel BMG. * Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec Mesa's UUB logic can't see through conditionals, limiting its ability to understand the bounds on the `num_rows` field in the cleanup run. Making it explicit that `num_rows` is, indeed, always <= `NUM_ROWS` helps mesa make slightly better codegen. Against mesa git, this currently shows a 1% performance improvement in tg128 on Qwen3.5-9B:BF16 on Intel BMG. * vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes There was a TODO to fix the OOB reads from the A matrix which we do here. It is within performance noise (+<0.1%) in tg128 for Qwen3.5-9B:BF16 on Intel BMG.	2026-05-29 09:47:30 +03:00
Jeff Bolz	1b590bbb9a	vulkan: use GL_NV_cooperative_matrix_decode_vector for faster matmul (llama/23541)	2026-05-29 09:47:30 +03:00
l8bloom	c5cde8c717	vulkan: add REPEAT op support for f16 to f16. (llama/23298) * feat: extend repeat op for vulkan * feat: add repeat_f16 vulkan pipeline * fix: ensure same dst and src types * fix: use type_size instead of data types * fix: use int16 and int32 for repeat shader op * chore: rename repeat_f* to repeat_i* * chore: rename repeat vulkan pipelines	2026-05-29 09:47:30 +03:00
Oliver Simons	98c6722fec	CUDA: restrict PDL to CTK >= 12.3 due to MSVC issues (llama/23742)	2026-05-29 09:47:30 +03:00
Winston Ma	80e87ec453	vulkan: avoid preferring transfer queue on AMD UMA devices (llama/22455)	2026-05-29 09:47:30 +03:00
Vladislav	6a249cd640	ggml-zendnn : fixed naming of matmul function (llama/20964) * ggml-zendnn: fixed naming of matmul function * ggml-zendnn: fixed naming of mul_mat_id function * ggml-zendnn: fixed print in mul_mat_id --------- Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>	2026-05-29 09:47:30 +03:00
Jeff Bolz	a0efd13f0f	vulkan: optimize conv2d and implement coopmat1 support (llama/22620) * vulkan: add CONV_SHAPE_64x128 for medium-K conv2d * vulkan: skip conv2d bounds checks when shapes align with tile sizes * vulkan: use WG_SIZE=128 for CONV_SHAPE_64x32 conv2d * vulkan: stage cm2 conv2d accumulator through shmem before global store * vulkan: add coopmat1 conv2d path * fallback when using too much shared memory. clean up comments * Require 16x16x16 and subgroup size 32 or 64 * check whether shared memory is sufficient before overwriting conv2d params with coopmat1 values	2026-05-29 09:47:30 +03:00
Max Krasnyansky	f8df28d331	hexagon: add support for CONCAT op (llama/23648) * hexagon: add support for CONCAT with optimized concat_2d_transposed qwen3.5 models are quite heavy on the CONCAT with large and transposed src1. * hex-concat: use fastdiv in generic version * hex-concat: make checks for transposed a bit more readable * hex-concat: reoder dma ops for better pipelining * hex-cont/cpy: optimize CPY and CONT ops The primary change is to avoid scalar divs in the inner loops. We were calling hvx_copy_uu(... type_size) where type_size is non a constexpr. This causes runtime divs by that value which is normally just 4 or 2 (f32/f16). * hex-get-rows: optimize GET_ROWS for large rows We now use DMA for larger rows and also split them into chunks to improve perf for Qwen3.5 and other models that do lots of GET_ROWS with huge (2MB+ rows). Also bump the DMA queue depth now that we can take advantage of it. * hex-concat: unroll the inner loops of concat_2d * hex-concat: more updates to concat_2d to improve perf a bit further * hex-cpy: fixed n_rows per thread checks in the copy ops * hmx-fa: fix alignment issues while computing dma sizes * hex-set-rows: add early returns for idle threads * hvx-rope: minor optimization to replace loops with fastdiv logic * hex-rope: replace scalar tail processing with HVX * hex-rope: optimize rope cache init with HVX Add hvx-utils sin/cos helpers that use an aprox method (similar to rsqrt, inverse, etc) Use the helpers to optimize ROPE.	2026-05-29 09:47:30 +03:00
Alexey Kopytko	049f0af339	SYCL: implement ggml_sycl_pool_vmm (llama/22862) * SYCL: implement ggml_sycl_pool_vmm * Add an option to bypass VMM with GGML_SYCL_DISABLE_VMM * Clean up debugging logging * document GGML_SYCL_DISABLE_VMM * Multi-stream MoE optimization * Revert "Multi-stream MoE optimization" This reverts commit 938929c3f13a562ec67c59e87cc5d38595444cce. * Update common.hpp Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * Flip GGML_SYCL_DISABLE_VMM to GGML_SYCL_ENABLE_VMM * add logging for GGML_SYCL_ENABLE_VMM when extension is not available (SYCL_EXT_ONEAPI_VIRTUAL_MEM macro) * Apply suggestions from code review Co-authored-by: Alexey Kopytko <alexey@kopytko.com> * Apply suggestion from @sanmai * Apply suggestion from @sanmai --------- Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>	2026-05-29 09:47:30 +03:00
Masashi Yoshimura	00a5110b19	ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K and clean up legacy MUL_MAT pipeline (llama/23594) * ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K * Fix to editorconfig checking pass * Remove mul-mat-legacy pipeline * Fix to use vendor name as is and add dot_product/vendor to shader_lib_ctx	2026-05-29 09:47:30 +03:00
Nikhil Jain	bc77933c2d	Check batch_compute_passes before sending passes when not doing GPU profiling (llama/23457) * Only run webgpu CI on my fork * Add webgpu only workflow * refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled * restore build.yml	2026-05-29 09:47:30 +03:00
Johannes Gäßler	2307712d32	CUDA: missing PDL sync for FWHT, better fallback (llama/23690)	2026-05-29 09:47:30 +03:00
forforever73	1c477d4056	metal : add apple device id (llama/23566) Co-authored-by: lvyichen <lvyichen@stepfun.com>	2026-05-29 09:47:30 +03:00
Aman Gupta	205ee5a189	CUDA: add fast walsh-hadamard transform (llama/23615) * CUDA: add fast walsh-hadamard transform * review: add unrolls + change size_t -> int * warp size 64 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-29 09:47:30 +03:00
Georgi Gerganov	1cf8e3a903	ggml : bump version to 0.13.0 (ggml/1510)	2026-05-25 12:44:04 +03:00
Johannes Gäßler	bcff515150	TP: fix ggml context size calculation (llama/22616) * TP: fix ggml context size calculation, memory leak * move split state cache back into the context * revert to constant ggml context size for cgraphs * increase headroom for statically allocated tensors * remove obsolete include	2026-05-25 12:44:04 +03:00
Gilad S	2979e5f95f	ggml: `gguf_init_from_callback` and `gguf_init_from_buffer` (llama/22341) * ggml: implement `gguf_init_from_buffer` * test: `gguf_init_from_buffer` * fix: memory breakdown for a model loaded with `no_alloc` from a file is consistent with being loaded from a buffer * fix: use `GGML_UNUSED` Co-authored-by: Copilot <copilot@github.com> * fix: remove `total_size` from `gguf_reader` * fix: file offset calculation, rename `offset` to `data_offset` Co-authored-by: Copilot <copilot@github.com> * refactor: extract model loader bug fixes to another PR * feat: add `gguf_init_from_callback` * fix: always require a max expected size * fix: change `gguf_reader_callback_t`'s `output` type to `void `, change `max_expected_size` and offsets to `uint64_t` fix: harden against offset overflow in buffer read * fix: remove seek behavior from the callback * feat: `max_chunk_read == 0` means `SIZE_MAX` * fix: seeking in a gguf file with no tensors --------- Co-authored-by: Copilot <copilot@github.com>	2026-05-25 12:44:04 +03:00
Georgi Gerganov	946d6813b9	ggml : bump version to 0.12.1 (ggml/1508)	2026-05-25 12:26:07 +03:00
Jeff Bolz	a369b3949c	ggml : Parallelize quant LUT init (llama/23595) - Use OpenMP to parallelize iq2xs_init_impl and iq3xs_init_impl. - Move the OpenMP detection from ggml-cpu to ggml-base. - Update OpenMP dependencies in ggml-config.cmake.in.	2026-05-25 12:26:07 +03:00
Johannes Gäßler	3306af62b1	TP: fix entirely zero-sized slices per device (llama/23525)	2026-05-25 12:26:07 +03:00
shaofeiqi	1435988ab3	opencl: batch profiling to improve speed and prevent memory leaks (llama/23495)	2026-05-25 12:26:07 +03:00
Yiwei Shao	b84d03487c	hexagon: apply repl optimization in flash attn softmax as #22993 (llama/23455)	2026-05-25 12:26:07 +03:00
dskwe	511f8602b1	ggml : Check the right iface method before using the fallback 2d get (llama/23514)	2026-05-25 12:26:07 +03:00
Jeff Bolz	6b85d73b33	vulkan: fix windows find_package of SPIRV-Headers (llama/23215) * vulkan: fix windows find_package of SPIRV-Headers * not windows-only	2026-05-25 12:26:07 +03:00
Shawn Gu	aefffa1fa5	opencl: generalize Adreno MoE kernels on M (llama/23449)	2026-05-25 12:26:07 +03:00

1 2 3 4 5 ...

2542 Commits