whisper.cpp

Commit Graph

Author	SHA1	Message	Date
redfox	4e8af441e5	mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … (#23729 ) * mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING * avoid a mismatch for JIT compilation of Turing device code for Ampere or newer Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-29 09:47:30 +03:00
Jaden_Mach	04795e6272	CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (llama/23227) * CUDA: per-quant MMVQ/MMQ batch threshold on AMD MFMA hardware The dispatcher uses a single global threshold (MMVQ_MAX_BATCH_SIZE = 8) to choose between mul_mat_vec_q (per-row GEMV) and mul_mat_q (MFMA-tiled GEMM) for quantized matmul. On AMD CDNA, the optimal crossover differs substantially by quant family because the per-row GEMV cost is dominated by dequantisation, not the dot-product itself: K-quants pay a heavier super-block decode and so MMQ wins sooner; legacy and IQ quants have lean decode and stay ahead until the batch fully populates an MFMA tile. This patch introduces ggml_cuda_should_use_mmvq(type, cc, ne11) -> bool, mirroring the existing ggml_cuda_should_use_mmq, and gates per-quant thresholds on amd_mfma_available(cc): Q3_K, Q4_K, Q5_K : MMVQ <= 3 (MMQ wins from batch=4: +5% .. +76%) Q2_K, Q6_K : MMVQ <= 5 (MMQ wins from batch=6: +8% .. +35%) others : MMVQ <= 8 (legacy & IQ regress under MMQ; unchanged) Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical to master. GGML_CUDA_FORCE_MMVQ=1 restores the original global threshold for A/B testing. Measured on MI250X (gfx90a, ROCm 7.2.1) with Llama-3.2-3B-Instruct, llama-bench pp512 across all 20 supported quants, ubatch 1..8, 10 reps. Full table in PR description. Selected pp512 throughput (tok/s, ub=8): Q4_K_S: 559 -> 940 (+68%) Q5_K_S: 503 -> 884 (+76%) Q3_K_S: 629 -> 879 (+40%) Q2_K : 615 -> 809 (+32%) Q6_K : 582 -> 776 (+33%) Selected pp512 throughput (tok/s, ub=4): Q4_K_S: 444 -> 480 (+ 8%) Q4_0 : 682 -> 685 (+ 0%) (no regression - retains MMVQ) IQ4_XS: 706 -> 698 (- 1%) (no regression - retains MMVQ) * CUDA: address review — inline MMVQ batch table, drop env hatch & doc block * tune kernel selection logic for CDNA1 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-29 09:47:30 +03:00
Max Krasnyansky	1b241b879c	hexagon: minor refresh for HMX FA and MM (llama/23796) * hex-fa: clean up qf32/fp32 handling and stride handling * hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79 * hex-fa: vectorize leftover handling * hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity * hmx-mm: remove dead code * hmx-mm: use fastdiv in x4x2 dequant * hmx-mm: sandwich dequant and scatter to improve perf * hmx-mm: fixed rebase conflicts * hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv * hmx-mm: an even earlier dispatch for per-type dequant * hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs This is a bit faster than LUT. * hex-cmake: one more tweak for lto --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>	2026-05-29 09:47:30 +03:00
Jeff Bolz	b896e91f18	vulkan: fast path for walsh-hadamard transform (llama/23687) * vulkan: fast path for walsh-hadamard transform * disable for intel due to segfault	2026-05-29 09:47:30 +03:00
Winston Ma	816c3029bc	vulkan: fix wrong index variable in inner loop (llama/23665)	2026-05-29 09:47:30 +03:00
Winston Ma	5db94bac04	vulkan: Fix memory logger unsafe iterator access (llama/23667)	2026-05-29 09:47:30 +03:00
fairydreaming	60e420ff6a	cuda : fix KQ mask offset integer overflow in fattn MMA kernel (llama/23610) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2026-05-29 09:47:30 +03:00
Martin Klacer	8e40325876	ggml: fixed Arm SVE usage bug in vec.h, vec.cpp (llama/22841) * Updated vec.h/vec.cpp code to accumulate to F32 rather than F16 Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8 Signed-off-by: Martin Klacer <martin.klacer@arm.com> Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>	2026-05-29 09:47:30 +03:00
ymcki	d284e1c3aa	Hexagon: OP_GATED_DELTA_NET K>1 support (llama/23531) * K>1 state snapshot support * removed picky indent multiple of 4 fixes	2026-05-29 09:47:30 +03:00
ymcki	7e843a80e1	opencl: OP_GATED_DELTA_NET (llama/23312) * OP_GATED_DELTA_NET impl * add back lanes_per_column declaration * removed has_subgroup_arithmetic and has_subgroup_clustered_reduce * removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot * support for K>1 state snapshot * removed picky indent multiple of 4 fixes * removed return that won\'t be executed	2026-05-29 09:47:30 +03:00
Reese Levine	8c8f213dac	ggml-webgpu: remove legacy constants (llama/23672)	2026-05-29 09:47:30 +03:00
Max Krasnyansky	3bbe93378c	hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (llama/23647) * hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now * hmx-mm: add support for Q4_1 * hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot * hexagon: fix repack scratch buffer overflow * hex-mm: fix Q4_1 repack buffer sizing * hexagon: flip the build order for mm and fa (seems to help LTO) * hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1 * hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output * hexagon: resurrect early-wake and add support for polling for op-batch completions With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax. This is a good thing! But it does add extra latency for the pure benchmark runs. Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking. --------- Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>	2026-05-29 09:47:30 +03:00
Masashi Yoshimura	a52bd385d6	ggml-webgpu: Fix how to dispatch WG to some ops (llama/23750)	2026-05-29 09:47:30 +03:00
Matt Corallo	8bce478ee8	vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 (llama/22887) * vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 Against mesa git, this shows a 4.8% performance improvement for tg128 on Qwen3.5-9B:BF16 on Intel BMG. Note that this breaks some tests until the last commit which fixes OOB A reads. * vulkan: Use aligned loads in mul_mat_vec when available Against mesa git, this shows a 3.3% performance improvement for tg128 on Qwen3.5-9B:BF16 on Intel BMG. * Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec Mesa's UUB logic can't see through conditionals, limiting its ability to understand the bounds on the `num_rows` field in the cleanup run. Making it explicit that `num_rows` is, indeed, always <= `NUM_ROWS` helps mesa make slightly better codegen. Against mesa git, this currently shows a 1% performance improvement in tg128 on Qwen3.5-9B:BF16 on Intel BMG. * vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes There was a TODO to fix the OOB reads from the A matrix which we do here. It is within performance noise (+<0.1%) in tg128 for Qwen3.5-9B:BF16 on Intel BMG.	2026-05-29 09:47:30 +03:00
Jeff Bolz	1b590bbb9a	vulkan: use GL_NV_cooperative_matrix_decode_vector for faster matmul (llama/23541)	2026-05-29 09:47:30 +03:00
l8bloom	c5cde8c717	vulkan: add REPEAT op support for f16 to f16. (llama/23298) * feat: extend repeat op for vulkan * feat: add repeat_f16 vulkan pipeline * fix: ensure same dst and src types * fix: use type_size instead of data types * fix: use int16 and int32 for repeat shader op * chore: rename repeat_f* to repeat_i* * chore: rename repeat vulkan pipelines	2026-05-29 09:47:30 +03:00
Oliver Simons	98c6722fec	CUDA: restrict PDL to CTK >= 12.3 due to MSVC issues (llama/23742)	2026-05-29 09:47:30 +03:00
Winston Ma	80e87ec453	vulkan: avoid preferring transfer queue on AMD UMA devices (llama/22455)	2026-05-29 09:47:30 +03:00
Vladislav	6a249cd640	ggml-zendnn : fixed naming of matmul function (llama/20964) * ggml-zendnn: fixed naming of matmul function * ggml-zendnn: fixed naming of mul_mat_id function * ggml-zendnn: fixed print in mul_mat_id --------- Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>	2026-05-29 09:47:30 +03:00
Jeff Bolz	a0efd13f0f	vulkan: optimize conv2d and implement coopmat1 support (llama/22620) * vulkan: add CONV_SHAPE_64x128 for medium-K conv2d * vulkan: skip conv2d bounds checks when shapes align with tile sizes * vulkan: use WG_SIZE=128 for CONV_SHAPE_64x32 conv2d * vulkan: stage cm2 conv2d accumulator through shmem before global store * vulkan: add coopmat1 conv2d path * fallback when using too much shared memory. clean up comments * Require 16x16x16 and subgroup size 32 or 64 * check whether shared memory is sufficient before overwriting conv2d params with coopmat1 values	2026-05-29 09:47:30 +03:00
Max Krasnyansky	f8df28d331	hexagon: add support for CONCAT op (llama/23648) * hexagon: add support for CONCAT with optimized concat_2d_transposed qwen3.5 models are quite heavy on the CONCAT with large and transposed src1. * hex-concat: use fastdiv in generic version * hex-concat: make checks for transposed a bit more readable * hex-concat: reoder dma ops for better pipelining * hex-cont/cpy: optimize CPY and CONT ops The primary change is to avoid scalar divs in the inner loops. We were calling hvx_copy_uu(... type_size) where type_size is non a constexpr. This causes runtime divs by that value which is normally just 4 or 2 (f32/f16). * hex-get-rows: optimize GET_ROWS for large rows We now use DMA for larger rows and also split them into chunks to improve perf for Qwen3.5 and other models that do lots of GET_ROWS with huge (2MB+ rows). Also bump the DMA queue depth now that we can take advantage of it. * hex-concat: unroll the inner loops of concat_2d * hex-concat: more updates to concat_2d to improve perf a bit further * hex-cpy: fixed n_rows per thread checks in the copy ops * hmx-fa: fix alignment issues while computing dma sizes * hex-set-rows: add early returns for idle threads * hvx-rope: minor optimization to replace loops with fastdiv logic * hex-rope: replace scalar tail processing with HVX * hex-rope: optimize rope cache init with HVX Add hvx-utils sin/cos helpers that use an aprox method (similar to rsqrt, inverse, etc) Use the helpers to optimize ROPE.	2026-05-29 09:47:30 +03:00
Alexey Kopytko	049f0af339	SYCL: implement ggml_sycl_pool_vmm (llama/22862) * SYCL: implement ggml_sycl_pool_vmm * Add an option to bypass VMM with GGML_SYCL_DISABLE_VMM * Clean up debugging logging * document GGML_SYCL_DISABLE_VMM * Multi-stream MoE optimization * Revert "Multi-stream MoE optimization" This reverts commit 938929c3f13a562ec67c59e87cc5d38595444cce. * Update common.hpp Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * Flip GGML_SYCL_DISABLE_VMM to GGML_SYCL_ENABLE_VMM * add logging for GGML_SYCL_ENABLE_VMM when extension is not available (SYCL_EXT_ONEAPI_VIRTUAL_MEM macro) * Apply suggestions from code review Co-authored-by: Alexey Kopytko <alexey@kopytko.com> * Apply suggestion from @sanmai * Apply suggestion from @sanmai --------- Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>	2026-05-29 09:47:30 +03:00
Masashi Yoshimura	00a5110b19	ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K and clean up legacy MUL_MAT pipeline (llama/23594) * ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K * Fix to editorconfig checking pass * Remove mul-mat-legacy pipeline * Fix to use vendor name as is and add dot_product/vendor to shader_lib_ctx	2026-05-29 09:47:30 +03:00
Nikhil Jain	bc77933c2d	Check batch_compute_passes before sending passes when not doing GPU profiling (llama/23457) * Only run webgpu CI on my fork * Add webgpu only workflow * refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled * restore build.yml	2026-05-29 09:47:30 +03:00
Johannes Gäßler	2307712d32	CUDA: missing PDL sync for FWHT, better fallback (llama/23690)	2026-05-29 09:47:30 +03:00
forforever73	1c477d4056	metal : add apple device id (llama/23566) Co-authored-by: lvyichen <lvyichen@stepfun.com>	2026-05-29 09:47:30 +03:00
Aman Gupta	205ee5a189	CUDA: add fast walsh-hadamard transform (llama/23615) * CUDA: add fast walsh-hadamard transform * review: add unrolls + change size_t -> int * warp size 64 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-29 09:47:30 +03:00
Daniel Bevenius	c932729a30	ci : add ignore for bindings/{ruby, go} in build.yml [no ci] (#3837 ) This commit adds an ignore for bindings-ruby and bindings-go in build.yml as these are handled by separate .yml file (separate jobs) and don't need to trigger a full CI build.	2026-05-28 18:06:04 +02:00
Daniel Bevenius	e47a3eeb04	ci : fix include paths for bindings-go job [no ci] (#3835 )	2026-05-28 14:53:34 +02:00
Daniel Bevenius	f41562bdd6	ci : add on push/pull_request paths ruby job (#3833 ) * ci : add on push/pull_request paths ruby job This commit adds paths to bindings-ruby to only build if changes where made to bindings/ruby or to include/whisper.h. * ci : add additional paths [no ci]	2026-05-28 14:41:48 +02:00
Daniel Bevenius	9186e2453b	ci : renable arm64 docker builds (#3832 ) This commit re-enables the arm64 docker images builds which were removed in Commit `9366544991` ("ci : fix arm builds"). It also uses ubuntu-24.04-arm as the runner which enables us to avoid QEMU. Resolves: https://github.com/ggml-org/whisper.cpp/issues/2859	2026-05-28 12:09:13 +02:00
Daniel Bevenius	f6e617bab7	ci : set GGML_NATIVE=OFF for bindings-java (#3830 ) * ci : set GGML_NATIVE=OFF for bindings-java This commit attempts to address an issue with the bindings-java job which is currently failing. I've not been able to reproduce this locally my windows machine and I suspect that what might be happning is that windows job compiles on a runner where it has different CPU features, for example AVX512 and when this dll is used on a different runner that does not have that feature it will crash. Refs: https://github.com/ggml-org/whisper.cpp/actions/runs/26496174929/job/78059073255?pr=3829 * ci : also disable BMI2	2026-05-28 07:21:25 +02:00
Daniel Bevenius	6dcdd65364	ci : only run docker jobs when pushed to master [no ci] (#3828 )	2026-05-27 08:46:23 +02:00
Daniel Bevenius	ee540bf0be	docs : add AGENTS.md and CONTRIBUTING.md [no ci] (#3826 ) * docs : add AGENTS.md and CONTRIBUTING.md [no ci] This commit add AGENTS.md and CONTRIBUTING.md which are based on the same files in llama.cpp. They have been modified slightly to fit with whisper.cpp. The motivation for this is to clarify the contribution policy in whisper.cpp so that contributers can have a better understanding of the expectations and requirements for contributing to the project.	2026-05-27 06:22:38 +02:00
texasich	27101c01dc	cli : merge tokens split across UTF-8 boundaries in JSON output (#3751 ) * cli : merge tokens split across UTF-8 boundaries in JSON output When a multi-byte UTF-8 codepoint (most commonly a CJK character, 3 bytes) is split across multiple whisper tokens, the -ojf/--output-json-full writer emitted each token's partial bytes as its own JSON string, producing invalid UTF-8 that chokes downstream parsers. Merge adjacent tokens in output_json whenever the accumulated text still ends on an incomplete UTF-8 sequence. The merged entry keeps the first token's id/p/t_dtw and extends t1 to the last absorbed token, which matches how segment text is assembled elsewhere. Refs #1798 * fix: address review — add braces for consistency, use full issue URL - Add braces to if/else chain for codebase consistency - Use full URL for issue #1798 reference Review: @danbev --------- Co-authored-by: texasich <texasich@users.noreply.github.com> Co-authored-by: texasich <texasich@gmail.com>	2026-05-26 06:23:41 +02:00
Georgi Gerganov	e0fd1f6787	release : v1.8.5	2026-05-25 13:06:33 +03:00
Georgi Gerganov	c245b3ec23	benches : update	2026-05-25 13:05:30 +03:00
Georgi Gerganov	f14ae77f40	sync : ggml	2026-05-25 12:44:07 +03:00
Georgi Gerganov	1cf8e3a903	ggml : bump version to 0.13.0 (ggml/1510)	2026-05-25 12:44:04 +03:00
Johannes Gäßler	bcff515150	TP: fix ggml context size calculation (llama/22616) * TP: fix ggml context size calculation, memory leak * move split state cache back into the context * revert to constant ggml context size for cgraphs * increase headroom for statically allocated tensors * remove obsolete include	2026-05-25 12:44:04 +03:00
Gilad S	2979e5f95f	ggml: `gguf_init_from_callback` and `gguf_init_from_buffer` (llama/22341) * ggml: implement `gguf_init_from_buffer` * test: `gguf_init_from_buffer` * fix: memory breakdown for a model loaded with `no_alloc` from a file is consistent with being loaded from a buffer * fix: use `GGML_UNUSED` Co-authored-by: Copilot <copilot@github.com> * fix: remove `total_size` from `gguf_reader` * fix: file offset calculation, rename `offset` to `data_offset` Co-authored-by: Copilot <copilot@github.com> * refactor: extract model loader bug fixes to another PR * feat: add `gguf_init_from_callback` * fix: always require a max expected size * fix: change `gguf_reader_callback_t`'s `output` type to `void `, change `max_expected_size` and offsets to `uint64_t` fix: harden against offset overflow in buffer read * fix: remove seek behavior from the callback * feat: `max_chunk_read == 0` means `SIZE_MAX` * fix: seeking in a gguf file with no tensors --------- Co-authored-by: Copilot <copilot@github.com>	2026-05-25 12:44:04 +03:00
Kaihui-AMD	44a50ca41a	readme : add AMD ROCm/HIP GPU build instructions (#3823 ) Signed-off-by: Kaihui-AMD <Kaihui.Tang@amd.com>	2026-05-25 12:27:42 +03:00
Georgi Gerganov	865ec171aa	talk-llama : sync llama.cpp	2026-05-25 12:26:07 +03:00
Georgi Gerganov	0a62a579cc	sync : ggml	2026-05-25 12:26:07 +03:00
Georgi Gerganov	946d6813b9	ggml : bump version to 0.12.1 (ggml/1508)	2026-05-25 12:26:07 +03:00
Jeff Bolz	a369b3949c	ggml : Parallelize quant LUT init (llama/23595) - Use OpenMP to parallelize iq2xs_init_impl and iq3xs_init_impl. - Move the OpenMP detection from ggml-cpu to ggml-base. - Update OpenMP dependencies in ggml-config.cmake.in.	2026-05-25 12:26:07 +03:00
Johannes Gäßler	3306af62b1	TP: fix entirely zero-sized slices per device (llama/23525)	2026-05-25 12:26:07 +03:00
shaofeiqi	1435988ab3	opencl: batch profiling to improve speed and prevent memory leaks (llama/23495)	2026-05-25 12:26:07 +03:00
Yiwei Shao	b84d03487c	hexagon: apply repl optimization in flash attn softmax as #22993 (llama/23455)	2026-05-25 12:26:07 +03:00
dskwe	511f8602b1	ggml : Check the right iface method before using the fallback 2d get (llama/23514)	2026-05-25 12:26:07 +03:00

1 2 3 4 5 ...

4577 Commits All Branches Search

4577 Commits

All Branches