whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Taimur Ahmad	0c10a15447	ggml-cpu: add RVV vec dot kernels for quantization types (llama/18784) * ggml-cpu: add rvv vec_dot for iq2_s Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> * ggml-cpu: add rvv vec_dot for iq3_s Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> * ggml-cpu: add rvv vec_dot for tq1_0, tq2_0 Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> ggml-cpu: add rvv vec_dot for tq1_0, tq2_0 * ggml-cpu: add rvv vec_dot for iq1_s, iq1_m Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> * ggml-cpu: add vlen switch for rvv vec_dot --------- Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>	2026-02-27 20:57:58 +02:00
Masashi Yoshimura	0158795ebc	ggml-webgpu: Add unary op (SQR, SQRT, SIN, COS) support. (llama/19700) * ggml-webgpu: Add unary op (SQR, SQRT, SIN, COS) support. * Fix to cast the src value to f32 before sin/cos computing.	2026-02-27 20:57:58 +02:00
Ruben Ortlam	3f68f30907	vulkan: fix MMQ shader push constants and multi-dispatch (llama/19732)	2026-02-27 20:57:58 +02:00
Johannes Gäßler	ade724fced	CUDA: fix kernel selection logic for tile FA (llama/19686) * CUDA: fix kernel selection logic for tile FA * add comment	2026-02-27 20:57:58 +02:00
shalinib-ibm	cc9e5cf89d	llamafile: powerpc: add FP16 MMA path for Q4/Q8 matmul (llama/19709) Avoid xvi8ger4pp signed→unsigned bias correction by dequantizing Q4/Q8 inputs to FP16 and using FP16×FP16→FP32 MMA. This removes post-processing overhead and improves performance. Performance Impact: 1.5 ~ 2x improvement in PP_Speed for Q4 and Q8 Models, measured with llama-bench and llama-batched-bench. Q8 Model: granite-4.0-h-micro-Q8_0.gguf (from huggingface) Q4 Model: Meta-Llama3-8b Q4 model (generated with llama-quantize from f32 model) llama-bench Q8 Model Results: model size params backend threads test Base t/s Patch t/s granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp8 64.48 ± 4.72 73.99 ± 0.27 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp16 80.11 ± 0.32 112.53 ± 0.40 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp32 89.10 ± 0.27 152.95 ± 0.68 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp64 93.65 ± 0.25 187.83 ± 0.83 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp128 99.93 ± 0.02 201.32 ± 0.11 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp256 102.32 ± 0.40 208.32 ± 0.41 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 pp512 103.42 ± 0.40 209.98 ± 0.14 granitehybrid 3B Q8_0 3.16 GiB 3.19 B CPU 10 tg128 20.35 ± 0.01 19.57 ± 0.01 llama-bench Q4 Model Results: model size params backend threads test Base t/s Patch t/s llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp8 34.77 ± 0.10 41.23 ± 0.08 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp16 40.81 ± 0.04 64.55 ± 0.15 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp32 44.65 ± 0.05 90.84 ± 0.22 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp64 47.49 ± 0.03 114.39 ± 0.11 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp128 49.29 ± 0.24 120.13 ± 0.19 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp256 49.77 ± 0.23 121.51 ± 0.11 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp512 49.89 ± 0.23 117.52 ± 0.10 llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg128 13.40 ± 0.01 13.37 ± 0.00 Llama perplexity Results: Model Base Final PPL Estimate Patch Final PPL Estimate granite-4.0-h-micro-Q8_0 1.3862 +/- 0.04424 1.3868 +/- 0.04432 Meta-Llama3-8b Q4 1.3801 +/- 0.04116 1.3803 +/- 0.04116 Signed-off-by: Shalini.Salomi.Bodapati <Shalini.Salomi.Bodapati@ibm.com>	2026-02-27 20:57:58 +02:00
Reese Levine	8b3a52ba87	ggml webgpu: Fix bug in dispatching large matrix-vector multiplication (llama/19535) * Fix bug in dispatching large matrix-vector multiplication	2026-02-27 20:57:58 +02:00
Reese Levine	fc7a78f4d8	ggml webgpu: shader library organization (llama/19530) * Basic JIT compilation for mul_mat, get_rows, and scale (ggml/17) * scale jit working * preliminary working jit for getrows and mulmat, needs refining * simplified mul_mat preprocessing switch statement * get_rows fixes, mul_mat refinement * formatted + last edits * removed some extraneous prints * fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish * small fix * some changes, working * get_rows and mul_mat jit fixed and working * Update formatting * formatting * Add header --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on all-encompassing shader library * refactor argmax, set_rows * Refactor all but flashattention, mat mul * flashattention and matrix multiplication moved to new format * clean up preprocessing * Formatting * remove duplicate constants * Split large shaders into multiple static strings --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>	2026-02-27 20:57:58 +02:00
Jeff Bolz	f1da0a26f5	vulkan: split mul_mat into multiple dispatches to avoid overflow (llama/19509) * vulkan: split mul_mat into multiple dispatches to avoid overflow The batch dimensions can be greater than the max workgroup count limit, in which case we need to split into multiple dispatches and pass the base index through a push constant. Fall back for the less common p021 and nc variants. * address feedback	2026-02-27 20:57:58 +02:00
shaofeiqi	51ce7de94c	opencl: refactor expm1 and softplus (llama/19404) * opencl: refactor expm1 * opencl: refactor softplus * opencl: use h for half literals --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-02-27 20:57:58 +02:00
shaofeiqi	6fadc749a9	opencl: optimize mean and sum_row kernels (llama/19614) * opencl: optimize mean and sum_row kernels * opencl: add comment for max subgroups * opencl: format --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-02-27 20:57:58 +02:00
Talha Can Havadar	58855d08c2	ggml: ggml-cpu: force-no-lto-for-cpu-feats (llama/19609) When LTO enabled in build environments it forces all builds to have LTO in place. But feature detection logic is fragile, and causing Illegal instruction errors with lto. This disables LTO for the feature detection code to prevent cross-module optimization from inlining architecture-specific instructions into the score function. Without this, LTO can cause SIGILL when loading backends on older CPUs (e.g., loading power10 backend on power9 crashes before feature check runs).	2026-02-27 20:57:58 +02:00
Georgi Gerganov	cf4bd07028	cuda : enable CUDA graphs for MMID 1 <= BS <= 4 (llama/19645) * cuda : enable CUDA graphs for MMID BS <= 4 * cont : add stream capture check Co-authored-by: Oliver Simons <osimons@nvidia.com> * cont : add MMVQ_MMID_MAX_BATCH_SIZE --------- Co-authored-by: Oliver Simons <osimons@nvidia.com>	2026-02-27 20:57:58 +02:00
Judd	5ee5748722	ggml : make `ggml_is_view` as API (llama/19539) * make `ggml_is_view` as API * introduce `ggml_aux_is_view` as inline version for internal use. * change `ggml_aux_is_view` to `ggml_impl_is_view`	2026-02-27 20:57:58 +02:00
Mario Limonciello	5d9d72ec12	Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (llama/19591) Avoids issues with ROCm 6.4.4. Closes: https://github.com/ggml-org/llama.cpp/issues/19580 Fixes: 6845f7f87 ("Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (#19461)") Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>	2026-02-27 20:57:58 +02:00
abhijain1204fujitsu	f8f7c1d891	ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel (llama/19132) * Updated repack.cpp * Updated repack.cpp * Updated repack.cpp * Added if condition to support only vector length 256. * Changed the format removed comments and duplicate variable * If SVE 256 not present then was using generic function to compute, hence slowing the performance. So added code if SVE 256 is not present then use NEON code. * Code format change suggestion --------- Co-authored-by: Vithule, Prashant <Prashant.Vithule@fujitsu.com>	2026-02-27 20:57:58 +02:00
David Friehs	02a9f660b8	cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (llama/19624) * cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization - load all 8 int8 for a grid position in one load - calculate signs via popcnt instead of fetching from ksigns table - broadcast signs to drop individual shift/mask * cuda: iq2xxs: simplify sum scaling express `(sum * scale + sum / 2) / 4` as `(sum * (scale * 2 + 1)) / 8` express `((aux32 >> 28) * 2 + 1)` as `(aux32 >> 27 \| 1)` saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight AFAICT no overflow can occur here as iq2xxs values are far too small * uint -> uint32_t error: identifier "uint" is undefined	2026-02-27 20:57:58 +02:00
Daniel Bevenius	df2f8d3bc4	cmake : check if KleidiAI API has been fetched (llama/19640) This commit addresses a build issue with the KleidiAI backend when building multiple cpu backends. Commmit 3a00c98584e42a20675b6569d81beadb282b0952 ("cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL") introduced a change where FetchContent_Populate is called instead of FetchContent_MakeAvailable, where the latter does handle this case (it is idempotent but FetchContent_Populate is not). I missed this during my review and I should not have commited without verifying the CI failure, sorry about that.	2026-02-27 20:57:58 +02:00
Georgi Gerganov	22f0861efc	ggml : avoid UB in gemm ukernel (llama/19642)	2026-02-27 20:57:58 +02:00
Aaron Teo	7b5a1ebaa6	ggml-cpu: optimize ggml_vec_dot_bf16 for s390x (llama/19399)	2026-02-27 20:57:58 +02:00
Aman Gupta	76f769d06f	ggml-cpu: FA add GEMM microkernel (llama/19422) * ggml-cpu: FA add GEMM microkernel * add guard for sizeless vector types * fix case where DV % GGML_F32_EPR !=0 * move memset out of the loop * move another memset out of the loop * use RM=4 for arm * simd_gemm: convert everything to int * convert everything to size_t to avoid warnings * fixup * add pragma for ignoring aggressive loop optimizations	2026-02-27 20:57:58 +02:00
SamareshSingh	7ee772ab2b	cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL (llama/19581) * cmake: fix KleidiAI install target failure with EXCLUDE_FROM_ALL Fix for the bug #19501 by adding EXCLUDE_FROM_ALL to FetchContent_Declare. This properly excludes KleidiAI from both build and install targets, preventing install failures when GGML_CPU_KLEIDIAI=ON is used. The KleidiAI source files are still compiled into libggml-cpu.so, preserving all functionality. * addressed code review comments	2026-02-27 20:57:58 +02:00
Georgi Gerganov	4bea3cd329	ggml : bump version to 0.9.7 (ggml/1425)	2026-02-27 20:57:58 +02:00
Dmitry Atamanov	cec1dd9d12	examples : update miniaudio library to 0.11.24 (#3672 )	2026-02-27 11:15:15 +01:00
Maxime Grenu	21411d81ea	docs : fix duplicate word typo in VAD section (#3670 ) The VAD section contained a spurious 'the' at the end of a sentence, creating the run-on 'Using this information the / only the speech segments...'. Replace the orphaned 'the' with a comma so the sentence reads correctly: 'Using this information, only the speech segments...'.	2026-02-19 16:18:42 +01:00
Georgi Gerganov	364c77f4ca	talk-llama : sync llama.cpp	2026-02-15 21:44:37 +02:00
Georgi Gerganov	83f2ed19e1	sync : ggml	2026-02-15 21:44:37 +02:00
Georgi Gerganov	4ac70ce791	models : optimize qwen3next graph (llama/19375) * models : optimizing qwen3next graph * cont * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * cont : remove redundant q, g chunking * minor * minor * avoid passing masks around * avoid concats during chunking * naming + shapes * update names and use prefix to disable CUDA graphs	2026-02-15 21:44:37 +02:00
Adrien Gallouët	226e8c041c	ggml : fix GGML_DEBUG with OpenMP (llama/19599) last_graph is only available without OpenMP, but ggml_graph_compute_thread() is called in both cases. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-02-15 21:44:37 +02:00
Georgi Gerganov	fbdac5119c	metal : fix ACC op (llama/19427)	2026-02-15 21:44:37 +02:00
Jeff Bolz	cc448def01	vulkan: support L2_NORM with contiguous rows (llama/19604)	2026-02-15 21:44:37 +02:00
Jeff Bolz	197e9ab6eb	vulkan: support GGML_OP_SET (llama/19584)	2026-02-15 21:44:37 +02:00
Sophon	fc6bbab817	vulkan: Add vendor id for Qualcomm drivers (llama/19569) This commit allows Qualcomm native vulkan driver to be used on Windows instead of Mesa Dozen.	2026-02-15 21:44:37 +02:00
Max Krasnyansky	e6476d4c12	hexagon: further optimizations and refactoring for flash attention (llama/19583) * ggml-hexagon: fa improvements ggml-hexagon: optimize flash attention calculations with improved variable handling ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32 ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements ggml-hexagon: optimize flash attention by changing slope vector type to F16 * hexfa: fixed test-backend-ops failurs due to leftover element handling * hexagon: refactor and optimize fa to use local context struct * ggml-hexagon: optimize flash-attention using hvx_vec_expf Use HVX for online softmax. --------- Co-authored-by: chraac <chraac@gmail.com>	2026-02-15 21:44:37 +02:00
Jeff Bolz	ec57bf407c	vulkan: restore -inf check in FA shaders (llama/19582)	2026-02-15 21:44:37 +02:00
Alberto Cabrera Pérez	e8a25654b2	Fix wrong memcpy length for block_interleave == 4 (llama/19575)	2026-02-15 21:44:37 +02:00
ymcki	628b545b7e	fix vulkan ggml_acc only works in 3d but not 4d (llama/19426) * fix vulkan ggml_acc only works in 3d but not 4d * removed clamp in test_acc_block * use the correct stride and its test case * cuda : fix "supports op" condition * change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check * version without boundary check * revert back to boundary check version --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-02-15 21:44:37 +02:00
Aman Gupta	58e3d5a42d	CUDA: loop over ne2ne3 in case it overflows (llama/19538) CUDA: loop over ne2ne3 in case it overflows use fastdiv	2026-02-15 21:44:37 +02:00
Oliver Simons	3eb4905af1	CUDA: Do not mutate cgraph for fused ADDs (llama/19566) * Do not mutate cgraph for fused ADDs 1. We should try to minimize in-place changes to the incoming ggml_cgraph where possible (those should happen in graph_optimize) 2. Modifying in-place leads to an additional, unnecessary graph capture step as we store the properties before modifying the graph in-place in the cuda-backend * Assert ggml_tensor is trivially copyable * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2026-02-15 21:44:37 +02:00
Georgi Gerganov	0e94faa19c	metal : improve concurrency (llama/19555)	2026-02-15 21:44:37 +02:00
Georgi Gerganov	c5325e50fc	metal : support GGML_OP_SET (llama/19548)	2026-02-15 21:44:37 +02:00
Shupei Fan	195af60a8b	hexagon: fix typo in vtcm_needs_release (llama/19545)	2026-02-15 21:44:37 +02:00
lhez	9f87eeccdf	opencl: add basic support for q4_1 (llama/19534) * opencl: add q4_1 mv * opencl: clean up * opencl: add flattened q4_1 mv * opencl: clean up * opencl: add basic q4_1 mm * opencl: fix whitespace * opencl: add general q4_0 mm	2026-02-15 21:44:37 +02:00
Georgi Gerganov	d8e3e2ef08	metal : update sum_rows kernel to support float4 (llama/19524)	2026-02-15 21:44:37 +02:00
Mario Limonciello	39b5f414a3	Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (llama/19461) There is an upstream problem [1] with AMD's LLVM 22 fork and rocWMMA 2.2.0 causing compilation issues on devices without native fp16 support (CDNA devices). The specialized types aren't resolved properly: ``` /opt/rocm/include/rocwmma/internal/mfma_impl.hpp:2549:37: error: ambiguous partial specializations of 'amdgcn_mfma<__half, __half, __half, 16, 16, 16>' 2549 \| using ARegsT = typename Impl::ARegsT; ``` Add a workaround to explicitly declare the types and cast when compiling with HIP and ROCWMMA_FATTN [2]. When this is actually fixed upstream some guards can be used to detect and wrap the version that has the fix to only apply when necessary. Link: https://github.com/ROCm/rocm-libraries/issues/4398 [1] Link: https://github.com/ggml-org/llama.cpp/issues/19269 [2] Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>	2026-02-15 21:44:37 +02:00
Max Krasnyansky	304205679c	hexagon: further optimization and tuning of matmul and dot kernels (llama/19407) * ggml-hexagon: implement 2x2 matmul kernel * hexmm: implement vec_dot_rx2x2 for Q8_0 and MXFP4 * hexagon: fix editor config failures * hexagon: refactor matmul ops to use context struct and remove wrappers Also implement vec_dot_f16 2x2 * hexagon: refactor dyn quantizers to use mmctx * hexagon: remove mm fastdiv from op_ctx * hexagon: refactor matmul entry point to reduce code duplication --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>	2026-02-15 21:44:37 +02:00
lhez	0326fd37dd	opencl: add general Q6_K mm and Q4_K mv (llama/19347) * opencl: add general q6_k mm * opencl: refine condition for q6_K mm * opencl: add general q4_K mv * opencl: fix whitespace	2026-02-15 21:44:37 +02:00
Georgi Gerganov	f3e78985be	ggml : unary ops support non-cont src0 + metal F16 unary ops (llama/19511) * ggml : unary ops support non-cont src0 * metal : support F16 unary ops + fix ELU	2026-02-15 21:44:37 +02:00
Georgi Gerganov	3ffa1fd84e	metal : extend l2_norm support for non-cont src0 (llama/19502)	2026-02-15 21:44:37 +02:00
Max Krasnyansky	09587ceb12	hexagon: Add ARGSORT, DIV, SQR, SQRT, SUM_ROWS, GEGLU (llama/19406) * hexagon: add ARGSORT op Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com> * hexagon: argsort reject tensors with huge rows for now * Adding support for DIV,SQR,SQRT,SUM_ROWS ops in hexagon backend * hexagon : Add GEGLU op * hexagon: fix editor config check * hexagon: rewrite and optimize binary ops ADD/SUB/MUL/DIV/ADD_ID to use DMA --------- Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com> Co-authored-by: Manohara Hosakoppa Krishnamurthy <mhosakop@qti.qualcomm.com>	2026-02-15 21:44:37 +02:00
Georgi Gerganov	3504358056	ggml : extend bin bcast for permuted src1 (llama/19484) * tests : extend bin bcast for permuted src1 * cont : extend bin support * cont : s0 is always 1 * tests : simplify	2026-02-15 21:44:37 +02:00

1 2 3 4 5 ...

4079 Commits All Branches Search

4079 Commits

All Branches