whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Daniel Bevenius	df2f8d3bc4	cmake : check if KleidiAI API has been fetched (llama/19640) This commit addresses a build issue with the KleidiAI backend when building multiple cpu backends. Commmit 3a00c98584e42a20675b6569d81beadb282b0952 ("cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL") introduced a change where FetchContent_Populate is called instead of FetchContent_MakeAvailable, where the latter does handle this case (it is idempotent but FetchContent_Populate is not). I missed this during my review and I should not have commited without verifying the CI failure, sorry about that.	2026-02-27 20:57:58 +02:00
Georgi Gerganov	22f0861efc	ggml : avoid UB in gemm ukernel (llama/19642)	2026-02-27 20:57:58 +02:00
Aaron Teo	7b5a1ebaa6	ggml-cpu: optimize ggml_vec_dot_bf16 for s390x (llama/19399)	2026-02-27 20:57:58 +02:00
Aman Gupta	76f769d06f	ggml-cpu: FA add GEMM microkernel (llama/19422) * ggml-cpu: FA add GEMM microkernel * add guard for sizeless vector types * fix case where DV % GGML_F32_EPR !=0 * move memset out of the loop * move another memset out of the loop * use RM=4 for arm * simd_gemm: convert everything to int * convert everything to size_t to avoid warnings * fixup * add pragma for ignoring aggressive loop optimizations	2026-02-27 20:57:58 +02:00
SamareshSingh	7ee772ab2b	cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL (llama/19581) * cmake: fix KleidiAI install target failure with EXCLUDE_FROM_ALL Fix for the bug #19501 by adding EXCLUDE_FROM_ALL to FetchContent_Declare. This properly excludes KleidiAI from both build and install targets, preventing install failures when GGML_CPU_KLEIDIAI=ON is used. The KleidiAI source files are still compiled into libggml-cpu.so, preserving all functionality. * addressed code review comments	2026-02-27 20:57:58 +02:00
Georgi Gerganov	4bea3cd329	ggml : bump version to 0.9.7 (ggml/1425)	2026-02-27 20:57:58 +02:00
Georgi Gerganov	4ac70ce791	models : optimize qwen3next graph (llama/19375) * models : optimizing qwen3next graph * cont * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * cont : remove redundant q, g chunking * minor * minor * avoid passing masks around * avoid concats during chunking * naming + shapes * update names and use prefix to disable CUDA graphs	2026-02-15 21:44:37 +02:00
Adrien Gallouët	226e8c041c	ggml : fix GGML_DEBUG with OpenMP (llama/19599) last_graph is only available without OpenMP, but ggml_graph_compute_thread() is called in both cases. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-02-15 21:44:37 +02:00
Georgi Gerganov	fbdac5119c	metal : fix ACC op (llama/19427)	2026-02-15 21:44:37 +02:00
Jeff Bolz	cc448def01	vulkan: support L2_NORM with contiguous rows (llama/19604)	2026-02-15 21:44:37 +02:00
Jeff Bolz	197e9ab6eb	vulkan: support GGML_OP_SET (llama/19584)	2026-02-15 21:44:37 +02:00
Sophon	fc6bbab817	vulkan: Add vendor id for Qualcomm drivers (llama/19569) This commit allows Qualcomm native vulkan driver to be used on Windows instead of Mesa Dozen.	2026-02-15 21:44:37 +02:00
Max Krasnyansky	e6476d4c12	hexagon: further optimizations and refactoring for flash attention (llama/19583) * ggml-hexagon: fa improvements ggml-hexagon: optimize flash attention calculations with improved variable handling ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32 ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements ggml-hexagon: optimize flash attention by changing slope vector type to F16 * hexfa: fixed test-backend-ops failurs due to leftover element handling * hexagon: refactor and optimize fa to use local context struct * ggml-hexagon: optimize flash-attention using hvx_vec_expf Use HVX for online softmax. --------- Co-authored-by: chraac <chraac@gmail.com>	2026-02-15 21:44:37 +02:00
Jeff Bolz	ec57bf407c	vulkan: restore -inf check in FA shaders (llama/19582)	2026-02-15 21:44:37 +02:00
Alberto Cabrera Pérez	e8a25654b2	Fix wrong memcpy length for block_interleave == 4 (llama/19575)	2026-02-15 21:44:37 +02:00
ymcki	628b545b7e	fix vulkan ggml_acc only works in 3d but not 4d (llama/19426) * fix vulkan ggml_acc only works in 3d but not 4d * removed clamp in test_acc_block * use the correct stride and its test case * cuda : fix "supports op" condition * change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check * version without boundary check * revert back to boundary check version --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-02-15 21:44:37 +02:00
Aman Gupta	58e3d5a42d	CUDA: loop over ne2ne3 in case it overflows (llama/19538) CUDA: loop over ne2ne3 in case it overflows use fastdiv	2026-02-15 21:44:37 +02:00
Oliver Simons	3eb4905af1	CUDA: Do not mutate cgraph for fused ADDs (llama/19566) * Do not mutate cgraph for fused ADDs 1. We should try to minimize in-place changes to the incoming ggml_cgraph where possible (those should happen in graph_optimize) 2. Modifying in-place leads to an additional, unnecessary graph capture step as we store the properties before modifying the graph in-place in the cuda-backend * Assert ggml_tensor is trivially copyable * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2026-02-15 21:44:37 +02:00
Georgi Gerganov	0e94faa19c	metal : improve concurrency (llama/19555)	2026-02-15 21:44:37 +02:00
Georgi Gerganov	c5325e50fc	metal : support GGML_OP_SET (llama/19548)	2026-02-15 21:44:37 +02:00
Shupei Fan	195af60a8b	hexagon: fix typo in vtcm_needs_release (llama/19545)	2026-02-15 21:44:37 +02:00
lhez	9f87eeccdf	opencl: add basic support for q4_1 (llama/19534) * opencl: add q4_1 mv * opencl: clean up * opencl: add flattened q4_1 mv * opencl: clean up * opencl: add basic q4_1 mm * opencl: fix whitespace * opencl: add general q4_0 mm	2026-02-15 21:44:37 +02:00
Georgi Gerganov	d8e3e2ef08	metal : update sum_rows kernel to support float4 (llama/19524)	2026-02-15 21:44:37 +02:00
Mario Limonciello	39b5f414a3	Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (llama/19461) There is an upstream problem [1] with AMD's LLVM 22 fork and rocWMMA 2.2.0 causing compilation issues on devices without native fp16 support (CDNA devices). The specialized types aren't resolved properly: ``` /opt/rocm/include/rocwmma/internal/mfma_impl.hpp:2549:37: error: ambiguous partial specializations of 'amdgcn_mfma<__half, __half, __half, 16, 16, 16>' 2549 \| using ARegsT = typename Impl::ARegsT; ``` Add a workaround to explicitly declare the types and cast when compiling with HIP and ROCWMMA_FATTN [2]. When this is actually fixed upstream some guards can be used to detect and wrap the version that has the fix to only apply when necessary. Link: https://github.com/ROCm/rocm-libraries/issues/4398 [1] Link: https://github.com/ggml-org/llama.cpp/issues/19269 [2] Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>	2026-02-15 21:44:37 +02:00
Max Krasnyansky	304205679c	hexagon: further optimization and tuning of matmul and dot kernels (llama/19407) * ggml-hexagon: implement 2x2 matmul kernel * hexmm: implement vec_dot_rx2x2 for Q8_0 and MXFP4 * hexagon: fix editor config failures * hexagon: refactor matmul ops to use context struct and remove wrappers Also implement vec_dot_f16 2x2 * hexagon: refactor dyn quantizers to use mmctx * hexagon: remove mm fastdiv from op_ctx * hexagon: refactor matmul entry point to reduce code duplication --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>	2026-02-15 21:44:37 +02:00
lhez	0326fd37dd	opencl: add general Q6_K mm and Q4_K mv (llama/19347) * opencl: add general q6_k mm * opencl: refine condition for q6_K mm * opencl: add general q4_K mv * opencl: fix whitespace	2026-02-15 21:44:37 +02:00
Georgi Gerganov	f3e78985be	ggml : unary ops support non-cont src0 + metal F16 unary ops (llama/19511) * ggml : unary ops support non-cont src0 * metal : support F16 unary ops + fix ELU	2026-02-15 21:44:37 +02:00
Georgi Gerganov	3ffa1fd84e	metal : extend l2_norm support for non-cont src0 (llama/19502)	2026-02-15 21:44:37 +02:00
Max Krasnyansky	09587ceb12	hexagon: Add ARGSORT, DIV, SQR, SQRT, SUM_ROWS, GEGLU (llama/19406) * hexagon: add ARGSORT op Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com> * hexagon: argsort reject tensors with huge rows for now * Adding support for DIV,SQR,SQRT,SUM_ROWS ops in hexagon backend * hexagon : Add GEGLU op * hexagon: fix editor config check * hexagon: rewrite and optimize binary ops ADD/SUB/MUL/DIV/ADD_ID to use DMA --------- Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com> Co-authored-by: Manohara Hosakoppa Krishnamurthy <mhosakop@qti.qualcomm.com>	2026-02-15 21:44:37 +02:00
Georgi Gerganov	3504358056	ggml : extend bin bcast for permuted src1 (llama/19484) * tests : extend bin bcast for permuted src1 * cont : extend bin support * cont : s0 is always 1 * tests : simplify	2026-02-15 21:44:37 +02:00
Georgi Gerganov	de949fb1db	metal : consolidate unary ops (llama/19490)	2026-02-15 21:44:37 +02:00
Oliver Simons	57c620b4b1	CUDA : Update CCCL-tag for 3.2 to final release from RC (llama/19486) CCCL 3.2 has been released since it was added to llama.cpp as part of the backend-sampling PR, and it makes sense to update from RC to final released version. https://github.com/NVIDIA/cccl/releases/tag/v3.2.0	2026-02-15 21:44:37 +02:00
Nikhil Jain	562255fd77	Plug memory leaks and free resources on shutdown (llama/19315) * Fix memory leaks in shader lib, backend, backend_context, buffer_context, and webgpu_buf_pool * Free pools * Cleanup * More cleanup * Run clang-format * Fix arg-parser and tokenizer test errors that free an unallocated buffer * Fix device lost callback to not print on device teardown * Fix include and run clang-format * remove unused unused * Update binary ops --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-02-15 21:44:37 +02:00
Alberto Cabrera Pérez	d77265c818	ggml-cpu: arm64: q6_K repack gemm and gemv (and generic) implementations (dotprod) (llama/19360) * First working version of GEMM and GEMV * interleave loads and compute * Clang-format * Added missing fallback. Removed tested TODO. * Swap M and N to be consistent with the repack template convention	2026-02-15 21:44:37 +02:00
k4ss4n	b0fe2e84fa	ggml : use noexcept overload for is_regular_file in backend registration (llama/19452) using noexcept std::filesystem::directory_entry::is_regular_file overload prevents abnormal termination upon throwing an error (as caused by symlinks to non-existent folders on linux) Resolves: #18560	2026-02-15 21:44:37 +02:00
Raul Torres	2de2fc9270	CANN: Remove unnecessary wrapper for `gml_backend_buft_is_cann` (llama/18968)	2026-02-15 21:44:37 +02:00
hipudding	6a74f56212	CANN: implement quantized MUL_MAT_ID for MoE models (llama/19228) Implement ggml_cann_mul_mat_id_quant function to support quantized matrix multiplication for Mixture of Experts (MoE) architectures on CANN backend. Key features: - Support Q4_0 and Q8_0 quantized weight formats - Use IndexSelect to dynamically route expert-specific weights based on indices - Leverage WeightQuantBatchMatmulV2 for efficient quantized computation - Handle automatic F16 type conversion for hardware compatibility - Support both per-expert and broadcast input modes Implementation details: - Extract expert weights and scales using CANN IndexSelect operation - Process each batch and expert combination independently - Create proper tensor views with correct stride for matmul operations - Automatic input/output type casting to/from F16 as needed Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).	2026-02-15 21:44:37 +02:00
Georgi Gerganov	a36210c836	cuda : extend GGML_OP_PAD to work with non-cont src0 (llama/19429) * cuda : extend GGML_OP_PAD to work with non-cont src0 * tests : add permuted pad	2026-02-15 21:44:37 +02:00
Oliver Simons	808904277e	CUDA: Fix non-contig rope (llama/19338) * Rename variables + fix rope_neox Seems memory layout is shared with Vulkan so we can port fix from https://github.com/ggml-org/llama.cpp/pull/19299 * Fix rope_multi * Fix rope_vision * Fix rope_norm * Rename ne* to ne0* for consistent variable naming * cont : consistent stride names --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-02-15 21:44:37 +02:00
Georgi Gerganov	55d7cb2e93	metal : consolidate bin kernels (llama/19390) * metal : refactor bin kernels * cont * cont : fix cv	2026-02-08 09:29:10 +02:00
Georgi Gerganov	a9a0a51fba	metal : fix event synchronization in cpy_tensor_async (llama/19402)	2026-02-08 09:29:10 +02:00
Abhijit Ramesh	1739af663a	ggml-webgpu: JIT compile binary operators and handle binding overlaps (llama/19310) * ggml webgpu: port binary operators to use pre-wgsl * Add binary.wgsl: unified shader with conditionals for all 4 ops * Add gen_binary_shaders.cpp: build tool for using pre_wgsl preprocessor * Remove bin_op.tmpl.wgsl and binary.wgsl (Python template) * Update CMake to generate binary operator shaders at build time * ggml-webgpu: migrate binary ops to JIT compilation with overlap handling * port binary operators from AOT to pre-wgsl JIT compilation * add src1=dst overlap handling for binary ops * use compile-time workgroup size defines instead of runtime overrides * ggml-webgpu: complete overlap handling for binary ops * add support for inplace & overlap case in binding setup * restructure conditional logic to handle all overlap cases * ensure all buffer bindings are correctly assigned for edge cases * ggml-webgpu: remove unused binary overlap cases Remove src0==src1 binary overlap case that never occurs in practice. * keep INPLACE (src0==dst), OVERLAP (src1==dst), DEFAULT * remove unused src0==src1 and all-same variant * refactor wgsl to eliminate duplication	2026-02-08 09:29:10 +02:00
Nechama Krashinski	f2f7320817	sycl: add F16 support for GGML_OP_CEIL (llama/19306) * Fix SYCL CEIL operator * sycl: implement GGML_OP_CEIL	2026-02-08 09:29:10 +02:00
Jeff Bolz	cea22b3075	vulkan: For coopmat2 FA, use fp16 accumulators for the final result (llama/19376) The cpu and cuda backends use fp16 for the VKQ accumulator type, this change does the same for vulkan. This helps particularly with large head sizes which are very register-limited. I tried this for the coopmat1 path and it slowed down a bit. I didn't try for scalar. I applied the softmax bias that the cuda backend uses to avoid overflow, although I was not able to reproduce the original bug without it.	2026-02-08 09:29:10 +02:00
Jeff Bolz	c1b63354bb	vulkan: make FA mask/softcap enables spec constants (llama/19309) * vulkan: make FA mask/softcap enables spec constants * don't specialize for sinks * bump timeout a little bit	2026-02-08 09:29:10 +02:00
Georgi Gerganov	776cf61857	metal : skip loading all-zero mask (llama/19337) * metal : skip loading all-zero mask * cont : minor	2026-02-08 09:29:10 +02:00
Georgi Gerganov	2a7d5490f1	cuda : cuda graphs now compare all node params (llama/19383)	2026-02-08 09:29:10 +02:00
Georgi Gerganov	34d332aca5	metal : adaptive CPU/GPU interleave based on number of nodes (llama/19369)	2026-02-08 09:29:10 +02:00
Jeff Bolz	a567c140a3	vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (llama/19281) Write out a 2-bit code per block and avoid loading the mask when it matches these two common cases. Apply this optimization when the mask is relatively large (i.e. prompt processing).	2026-02-08 09:29:10 +02:00
Georgi Gerganov	0781df2518	metal : add diag (llama/19330)	2026-02-08 09:29:10 +02:00

1 2 3 4 5 ...

2032 Commits