whisper.cpp

Commit Graph

Author	SHA1	Message	Date
hipudding	5c583f3c02	CANN: Fix rename for get_env (llama/18652) In #18624, get_env in ggml-cann was renamed to get_env_as_lowercase to accurately reflect the function’s behavior and reduce the chance of misuse. However, the update missed renaming call sites in other files. This commit fixes that oversight.	2026-01-14 09:11:59 +02:00
Raul Torres	47671c81db	CANN: Rename `get_env` to `get_env_as_lowercase` (llama/18624)	2026-01-14 09:11:59 +02:00
Max Krasnyansky	a5f51ac75b	Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (llama/18611) * hexagon: improve fp16 matmul and add fp32/fp16 flash-attention * hexagon: add support for set-rows fp32 -> fp16 with i32/i64 row-idx * hexagon: add support for SCALE fp32 * hexagon: replace scalar fp32 -> fp16 copy with HVX * hexagon: optimize flash_atten_ext with aligned VTCM buffers and DMA - Implements double-buffered DMA prefetching for K, V, and Mask tensors. - Ensures K and V rows in VTCM are padded to 128 bytes to support aligned HVX operations. - Correctly synchronizes DMA transfers to prevent race conditions. - Uses `FLASH_ATTN_BLOCK_SIZE` of 128 for efficient chunking. * hexagon: use aligned mad_f16 * hexagon: flash_atten more aligned ops * hexagon: optimize scale_f32 hvx helpers * hexagon: unroll fa loops * hexagon: remove unused set-rows log * hexagon: flash_attn_ext add support for DMAing Q - Update `op_flash_attn_ext` to include Q row size in scratchpad allocation. - Pad Q row size to 128 bytes for alignment. - Implement DMA transfer for Q tensor in `flash_attn_ext_f16_thread`. - Update dot product computations to use VTCM-buffered Q data. * hexagon: fix handling of NANs hvx dotproducts * hexagon: cleanup spad allocation in flash-atten * hexagon: improve fp16/fp32 matmul - Introduced `vec_dot_f16_f16` and `vec_dot_f16_f16_rx2` kernels using efficient HVX dot product intrinsics. - Added `quantize_fp32_f16` to copy/convert weights from DDR to VTCM - Updated `op_matmul` to use the optimized path when VTCM capacity allows and broadcasting requirements are compatible. - Implemented fallback logic to the original implementation for complex broadcasting scenarios. * hexagon: fix HVX_ARCH check * hexagon: matmul cleanup and fp16 fixes Use aligned vec_dot_f16 for 2d matmuls and unaligned version for 4d. * hexagon: fix fp16 x fp16 matmuls and some minor refactoring * hexagon: add support for GET_ROWS f32 -> f32 Also optimize SET_ROWS threading a bit when we have just a few rows to process. * hexagon: optimize set-rows threading * hexagon: update adb/run-bench.sh to properly support experimental and verbose options * hexagon: flash_atten use aligned vectors for dot products	2026-01-14 09:11:59 +02:00
Aadeshveer Singh	436f30d05f	ggml : optimize cuda ssm_scan using warp-level reduction (llama/18505) * ggml : optimize cuda ssm_scan using warp-level reduction * ggml : apply code review suggestions (style, const, constexpr) * ggml : add TODO regarding stride consistency	2026-01-14 09:11:59 +02:00
Jeff Bolz	dbec71f6cf	vulkan: support buffer_from_host_ptr (llama/18467) * vulkan: support buffer_from_host_ptr * hacky use of buffer_from_host_ptr for directio * disable buffer_from_host_ptr cap * use external memory for ggml_vk_host_malloc, revert model loader changes * disable external_memory_host for MoltenVK * take buffer memory types into account * don't use external_memory_host for ggml_vk_host_malloc	2026-01-14 09:11:59 +02:00
Aman Gupta	575d894603	ggml-cuda: refactor cuda graph usage (llama/18637) * ggml-cuda: refactor cuda graph usage * use is_enabled() instead of enabled	2026-01-14 09:11:59 +02:00
Beinsezii	ed674cfc10	mmq.cu: tune mmq/rocblas switching for RDNA (llama/18537) * Patch perf regression for mmq kernels in ROCm recover performance regression for https://github.com/ggml-org/llama.cpp/issues/17917 * add n_experts branch like the cdna path * mmq.cu: tune mmq/wmma switching for RDNA * mmq.cu: move amd wmma mmq/wmma switching behind IS_RDNA3 * Update ggml/src/ggml-cuda/mmq.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Jiacheng (Jason) Chen <76919340+jiachengjason@users.noreply.github.com> Co-authored-by: jiachengjason <jasonchen.jiacheng@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-01-14 09:11:59 +02:00
Adrien Gallouët	5520f27363	ggml : fix avx512bf16 build (llama/18623) - include `immintrin.h` when required - remove unused m512bh Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-01-14 09:11:59 +02:00
Raul Torres	9a1a6685ba	CANN: Make `valid_values` variable `static const` (llama/18627)	2026-01-14 09:11:59 +02:00
nwyin	e563e239a7	ggml webgpu: add CEIL operation support (llama/18605) * ggml-webgpu: add CEIL operation support Add support for the CEIL unary operation in the WebGPU backend: - Add CEIL_FUNC shader template in unary_op.wgsl - Add 4 shader variants (f32, f16, inplace versions) - Initialize CEIL pipelines in ggml-webgpu.cpp - Register CEIL in supports_op function * docs: update WebGPU ops support for CEIL	2026-01-14 09:11:59 +02:00
Johannes Gäßler	9956333361	CUDA: fix FA FP16 accumulator overflow for Granite (llama/18614)	2026-01-14 09:11:59 +02:00
Aman Gupta	804f545454	ggml-cuda: check for srcs outside the cgraph (llama/18583) * ggml-cuda: check for srcs outside the cgraph * review: use leafs instead	2026-01-14 09:11:59 +02:00
Jeff Bolz	52ba45e2b8	vulkan: fix topk_moe_sigmoid_norm_bias failures in GLM-4.6 (llama/18582)	2026-01-14 09:11:59 +02:00
Jeff Bolz	0a99b4c377	vulkan: handle quantize_q8_1 overflowing the max workgroup count (llama/18515) * vulkan: handle quantize_q8_1 overflowing the max workgroup count * vulkan: Fix small tile size matmul on lavapipe * fix mul_mat_id failures	2026-01-14 09:11:59 +02:00
Chenguang Li	1d657effe3	CANN: add operator fusion support for ADD + RMS_NORM (llama/17512) This commit implements operator fusion for ADD + RMS_NORM operations in the CANN backend to reduce memory access overhead and improve performance. The fusion is controlled by the GGML_CANN_OPERATOR_FUSION environment variable (default: false). Changes: - Implement ggml_cann_op_add_rms_norm_fused() using ACLNN AddRmsNorm - Add ggml_cann_can_fuse() to check fusion eligibility - Integrate fusion logic into computation graph evaluation - Add test cases for ADD + RMS_NORM fusion - Update documentation with new environment variable The fusion combines ADD and RMS_NORM into a single kernel call, which is more efficient than executing them separately.	2026-01-14 09:11:59 +02:00
Daniel Bevenius	4d6a3fb00d	sampling : add support for backend sampling (llama/17004) * sampling : add support for backend sampling This commit adds support for performing sampling operations on the backend (e.g. GPU) as part of the model computation graph. The motivation for this feature is to enable sampling to be performed directly on the backend as part of the computation graph being executed, allowing for some or all of the sampling to be done on the backend. For example, the backend sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory. It is also possible for the backend samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers. Currently the backend sampling works in a similar manner to how pooling works, it is a function that is called by build_graph and the sampler operations become part of the models computation graph. * llama-cli : add backend sampler configuration * server : add backend sampling options/configuration * webui : add backend sampling options * ggml : add initial cumsum implementation for CUDA * sampling : enable all backend sampler tests This commit enables all exisiting backend sampler tests in the test-backend-sampler. Previously, some tests were disabled because there were missing ggml operation implementations. * graph : do not include llama-model.h * sampling : always expose sampled_ids This commit precomputes and caches the full-vocab token id list in llama_context's constructor, so llama_get_backend_sampled_token_ids_ith always returns a valid pointer. The motivation for this is that this enables both common/sampling.cpp and src/llama-sampling.cpp can simplify their logic. Not all backends samplers that process logits need to set the sampled_tokens_id as they may not change the order of the logits, for example the temperature sampler only scales the logits but does not change their order. Simliar the logit bias sampler only adds bias to specific token ids but does not change the order of the logits. In these cases there will not be a device to host copy of the sampled token ids, and this is the use case where having this precomputed list is useful. * sampling : ensure at most one output token per seq This commit adds a check in the batch allocator to ensure that when backend sampling is enabled, at most one output token is specified per sequence. * CUDA: Optimize argsort for gpu-based token sampling Argsort is used for top-k currently. WE optimize argsort by 2 things: 1. Use `DeviceRadixSort` for single-row/sequence to parallelize it across our SMs 2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the correct entrypoint (the function chooses different execution paths, it contains `DeviceSegmentedRadixSort` as one of the paths and will choose the best one according to heuristics. https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview Some perf numbers for a RTX PRO 6000: On the kernel level, tested with `GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf` Before: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 359.24 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 8192 runs - 861.34 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 1020.01 us/run ``` After: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 312.41 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 16384 runs - 63.48 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 874.36 us/run ```	2026-01-14 09:11:59 +02:00
Aman Gupta	f0bf5b8cc3	CUDA: disable cuda graph when using n-cpu-moe (llama/18593) * CUDA: disable cuda graph when using n-cpu-moe * call ggml_cuda_set_device	2026-01-14 09:11:59 +02:00
Aman Gupta	88f5765c82	ggml-cuda: remove unused params in ggml_cuda_graph (llama/18579)	2026-01-14 09:11:59 +02:00
Aman Gupta	1e725546b0	ggml-cuda: fixes for concurrent streams (llama/18496)	2026-01-14 09:11:59 +02:00
Johannes Gäßler	60d178cee9	CUDA: only allocate FA tmp buffer if needed (llama/18564)	2026-01-14 09:11:59 +02:00
pl752	304e780e5f	(Bugfix, ggml-cuda) Pool alloc count fix + small size computation type adjustment (llama/18559) * CUDA: Fixed obj byte size instead of obj count being passed to pool alloc (fattn-common, dst_tmp_meta) * CUDA: Explicitly casted some of the int alloc counts before multiplication in argsort --------- Co-authored-by: pl752 <maximpl752@gmail.com>	2026-01-14 09:11:59 +02:00
Shouyu	c9e9f083c2	ggml-hexagon: optimize activation function (llama/18393) * refactor: refactor silu * refactor: optimize swiglu * refactor: remove unncessary if in swiglu * refactor: refactor swiglu_oai * chore: fix formatting issue	2026-01-14 09:11:59 +02:00
Jeff Bolz	9d83865607	vulkan: Optimize GGML_OP_CUMSUM (llama/18417) * vulkan: Optimize GGML_OP_CUMSUM There are two paths: The preexisting one that does a whole row per workgroup in a single shader, and one that splits each row into multiple blocks and does two passes. The first pass computes partials within a block, the second adds the block partials to compute the final result. The multipass shader is used when there are a small number of large rows. In the whole-row shader, handle multiple elements per invocation. * use 2 ELEM_PER_THREAD for AMD/Intel * address feedback	2026-01-14 09:11:59 +02:00
Jeff Bolz	b7ff521e71	vulkan: Implement mmvq for iq1_s/iq1_m (llama/18450)	2026-01-14 09:11:59 +02:00
Georgi Gerganov	b99c911c49	metal : adjust extra size for FA buffer to avoid reallocations (llama/18545)	2026-01-14 09:11:59 +02:00
Chris Rohlf	f328b13d5c	rpc : use unordered_map::reserve and emplace (llama/18513)	2026-01-14 09:11:59 +02:00
MeeMin	fbde389665	cuda : fix copy of large tensors (ggml_nbytes <= INT_MAX assertion) (llama/18433) * ggml-cuda: fixed assertion in ggml_cuda_cpy (llama/18140) * ggml-cuda: changes in data types to int64_t * ggml-cuda: added asserts for CUDA block numbers * ggml-cuda: changed the condition for y and z dimension	2026-01-14 09:11:59 +02:00
Aman Gupta	f22c1ccbe4	ggml-cuda: remove unneccesary prints on ggml_cuda_init (llama/18502)	2026-01-14 09:11:59 +02:00
Jeff Bolz	b1f65a4a7e	vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (llama/18295) * vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron Also handle GGML_OP_SCALE at the end (nemotron, deepseek2). Fewer pipeline variants and spec constants, just use push constants. In test_topk_moe, change exp_probs_b to be 1D, matching real networks. Update test-backend-ops and ggml-backend to allow verifying multiple outputs in a fusion test (topk_moe has two outputs). Previously only the final node was verified. * change test_topk_moe to allow results in arbitrary order * disable sigmoid fusion for moltenvk	2026-01-14 09:11:59 +02:00
Peter A.	a96310871a	examples : fix executable example targets (#3600 ) * cmake: - added `whisper-` prefix to unprefixed targets: `quantize`, `lsp`, `vad-speech-segments` - added `install(TARGETS ${TARGET} RUNTIME)` where it was missing Signed-off-by: Peter A. <ink.splatters@pm.me> * .github/workflows/build.yml: quantize -> whisper-quantize Signed-off-by: Peter A. <ink.splatters@pm.me> --------- Signed-off-by: Peter A. <ink.splatters@pm.me>	2026-01-13 08:08:18 +01:00
KITAITI Makoto	679bdb53db	ruby : fix segmentation fault (#3591 ) * Mark long live variable * Fix test for Whisper::Token#deconstruct_keys(nil) * Don't use long live variable * Fix indentation	2026-01-05 17:41:22 +09:00
Georgi Gerganov	e9898ddfb9	sync : ggml	2025-12-31 18:27:20 +02:00
Georgi Gerganov	ce03f8e759	ggml : bump version to 0.9.5 (ggml/1410)	2025-12-31 18:27:20 +02:00
Georgi Gerganov	7359ac94d5	talk-llama : sync llama.cpp	2025-12-31 17:52:09 +02:00
Georgi Gerganov	54fa8216ea	sync : ggml	2025-12-31 17:52:09 +02:00
gatbontonpc	8189f2cb65	metal : add count_equal op (llama/18314) * add count equal for metal * remove trailing whitespace * updated doc ops table * changed shmem to i32 * added multi tg and templating * removed BLAS support from Metal docs * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add memset to set dst to 0 * metal : cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-31 17:52:09 +02:00
Johannes Gäßler	2d250f8049	CUDA: fix KQ max calculation (llama/18487)	2025-12-31 17:52:09 +02:00
Georgi Gerganov	5deaf8f2a3	metal : remove BF16 x F16 kernels (llama/18456)	2025-12-31 17:52:09 +02:00
Aman Gupta	467933199a	sycl: add newline at the end of CMakeLists.txt (llama/18503)	2025-12-31 17:52:09 +02:00
Rahul Sathe	a3635494da	Work around broken IntelSYCLConfig.cmake in Intel oneAPI 2025.x (llama/18345) * cmake: work around broken IntelSYCLConfig.cmake in oneAPI 2025.x * [AI] sycl: auto-detect and skip incompatible IntelSYCL package Automatically detect compiler versions with incompatible IntelSYCL CMake configuration files and fall back to manual SYCL flags instead of requiring users to set options manually. Fixes build failures with oneAPI 2025.x where IntelSYCLConfig.cmake has SYCL_FEATURE_TEST_EXTRACT invocation errors. * refactor: improve SYCL provider handling and error messages in CMake configuration * refactor: enhance SYCL provider validation and error handling in CMake configuration * ggml-sycl: wrap find_package(IntelSYCL) to prevent build crashes	2025-12-31 17:52:09 +02:00
Charles Xu	c9955367d4	kleidiai: add and integrate SVE 256-bit vector-length kernel (llama/18458) * kleidiai: add and integrate SVE 256-bit vector-length kernel * updated for review comments	2025-12-31 17:52:09 +02:00
Aman Gupta	6d4aa96bfa	CUDA: add log line when mxfp4 acceleration is used (llama/18483) * CUDA: add log line when mxfp4 acceleration is used * add in backend_get_features	2025-12-31 17:52:09 +02:00
Johannes Gäßler	5765c5b04e	CUDA: fix replacment of bad archs in CMake (llama/18457)	2025-12-31 17:52:09 +02:00
Johannes Gäßler	d6cb2407b7	CUDA: Blackwell features for non-native builds (llama/18436)	2025-12-31 17:52:09 +02:00
Aman Gupta	e49e88b2d8	cuda: fix race condition in cumsum (llama/18448) * ggml-cuda: fix race condition in cumsum * remove unneccesary sync_threads	2025-12-31 17:52:09 +02:00
uvos	20f5729921	HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated (llama/18202)	2025-12-31 17:52:09 +02:00
Aman Gupta	b8d209f55c	Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413 )" (llama/18426)	2025-12-31 17:52:09 +02:00
o7si	54fe9a645d	rpc: fix segfault on invalid endpoint format (llama/18387) * rpc: fix segfault on invalid endpoint format * rpc: add error log for failed endpoint connection	2025-12-31 17:52:09 +02:00
Boian Berberov	b3788ef729	cmake: Added more x86_64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` (llama/18186) * minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h` * cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` - `ivybridge` - `piledriver` - `cannonlake` - `cascadelake` - `cooperlake` - `zen4` Resolves: #17966	2025-12-31 17:52:09 +02:00
QDelta	31fc2c37c8	ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (llama/18413)	2025-12-31 17:52:09 +02:00

1 2 3 4 5 ...

3922 Commits All Branches Search

3922 Commits

All Branches