whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Johannes Gäßler	f21d0cbb1a	CUDA: fix alignment check for FA (llama/19023)	2026-01-30 15:56:40 +02:00
lhez	0e030b852a	opencl: enable the general fp mm for non-cont input and as a fallback for specialized kqv kernel for adreno (llama/18970) * opencl: add `copy_to_contiguous` and utilize mm kernels * opencl: only copy to cont for f32 and f16 tensors * opencl: use cont mm for fallback when dst is large * opencl: use nb local to copy-to-cont * opencl: use local offset as well	2026-01-30 15:56:40 +02:00
Aman Gupta	d4fafcfc6f	CUDA: add gqa_ratio 4 for GLM 4.7 flash (llama/18953)	2026-01-30 15:56:40 +02:00
shaofeiqi	167fec69d5	opencl: add TRI op support (llama/18979)	2026-01-30 15:56:40 +02:00
Aleksei Nikiforov	55927d42ef	ggml-zdnn : mark zDNN buffers as non-host (llama/18967) While buffers reside in host memory, additional transformation is needed to use buffers with zDNN. Fixes #18848	2026-01-30 15:56:40 +02:00
Jeff Bolz	b7e323f40b	vulkan: Remove transfer_ctx, do everything in compute_ctx. (llama/18945) * vulkan: Remove transfer_ctx, do everything in compute_ctx. We had a bug where a set_tensor_async (using transfer_ctx) didn't get submitted before the graph_compute (using compute_ctx) that came after it. To avoid this sort of issue, just do everything in compute_ctx. Remove transfer_cmd_pool, which was already unused. * fix crash with perf logger	2026-01-30 15:56:40 +02:00
Jeff Bolz	b2bc4d810b	vulkan: support flash attention GQA/split_k with small batches (llama/18938)	2026-01-30 15:56:40 +02:00
Masato Nakasaka	3bbf4ced47	Revert "vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356 )" (llama/18831) This reverts commit 980b7cd17e055c8c587f79ffda7eb4fddf405566.	2026-01-30 15:56:40 +02:00
Jeff Bolz	660d943ff8	vulkan: Use mul_mat_vec_id for small values of n (llama/18918) Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and update the indexing calculations in get_offsets. Mat-vec is faster than mat-mat for small values of n. We don't get the same reuse of the weights as in the non-ID path, but with this the cost is linear in n rather than n>1 being far slower than n==1.	2026-01-30 15:56:40 +02:00
Oliver Simons	924a9e292c	CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator (llama/18964) * CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator Strided iterator was added in [CCCL 3.1](https://github.com/NVIDIA/cccl/releases/tag/v3.1.0), which is packaged into [CTK 13.1](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id5) * Unindent as per code review request	2026-01-30 15:56:40 +02:00
Oliver Simons	fdc83ee3c0	CUDA: Replace init_offsets kernel with iterators in cub-based argsort (llama/18930) * CUDA: Replace `init_offsets` with iterators in argsort This is a QOL improvement, saving us the cost of materializing the iterator * Remove unnecessary include from top-k.cu	2026-01-30 15:56:40 +02:00
Adrien Gallouët	bf71ffa6b3	ggml : cleanup path_str() (llama/18928) - Remove pragmas as `std::codecvt_utf8` is not used. - Avoid implicit `strlen()`. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-01-30 15:56:40 +02:00
Georgi Gerganov	b0517d6912	metal : enable FA for MLA heads (llama/18950)	2026-01-30 15:56:40 +02:00
Georgi Gerganov	47f3e3b927	ggml : add ggml_build_forward_select (llama/18550) * ggml : add ggml_build_forward_select * cuda : adapt CUDA graph compat to new feature * vulkan : update logic to handle command buffer closing * ggml : check compute for fusion * ggml : add comment	2026-01-30 15:56:40 +02:00
lhez	62a09b106d	opencl: fix q6_K mv for m=1 (llama/18893)	2026-01-30 15:56:40 +02:00
Reese Levine	389dafc7c2	ggml webgpu: support for backend sampling (llama/18880)	2026-01-30 15:56:40 +02:00
Thore Koritzius	511ca7a1f4	ggml : extend ggml_pool_1d + metal (llama/16429) * chore: resolve conflicts * feat: ggml metal impl * fix: ggml_metal_kargs_pool_1d struct * fix: require contiguous input * chore: test pool_1d * chore: limit pool1d test cases to p0=0 and s0=k0 to conform with asserts * chore: add p0 and s0 to testing * fix: allow padding for cpu and metal * Update ggml/src/ggml-metal/ggml-metal.metal * fix: correct single-threaded loop * ggml : cleanup * tests : add ne[1] != 1 tests * fix: ne[1] handling in np * cont : fixes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-01-30 15:56:40 +02:00
Perry Naseck	ecb4b80c35	ggml-blas: hide warnings from included BLAS headers (llama/18818) * fix compile def openblas, blis for compat libs, nvpl compile def, warn if no blas vendor set * ggml-blas: hide warnings from included BLAS headers	2026-01-30 15:56:40 +02:00
Raul Torres	42960b6073	CANN: Remove unused `ggml_cann_get_device` function (llama/18625)	2026-01-30 15:56:40 +02:00
Chenguang Li	2fceb5a80f	CANN: fix an issue where get_env was not fully renamed (llama/18796) * CANN: fix an issue where get_env was not fully renamed * ci: add cann with acl group * ci: define use_acl_graph using GitHub Action * ci: update cann dockerfile with acl graph	2026-01-30 15:56:40 +02:00
hipudding	854274a297	CANN: support gated linear attn (llama/18653) * CANN: support gated linear attn This change adds support for the GGML_OP_GATED_LINEAR_ATTN operator. The feature was implemented by YushengZhao. Because the previous submission was based on an outdated codebase, this PR was rebased to merge. Co-authored-by: YushengZhao <yusheng.chao@outlook.com> Co-authored-by: hipudding <huafengchun@gmail.com> * CANN: optimize OP gla Optimize gla for high preformance * Remove unused comments --------- Co-authored-by: 赵禹昇 <2501112001@cninfer02.localdomain> Co-authored-by: YushengZhao <yusheng.chao@outlook.com>	2026-01-30 15:56:40 +02:00
shaofeiqi	ed6004d051	OpenCL: add SOLVE_TRI op support (llama/18846)	2026-01-30 15:56:40 +02:00
Georgi Gerganov	290ff3d28d	cuda : print less debug logs when disabling cuda graphs (llama/18868)	2026-01-30 15:56:40 +02:00
Johannes Gäßler	f2f0ba0384	CUDA: fix allignment on register spill for FA (llama/18815)	2026-01-30 15:56:40 +02:00
shalinib-ibm	78a23d4830	ggml-cpu: optimize ggml_vec_dot_bf16 for Power9 (llama/18837)	2026-01-30 15:56:40 +02:00
Max Krasnyansky	50b7ab3d46	hexagon: support for OP_CPY, host buffers now optional (llama/18822)	2026-01-30 15:56:40 +02:00
Oliver Simons	bc09047405	CUDA: Factor out and re-use `block_reduce` function (llama/18785) * CUDA: Refactor and expose two_stage_warp_reduce_* function * Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it Moving smem out of `__device__` function to `__global__` function allows for explicit smem reuse, as either compiler or cuda rt seem to not free it afterwards (`cudaFuncSetAttribute` fails when not accounting for it once for each call to two_stage_warp_reduce) * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Use two_stage_warp_reduce in group_norm_f32 * Use two_stage_warp_reduce in rms_norm_f32 * Fix smem calculation which expects bytes * Make `two_stage_warp_reduce` accept all values warp_reduce accepts Also integrate it into norm_f32 function * Use two_stage_warp_reduce in l2_norm_f32 * Use type traits for block reduction for better legibility Also adresss other requests by @am17an such as variable renaming * Make norm tests cover all cuda paths * Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK Unit-tests passed locally, let's see if they pass in the CI as well * Use `enum class` for `block_reduce_method` This is more type-safe than plain enum * Rename variables as suggested in code review by @am17an * Rename two_stage_warp_reduce -> block_reduce * Fix trailing whitespace in common.cuh * Make condition of static_assert type-dependent This delays evaluation until the template is actually instantiated. Otherwise, some compilers may evaluate the assert when parsing the template, resulting in build errors as observed here: https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785 * Inline definitions --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2026-01-30 15:56:40 +02:00
Jeff Bolz	4b155e9bfb	vulkan: Check maxStorageBufferRange in supports_op (llama/18709) * vulkan: Check maxStorageBufferRange in supports_op * skip maxStorageBufferRange check when shader64BitIndexing is enabled	2026-01-30 15:56:40 +02:00
Daniel Bevenius	25aeb66a4a	CUDA : fix typo in clang pragma comment [no ci] (llama/18830)	2026-01-30 15:56:40 +02:00
Ruben Ortlam	49762e8fb3	vulkan: work around Intel fp16 bug in mmq (llama/18814)	2026-01-30 15:56:40 +02:00
Perry Naseck	17656e56dc	ggml-metal: do not copy headers for embedded, use current binary dir for embedded (llama/18705)	2026-01-30 15:56:40 +02:00
yulo	c6a495ae5d	HIP: add fattn-mma-f16 for RDNA4 (llama/18481) * finish VQ mma * flash_attn_ext_f16_iter * KQ_rowsum * correct exp * fix scale error * fix softmax scale * fix softmax scale * enable fattn on cpu side * fix random error * disable fattn-mma-f16 on rdna3 * fix wrong col for rdna * use identity mat to transpose * resolve conflicts * basic tuning for DeepSeek-R1-Distill-Qwen-1.5B * fix volta compile error * align rdna4 policy for fattn * adjust fattn policy * adjust kernel selection logic * update as the review comments * keep fattn-wmma logic * adjust kernel selection logic --------- Co-authored-by: zhang hui <you@example.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-01-30 15:56:40 +02:00
Bráulio Oliveira	7aa8818647	examples : use -dev/--device and WHISPER_ARG_DEVICE (#3557 ) Align device selection naming with llama.cpp.	2026-01-21 08:40:30 +01:00
Yshtola	f53dc74843	whisper : Fix UTF-8 character boundary issue in segment wrapping (max_len) (#3592 ) The current implementation in `whisper_wrap_segment()` uses `strlen()` to count bytes, not UTF-8 characters. When splitting segments at `max_len`, this can break multi-byte UTF-8 characters, resulting in invalid sequences displayed as `�` (U+FFFD replacement character).	2026-01-16 14:16:05 +02:00
Georgi Gerganov	2eeeba56e9	release : v1.8.3	2026-01-15 11:54:31 +02:00
Georgi Gerganov	21c1765fcb	benches : update	2026-01-15 11:53:09 +02:00
Georgi Gerganov	47af2fb70f	sync : ggml	2026-01-14 09:11:59 +02:00
Georgi Gerganov	6ee0eaf531	CUDA : fix unused argument when USE_CUDA_GRAPH=OFF (llama/18800)	2026-01-14 09:11:59 +02:00
Jeff Bolz	ab1828dc1c	vulkan: change memory_logger to be controlled by an env var (llama/18769)	2026-01-14 09:11:59 +02:00
Jeff Bolz	aedf332ec5	vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (llama/18678) This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128. This should work when the number of blocks in the A matrix is less than 2^32 (for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like 2^32*LOAD_VEC_A elements. - Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b. - Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle variants. So far this change just adds a single use case for this, compiling with the e64BitIndexingEXT flag. - Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange. 64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort to avoid enabling it unconditionally.	2026-01-14 09:11:59 +02:00
Ruben Ortlam	716d68aca9	vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (llama/18763) * vulkan: Disable large coopmat matmul configuration on proprietary AMD driver * Also disable the large tile size	2026-01-14 09:11:59 +02:00
Ruben Ortlam	c0433783c3	Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (llama/18749) * vulkan: Enable and optimize large matmul parameter combination for AMD * limit tuning to AMD GPUs with coopmat support * use tx_m values instead of _l	2026-01-14 09:11:59 +02:00
Georgi Gerganov	ecfcc65fbf	talk-llama : sync llama.cpp	2026-01-14 09:11:59 +02:00
Georgi Gerganov	13dc9a912b	sync : ggml	2026-01-14 09:11:59 +02:00
shaofeiqi	d4ce2e554f	opencl: add SOFTPLUS op support (llama/18726)	2026-01-14 09:11:59 +02:00
Johannes Gäßler	3a1ea96373	HIP: adjust RDNA3.5 MMQ kernel selction logic (llama/18666)	2026-01-14 09:11:59 +02:00
Perry Naseck	484b17053a	cmake : update blas logic (llama/18205)	2026-01-14 09:11:59 +02:00
Michael Wand	45be2cd27a	Corrected: changed s13 = src1->nb[3] instead of nb[2] (llama/18724)	2026-01-14 09:11:59 +02:00
shaofeiqi	4af27bf2da	opencl: add EXPM1 op (llama/18704)	2026-01-14 09:11:59 +02:00
Reese Levine	4ac8c3b478	Updates to webgpu get_memory (llama/18707)	2026-01-14 09:11:59 +02:00

1 2 3 4 5 ...

3934 Commits All Branches Search

3934 Commits

All Branches