whisper.cpp

Commit Graph

Author	SHA1	Message	Date
shaofeiqi	db9c88744d	opencl: add optimized q8_0 mm kernel for adreno (llama/18871) * Add Q8_0 OpenCL kernel Co-authored-by: yunjie <yunjie@qti.qualcomm.com> * opencl: fix build for non-adreno * opencl: refactor q8_0 * opencl: enforce subgroup size of 64 for adreno for q8_0 * For A750 and older generations, subgroup size can be 64 or 128. This kernel assumes subgroup size 64. * opencl: suppress warning when adreno kernels are disabled --------- Co-authored-by: yunjie <yunjie@qti.qualcomm.com> Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-02-08 09:29:10 +02:00
Simon Redman	efd6344939	Correctly fetch q8_1 quantize pipeline in test as needed by 8a3519b (llama/19194)	2026-02-08 09:29:10 +02:00
Georgi Gerganov	06e3750407	ggml : bump version to 0.9.6 (ggml/1423)	2026-02-08 09:29:10 +02:00
Georgi Gerganov	fc1a3e579e	cmake : remove unused file (ggml/1419)	2026-02-08 09:29:10 +02:00
Georgi Gerganov	acbace0571	cuda : fix compile warnings (#0 )	2026-01-30 15:56:40 +02:00
bssrdf	5dca0db99c	add tensor type checking as part of cuda graph properties (llama/19186)	2026-01-30 15:56:40 +02:00
s8322	2a16e7a67f	sycl: implement GGML_UNARY_OP_SOFTPLUS (llama/19114) * sycl: add softplus unary op implementation * sycl: add softplus unary op implementation * docs(ops): mark SYCL SOFTPLUS as supported * docs: update SYCL status for SOFTPLUS	2026-01-30 15:56:40 +02:00
RachelMantel	1b3c27efae	sycl: implement GGML_OP_TRI (llama/19089) * sycl: implement GGML_OP_TRI * docs: update ops.md for SYCL TRI * docs: regenerate ops.md * docs: update SYCL support for GGML_OP_TRI	2026-01-30 15:56:40 +02:00
Zheyuan Chen	829e70044b	ggml-webgpu: improve flastAttention performance by software pipelining (llama/19151) * webgpu : pipeline flash_attn Q/K loads in WGSL * ggml-webgpu: unroll QK accumlation inner loop ggml-webgpu: vectorization * ggml-webgpu: unrolling * ggml-webgpu: remove redundant unrolling * ggml-webgpu: restore the config * ggml-webgpu: remove redundant comments * ggml-webgpu: formatting * ggml-webgpu: formatting and remove vectorization * ggml-webgpu: remove unnecessary constants * ggml-webgpu: change QKV buffer to read_write to pass validation * ggml-webgpu: add explanation for the additional bracket around Q K accumulate * Indentation and for -> if for tail * Kick off CI on wgsl only commits --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-01-30 15:56:40 +02:00
Todor Boinovski	2a89a3f35c	hexagon: enable offloading to Hexagon on Windows on Snapdragon (llama/19150) * hexagon: updates to enable offloading to HTP on WoS * Update windows.md * Update windows.md * hexagon: enable -O3 optimizations * hexagon: move all _WINDOWS conditional compilation to _WIN32 * hexagon: updates to enable offloading to HTP on WoS * hexagon: use run-time vs load-time dynamic linking for cdsp driver interface * refactor htp-drv * hexagon: add run-bench.ps1 script * hexagon: htdrv refactor * hexagon: unify Android and Windows build readmes * hexagon: update README.md * hexagon: refactor htpdrv * hexagon: drv refactor * hexagon: more drv refactor * hexagon: fixes for android builds * hexagon: factor out dl into ggml-backend-dl * hexagon: add run-tool.ps1 script * hexagon: merge htp-utils in htp-drv and remove unused code * wos: no need for getopt_custom.h * wos: add missing CR in htpdrv * hexagon: ndev enforecement applies only to the Android devices * hexagon: add support for generating and signing .cat file * hexagon: add .inf file * hexagon: working auto-signing and improved windows builds * hexagon: futher improve skel build * hexagon: add rough WoS guide * hexagon: updated windows guide * hexagon: improve cmake handling of certs and logging * hexagon: improve windows setup/build doc * hexagon: more windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * Update windows.md * Update windows.md * snapdragon: rename docs/backend/hexagon to docs/backends/snapdragon Also added a power shell script to simplify build env setup. * hexagon: remove trailing whitespace and move cmake requirement to user-presets * hexagon: fix CMakeUserPresets path in workflow yaml * hexagon: introduce local version of libdl.h * hexagon: fix src1 reuse logic gpt-oss needs a bigger lookahead window. The check for src[1] itself being quantized was wrong. --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-01-30 15:56:40 +02:00
Georgi Gerganov	b997e690ef	cuda : fix nkvo, offload and cuda graph node properties matching (llama/19165) * cuda : fix nkvo * cont : more robust cuda graph node property matching * cont : restore pre-leafs implementation * cont : comments + static_assert	2026-01-30 15:56:40 +02:00
yulo	34a3e28a08	HIP: add mmf for CDNA (llama/18896) * refactor mmf rows_per_block * speed up compile * pass cdna compile * fix cuda error * clean up mmf * f32 mmf * clean float mma * fix mmf error * faster mmf * extend tile k * fix compile error * Revert "extend tile k" This reverts commit 4d2ef3d483932659801a59a5af0b6b48f6ffd5c7. * fix smem overflow * speed up compiling mmf * speed up compile for hip * 512 block for cdna * config pad size * fix as comment * update select logic * move some code to cuh * fix as comment * correct cdna3 config --------- Co-authored-by: zhang hui <you@example.com>	2026-01-30 15:56:40 +02:00
Vishal Singh	e0a2182970	ggml-zendnn : resolve ZenDNN backend cross-module symbol dependency (llama/19159)	2026-01-30 15:56:40 +02:00
Aman Gupta	62ba8b537f	CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) (llama/19126)	2026-01-30 15:56:40 +02:00
Neo Zhang	f0e85bb142	sycl: fix norm kernels: l2_norm, group_norm, rms_norm by remove assert to support more cases (llama/19154) Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2026-01-30 15:56:40 +02:00
Ruben Ortlam	33148bb523	Vulkan Flash Attention Coopmat1 Refactor (llama/19075) * vulkan: use coopmat for flash attention pv matrix multiplication fix P loading issue * fix barrier position * remove reduction that is no longer needed * move max thread reduction into loop * remove osh padding * add bounds checks and padding * remove unused code * fix shmem sizes, loop duration and accesses * don't overwrite Qf, add new shared psh buffer instead * add missing bounds checks * use subgroup reductions * optimize * move bounds check, reduce barriers * support other Bc values and other subgroup sizes * remove D_split * replace Of register array with shared memory Ofsh array * parallelize HSV across the rowgroups * go back to Of in registers, not shmem * vectorize sfsh * don't store entire K tile in shmem * fixes * load large k tiles to shmem on Nvidia * adapt shared memory host check function to shader changes * remove Bc 32 case * remove unused variable * fix missing mask reduction tmspsh barrier * fix mask bounds check * fix rowmax f16 under/overflow to inf * fix flash_attn_cm2 BLOCK_SIZE preprocessor directives	2026-01-30 15:56:40 +02:00
Patryk Kaminski	cc0c103b5d	ggml-sycl: remove unused syclcompat header (llama/19140) The syclcompat/math.hpp is not used anymore. The change that intrduced it was successfuly reverted (https://github.com/ggml-org/llama.cpp/pull/17826). This include path will become obsolete and dropped in oneAPI 2026.0 effectively breaking ggml-sycl builds.	2026-01-30 15:56:40 +02:00
Oleksandr Kuvshynov	dda7d9cd1c	vulkan: handle device dedup on MacOS + Vega II Duo cards (llama/19058) Deduplication here relied on the fact that vulkan would return unique UUID for different physical GPUs. It is at the moment not always the case. On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total), MotlenVK would assign same UUID to pairs of GPUs, unless they are connected with Infinity Fabric. See more details here: KhronosGroup/MoltenVK#2683. The right way is to fix that in MoltenVK, but until it is fixed, llama.cpp would only recognize 2 of 4 GPUs in such configuration. The deduplication logic here is changed to only filter GPUs if UUID is same but driver is different.	2026-01-30 15:56:40 +02:00
Kevin Pouget	531d7b6781	ggml: new backend for Virglrenderer API Remoting acceleration (v2) (llama/18718)	2026-01-30 15:56:40 +02:00
Alberto Cabrera Pérez	3701413a71	ggml-cpu: arm64: Q4_K scale unroll and vectorization (llama/19108)	2026-01-30 15:56:40 +02:00
Georgi Gerganov	7fb0f823de	cuda : fix "V is K view" check for non-unified KV cache (llama/19145)	2026-01-30 15:56:40 +02:00
Georgi Gerganov	f28a733025	CUDA: tune GLM 4.7 Flash FA kernel selection logic (DGX Spark) (llama/19142)	2026-01-30 15:56:40 +02:00
Nikhil Jain	dfdd2fee83	ggml webgpu: Split shared state (webgpu_context) into global state and per-thread state (llama/18976) * Squashed commit of the following: commit b3c6bf4b0450d8d452b934df27a0fb7cb53cd755 Author: Abhijit Ramesh <abhijitramesh2k@gmail.com> Date: Mon Dec 1 18:29:00 2025 -0800 ggml webgpu: fix xielu parameter passing (llama/11) The XIELU operation was incorrectly using static_cast to convert float parameters to uint32_t, which converted numeric values instead of preserving IEEE 754 bit patterns. This caused incorrect values to be interpreted by the GPU shader. * Use reinterpret_cast to preserve float bit patterns when passing through uint32_t params buffer * Update WGSL shader parameter types from u32 to f32 * Re-enable XIELU support (was disabled due to numerical issues) Fixes NMSE test failures for XIELU operation on WebGPU backend. commit 5ca9b5e49ea7cddc9ab7c8b43a11a9c76a4dff4a Author: neha-ha <137219201+neha-ha@users.noreply.github.com> Date: Tue Nov 18 12:17:00 2025 -0800 Refactored pipelines and workgroup calculations (llama/10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 29 23:13:06 2025 -0700 formatted embed wgsl and ggml-webgpu.cpp commit e1f6baea31645e5d96ad53664acae856f74b96f4 Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 29 23:08:37 2025 -0700 implemented REPL_Template support and removed bug in unary operators kernel commit 8c70b8fece445cdc9a8c660dbddbf201e52da2bb Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 15 16:14:20 2025 -0700 responded and dealt with PR comments commit f9282c660c10dec4487d434549bdb707a9cd9f37 Author: James Contini <jamescontini@gmail.com> Date: Sun Oct 12 13:41:41 2025 -0700 removed unnecesarry checking if node->src[1] exists for unary operators commit 4cf28d7dec41c29186d66152735b244c5699f9dc Author: James Contini <jamescontini@gmail.com> Date: Sun Oct 12 13:32:45 2025 -0700 All operators (inlcluding xielu) working commit 74c6add1761a59d2c2ff60b60e8ad3c8300f6d3e Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 13:16:48 2025 -0700 fixed autoconfig commit 362749910be4f0120c8ffb21ceddeb7d2c088e51 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 13:10:46 2025 -0700 removed vestigial files commit cb0858333785757804c5104e59c4981843207c16 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 12:59:32 2025 -0700 abides by editor-config commit 5360e2852a4b51197d7d67d0a5d42e908b02d7ed Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 12:45:57 2025 -0700 rms_norm double declaration bug atoned commit 7b09baa4aa53711be5a126043670cc182c78bfcd Merge: 8a6ec843 74b8fc17 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 11:50:03 2025 -0700 resolving merge conflicts commit 8a6ec843a50ab82f8cef59b4558eb63f318ba02d Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 8 18:06:47 2025 -0700 unary operators pass ggml tests commit c3ae38278a2db236adc5912c9140e4f0d63f2c19 Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 1 16:22:40 2025 -0700 neg passes backend test commit aa1c9b2f8877a405470ca56709c42a1fd43713de Author: James Contini <jamescontini@gmail.com> Date: Tue Sep 30 23:55:27 2025 -0700 neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though Co-authored-by: James Contini <jamescontini@gmail.com> Co-authored-by: Neha Abbas <neabbas@ucsc.edu> Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com> * Remove extra code and format * Add ops documentation (finally) * ggml webgpu: add SOFTPLUS unary operator Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32 precision for intermediate calculations to prevent f16 overflow. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * Follow Vulkan backend numerical stability pattern * ggml webgpu: add EXPM1 unary operator Implements EXPM1 (exp(x) - 1) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add FLOOR unary operator Implements FLOOR (rounds down to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add CEIL unary operator Implements CEIL (rounds up to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add ROUND unary operator Implements ROUND (rounds to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add TRUNC unary operator Implements TRUNC (truncates towards zero) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS) * Updates to webgpu get_memory * Move shared state (webgpu_context) and device creation out of registration context, device context, and buffer context, and move into backend context * Small cleanup * Move Instance, Device, Adapter, Device creation, and capabilities to global state while moving Queue, pipelines, and buffers to per-thread state. * Cleanups * More cleanup * Move staging_buf mutex to global context * Resolve merge * Resolve merge * Resolve merge * Clean up merge errors, delete forward declaration, and run clang-format * Rename device_init to backend_init * Move webgpu_context to backend_context * Move buffer context members into global context and refactor function calls * Run clang-format * Remove commends * Move parameter buffers to per-thread, add single memset_tensor param buf * Fix CI compilation issue * Fix builds for emscripten not supporting subgroups * cleanup * cleanup --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-01-30 15:56:40 +02:00
Vishal Singh	9c75c793a6	ggml-zendnn : update ZenDNN git tag to main branch (llama/19133)	2026-01-30 15:56:40 +02:00
Johannes Gäßler	9d94d0f782	CUDA: tune GLM 4.7 Flash FA kernel selection logic (llama/19097)	2026-01-30 15:56:40 +02:00
Alberto Cabrera Pérez	00885e08e2	ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860 (llama/18888) * Boilerplate for q6_K repack * q6_K repack to q6_Kx8 implementation Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * q6_K generic gemv and gemm * wip, gemm_q6_K 8x8 * Still WIP: loading of q8s, q6h and q6l * first working version of q6_K gemm * Moved q6 loads outside of sb block, Unrolled inner loop * Replaced modulo with mask * First implementation of GEMV * ggml_vdotq_s32 -> vdotq_s32 * Reduce width of accumulators in q6_K gemv * Bsums instead of calc bias. Preload scales to use vget_lane. Unroll. * Reuse scales in GEMM (same GEMV opt) * Added todos for bsum and different qh repack * Arch fallback * VSLIQ for merging qh adn ql * Removed TODO, already tested * Apply suggestions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Removed unused import --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-01-30 15:56:40 +02:00
Gaurav Garg	5fcbbdc0dd	Reduce CPU-side stalls due to the CUDA command buffer being full (llama/19042) * [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline. Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size. * Set the env variable in the CUDA backend registry allocation * Add link to PR in code comment * Remove warning logs and update documentation	2026-01-30 15:56:40 +02:00
shalinib-ibm	b2e2032856	ggml-cpu: Enable FP16 MMA kernels on PPC (llama/19060)	2026-01-30 15:56:40 +02:00
lhez	56f82a9f33	opencl: add flattened q6_K mv (llama/19054)	2026-01-30 15:56:40 +02:00
Johannes Gäßler	41d5d7bb0e	CUDA: fix padding of GQA to power of 2 in FA (llama/19115)	2026-01-30 15:56:40 +02:00
Johannes Gäßler	f63848eada	CUDA: faster FA for GQA > 1 but not power of 2 (llama/19092)	2026-01-30 15:56:40 +02:00
ccbinn	4372b87b8e	metal : fix recommendedMaxWorkingSetSize availability on legacy iOS/macOS (llama/19088) Co-authored-by: chenbin11 <chenbin11@kuaishou.com>	2026-01-30 15:56:40 +02:00
Aman Gupta	1642a4fb60	ggml-cpu: Use tiled FA for prompt-processing (llama/19012) * ggml-cpu: Use tiled FA for prompt-processing the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine. * fix out of bounds for mask * skip rows where there are all masks * skip tile if mask is inf * store mask in worksize * check inf tile earlier	2026-01-30 15:56:40 +02:00
Georgi Gerganov	d2b51404e4	kv-cache : support V-less cache (llama/19067) * kv-cache : support V-less cache * cuda : better check for V_is_K_view * cuda : improve V_is_K_view check * graph : add comments * hparams : refactor	2026-01-30 15:56:40 +02:00
Johannes Gäßler	f53eafd745	CUDA: re-use MLA K data for V in MMA FA (llama/19057)	2026-01-30 15:56:40 +02:00
Aman Gupta	13577a6ce4	ggml-cuda: enable cuda-graphs for `n-cpu-moe` (llama/18934) * ggml-cuda: add split-wise cuda graph * add n-cpu-moe compare_llama_bench.py * fix hip/musa builds	2026-01-30 15:56:40 +02:00
nullname	79f1bb3d35	ggml-hexagon: flash-attn opt (llama/19025) * optimize flash attention kernel by improving score computation and online softmax update * wip * Refactor online softmax update in flash attention kernel for improved performance * Optimize flash attention kernel by replacing float array with HVX_Vector for score computation * wip	2026-01-30 15:56:40 +02:00
Neo Zhang	0d9dda5a99	use malloc to support both iGPU and dGPU in same time (llama/18992) * use malloc to support both iGPU and dGPU in same time * support windows --------- Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2026-01-30 15:56:40 +02:00
Alberto Cabrera Pérez	e090d91f5e	ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm) (llama/18860) * Boilerplate for q5_Kx8 REPACK on ARM and fallback Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Implements make_block_q5_Kx8 by extending make_block_q4_Kx8 Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * q5_K repack gemm and gemv generics * Gemm and Gemv ARM implementations (i8mm) * Improved qh manipulation looking at non-repack vec_dot implementation * Full unroll * Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments. Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fix wrong fallback definitions of Q5_K Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fixed comments. Reverted unnecessary formatting Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fixed typo in generic definitions * Switching AND + Shift with Shift Insert. Better op interleaving. * Vectorize + unroll the block scales * Apply gemm optimizations to gemv * Improve bias calculation --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>	2026-01-30 15:56:40 +02:00
Georgi Gerganov	3f96a1da0e	mla : make the V tensor a view of K (llama/18986) * mla : pass V as a view of K to the FA op * cuda : adjust mla logic to new layout * kv-cache : fix rope shift * tests : remove comment * cuda : fix reusable_cutoff Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-01-30 15:56:40 +02:00
Johannes Gäßler	f21d0cbb1a	CUDA: fix alignment check for FA (llama/19023)	2026-01-30 15:56:40 +02:00
lhez	0e030b852a	opencl: enable the general fp mm for non-cont input and as a fallback for specialized kqv kernel for adreno (llama/18970) * opencl: add `copy_to_contiguous` and utilize mm kernels * opencl: only copy to cont for f32 and f16 tensors * opencl: use cont mm for fallback when dst is large * opencl: use nb local to copy-to-cont * opencl: use local offset as well	2026-01-30 15:56:40 +02:00
Aman Gupta	d4fafcfc6f	CUDA: add gqa_ratio 4 for GLM 4.7 flash (llama/18953)	2026-01-30 15:56:40 +02:00
shaofeiqi	167fec69d5	opencl: add TRI op support (llama/18979)	2026-01-30 15:56:40 +02:00
Aleksei Nikiforov	55927d42ef	ggml-zdnn : mark zDNN buffers as non-host (llama/18967) While buffers reside in host memory, additional transformation is needed to use buffers with zDNN. Fixes #18848	2026-01-30 15:56:40 +02:00
Jeff Bolz	b7e323f40b	vulkan: Remove transfer_ctx, do everything in compute_ctx. (llama/18945) * vulkan: Remove transfer_ctx, do everything in compute_ctx. We had a bug where a set_tensor_async (using transfer_ctx) didn't get submitted before the graph_compute (using compute_ctx) that came after it. To avoid this sort of issue, just do everything in compute_ctx. Remove transfer_cmd_pool, which was already unused. * fix crash with perf logger	2026-01-30 15:56:40 +02:00
Jeff Bolz	b2bc4d810b	vulkan: support flash attention GQA/split_k with small batches (llama/18938)	2026-01-30 15:56:40 +02:00
Masato Nakasaka	3bbf4ced47	Revert "vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356 )" (llama/18831) This reverts commit 980b7cd17e055c8c587f79ffda7eb4fddf405566.	2026-01-30 15:56:40 +02:00
Jeff Bolz	660d943ff8	vulkan: Use mul_mat_vec_id for small values of n (llama/18918) Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and update the indexing calculations in get_offsets. Mat-vec is faster than mat-mat for small values of n. We don't get the same reuse of the weights as in the non-ID path, but with this the cost is linear in n rather than n>1 being far slower than n==1.	2026-01-30 15:56:40 +02:00
Oliver Simons	924a9e292c	CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator (llama/18964) * CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator Strided iterator was added in [CCCL 3.1](https://github.com/NVIDIA/cccl/releases/tag/v3.1.0), which is packaged into [CTK 13.1](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id5) * Unindent as per code review request	2026-01-30 15:56:40 +02:00
Oliver Simons	fdc83ee3c0	CUDA: Replace init_offsets kernel with iterators in cub-based argsort (llama/18930) * CUDA: Replace `init_offsets` with iterators in argsort This is a QOL improvement, saving us the cost of materializing the iterator * Remove unnecessary include from top-k.cu	2026-01-30 15:56:40 +02:00
Adrien Gallouët	bf71ffa6b3	ggml : cleanup path_str() (llama/18928) - Remove pragmas as `std::codecvt_utf8` is not used. - Avoid implicit `strlen()`. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-01-30 15:56:40 +02:00
Georgi Gerganov	b0517d6912	metal : enable FA for MLA heads (llama/18950)	2026-01-30 15:56:40 +02:00
Georgi Gerganov	47f3e3b927	ggml : add ggml_build_forward_select (llama/18550) * ggml : add ggml_build_forward_select * cuda : adapt CUDA graph compat to new feature * vulkan : update logic to handle command buffer closing * ggml : check compute for fusion * ggml : add comment	2026-01-30 15:56:40 +02:00
lhez	62a09b106d	opencl: fix q6_K mv for m=1 (llama/18893)	2026-01-30 15:56:40 +02:00
Reese Levine	389dafc7c2	ggml webgpu: support for backend sampling (llama/18880)	2026-01-30 15:56:40 +02:00
Thore Koritzius	511ca7a1f4	ggml : extend ggml_pool_1d + metal (llama/16429) * chore: resolve conflicts * feat: ggml metal impl * fix: ggml_metal_kargs_pool_1d struct * fix: require contiguous input * chore: test pool_1d * chore: limit pool1d test cases to p0=0 and s0=k0 to conform with asserts * chore: add p0 and s0 to testing * fix: allow padding for cpu and metal * Update ggml/src/ggml-metal/ggml-metal.metal * fix: correct single-threaded loop * ggml : cleanup * tests : add ne[1] != 1 tests * fix: ne[1] handling in np * cont : fixes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-01-30 15:56:40 +02:00
Perry Naseck	ecb4b80c35	ggml-blas: hide warnings from included BLAS headers (llama/18818) * fix compile def openblas, blis for compat libs, nvpl compile def, warn if no blas vendor set * ggml-blas: hide warnings from included BLAS headers	2026-01-30 15:56:40 +02:00
Raul Torres	42960b6073	CANN: Remove unused `ggml_cann_get_device` function (llama/18625)	2026-01-30 15:56:40 +02:00
Chenguang Li	2fceb5a80f	CANN: fix an issue where get_env was not fully renamed (llama/18796) * CANN: fix an issue where get_env was not fully renamed * ci: add cann with acl group * ci: define use_acl_graph using GitHub Action * ci: update cann dockerfile with acl graph	2026-01-30 15:56:40 +02:00
hipudding	854274a297	CANN: support gated linear attn (llama/18653) * CANN: support gated linear attn This change adds support for the GGML_OP_GATED_LINEAR_ATTN operator. The feature was implemented by YushengZhao. Because the previous submission was based on an outdated codebase, this PR was rebased to merge. Co-authored-by: YushengZhao <yusheng.chao@outlook.com> Co-authored-by: hipudding <huafengchun@gmail.com> * CANN: optimize OP gla Optimize gla for high preformance * Remove unused comments --------- Co-authored-by: 赵禹昇 <2501112001@cninfer02.localdomain> Co-authored-by: YushengZhao <yusheng.chao@outlook.com>	2026-01-30 15:56:40 +02:00
shaofeiqi	ed6004d051	OpenCL: add SOLVE_TRI op support (llama/18846)	2026-01-30 15:56:40 +02:00
Georgi Gerganov	290ff3d28d	cuda : print less debug logs when disabling cuda graphs (llama/18868)	2026-01-30 15:56:40 +02:00
Johannes Gäßler	f2f0ba0384	CUDA: fix allignment on register spill for FA (llama/18815)	2026-01-30 15:56:40 +02:00
shalinib-ibm	78a23d4830	ggml-cpu: optimize ggml_vec_dot_bf16 for Power9 (llama/18837)	2026-01-30 15:56:40 +02:00
Max Krasnyansky	50b7ab3d46	hexagon: support for OP_CPY, host buffers now optional (llama/18822)	2026-01-30 15:56:40 +02:00
Oliver Simons	bc09047405	CUDA: Factor out and re-use `block_reduce` function (llama/18785) * CUDA: Refactor and expose two_stage_warp_reduce_* function * Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it Moving smem out of `__device__` function to `__global__` function allows for explicit smem reuse, as either compiler or cuda rt seem to not free it afterwards (`cudaFuncSetAttribute` fails when not accounting for it once for each call to two_stage_warp_reduce) * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Use two_stage_warp_reduce in group_norm_f32 * Use two_stage_warp_reduce in rms_norm_f32 * Fix smem calculation which expects bytes * Make `two_stage_warp_reduce` accept all values warp_reduce accepts Also integrate it into norm_f32 function * Use two_stage_warp_reduce in l2_norm_f32 * Use type traits for block reduction for better legibility Also adresss other requests by @am17an such as variable renaming * Make norm tests cover all cuda paths * Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK Unit-tests passed locally, let's see if they pass in the CI as well * Use `enum class` for `block_reduce_method` This is more type-safe than plain enum * Rename variables as suggested in code review by @am17an * Rename two_stage_warp_reduce -> block_reduce * Fix trailing whitespace in common.cuh * Make condition of static_assert type-dependent This delays evaluation until the template is actually instantiated. Otherwise, some compilers may evaluate the assert when parsing the template, resulting in build errors as observed here: https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785 * Inline definitions --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2026-01-30 15:56:40 +02:00
Jeff Bolz	4b155e9bfb	vulkan: Check maxStorageBufferRange in supports_op (llama/18709) * vulkan: Check maxStorageBufferRange in supports_op * skip maxStorageBufferRange check when shader64BitIndexing is enabled	2026-01-30 15:56:40 +02:00
Daniel Bevenius	25aeb66a4a	CUDA : fix typo in clang pragma comment [no ci] (llama/18830)	2026-01-30 15:56:40 +02:00
Ruben Ortlam	49762e8fb3	vulkan: work around Intel fp16 bug in mmq (llama/18814)	2026-01-30 15:56:40 +02:00
Perry Naseck	17656e56dc	ggml-metal: do not copy headers for embedded, use current binary dir for embedded (llama/18705)	2026-01-30 15:56:40 +02:00
yulo	c6a495ae5d	HIP: add fattn-mma-f16 for RDNA4 (llama/18481) * finish VQ mma * flash_attn_ext_f16_iter * KQ_rowsum * correct exp * fix scale error * fix softmax scale * fix softmax scale * enable fattn on cpu side * fix random error * disable fattn-mma-f16 on rdna3 * fix wrong col for rdna * use identity mat to transpose * resolve conflicts * basic tuning for DeepSeek-R1-Distill-Qwen-1.5B * fix volta compile error * align rdna4 policy for fattn * adjust fattn policy * adjust kernel selection logic * update as the review comments * keep fattn-wmma logic * adjust kernel selection logic --------- Co-authored-by: zhang hui <you@example.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-01-30 15:56:40 +02:00
Georgi Gerganov	6ee0eaf531	CUDA : fix unused argument when USE_CUDA_GRAPH=OFF (llama/18800)	2026-01-14 09:11:59 +02:00
Jeff Bolz	ab1828dc1c	vulkan: change memory_logger to be controlled by an env var (llama/18769)	2026-01-14 09:11:59 +02:00
Jeff Bolz	aedf332ec5	vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (llama/18678) This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128. This should work when the number of blocks in the A matrix is less than 2^32 (for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like 2^32*LOAD_VEC_A elements. - Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b. - Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle variants. So far this change just adds a single use case for this, compiling with the e64BitIndexingEXT flag. - Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange. 64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort to avoid enabling it unconditionally.	2026-01-14 09:11:59 +02:00
Ruben Ortlam	716d68aca9	vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (llama/18763) * vulkan: Disable large coopmat matmul configuration on proprietary AMD driver * Also disable the large tile size	2026-01-14 09:11:59 +02:00
Ruben Ortlam	c0433783c3	Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (llama/18749) * vulkan: Enable and optimize large matmul parameter combination for AMD * limit tuning to AMD GPUs with coopmat support * use tx_m values instead of _l	2026-01-14 09:11:59 +02:00
shaofeiqi	d4ce2e554f	opencl: add SOFTPLUS op support (llama/18726)	2026-01-14 09:11:59 +02:00
Johannes Gäßler	3a1ea96373	HIP: adjust RDNA3.5 MMQ kernel selction logic (llama/18666)	2026-01-14 09:11:59 +02:00
Perry Naseck	484b17053a	cmake : update blas logic (llama/18205)	2026-01-14 09:11:59 +02:00
Michael Wand	45be2cd27a	Corrected: changed s13 = src1->nb[3] instead of nb[2] (llama/18724)	2026-01-14 09:11:59 +02:00
shaofeiqi	4af27bf2da	opencl: add EXPM1 op (llama/18704)	2026-01-14 09:11:59 +02:00
Reese Levine	4ac8c3b478	Updates to webgpu get_memory (llama/18707)	2026-01-14 09:11:59 +02:00
Aaron Teo	fff3ebd93d	llama: use host memory if device reports 0 memory (llama/18587)	2026-01-14 09:11:59 +02:00
Masashi Yoshimura	a71127dfd8	ggml-webgpu: Fix GGML_MEM_ALIGN to 8 for emscripten. (llama/18628) * Fix GGML_MEM_ALIGN to 8 for emscripten. * Add a comment explaining the need for GGML_MEM_ALIGN == 8 in 64-bit wasm with emscripten	2026-01-14 09:11:59 +02:00
Reese Levine	1bb903f599	ggml webgpu: initial flashattention implementation (llama/18610) * FlashAttention (llama/13) * Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though * neg passes backend test * unary operators pass ggml tests * rms_norm double declaration bug atoned * abides by editor-config * removed vestigial files * fixed autoconfig * All operators (inlcluding xielu) working * removed unnecesarry checking if node->src[1] exists for unary operators * responded and dealt with PR comments * implemented REPL_Template support and removed bug in unary operators kernel * formatted embed wgsl and ggml-webgpu.cpp * Faster tensors (llama/8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (llama/9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (llama/4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Refactored pipelines and workgroup calculations (llama/10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on flash attention * Shader structure set up (many bugs still) * debugging * Working first test * Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32 * Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling * Start work on integrating pre-wgsl * Separate structs/initial shader compilation library into separate files * Work on compilation choices for flashattention * Work on subgroup matrix/tile size portability * subgroup size agnostic online softmax * Cleanups, quantization types * more cleanup * fix wasm build * Refactor flashattention to increase parallelism, use direct loads for KV in somce cases * Checkpoint * formatting * Update to account for default kv cache padding * formatting shader * Add workflow for ggml-ci webgpu * Try passing absolute path to dawn in ggml-ci * Avoid error on device destruction, add todos for proper cleanup * Fix unused warning * Forgot one parameter unused * Move some flashattn computation to f32 for correctness	2026-01-14 09:11:59 +02:00
Jeff Bolz	0bc0e5616e	vulkan: fix push constant size for quantize_q8_1 (llama/18687) I added an assert to catch further mismatches, and it found several. Fix those, too.	2026-01-14 09:11:59 +02:00
Jeff Bolz	678c660e62	vulkan: optimize ssm_scan (llama/18630) * vulkan: optimize ssm_scan * fix warp vs subgroup naming	2026-01-14 09:11:59 +02:00
도로로도로또	f2d8588229	metal : add MoE kernel specialization for ne20=5 (llama/18667) Add template specialization for kernel_mul_mm_id_map0 with ne20=5 to support models using 5 active experts (e.g., VAETKI).	2026-01-14 09:11:59 +02:00
Doctor Shotgun	b9965c89a1	ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH (llama/18535) * ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH * makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32 * ggml: read GGML_OP_OFFLOAD_MIN_BATCH once and store to dev ctx * cann: forward declaration of device context struct * cann: move offload op check after device context declaration * cuda: fix whitespace Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2026-01-14 09:11:59 +02:00
shaofeiqi	85a329cb08	opencl: add FILL op support (llama/18682)	2026-01-14 09:11:59 +02:00
Oliver Walsh	4f2ca7c163	cuda : fix build on cuda 12.8 (llama/18672) compute121 requires 12.9 Signed-off-by: Oliver Walsh <owalsh@redhat.com>	2026-01-14 09:11:59 +02:00
Jeff Bolz	a91ab72bd9	vulkan: reject ops when a tensor is too large to allocate (llama/18646)	2026-01-14 09:11:59 +02:00
virajwad	096e7e911a	vulkan: Warptile tuning for Intel Xe2/Xe3 (llama/18178) * modify warptile tuning for xe3 * intel vendor check w/ coopmat support * fix back formatting * fix formatting change 2 * move intel check to chip specific tuning part * Change to support both windows and linux * modify m_warptile to l_warptile for intel * modify warptile tuning for bf16 matmuls to fix regression (m_warptile to l_warptile) * Code style changes * Code style changes (2) * Code style changes (3)	2026-01-14 09:11:59 +02:00
Eve	a576ed944a	vulkan: more mul mat optimizations (llama/18533) * q4_k * q5_k * q2_k * q4_1 * q5_1 * better buf index	2026-01-14 09:11:59 +02:00
hipudding	5c583f3c02	CANN: Fix rename for get_env (llama/18652) In #18624, get_env in ggml-cann was renamed to get_env_as_lowercase to accurately reflect the function’s behavior and reduce the chance of misuse. However, the update missed renaming call sites in other files. This commit fixes that oversight.	2026-01-14 09:11:59 +02:00
Raul Torres	47671c81db	CANN: Rename `get_env` to `get_env_as_lowercase` (llama/18624)	2026-01-14 09:11:59 +02:00
Max Krasnyansky	a5f51ac75b	Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (llama/18611) * hexagon: improve fp16 matmul and add fp32/fp16 flash-attention * hexagon: add support for set-rows fp32 -> fp16 with i32/i64 row-idx * hexagon: add support for SCALE fp32 * hexagon: replace scalar fp32 -> fp16 copy with HVX * hexagon: optimize flash_atten_ext with aligned VTCM buffers and DMA - Implements double-buffered DMA prefetching for K, V, and Mask tensors. - Ensures K and V rows in VTCM are padded to 128 bytes to support aligned HVX operations. - Correctly synchronizes DMA transfers to prevent race conditions. - Uses `FLASH_ATTN_BLOCK_SIZE` of 128 for efficient chunking. * hexagon: use aligned mad_f16 * hexagon: flash_atten more aligned ops * hexagon: optimize scale_f32 hvx helpers * hexagon: unroll fa loops * hexagon: remove unused set-rows log * hexagon: flash_attn_ext add support for DMAing Q - Update `op_flash_attn_ext` to include Q row size in scratchpad allocation. - Pad Q row size to 128 bytes for alignment. - Implement DMA transfer for Q tensor in `flash_attn_ext_f16_thread`. - Update dot product computations to use VTCM-buffered Q data. * hexagon: fix handling of NANs hvx dotproducts * hexagon: cleanup spad allocation in flash-atten * hexagon: improve fp16/fp32 matmul - Introduced `vec_dot_f16_f16` and `vec_dot_f16_f16_rx2` kernels using efficient HVX dot product intrinsics. - Added `quantize_fp32_f16` to copy/convert weights from DDR to VTCM - Updated `op_matmul` to use the optimized path when VTCM capacity allows and broadcasting requirements are compatible. - Implemented fallback logic to the original implementation for complex broadcasting scenarios. * hexagon: fix HVX_ARCH check * hexagon: matmul cleanup and fp16 fixes Use aligned vec_dot_f16 for 2d matmuls and unaligned version for 4d. * hexagon: fix fp16 x fp16 matmuls and some minor refactoring * hexagon: add support for GET_ROWS f32 -> f32 Also optimize SET_ROWS threading a bit when we have just a few rows to process. * hexagon: optimize set-rows threading * hexagon: update adb/run-bench.sh to properly support experimental and verbose options * hexagon: flash_atten use aligned vectors for dot products	2026-01-14 09:11:59 +02:00
Aadeshveer Singh	436f30d05f	ggml : optimize cuda ssm_scan using warp-level reduction (llama/18505) * ggml : optimize cuda ssm_scan using warp-level reduction * ggml : apply code review suggestions (style, const, constexpr) * ggml : add TODO regarding stride consistency	2026-01-14 09:11:59 +02:00
Jeff Bolz	dbec71f6cf	vulkan: support buffer_from_host_ptr (llama/18467) * vulkan: support buffer_from_host_ptr * hacky use of buffer_from_host_ptr for directio * disable buffer_from_host_ptr cap * use external memory for ggml_vk_host_malloc, revert model loader changes * disable external_memory_host for MoltenVK * take buffer memory types into account * don't use external_memory_host for ggml_vk_host_malloc	2026-01-14 09:11:59 +02:00
Aman Gupta	575d894603	ggml-cuda: refactor cuda graph usage (llama/18637) * ggml-cuda: refactor cuda graph usage * use is_enabled() instead of enabled	2026-01-14 09:11:59 +02:00
Beinsezii	ed674cfc10	mmq.cu: tune mmq/rocblas switching for RDNA (llama/18537) * Patch perf regression for mmq kernels in ROCm recover performance regression for https://github.com/ggml-org/llama.cpp/issues/17917 * add n_experts branch like the cdna path * mmq.cu: tune mmq/wmma switching for RDNA * mmq.cu: move amd wmma mmq/wmma switching behind IS_RDNA3 * Update ggml/src/ggml-cuda/mmq.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Jiacheng (Jason) Chen <76919340+jiachengjason@users.noreply.github.com> Co-authored-by: jiachengjason <jasonchen.jiacheng@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-01-14 09:11:59 +02:00
Adrien Gallouët	5520f27363	ggml : fix avx512bf16 build (llama/18623) - include `immintrin.h` when required - remove unused m512bh Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-01-14 09:11:59 +02:00
Raul Torres	9a1a6685ba	CANN: Make `valid_values` variable `static const` (llama/18627)	2026-01-14 09:11:59 +02:00
nwyin	e563e239a7	ggml webgpu: add CEIL operation support (llama/18605) * ggml-webgpu: add CEIL operation support Add support for the CEIL unary operation in the WebGPU backend: - Add CEIL_FUNC shader template in unary_op.wgsl - Add 4 shader variants (f32, f16, inplace versions) - Initialize CEIL pipelines in ggml-webgpu.cpp - Register CEIL in supports_op function * docs: update WebGPU ops support for CEIL	2026-01-14 09:11:59 +02:00
Johannes Gäßler	9956333361	CUDA: fix FA FP16 accumulator overflow for Granite (llama/18614)	2026-01-14 09:11:59 +02:00
Aman Gupta	804f545454	ggml-cuda: check for srcs outside the cgraph (llama/18583) * ggml-cuda: check for srcs outside the cgraph * review: use leafs instead	2026-01-14 09:11:59 +02:00
Jeff Bolz	52ba45e2b8	vulkan: fix topk_moe_sigmoid_norm_bias failures in GLM-4.6 (llama/18582)	2026-01-14 09:11:59 +02:00
Jeff Bolz	0a99b4c377	vulkan: handle quantize_q8_1 overflowing the max workgroup count (llama/18515) * vulkan: handle quantize_q8_1 overflowing the max workgroup count * vulkan: Fix small tile size matmul on lavapipe * fix mul_mat_id failures	2026-01-14 09:11:59 +02:00
Chenguang Li	1d657effe3	CANN: add operator fusion support for ADD + RMS_NORM (llama/17512) This commit implements operator fusion for ADD + RMS_NORM operations in the CANN backend to reduce memory access overhead and improve performance. The fusion is controlled by the GGML_CANN_OPERATOR_FUSION environment variable (default: false). Changes: - Implement ggml_cann_op_add_rms_norm_fused() using ACLNN AddRmsNorm - Add ggml_cann_can_fuse() to check fusion eligibility - Integrate fusion logic into computation graph evaluation - Add test cases for ADD + RMS_NORM fusion - Update documentation with new environment variable The fusion combines ADD and RMS_NORM into a single kernel call, which is more efficient than executing them separately.	2026-01-14 09:11:59 +02:00
Daniel Bevenius	4d6a3fb00d	sampling : add support for backend sampling (llama/17004) * sampling : add support for backend sampling This commit adds support for performing sampling operations on the backend (e.g. GPU) as part of the model computation graph. The motivation for this feature is to enable sampling to be performed directly on the backend as part of the computation graph being executed, allowing for some or all of the sampling to be done on the backend. For example, the backend sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory. It is also possible for the backend samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers. Currently the backend sampling works in a similar manner to how pooling works, it is a function that is called by build_graph and the sampler operations become part of the models computation graph. * llama-cli : add backend sampler configuration * server : add backend sampling options/configuration * webui : add backend sampling options * ggml : add initial cumsum implementation for CUDA * sampling : enable all backend sampler tests This commit enables all exisiting backend sampler tests in the test-backend-sampler. Previously, some tests were disabled because there were missing ggml operation implementations. * graph : do not include llama-model.h * sampling : always expose sampled_ids This commit precomputes and caches the full-vocab token id list in llama_context's constructor, so llama_get_backend_sampled_token_ids_ith always returns a valid pointer. The motivation for this is that this enables both common/sampling.cpp and src/llama-sampling.cpp can simplify their logic. Not all backends samplers that process logits need to set the sampled_tokens_id as they may not change the order of the logits, for example the temperature sampler only scales the logits but does not change their order. Simliar the logit bias sampler only adds bias to specific token ids but does not change the order of the logits. In these cases there will not be a device to host copy of the sampled token ids, and this is the use case where having this precomputed list is useful. * sampling : ensure at most one output token per seq This commit adds a check in the batch allocator to ensure that when backend sampling is enabled, at most one output token is specified per sequence. * CUDA: Optimize argsort for gpu-based token sampling Argsort is used for top-k currently. WE optimize argsort by 2 things: 1. Use `DeviceRadixSort` for single-row/sequence to parallelize it across our SMs 2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the correct entrypoint (the function chooses different execution paths, it contains `DeviceSegmentedRadixSort` as one of the paths and will choose the best one according to heuristics. https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview Some perf numbers for a RTX PRO 6000: On the kernel level, tested with `GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf` Before: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 359.24 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 8192 runs - 861.34 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 1020.01 us/run ``` After: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 312.41 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 16384 runs - 63.48 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 874.36 us/run ```	2026-01-14 09:11:59 +02:00
Aman Gupta	f0bf5b8cc3	CUDA: disable cuda graph when using n-cpu-moe (llama/18593) * CUDA: disable cuda graph when using n-cpu-moe * call ggml_cuda_set_device	2026-01-14 09:11:59 +02:00
Aman Gupta	88f5765c82	ggml-cuda: remove unused params in ggml_cuda_graph (llama/18579)	2026-01-14 09:11:59 +02:00
Aman Gupta	1e725546b0	ggml-cuda: fixes for concurrent streams (llama/18496)	2026-01-14 09:11:59 +02:00
Johannes Gäßler	60d178cee9	CUDA: only allocate FA tmp buffer if needed (llama/18564)	2026-01-14 09:11:59 +02:00
pl752	304e780e5f	(Bugfix, ggml-cuda) Pool alloc count fix + small size computation type adjustment (llama/18559) * CUDA: Fixed obj byte size instead of obj count being passed to pool alloc (fattn-common, dst_tmp_meta) * CUDA: Explicitly casted some of the int alloc counts before multiplication in argsort --------- Co-authored-by: pl752 <maximpl752@gmail.com>	2026-01-14 09:11:59 +02:00
Shouyu	c9e9f083c2	ggml-hexagon: optimize activation function (llama/18393) * refactor: refactor silu * refactor: optimize swiglu * refactor: remove unncessary if in swiglu * refactor: refactor swiglu_oai * chore: fix formatting issue	2026-01-14 09:11:59 +02:00
Jeff Bolz	9d83865607	vulkan: Optimize GGML_OP_CUMSUM (llama/18417) * vulkan: Optimize GGML_OP_CUMSUM There are two paths: The preexisting one that does a whole row per workgroup in a single shader, and one that splits each row into multiple blocks and does two passes. The first pass computes partials within a block, the second adds the block partials to compute the final result. The multipass shader is used when there are a small number of large rows. In the whole-row shader, handle multiple elements per invocation. * use 2 ELEM_PER_THREAD for AMD/Intel * address feedback	2026-01-14 09:11:59 +02:00
Jeff Bolz	b7ff521e71	vulkan: Implement mmvq for iq1_s/iq1_m (llama/18450)	2026-01-14 09:11:59 +02:00
Georgi Gerganov	b99c911c49	metal : adjust extra size for FA buffer to avoid reallocations (llama/18545)	2026-01-14 09:11:59 +02:00
Chris Rohlf	f328b13d5c	rpc : use unordered_map::reserve and emplace (llama/18513)	2026-01-14 09:11:59 +02:00
MeeMin	fbde389665	cuda : fix copy of large tensors (ggml_nbytes <= INT_MAX assertion) (llama/18433) * ggml-cuda: fixed assertion in ggml_cuda_cpy (llama/18140) * ggml-cuda: changes in data types to int64_t * ggml-cuda: added asserts for CUDA block numbers * ggml-cuda: changed the condition for y and z dimension	2026-01-14 09:11:59 +02:00
Aman Gupta	f22c1ccbe4	ggml-cuda: remove unneccesary prints on ggml_cuda_init (llama/18502)	2026-01-14 09:11:59 +02:00
Jeff Bolz	b1f65a4a7e	vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (llama/18295) * vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron Also handle GGML_OP_SCALE at the end (nemotron, deepseek2). Fewer pipeline variants and spec constants, just use push constants. In test_topk_moe, change exp_probs_b to be 1D, matching real networks. Update test-backend-ops and ggml-backend to allow verifying multiple outputs in a fusion test (topk_moe has two outputs). Previously only the final node was verified. * change test_topk_moe to allow results in arbitrary order * disable sigmoid fusion for moltenvk	2026-01-14 09:11:59 +02:00
Georgi Gerganov	ce03f8e759	ggml : bump version to 0.9.5 (ggml/1410)	2025-12-31 18:27:20 +02:00
gatbontonpc	8189f2cb65	metal : add count_equal op (llama/18314) * add count equal for metal * remove trailing whitespace * updated doc ops table * changed shmem to i32 * added multi tg and templating * removed BLAS support from Metal docs * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add memset to set dst to 0 * metal : cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-31 17:52:09 +02:00
Johannes Gäßler	2d250f8049	CUDA: fix KQ max calculation (llama/18487)	2025-12-31 17:52:09 +02:00
Georgi Gerganov	5deaf8f2a3	metal : remove BF16 x F16 kernels (llama/18456)	2025-12-31 17:52:09 +02:00
Aman Gupta	467933199a	sycl: add newline at the end of CMakeLists.txt (llama/18503)	2025-12-31 17:52:09 +02:00
Rahul Sathe	a3635494da	Work around broken IntelSYCLConfig.cmake in Intel oneAPI 2025.x (llama/18345) * cmake: work around broken IntelSYCLConfig.cmake in oneAPI 2025.x * [AI] sycl: auto-detect and skip incompatible IntelSYCL package Automatically detect compiler versions with incompatible IntelSYCL CMake configuration files and fall back to manual SYCL flags instead of requiring users to set options manually. Fixes build failures with oneAPI 2025.x where IntelSYCLConfig.cmake has SYCL_FEATURE_TEST_EXTRACT invocation errors. * refactor: improve SYCL provider handling and error messages in CMake configuration * refactor: enhance SYCL provider validation and error handling in CMake configuration * ggml-sycl: wrap find_package(IntelSYCL) to prevent build crashes	2025-12-31 17:52:09 +02:00
Charles Xu	c9955367d4	kleidiai: add and integrate SVE 256-bit vector-length kernel (llama/18458) * kleidiai: add and integrate SVE 256-bit vector-length kernel * updated for review comments	2025-12-31 17:52:09 +02:00
Aman Gupta	6d4aa96bfa	CUDA: add log line when mxfp4 acceleration is used (llama/18483) * CUDA: add log line when mxfp4 acceleration is used * add in backend_get_features	2025-12-31 17:52:09 +02:00
Johannes Gäßler	5765c5b04e	CUDA: fix replacment of bad archs in CMake (llama/18457)	2025-12-31 17:52:09 +02:00
Johannes Gäßler	d6cb2407b7	CUDA: Blackwell features for non-native builds (llama/18436)	2025-12-31 17:52:09 +02:00
Aman Gupta	e49e88b2d8	cuda: fix race condition in cumsum (llama/18448) * ggml-cuda: fix race condition in cumsum * remove unneccesary sync_threads	2025-12-31 17:52:09 +02:00
uvos	20f5729921	HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated (llama/18202)	2025-12-31 17:52:09 +02:00
Aman Gupta	b8d209f55c	Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413 )" (llama/18426)	2025-12-31 17:52:09 +02:00
o7si	54fe9a645d	rpc: fix segfault on invalid endpoint format (llama/18387) * rpc: fix segfault on invalid endpoint format * rpc: add error log for failed endpoint connection	2025-12-31 17:52:09 +02:00
Boian Berberov	b3788ef729	cmake: Added more x86_64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` (llama/18186) * minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h` * cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` - `ivybridge` - `piledriver` - `cannonlake` - `cascadelake` - `cooperlake` - `zen4` Resolves: #17966	2025-12-31 17:52:09 +02:00
QDelta	31fc2c37c8	ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (llama/18413)	2025-12-31 17:52:09 +02:00
lhez	a800a3acd1	opencl: allow resizing transpose buffers (llama/18384) * opencl: allow resizing transpose buffers instead of using fixed sizes * opencl: remove commented code	2025-12-31 17:52:09 +02:00
Aman Gupta	29f8155445	ggml-cuda: Use same regex for GGML_NATIVE=OFF (llama/18407)	2025-12-31 17:52:09 +02:00
Jeff Bolz	015b618d96	vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (llama/18352) Run a preprocess to count how many times each expert is used, and use this to quickly discard workgroups that aren't needed.	2025-12-31 17:52:09 +02:00
Jeff Bolz	e37c8ed94e	vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (llama/18349) * vulkan: Use BK=32 for coopmat2 mul_mat_id * vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader Disable robustness, remove the OOB check in decodeFuncB, and initialize the row_ids to zero to avoid OOB access. Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of zero and remove the '& (BN - 1)'. This allows the compiler to common some of the shared memory loads.	2025-12-31 17:52:09 +02:00
Jeff Bolz	331c6ccd31	vulkan: Use BK=32 for coopmat2 mul_mat_id (llama/18332)	2025-12-31 17:52:09 +02:00
Eve	35cb4abb67	vulkan: small dequantization improvements (llama/18380) * iq4_xs * quants	2025-12-31 17:52:09 +02:00
Jeff Bolz	181e36f194	vulkan: Support UPSCALE w/antialias (llama/18327)	2025-12-31 17:52:09 +02:00
Jeff Bolz	67473fef57	vulkan: handle rope with large number of rows (llama/18306)	2025-12-31 17:52:09 +02:00
0Marble	33f75a88ac	CANN: implement the SSM_CONV operator (llama/17737) * CANN: implement SSM_CONV operator Co-authored-by: Aleksei Lobanov, <zeromarblectm@gmail.com> Co-authored-by: Sujin Kang, <waterjin326@gmail.com> * CANN: remove custom error limit for SSM_CONV * CANN: merge SSM_CONV tensor shape/strides into one line --------- Co-authored-by: Sujin Kang, <waterjin326@gmail.com>	2025-12-31 17:52:09 +02:00
Aman Gupta	51778354ce	ggml-cuda: fix regex for arch list (llama/18371) * ggml-cuda: fix regex for arch list * make regex exact	2025-12-31 17:52:09 +02:00
Aman Gupta	8e02f0919d	cuda: optimize cumsum cub path (llama/18362) * cuda: optimize cumsum cub path * remove heavy perf test	2025-12-31 17:52:09 +02:00
Aman Gupta	ea07c5d3b7	ggml-cuda: fix blackwell native builds (llama/18361) * ggml-cuda: fix blackwell native builds Replace 12x in native architectures by 12xa * replace for GGML_NATIVE=OFF too * only replace for native * remove 120f-virtual for default compilation --------- Co-authored-by: Aman Gupta <aman>	2025-12-31 17:52:09 +02:00
Penglin Cai	5f0488f012	CANN: Add support for CONV_TRANSPOSE_1D when kernel size > 255 (llama/17934) * CONV_TRANSPOSE_1D kernel_size>255 * remove condition check * fix the bug of type conversion * removing trailing whitespaces * fix: return true in the switch case	2025-12-31 17:52:09 +02:00
Aadeshveer Singh	db75fff539	ggml : optimize cuda cumsum fallback kernel (llama/18343)	2025-12-31 17:52:09 +02:00
Aman Gupta	41e578ec8a	CUDA: experimental native mxfp4 support for blackwell (llama/17906) * CUDA: experimental native mxfp4 support for blackwell * optimize load_tiles * optimize quantize_mxfp4 * cleanup * first pass review: formatting * use interleaved layout for mma * mmq: add assert for size * use __nv_fp4x4_e2m1 * use iter_k as 512, cleanup * Use 1200 as blackwell instead of 1000 * address review comments * mmq: fix stride * quantize.cu: use reference impl of e8m0 scale * address review comments * add 120f-virtual + minor fixes --------- Co-authored-by: Aman Gupta <aman>	2025-12-31 17:52:09 +02:00
Jeff Bolz	f863735caa	vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (llama/18302)	2025-12-31 17:52:09 +02:00
Wang Weixuan	bab2c02da5	CANN : refactor ACL graph cache (llama/17752) Move the graph property checking code into methods of LRU cache. Signed-off-by: Wang Weixuan <wangweixvan@gmail.com>	2025-12-31 17:52:09 +02:00
Ruben Ortlam	1356600679	vulkan: use fewer FA rows for small cache runs (llama/18280)	2025-12-31 17:52:09 +02:00
TianHao324	ec9239d3b7	CANN: Uses yarn_ramp cache in ROPE (llama/17725)	2025-12-31 17:52:09 +02:00
Chris Rohlf	9bdd4658f4	rpc : add check for rpc buffer type (llama/18242)	2025-12-31 17:52:09 +02:00
nullname	e4c89612cd	ggml-hexagon: create generalized functions for cpu side op (llama/17500) * refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility * refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility * refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity * add comment * refactor: remove redundant buffer checks in hexagon supported operations * wip * add missing include to fix weak symbol warning * add ggml_hexagon_op_generic * refactor: simplify tensor operation initialization and buffer management in hexagon implementation * refactor: streamline hexagon operation initialization and buffer management * refactor: update function signatures and streamline request handling in hexagon operations * wip * ggml-hexagon: clean up code formatting and improve unary operation handling * wip * rename * fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations * refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity refactor: remove redundant buffer checks in hexagon supported operations add missing include to fix weak symbol warning add ggml_hexagon_op_generic refactor: simplify tensor operation initialization and buffer management in hexagon implementation refactor: streamline hexagon operation initialization and buffer management refactor: update function signatures and streamline request handling in hexagon operations ggml-hexagon: clean up code formatting and improve unary operation handling fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations # Conflicts: # ggml/src/ggml-hexagon/ggml-hexagon.cpp * hexagon: fix merge conflicts * hexagon: minor cleanup for buffer support checks * hexagon: factor out op_desc and the overal op logging * hexagon: further simplify and cleanup op dispatch logic * snapdragon: update adb scripts to use llama-cli and llama-completion * fix pipeline failure --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2025-12-31 17:52:09 +02:00
Shouyu	2f33395197	ggml-hexagon: gelu optimization (llama/18151) * feat: working gelu with src0 put on vtcm * feat: gelu ping-pong for both in and out * fix: fixu compile error * break: distinguish dma ddr->vtcm and vtcm->ddr operation * fix: fix dma queue size * break: update dma api to either pop src or dst ptr * fix: fix activation vtcm allocation issue for src1 when swapperd * refactor: ping-pong gelu logic to avoid unnecessary if else * dma: improved queue interface and prefetch handling * gelu: fix N+2 block prefetch --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2025-12-31 17:52:09 +02:00
Taimur Ahmad	5b0c1c1580	llamafile: add rvv support for sgemm kernels (llama/18199) Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>	2025-12-31 17:52:09 +02:00
lhez	f2fe1e5baf	opencl: unpack q4_0 for adreno in get_tensor (llama/18278)	2025-12-31 17:52:09 +02:00
Jeff Bolz	dbbe6c11b5	vulkan: Extend rope fusions to allow mrope (llama/18264) Extend the test-backend-ops tests as well.	2025-12-31 17:52:09 +02:00
Jeff Bolz	98e59a43d1	vulkan: Implement set_tensor_async and the event interfaces (llama/18047) The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from #7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.	2025-12-31 17:52:09 +02:00
Johannes Gäßler	b68b12f2d5	llama: fix RPC for -fit on (llama/18233)	2025-12-31 17:52:09 +02:00
Jeff Bolz	b893e0813a	vulkan: fix im2col overflowing maxworkgroupcount (llama/18180)	2025-12-31 17:52:09 +02:00
Jeff Bolz	f407c5e562	vulkan/cuda: fix topk_moe with exp_probs_b (llama/18071) I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn and added coverage for exp_probs_b and some other missing combinations. This exposed a bug in both CUDA and Vulkan backends where they were assuming the input to argsort and the input to get_rows are the same. I'd like to optimize this graph in another change, but for now just get it functional. CUDA also had a bug where it got n_experts from the wrong place, leading to GGML_ASSERT failures in some of the new tests.	2025-12-31 17:52:09 +02:00
Jeff Bolz	ad6ee3865d	vulkan: support GGML_UNARY_OP_XIELU (llama/18062)	2025-12-31 17:52:09 +02:00
Jeff Bolz	3cd141f1a9	vulkan: in graph_optimize, try to group ADD operations (llama/18060) I saw the adds not staying together in the new nemotron 3 nano model.	2025-12-31 17:52:09 +02:00
lovedheart	449fc7c024	Vulkan: some improvement on mul_mat_iq2_xs (llama/18031) * Some improvement on mul_mat_iq2_xs Refactor calculations for db values and grid data to optimize performance and reduce redundancy. * Fix trailing whitespace	2025-12-31 17:52:09 +02:00
Aadeshveer Singh	0983985f06	Added comments explaining thread block size selection logic based on row count and column size, derived from historical commit context (llama/18212)	2025-12-31 17:52:09 +02:00
Alfred	17a4cb15b8	ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations (llama/17977) * feat: implement real Q8_0 * feat: adding cmake option for configuring FP32 quantize group size * typo: set() shall be used --------- Co-authored-by: ngdxzy <zhenyu_xu@uri.edu>	2025-12-31 17:52:09 +02:00
Jeff Bolz	195d8d0c65	vulkan: Add perf logger mode with concurrency (llama/17944) This implements a variation of the perf logger where rather than timing each operation individually with effectively a barrier in between, we put the timing boundaries where we already synchronize and time the groups of work that normally overlap. This can be useful to help understand whether individual operations need to be optimized, or if the group is already running efficiently. GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when GGML_VK_PERF_LOGGER is also set). GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch.	2025-12-31 17:52:09 +02:00
Xuan-Son Nguyen	fea481f412	model : add ASR support for LFM2-Audio-1.5B (conformer) (llama/18106) * ASR with LFM2-Audio-1.5B * Set rope_theta * Fix comment * Remove rope_theta setting * Address PR feedback * rename functions to conformer * remove some redundant ggml_cont * fix missing tensor * add prefix "a." for conv tensors * remove redundant reshape * clean up * add test model --------- Co-authored-by: Tarek Dakhran <tarek@liquid.ai>	2025-12-31 17:52:09 +02:00
Taimur Ahmad	956fac433b	ggml-cpu: extend support for RVV floating-point kernels (llama/17318) * cmake: add BF16 RVV flag for ggml-cpu * ggml-cpu: add floating-point conversion kernels * ggml: add floating-point kernels Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> * ggml-cpu: fix lmul in vec_dot_bf16 * ggml-cpu: change redsum to lmul 4, fix leftover --------- Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>	2025-12-31 17:52:09 +02:00
yulo	325a9b739c	remove i_major_dual (llama/18157) Co-authored-by: zhang hui <you@example.com>	2025-12-31 17:52:09 +02:00
Shouyu	c3a16089e3	ggml-hexagon: swiglu_oai operation (llama/18114) * snapshot: debug ggml-hexagon swiglu-oai * fix: fix hvx_min_scalar_f32 * feat: working swiglu-oai * chore: fix formating isue	2025-12-31 17:52:09 +02:00
Shouyu	c7ccedb5ba	ggml-hexagon: gelu operation (llama/17921) * feat: inital support for gelu using sigmoid approximation * snapshot: faster gelu using polynomial approximation * test: disable l2-block prefetch in polynomail approximation * Revert "test: disable l2-block prefetch in polynomail approximation" This reverts commit 72339994d45b2bed887e79994403c378d90b62b5. * Revert "snapshot: faster gelu using polynomial approximation" This reverts commit 2a787a61d11f9e63e5943a2e6d134b2f0c402ace. * debug: temporarily disable unnecessary log message for debug purpose * Feat: optiized unaligned sigmoid_f32 * Feat: larger l2prefetch block * feat: apply unaligned-load optimization on mul and mul_scalar * Revert "debug: temporarily disable unnecessary log message for debug purpose" This reverts commit 84f2f23aa9f17e2fa826db969cd825d0ab192995. * refactor: cleanup commented unused code * chore: reformat code with clang-formatter to pass cli test * Revert "chore: reformat code with clang-formatter to pass cli test" This reverts commit 952877ec24732b12010c7fa7ed3fc8de4b74e718. * fix: fix loop overflow * chore: fix formating ci error	2025-12-31 17:52:09 +02:00
Alberto Cabrera Pérez	1f72f00542	ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) (llama/18096) * wip: skeleton for q8_0 repack * q8_0 repack GEMV implementations * GEMM implementations * Formatting * Fixed format consistency of repack gemm and gemv declarations * gemv and gemm generic location consistent with declarations * Removed non-correct unused variables statements * Cleanup, consistent style * Missing generic fallbacks for x86 and powerpc	2025-12-31 17:52:09 +02:00
yulo	9118c05dc4	HIP: Refactor mma for RDNA and CDNA (llama/17990) * mma.cuh for rdna4 * mma for rdna3 * mmq for rdna4 * mmq for rdna3 * align i-major and j-major * cdna * fix cuda error * add missing tile of mfma * fix j-major wrong ne on CDNA * fix gramma and empty spaces --------- Co-authored-by: zhang hui <you@example.com>	2025-12-31 17:52:09 +02:00
Naco Siren	00108bb713	llama.android : Rewrite Android binding (w/o cpu_features dep) (llama/17413) * UI: implement basic UI components * util: implement performance monitor; wrap it with a viewmodel * util: implement user preferences utility * UI: implement core flow's screens * UI: add a new MainActivity; update manifest * [WIP] DI: implement simple local vm factory provider * UI: disable triggering drawer via gesture; enable alert dialog on back navigation inside conversation and benchmark * UI: allow drawer's gesture control only on Home and Settings screens; enable alert dialog on back navigation inside conversation and benchmark * UI: split a nested parent settings screen into separate child settings screens * UI: polish system prompt setup UI * Deps: bump Kotlin plugin; introduce KSP; apply in :app subproject * DB: setup Room database * data: introduce repo for System Prompt; flow data from Room to VM * bugfix: properly handle user's quitting conversation screen while tokens in generation * UI: rename `ModeSelection` to `ModelLoading` for better clarity * UI: update app name to be more Arm * UI: polish conversation screen * data: code polish * UI: code polish * bugfix: handle user quitting on model loading * UI: locks user in alert dialog when model is unloading * vm: replace token metrics stubs with actual implementation * UI: refactor top app bars * nit: combine temperatureMetrics and useFahrenheit * DI: introduce Hilt plugin + processor + lib dependencies * DI: make app Hilt injectable * DI: make viewmodels Hilt injectable * DI: replace manual DI with Hilt DI * UI: optimize AppContent's composing * bugfix: wait for model to load before navigating to benchmark screen; use NavigationActions instead of raw navController * UI: navigation with more natural animated transitions * DI: Optimize AppModule * Feature: Introduce ModelRepository and ModelsManagementViewModel; update AppModule * UI: polish UI for ModelsManagementScreen; inject ModelsManagementVieModel * DI: abstract the protocol of SystemPromptRepository; update AppModule * data: [WIP] prepare for ModelRepository refactor & impl * data: introduce Model entity and DAO; update DI module * UI: replace Models Management screen's stubbing with instrumentation * UI: polish sort order menu * data: import local model with file picker * bugfix: use List instead of Collection for ModelDao's deletion * data: add a util file for extracting file name & size and model metadata * UI: enrich ModelManagementState; extract filename to show correct importing UI * UI: implement multiple models deletion; update Models Management screen * UI: handle back navigation when user is in multi-selection mode * util: extract file size formatting into ModelUtils * UI: add a confirmation step when user picks a file; refactor model import overlay into AlertDialog * UI: extract a shared ModelCard component * UI: replace model selection screen's data stubbing; add empty view * nit: tidy SystemPromptViewModel * Util: split FileUtils from ModelUtils; extract copy methods into FileUtils * data: pass through getModelById from ModelDao into ModelRepository * core: extract conversation and benchmark logics into InferenceManager; add logs and missing state updates in stub InferenceEngine * vm: split mono MainViewModel into separate individual ViewModels * vm: merge SystemPromptViewModel into ModelLoadingViewModel * core: break down InferenceManager due to Interface Segregation Principle * UI: show model card in Model Loading screen * UI: show model card in Conversation screen * UI: unify Model Card components * core: swap in LLamaAndroid and mark stub engine for testing only * data: allow canceling the ongoing model import * UI: update UI ongoing model import's cancellation * LLama: update engine state after handling the cancellation of sendUserPrompt * VM: handle the cancellation of ongoing token generation * LLama: refactor loadModel by splitting the system prompt setting into a separate method * feature: check for available space before copying local model * UI: centralize the AppScaffold and modularize its configs * UI: refactor BottomBarConfig.ModelsManagement APIs * UI: combine TopBarConfig and BottomBarConfig into each route's ScaffoldConfig * UI: replace ugly optional as casts in AppScaffold with extension functions * UI: fix the typo `totalGb` in `StorageMetrics` * UI: remove code duplication in sort menu * LLama: add ModelUnloadingState to engine State; add missing state checks in stub engine; fix instrumentation engine's error messages * UI: refactor back handling by removing centralized BackHandlerSetup and UnloadModelConfirmationDialog from AppContent * UI: implement BenchmarkScreen's individual back handling * LLama: add a new Initializing state; ; add two extension properties; rename LibraryLoaded state to Initialized * UI: Introduce an abstract ViewModel to handle additional model unloading logics * UI: expose a single facade ModelUnloadDialogHandler; move UnloadModelState into ModelUnloadingViewModel.kt * UI: migrate ModelLoadingScreen onto ModelLoadingViewModel; update & refine ModelLoadingScreen * UI: migrate ConversationViewModel onto ModelLoadingViewModel; update & refine ConversationScreen * nit: extract app name into a constant value; remove unused onBackPressed callbacks * UI: update AppContent to pass in correct navigation callbacks * nit: polish ModelLoadingScreen UI * core: throw Exception instead of returning null if model fails to load * navigation: sink model loading state management from AppContent down into ModelLoadingScreen; pass ModelLoadingMetrics to Benchmark and Conversation screens * gguf: add GGUF metadata data holder and its corresponding extractor implementation * DB: introduce Kotlin serialization extension's library and plugin; add Room runtime library * GGUF: make GgufMetadata serializable in order to be compatible with Room * nit: refactor data.local package structure * nit: rename lastUsed field to dateLastUsed; add dateAdded field * UI: refactor ModelCard UI to show GGUF metadata * UI: update ModelSelectionScreen with a preselect mechanism * UI: polish model card * nit: allow deselect model on Model Selection screen * nit: revert accidental committing of debug code * UI: polish ModelLoading screen * util: extract formatting helper functions from FileUtils into a new FormatUtils * UI: polish model cards on Benchmark and Conversation screens to show model loading metrics * UI: show a Snack bar to warn user that system prompt is not always supported * UI: handle back press on Model Selection screen * UI: finally support theme modes; remove hardcoded color schemes, default to dynamic color scheme implementation * feature: support searching on Model Selection screen * nit: move scaffold related UI components into a separate package * UI: extract InfoView out into a separate file for reusability * data: move Model related actions (query, filter, sort) into ModelInfo file * UI: animate FAB on model preselection states * feature: support filtering in Model Management screen * ui: show empty models info in Model Management screen * ui: add filter off icon to "Clear filters" menu item * [WIP] ui: polish Benchmark screen; implement its bottom app bar * ui: polish Benchmark screen; implement its bottom app bar's rerun and share * nit: disable mode selection's radio buttons when loading model * feature: implement Conversation screen's bottom app bar * pkg: restructure BottomAppBars into separate files in a child package * pkg: restructure TopBarApps into separate files in a child package * pkg: restructure system metrics into a separate file * UI: polish Conversation screen * data: update system prompt presets * UI: allow hide or show model card on Conversation & Benchmark screens; fix message arrangement * data: update & enhance system prompt presets * deps: introduce Retrofit2 * data: implement HuggingFace data model, data source with Retrofit API * data: update Model data repository to support fetching HuggingFace models * [WIP] UI: replace the HuggingFace stub in Model Management screen with actual API call * UI: map language codes into country Emojis * ui: add "clear results" action to Benchmark screen * nit: print current pp & tg in llama-bench * UI: disable landscape mode; prevent duplicated benchmark running * llama: migrate C/CXX flags into CMakeList * [WIP] llama: ABI split builds five .so artifacts. However, all .so are performing on SVE level * [WIP] llama: ABI split where five tiers are built sequentially. * [WIP] llama: disable OpenMP in ABI split since most SoCs are big.LITTLE * [WIP] llama: enable KleidiAI and disable tier 4 due to `+sve+sve2` bug caused by `ggml_add_cpu_backend_variant_impl` as explained below ```CMake if (NOT SME_ENABLED MATCHES -1) ... set(PRIVATE_ARCH_FLAGS "-fno-tree-vectorize;${PRIVATE_ARCH_FLAGS}+sve+sve2") ... ``` * core: add Google's cpu_features as a submodule * core: implement cpu_detector native lib * core: swap out hardcoded LlamaAndroid library loading * core: add back OpenMP due to huge perf loss on TG128 * misc: reorg the pkg structure * misc: rename LlamaAndroid related class to InferenceEngine prefixes * [WIP] lib: move GgufMetadata into the lib submodule * lib: expose GgufMetadataReader as interface only * lib: replace the naive & plain SharedPreferences with DataStore implementation * lib: hide the internal implementations, only expose a facade and interfaces * lib: expose Arm features * di: add a stub TierDetection; provide both actual impl and stub in AppModule * UI: add visualizer UI for Arm features * misc: UI polish * lib: refactored InferenceEngineLoader; added a `NONE` Llama Tier * UI: support `NONE` Llama Tier in general settings * lib: optimize engine loader; always perform a fresh detection when cache is null * remote: add HuggingFaceModelDetails data class * remote: refine HuggingFaceModel data class * nit: remove `trendingScore` field from HuggingFace model entities, weird... * remote: refactor HuggingFaceApiService; implement download feature in HuggingFaceRemoteDataSource * remote: fix the incorrect parse of HuggingFace's inconsistent & weird JSON response * UI: scaffold Models Management screen and view model * UI: implement a dialog UI to show fetched HuggingFace models. * UI: use a broadcast receiver to listen for download complete events and show local import dialog. * data: handle network exceptions elegantly * pkg: restructure `data`'s packages * data: extract local file info, copy and cleanup logics into LocalFileDataSource * nit: minor UI patch; add missing comments * bugfix: tapping "Home" in navigation drawer should simply close it without any navigation action. * UI: improve autoscroll during token generation * lib: tested on JFrog Artifactory for Maven publishing * UI: show RAM warning if model too large * UI: polish model management screen's error dialog * util: add more items into the mapping table of ISO 639-1 language code to ISO 3166-1 country code * llm: properly propagate error to UI upon failing to load selected model * UI: avoid duplicated calculation of token metrics * lib: read & validate the magic number from the picked source file before executing the import * UI: add "Learn More" hyperlinks to Error dialog upon model import failures * lib: refactor the GgufMetadataReader to take InputStream instead of absolute path as argument * lib: fix the `SIMD` typo in Tier description * core: verify model file path is readable * lib: add UnsupportedArchitectureException for triaged error message * util: split FormatUtils into multiple utils for better readability * UI: change benchmark screen from raw markdown to table view * bugfix: reset preselection upon running the preselected model * misc: linter issue * bugfix: fix the malfunctioning monitoring switch * UI: update Arm features indicator; fix the broken hyperlinks * UI: add quick action buttons to benchmark screen's result card * UI: hide share fab after clearing all benchmark results * UI: fix the model unload dialog message; elevate the model card and hide it by default on Conversation screen; * UI: hide the stubbing actions in Conversation screen * UI: add show/hide stats control to conversation screen's assistant message bubble; fix placeholder * UI: add a info button to explain token metrics * misc: remove the redundant `Companion` added due to refactoring * UI: show corresponding system metrics detailed info upon tapping RAM / storage / temperature indicator * UI: add info button to System Prompt switch; expand the model card by default * UI: disable tag & language chips; add section headers to explain what they are * misc: replace top bar indicator's spacer with padding * UI: merge the Model Selection and Model Management into a unified Models screen * UI: split the ModelsManagementViewModel from a unified ModelsViewModel due to huge complexity * UI: add model loading in progress view; polish the empty model info view * UI: polish the bottom bars and info view when no models found; show loading in progress while fetching models * build: [BREAKING] bump the versions of libraries and plugins * UI: fix the breaking build * UI: add Tooltip on Import FAB for user onboarding * UI: adds AppPreferences to track user onboarding status * UI: tracks user's first success on importing a model * data: add hand crafted rules to filter the models fetched from HuggingFace API * UI: update app name & about; polish top bars' indicators & buttons * UI: polish Hugging Face download dialog UI * UX: implement onboarding tooltips for model import and onboarding * misc: use sentence case for CTA button labels * [WIP] UI: add Arm color palette from Philip.Watson3 * UI: address Rojin's UX feedbacks * UI: address Rojin's UX feedbacks - part 2 * UI: update Arm color palette from Philip.Watson3 * data: make sure fetch preselected models in the same order of their IDs * UI: fix UI issues in the generic settings screen and navigation drawer * nit: address Rojin's feedbacks on model import message again * nit: append `®` to all `Arm` labels * UI: extract a reusable InfoAlertDialog * core: support GGML_CPU_ALL_VARIANTS on Android! * core: restructure Kleidi-Llama library * core: organizing cmake arguments * data: sort preselected models according to device's available RAM * app: update adaptive + themed + legacy icons and app name * UI: fix the font size auto scaling for ArmFeaturesVisualizer * core: further improve the performance on native methods * UI: minor color palette changes; emphasize the bottom bar FABs; fix Settings Screen menu item label * UI: make more room for assistant message bubble's width * UI: better usage of tertiary colors to highlight model cards but not for warnings * UI: fix the layout issue on large font sizes * lib: support x86-64 by dynamically set Arm related definitions * lib: replace the factory pattern for deprecated tiered lib loading with single instance pattern * llama: update the library name in JNI and CMake project * llama: update the library's package name and namespace * llama: update the app's package name and namespace * app: bump ksp version * app: remove deprecated SystemUIController from accompanist by migrating to EdgeToEdge * app: extract AppContent from MainActivity to a separate file in ui package * lib: add File version for GGUF Magic number verification * lib: perform engine state check inclusively instead of exclusively * lib: change `LlamaTier` to `ArmCpuTier` * lib: remove kleidi-llama related namings * cleanup: remove Arm AI Chat/Playground app source code; replace with the basic sample app from https://github.com/hanyin-arm/Arm-AI-Chat-Sample Note: the full Google Play version of AI Chat app will be open will be open sourced in another repo soon, therefore didn't go through the trouble of pruning the history using `git filter-repo` here. * [WIP] doc: update main and Android README docs; add self to code owners * lib: revert System.load back to System.loadLibrary * jni: introduce a logging util to filter different logging levels on different build types * lib: enable app optimization * doc: replace stub Google Play app URL with the actual link add screenshots; add my GitHub ID to maintainer list * Remove cpu_features * Fix linters issues in editorconfig-checker job https://github.com/ggml-org/llama.cpp/actions/runs/19548770247/job/55974800633?pr=17413 * Remove unnecessary Android CMake flag * purge include/cpu_features directory --------- Co-authored-by: Han Yin <han.yin@arm.com>	2025-12-18 08:20:56 +02:00
Aadeshveer Singh	41a95b8ba7	ggml : use WARP_SIZE/2 for argmax reduction offset (llama/18092)	2025-12-18 08:20:56 +02:00
Shouyu	8dd70bdc85	ggml-hexagon: mm for mtmd (llama/17894) * feat: add run_mtmd script for hexagon * fix: fix issue in fp16xfp32 mm * fix: remove opt_experiment for fp16xfp32 mm * fix: ggml-hexagon: matmul fp16xfp32 support non-contigious src0 * fix: fix syntax check for run-mtmd.sh for cli	2025-12-18 08:20:56 +02:00
Jeremy Demeule	b90ec07aba	metal: use shared buffers on eGPU (llama/17866) * metal: use shared buffers on eGPU With #15906, I noticed on important regression when using metal backend on eGPU. This commit restore the previous behavior and add an option to force its activation. * metal: use shared buffers on eGPU * metal: use shared buffers on eGPU	2025-12-18 08:20:56 +02:00
Johannes Gäßler	aaf3f39b4a	llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (llama/16653) * llama: automatically fit args to free memory llama-fit-params tool * fix CI * hints for bug reports, ensure no reallocation * fix segfault with Vulkan * add llama-fit-params to CI * fix CI * fix CI * fix CI * minor adjustments * fix assignment of 1 dense layer * fix logger not being reset on model load failure * remove --n-gpu-layer hint on model load failure * fix llama-fit-params verbosity * fix edge case * fix typo [no ci]	2025-12-18 08:20:56 +02:00
Neo Zhang Jianyu	b5e352a52f	Support gpt-oss by OPs add-id, mul_mat for mxfp4, swiglu_oai (llama/17826) * support gpt-oss GPU by OP add-id, mul_mat for mxfp4, swiglu_oai, fix warning * fix fault ut case, update ops.md * rebase, fix format issue	2025-12-18 08:20:56 +02:00
Ruben Ortlam	3bb4e1e0ac	vulkan: fix mul_mat_vec_iq1_s formatting (llama/18026)	2025-12-18 08:20:56 +02:00
Jeff Bolz	af2c8cba6f	vulkan: Fix data race/hang in scalar/cm1 flash attention (llama/17887)	2025-12-18 08:20:56 +02:00
lovedheart	7e5df2975e	vulkan: improve mul_mat_vec_iq1_s speed (llama/17874)	2025-12-18 08:20:56 +02:00
Eve	cdadfc3b72	vulkan: faster q6_k matmul (llama/17813) * q6_k faster mul mat * 8 values * fix comment * switch to two at a time * start ci for .glsl files	2025-12-18 08:20:56 +02:00
Georgi Gerganov	b62ef9af7a	ggml : arm repack fix build (llama/0)	2025-12-18 08:20:56 +02:00
Jeff Bolz	b901ebe4a3	vulkan: support get_rows for i32 (llama/17941)	2025-12-18 08:20:56 +02:00
Jeff Bolz	f33446643e	vulkan: support GGML_OP_DIAG (llama/17893)	2025-12-18 08:20:56 +02:00
Jeff Bolz	939d3085e9	vulkan: Multi-pass softmax for large number of cols (llama/17892) When the number of cols is large, split each row across multiple workgroups. There are three phases that communicate partial results through temp buffers: (1) compute max partials (2) take max of partials, compute sum(exp(x-max)) partials (3) sum partials, compute scaled result	2025-12-18 08:20:56 +02:00
Jeff Bolz	13bb296dbf	vulkan: Allow non-pow2 n_experts in topk_moe (llama/17872)	2025-12-18 08:20:56 +02:00
Johannes Gäßler	feb856d4a1	CUDA: fix overflow in MMA kernel without stream-k (llama/17939)	2025-12-18 08:20:56 +02:00
Sigbjørn Skjæret	db1fcd958f	cann : fix ops broken by circular padding guard (llama/17825)	2025-12-18 08:20:56 +02:00
ixgbe	2c782ec325	ggml-cpu : fix RISC-V Q4_0 repack select and RVV feature reporting (llama/17951) * ggml-cpu:fix RISC-V Q4_0 repack select and RVV feature reporting Signed-off-by: Wang Yang <yangwang@iscas.ac.cn> * using the name VLEN instead of CNT * Update ggml/include/ggml-cpu.h --------- Signed-off-by: Wang Yang <yangwang@iscas.ac.cn> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-18 08:20:56 +02:00
yulo	25d99e9135	HIP: enable mmf for RDNA3 (llama/17879) * enable mmf for RDNA3 * disable mmf for some shape * move some mmvf to mmf * more mmfv to mmf * 3 is good in mmvf --------- Co-authored-by: zhang hui <you@example.com>	2025-12-18 08:20:56 +02:00
Piotr Wilkin (ilintar)	e0af519a61	SOLVE_TRI extension to more dimensions (llama/17793) * Extended TRI * Fix whitespace * chore: update webui build output * Just use cuBLAS for everything... * Merge both versions * Remove incorrect imports causing failures for CI * Still failing... remove all direct cublas imports and rely on common imports from "common.cuh" * Defines for hipBlas * Aaaand MUSA defines... * I hate this job... * Stupid typo... * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-18 08:20:56 +02:00
Georgi Gerganov	f0c9017a2f	ggml : arm repack fix build (#0 )	2025-12-13 08:04:09 +02:00
Congcong Cai	324dd21d3c	cmake : set `CMAKE_RUNTIME_OUTPUT_DIRECTORY` for non standalone build (ggml/1394) Some backend depends on CMAKE_RUNTIME_OUTPUT_DIRECTORY to create temporary file like metal backened. Missing CMAKE_RUNTIME_OUTPUT_DIRECTORY will cause some cmake error like permission denied (try to copy file to root). This PR wants to setup a default path for CMAKE_RUNTIME_OUTPUT_DIRECTORY when it does not exist.	2025-12-12 17:53:24 +02:00
Georgi Gerganov	1da1a6865c	ggml-alloc : fix reuse-parent logic for misaligned sizes (llama/17884)	2025-12-12 17:53:24 +02:00
nullname	0c88de5c69	ggml-hexagon: fix `rope` failure at `test-backend-ops` (llama/17565) * fix test failure * fix: correct scaling calculations in rope_cache_init * fix: optimize element copying in rope_hex_f32 using memcpy * fix: optimize loop boundaries in rope_hex_f32 for better performance * feat: add profiling macros for performance measurement in operations	2025-12-12 17:53:24 +02:00
Max Krasnyansky	a2886fba48	Fix race conditions in threadpool when dealing with dynamic/frequent n_threads changes (llama/17748) * tests: update barrier test to check for race condition in active threads * cpu: combine n_graph and n_threads into a single atomic update * tests: add multi-graph test for test_barrier	2025-12-12 17:53:24 +02:00
Georgi Gerganov	cd9b8c6d18	ggml : remove GGML_KQ_MASK_PAD constant (llama/17910) * ggml : remove GGML_KQ_MASK_PAD constant * cont : remove comment	2025-12-12 17:53:24 +02:00
Sigbjørn Skjæret	ca8ea18d06	cuda : add missing support check for xielu (llama/17895)	2025-12-12 17:53:23 +02:00
Johannes Gäßler	ea1829134f	CUDA: fix unpadded strides in MMA FA kernel (llama/17891)	2025-12-12 17:53:23 +02:00
Neo Zhang Jianyu	c10b4f9a01	fix softmax for iGPU (llama/17838)	2025-12-12 17:53:23 +02:00
Gabe Goodhart	307dc525bb	metal: SSM kernel improvements (llama/17876) * feat: Add a batched version of ssm_conv This was done using Claude Code. It found a number of optimizations around how the threads were organized, resulting in a huge performance boost! Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Optimized SSM_SCAN kernel for metal This used Claude Code and resulted in a modest performance improvement while maintaining correctness. Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * test: Add test-backend-ops perf tests for SSM_CONV Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * test: Real representitive tests for SSM_CONV Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use function constant for ssm_conv batch size Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * test: backend op tests for ssm_scan from granite4 1b-h Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: remove commented out templates Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: float4 version of ssm_conv_batched Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing ggml_metal_cv_free Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:23 +02:00
Piotr Wilkin (ilintar)	2817582be2	Add DIAG for CUDA (llama/17873) * Add DIAG for CUDA * Refactor parameters	2025-12-12 17:53:23 +02:00
Gabe Goodhart	41bbc034f0	ggml : Provide macos-specific backtrace printing to avoid terminal death (llama/17869) * fix: Provide macos-specific backtrace printing to avoid terminal death Branch: MacOSSafeBacktrace Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add GGML_BACKTRACE_LLDB env var to enable using lldb for backtrace Branch: MacOSSafeBacktrace Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-12-12 17:53:22 +02:00
Georgi Gerganov	b6ae0b29d1	metal : print node names for debugging (llama/17882)	2025-12-12 17:53:22 +02:00
Sigbjørn Skjæret	ba463fb577	ggml : allow fill node alloc inplace (llama/17870)	2025-12-12 17:53:22 +02:00
Chenguang Li	79d86a5c2c	CANN: add support for partial RoPE and Vision mode (llama/17543) * cann: add support for partial RoPE and Vision mode Add support for two important RoPE variants: partial rotation (rope_dims < ne0) and Vision mode rotation. 1. Support for partial RoPE (rope_dims < ne0): - Split tensor into head (first rope_dims dimensions) and tail portions - Apply rotation only to head portion using RotaryPositionEmbedding operator - Copy unrotated tail portion directly from source to destination - Handle both contiguous and non-contiguous tensor layouts 2. Support for Vision mode (GGML_ROPE_TYPE_VISION): - Set rope_dims = ne0 for Vision mode to rotate entire tensor - Vision mode pairs dimension i with dimension i+n_dims (where n_dims = ne0/2) - No tail handling needed since entire tensor is rotated Implementation details: - Use has_tail flag to determine execution path: head/tail splitting when rope_dims < ne0, or full tensor rotation when rope_dims == ne0 - Support both F32 and F16 data types with intermediate F32 conversion - Copy non-contiguous tensors to contiguous buffers before calling RotaryPositionEmbedding operator for compatibility - Improve cache invalidation logic to include rope_dims and indep_sects parameters These enhancements enable CANN backend to handle various RoPE configurations used in modern vision-language models and models with partial rotation. * cann: fix review comment	2025-12-12 17:53:22 +02:00
Johannes Gäßler	bef1f5a57e	CUDA: fix FP16 overflow in tile FA kernel (llama/17875)	2025-12-12 17:53:22 +02:00
Jay Zenith	821c2071ab	cuda : add FILL op support (llama/17851) * cuda : add FILL op support * cuda : add missing FILL op files	2025-12-12 17:53:22 +02:00
wsbagnsv1	e1562e85fc	cuda: optimize SOLVE_TRI using registers and FMAF (llama/17703) * ggml-cuda: optimize solve_tri_f32_fast and fix stride handling - Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts. - Implement explicit `fmaf` instructions for the reduction loop. - Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to `char ` before addition). - Remove unused `MAX_K_FAST` definition. Small cleanup * Remove comments in solve_tri.cu * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Use const for variables in solve_tri.cu * Replace fmaf with more readable code * remove last fmaf --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-12 17:53:21 +02:00
ixgbe	c8d0ee2f9f	ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support (llama/17784) * ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support Signed-off-by: Wang Yang <yangwang@iscas.ac.cn> * cmake: enable RISC-V zihintpause extension for Spacemit builds * readme : add ZIHINTPAUSE support for RISC-V --------- Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>	2025-12-12 17:53:21 +02:00
lovedheart	d6d44fac69	Vulkan: improve mul_mat_vec_iq1_m (llama/16907) * Optimize Vulkan shader for matrix-vector multiplication * Revert changes on compute_outputs and main Refactor compute_outputs to handle remaining rows correctly. * Fix trailing whitespace	2025-12-12 17:53:21 +02:00
Law Po Ying	447ef8633b	sycl: add missing BF16 conversion support for Intel oneAPI (llama/17780) * sycl: add missing BF16 conversion support for Intel oneAPI * Fix Line 645: Trailing whitespace	2025-12-12 17:53:21 +02:00
Jeff Bolz	898f876fe2	vulkan: perf_logger improvements (llama/17672) * vulkan: perf_logger improvements - Move perf_logger from device to ctx. - Add an env var to control the frequency we dump the stats. If you set a very large value, it just dumps when the ctx is destroyed. - Add a fusion info string to the tracking, only log one item per fused op. - Fix MUL_MAT_ID flops calculation. * fix vector sizes	2025-12-12 17:53:21 +02:00
Vishal Singh	ebff8f9db9	ggml-zendnn : add ZenDNN backend for AMD CPUs (llama/17690) * ggml-zennn: add ZenDNN backend support * ggml-zendnn : address ZenDNN backend review fixes and suggestions * docs : apply blockquote syntax to ZenDNN docs --------- Co-authored-by: Manoj Kumar <mkumar@zettabolt.com>	2025-12-12 17:53:21 +02:00
Phylliida Dev	c5e1807071	ggml : add circular tiling support to pad, for Vulkan, CUDA, and CPU (used for making seamless textures) (llama/16985) * Feat: Added vulkan circular tiling support * Feat: Added cpu circular * Feat: Added cuda kernels * Added tests * Added tests * Removed non-pad operations * Removed unneded changes * removed backend non pad tests * Update test-backend-ops.cpp * Fixed comment on pad test * removed trailing whitespace * Removed unneded test in test-backend-ops * Removed removed test from calls * Update ggml/src/ggml-vulkan/vulkan-shaders/pad.comp Co-authored-by: Ruben Ortlam <picard12@live.de> * Fixed alignment * Formatting Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Format pad * Format * Clang format * format * format * don't change so much stuff * clang format and update to bool * fix duplicates * don't need to fix the padding * make circular bool * duplicate again * rename vulkan to wrap around * Don't need indent * moved to const expr * removed unneded extra line break * More readable method calls * Minor wording changes * Added final newline * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Added circular pad ext tests * Gate non circular pad devices * Cleaned gating of non-circular pad devices --------- Co-authored-by: Phylliida <phylliidadev@gmail.com> Co-authored-by: Ruben Ortlam <picard12@live.de> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:20 +02:00
Johannes Gäßler	94be71911f	HIP: fix RDNA3 FP16/BF16 matrix multiplication (llama/17817)	2025-12-12 17:53:20 +02:00
Sky	b67e3abdb2	ggml : improve error handling for search path existence checks (llama/17653) * Improve error handling for search path existence checks Refactor existence checks for search paths using std::error_code to handle potential errors. * Improve cache file existence check with error code Update fs::exists to use std::error_code for error handling. * Simplify existence check for search paths Simplify existence check for search paths * Fix logging path in error message for posix_stat * Update ggml/src/ggml-backend-reg.cpp Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Adapt to the coding standard --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2025-12-12 17:53:20 +02:00
Jeff Bolz	c66c71e9f4	vulkan: Use one row per workgroup for f32 mmv (llama/17711) The MoE models have a mul_mat_vec with very small m (32, 64, 128) right before the topk_moe selection. Running multiple rows per wg doesn't utilize the SMs well. I think even for larger m, f32 is so bandwidth-limited that running multiple rows doesn't help.	2025-12-12 17:53:20 +02:00
Jeff Bolz	875d861473	vulkan: support solve_tri with larger N/K values (llama/17781) Split N into chunks to fit into shared memory. If K > 128, use a larger workgroup with enough invocations. Add perf tests matching qwen3next.	2025-12-12 17:53:20 +02:00
Georgi Gerganov	41cf229d72	metal : fix build(#17799 ) * metal : fix build * tests : fix context destruction	2025-12-12 17:53:20 +02:00
Masato Nakasaka	a8d02735f7	vulkan: Replace deprecated VK_EXT_validation_features (llama/17637) * replaced deprecated VK_EXT_validation_features * forgot to remove old code	2025-12-12 17:53:19 +02:00
Masato Nakasaka	191e5f46a2	vulkan: Fix mismatch in TOPK_MOE unit test (llama/17541) * Fix shader to support 2D workgroup mapping to a single subgroup * Set required_subgroup_size topk_moe shader requires static WARP_SIZE and actual subgroup size to match	2025-12-12 17:53:19 +02:00
Jeff Bolz	64a3f573e0	vulkan: add more num_blocks instantiations in rms_norm (llama/17701)	2025-12-12 17:53:19 +02:00
Jeff Bolz	0484147ab2	vulkan: fix top_k bug when there are ties in the input (llama/17659) * vulkan: Reduce temporary memory usage for TOP_K - Compute row size for the temp buffer based on the output of the first pass. - Update shader addressing math to use the output row size - Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k" For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer from about 3.2MB to 500KB. * vulkan: fix top_k bug when there are ties in the input I noticed by inspection a bug in the vulkan top_k shader where if the least value in the top_k appears multiple times we could end up writing those extra copies out rather than some larger values (if the larger values are on higher numbered threads). I rewrote the test verification to handle this case, where the final index set is not necessarily the same. * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:19 +02:00
Acly	0b53759b29	vulkan : support conv-2d with large output size (llama/17685)	2025-12-12 17:53:19 +02:00
Reese Levine	23984be4da	ggml webgpu: unary op suppport, code refactoring, ops support (llama/17764) * Squashed commit of the following: commit b3c6bf4b0450d8d452b934df27a0fb7cb53cd755 Author: Abhijit Ramesh <abhijitramesh2k@gmail.com> Date: Mon Dec 1 18:29:00 2025 -0800 ggml webgpu: fix xielu parameter passing (llama/11) The XIELU operation was incorrectly using static_cast to convert float parameters to uint32_t, which converted numeric values instead of preserving IEEE 754 bit patterns. This caused incorrect values to be interpreted by the GPU shader. * Use reinterpret_cast to preserve float bit patterns when passing through uint32_t params buffer * Update WGSL shader parameter types from u32 to f32 * Re-enable XIELU support (was disabled due to numerical issues) Fixes NMSE test failures for XIELU operation on WebGPU backend. commit 5ca9b5e49ea7cddc9ab7c8b43a11a9c76a4dff4a Author: neha-ha <137219201+neha-ha@users.noreply.github.com> Date: Tue Nov 18 12:17:00 2025 -0800 Refactored pipelines and workgroup calculations (llama/10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 29 23:13:06 2025 -0700 formatted embed wgsl and ggml-webgpu.cpp commit e1f6baea31645e5d96ad53664acae856f74b96f4 Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 29 23:08:37 2025 -0700 implemented REPL_Template support and removed bug in unary operators kernel commit 8c70b8fece445cdc9a8c660dbddbf201e52da2bb Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 15 16:14:20 2025 -0700 responded and dealt with PR comments commit f9282c660c10dec4487d434549bdb707a9cd9f37 Author: James Contini <jamescontini@gmail.com> Date: Sun Oct 12 13:41:41 2025 -0700 removed unnecesarry checking if node->src[1] exists for unary operators commit 4cf28d7dec41c29186d66152735b244c5699f9dc Author: James Contini <jamescontini@gmail.com> Date: Sun Oct 12 13:32:45 2025 -0700 All operators (inlcluding xielu) working commit 74c6add1761a59d2c2ff60b60e8ad3c8300f6d3e Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 13:16:48 2025 -0700 fixed autoconfig commit 362749910be4f0120c8ffb21ceddeb7d2c088e51 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 13:10:46 2025 -0700 removed vestigial files commit cb0858333785757804c5104e59c4981843207c16 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 12:59:32 2025 -0700 abides by editor-config commit 5360e2852a4b51197d7d67d0a5d42e908b02d7ed Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 12:45:57 2025 -0700 rms_norm double declaration bug atoned commit 7b09baa4aa53711be5a126043670cc182c78bfcd Merge: 8a6ec843 74b8fc17 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 11:50:03 2025 -0700 resolving merge conflicts commit 8a6ec843a50ab82f8cef59b4558eb63f318ba02d Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 8 18:06:47 2025 -0700 unary operators pass ggml tests commit c3ae38278a2db236adc5912c9140e4f0d63f2c19 Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 1 16:22:40 2025 -0700 neg passes backend test commit aa1c9b2f8877a405470ca56709c42a1fd43713de Author: James Contini <jamescontini@gmail.com> Date: Tue Sep 30 23:55:27 2025 -0700 neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though Co-authored-by: James Contini <jamescontini@gmail.com> Co-authored-by: Neha Abbas <neabbas@ucsc.edu> Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com> * Remove extra code and format * Add ops documentation (finally) * Update ggml/src/ggml-webgpu/wgsl-shaders/embed_wgsl.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: James Contini <jamescontini@gmail.com> Co-authored-by: Neha Abbas <neabbas@ucsc.edu> Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-12 17:53:18 +02:00
Jeff Bolz	7e97d3b069	vulkan: enable mmvq for q2_k on NVIDIA (llama/17675)	2025-12-12 17:53:18 +02:00
Jeff Bolz	32ba1ec8e0	vulkan: set all memory allocations to high priority (llama/17624) * vulkan: set all memory allocations to high priority * gate by env var	2025-12-12 17:53:18 +02:00
Georgi Gerganov	aefcd75f4f	rpc : fix alloc size logic (llama/17116) * rpc : fix alloc size logic * rpc : bump version	2025-12-12 17:53:18 +02:00
Georgi Gerganov	322903fa67	metal : add residency sets keep-alive heartbeat (llama/17766) * examples : add idle * metal : attach residency sets to queue * idle : add link * idle : adjust intervals * metal : add residency sets keep-alive heartbeat * cont : adjust default keep-alive time	2025-12-12 17:53:18 +02:00
Johannes Gäßler	4170159dcd	HIP : fix RDNA4 build (llama/17792)	2025-12-12 17:53:18 +02:00
shalinib-ibm	d30b744047	Q4/Q8 Tiled Gemm Optimization. (llama/16999)	2025-12-12 17:53:17 +02:00
Johannes Gäßler	14502d6561	CUDA: fix FA VKQ accumulator overflow (llama/17746)	2025-12-12 17:53:17 +02:00
Jiacheng (Jason) Chen	e3f3c6ead1	HIP: enable WMMA-MMQ INT kernels for RDNA 3 (llama/17576) * enabled wmma instructions for most quantizations other than q2k * fixed the last q2_k test case failure * address comments: fix out of bound write for RDNA4, add comments after #endif * clean up rebase: fix ne error in half2 * fix the EditorConfig CI	2025-12-12 17:53:17 +02:00
Piotr Wilkin (ilintar)	8d44d6181a	Add support for CUMSUM and TRI for CUDA. (llama/17584) * Add support for CUMSUM and TRI for CUDA. * Minor optimizations. * Correct warp_prefix_inclusive_sum in float2 variant to return float2 * Optimize TRI * Whitespace * Fix strides. * Implement double loop * Whitespace * Fix HIP compilation bugs * Optimizations + big case performance tests * Implement using CUB with fallback to custom kernel * Remove error message. * Fixes from code review * Comment out CPU-unsupported F16/BF16 cases to fix CI * Fine, you win :P * Fix last cast, use NO_DEVICE_CODE and GGML_UNUSED_VARS * Vary warp-size based on physical warp size * Add GGML_UNUSED_VARS in tri as well * Use constexpr and call prefix_inclusive with warp_size template param * Update ggml/src/ggml-cuda/cumsum.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Change to tid % warp_size * Fix strides; hardcode mask; add ggml_lane_mask_t * Missing renames, remove unused get_warp_mask(), explicit calls to ggml_cuda_info() * Too hasty... --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-12 17:53:17 +02:00
Gabe Goodhart	8902c9d976	metal: TRI, FILL, EXPM1, SOFTPLUS (llama/16623) * feat(wip): Port initial TRI impl from pervious work The kernel does not work and is not optimized, but the code compiles and runs, so this will be the starting point now that the core op has been merged. Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove argument for constant val override This was added in the original draft, but later removed. With this, the kernel now passes tests. Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Move the ttype conditional to templating to avoid conditional in kernel Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Type fixes Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * feat: Add softplus for metal Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add EXPM1 for metal Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add FILL for metal Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Branchless version of tri using _ggml_vec_tri_cmp as a mask Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unused arguments Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use select instead of branch for softplus non-vec Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:17 +02:00
Alberto Cabrera Pérez	f96ebc92d2	ggml-cpu : remove asserts always evaluating to false (llama/17728)	2025-12-12 17:53:17 +02:00
Georgi Gerganov	194d016456	metal : use params per pipeline instance (llama/17739)	2025-12-12 17:53:16 +02:00
Adrien Gallouët	92e50155c9	build : move _WIN32_WINNT definition to headers (llama/17736) Previously, cmake was forcing `_WIN32_WINNT=0x0A00` for MinGW builds, This caused "macro redefined" warnings with toolchains that define the version. This also removes the `GGML_WIN_VER` variable as it is no longer needed. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-12 17:53:16 +02:00

... 3 4 5 6 7 ...

2159 Commits