Commit Graph

4210 Commits

Author SHA1 Message Date
Neo Zhang f0e85bb142 sycl: fix norm kernels: l2_norm, group_norm, rms_norm by remove assert to support more cases (llama/19154)
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2026-01-30 15:56:40 +02:00
Ruben Ortlam 33148bb523 Vulkan Flash Attention Coopmat1 Refactor (llama/19075)
* vulkan: use coopmat for flash attention p*v matrix multiplication

* fix P loading issue

* fix barrier position

* remove reduction that is no longer needed

* move max thread reduction into loop

* remove osh padding

* add bounds checks and padding

* remove unused code

* fix shmem sizes, loop duration and accesses

* don't overwrite Qf, add new shared psh buffer instead

* add missing bounds checks

* use subgroup reductions

* optimize

* move bounds check, reduce barriers

* support other Bc values and other subgroup sizes

* remove D_split

* replace Of register array with shared memory Ofsh array

* parallelize HSV across the rowgroups

* go back to Of in registers, not shmem

* vectorize sfsh

* don't store entire K tile in shmem

* fixes

* load large k tiles to shmem on Nvidia

* adapt shared memory host check function to shader changes

* remove Bc 32 case

* remove unused variable

* fix missing mask reduction tmspsh barrier

* fix mask bounds check

* fix rowmax f16 under/overflow to inf

* fix flash_attn_cm2 BLOCK_SIZE preprocessor directives
2026-01-30 15:56:40 +02:00
Patryk Kaminski cc0c103b5d ggml-sycl: remove unused syclcompat header (llama/19140)
The syclcompat/math.hpp is not used anymore. The change that intrduced it was successfuly reverted (https://github.com/ggml-org/llama.cpp/pull/17826).
This include path will become obsolete and dropped in oneAPI 2026.0 effectively breaking ggml-sycl builds.
2026-01-30 15:56:40 +02:00
Oleksandr Kuvshynov dda7d9cd1c vulkan: handle device dedup on MacOS + Vega II Duo cards (llama/19058)
Deduplication here relied on the fact that vulkan would return unique
UUID for different physical GPUs. It is at the moment not always the case.
On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total),
MotlenVK would assign same UUID to pairs of GPUs, unless they
are connected with Infinity Fabric.

See more details here: KhronosGroup/MoltenVK#2683.

The right way is to fix that in MoltenVK, but until it is fixed,
llama.cpp would only recognize 2 of 4 GPUs in such configuration.

The deduplication logic here is changed to only filter GPUs if UUID is
same but driver is different.
2026-01-30 15:56:40 +02:00
Kevin Pouget 531d7b6781 ggml: new backend for Virglrenderer API Remoting acceleration (v2) (llama/18718) 2026-01-30 15:56:40 +02:00
Alberto Cabrera Pérez 3701413a71 ggml-cpu: arm64: Q4_K scale unroll and vectorization (llama/19108) 2026-01-30 15:56:40 +02:00
Georgi Gerganov 7fb0f823de cuda : fix "V is K view" check for non-unified KV cache (llama/19145) 2026-01-30 15:56:40 +02:00
Georgi Gerganov f28a733025 CUDA: tune GLM 4.7 Flash FA kernel selection logic (DGX Spark) (llama/19142) 2026-01-30 15:56:40 +02:00
Nikhil Jain dfdd2fee83 ggml webgpu: Split shared state (webgpu_context) into global state and per-thread state (llama/18976)
* Squashed commit of the following:

commit b3c6bf4b0450d8d452b934df27a0fb7cb53cd755
Author: Abhijit Ramesh <abhijitramesh2k@gmail.com>
Date:   Mon Dec 1 18:29:00 2025 -0800

    ggml webgpu: fix xielu parameter passing (llama/11)

    The XIELU operation was incorrectly using static_cast to convert
    float parameters to uint32_t, which converted numeric values instead
    of preserving IEEE 754 bit patterns. This caused incorrect values
    to be interpreted by the GPU shader.

    * Use reinterpret_cast to preserve float bit patterns when passing
      through uint32_t params buffer
    * Update WGSL shader parameter types from u32 to f32
    * Re-enable XIELU support (was disabled due to numerical issues)

    Fixes NMSE test failures for XIELU operation on WebGPU backend.

commit 5ca9b5e49ea7cddc9ab7c8b43a11a9c76a4dff4a
Author: neha-ha <137219201+neha-ha@users.noreply.github.com>
Date:   Tue Nov 18 12:17:00 2025 -0800

    Refactored pipelines and workgroup calculations (llama/10)

    * refactored pipelines

    * refactored workgroup calculation

    * removed commented out block of prior maps

    * Clean up ceiling division pattern

    ---------

    Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu>
    Co-authored-by: Reese Levine <reeselevine1@gmail.com>

Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 29 23:13:06 2025 -0700

    formatted embed wgsl and ggml-webgpu.cpp

commit e1f6baea31645e5d96ad53664acae856f74b96f4
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 29 23:08:37 2025 -0700

    implemented REPL_Template support and removed bug in unary operators kernel

commit 8c70b8fece445cdc9a8c660dbddbf201e52da2bb
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 15 16:14:20 2025 -0700

    responded and dealt with PR comments

commit f9282c660c10dec4487d434549bdb707a9cd9f37
Author: James Contini <jamescontini@gmail.com>
Date:   Sun Oct 12 13:41:41 2025 -0700

    removed unnecesarry checking if node->src[1] exists for unary operators

commit 4cf28d7dec41c29186d66152735b244c5699f9dc
Author: James Contini <jamescontini@gmail.com>
Date:   Sun Oct 12 13:32:45 2025 -0700

    All operators (inlcluding xielu) working

commit 74c6add1761a59d2c2ff60b60e8ad3c8300f6d3e
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 13:16:48 2025 -0700

    fixed autoconfig

commit 362749910be4f0120c8ffb21ceddeb7d2c088e51
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 13:10:46 2025 -0700

    removed vestigial files

commit cb0858333785757804c5104e59c4981843207c16
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 12:59:32 2025 -0700

    abides by editor-config

commit 5360e2852a4b51197d7d67d0a5d42e908b02d7ed
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 12:45:57 2025 -0700

    rms_norm double declaration bug atoned

commit 7b09baa4aa53711be5a126043670cc182c78bfcd
Merge: 8a6ec843 74b8fc17
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 11:50:03 2025 -0700

    resolving merge conflicts

commit 8a6ec843a50ab82f8cef59b4558eb63f318ba02d
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 8 18:06:47 2025 -0700

    unary operators pass ggml tests

commit c3ae38278a2db236adc5912c9140e4f0d63f2c19
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 1 16:22:40 2025 -0700

    neg passes backend test

commit aa1c9b2f8877a405470ca56709c42a1fd43713de
Author: James Contini <jamescontini@gmail.com>
Date:   Tue Sep 30 23:55:27 2025 -0700

    neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though

Co-authored-by: James Contini <jamescontini@gmail.com>
Co-authored-by: Neha Abbas <neabbas@ucsc.edu>
Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com>

* Remove extra code and format

* Add ops documentation (finally)

* ggml webgpu: add SOFTPLUS unary operator

Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32
precision for intermediate calculations to prevent f16 overflow.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* Follow Vulkan backend numerical stability pattern

* ggml webgpu: add EXPM1 unary operator

Implements EXPM1 (exp(x) - 1) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add FLOOR unary operator

Implements FLOOR (rounds down to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add CEIL unary operator

Implements CEIL (rounds up to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add ROUND unary operator

Implements ROUND (rounds to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add TRUNC unary operator

Implements TRUNC (truncates towards zero) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS)

* Updates to webgpu get_memory

* Move shared state (webgpu_context) and device creation out of registration context, device context, and buffer context, and move into backend context

* Small cleanup

* Move Instance, Device, Adapter, Device creation, and capabilities to global state while moving Queue, pipelines, and buffers to per-thread state.

* Cleanups

* More cleanup

* Move staging_buf mutex to global context

* Resolve merge

* Resolve merge

* Resolve merge

* Clean up merge errors, delete forward declaration, and run clang-format

* Rename device_init to backend_init

* Move webgpu_context to backend_context

* Move buffer context members into global context and refactor function calls

* Run clang-format

* Remove commends

* Move parameter buffers to per-thread, add single memset_tensor param buf

* Fix CI compilation issue

* Fix builds for emscripten not supporting subgroups

* cleanup

* cleanup

---------

Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-01-30 15:56:40 +02:00
Vishal Singh 9c75c793a6 ggml-zendnn : update ZenDNN git tag to main branch (llama/19133) 2026-01-30 15:56:40 +02:00
Johannes Gäßler 9d94d0f782 CUDA: tune GLM 4.7 Flash FA kernel selection logic (llama/19097) 2026-01-30 15:56:40 +02:00
Alberto Cabrera Pérez 00885e08e2 ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860 (llama/18888)
* Boilerplate for q6_K repack

* q6_K repack to q6_Kx8 implementation

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* q6_K generic gemv and gemm

* wip, gemm_q6_K 8x8

* Still WIP: loading of q8s, q6h and q6l

* first working version of q6_K gemm

* Moved q6 loads outside of sb block, Unrolled inner loop

* Replaced modulo with mask

* First implementation of GEMV

* ggml_vdotq_s32 -> vdotq_s32

* Reduce width of accumulators in q6_K gemv

* Bsums instead of calc bias. Preload scales to use vget_lane. Unroll.

* Reuse scales in GEMM (same GEMV opt)

* Added todos for bsum and different qh repack

* Arch fallback

* VSLIQ for merging qh adn ql

* Removed TODO, already tested

* Apply suggestions

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Removed unused import

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-30 15:56:40 +02:00
Gaurav Garg 5fcbbdc0dd Reduce CPU-side stalls due to the CUDA command buffer being full (llama/19042)
* [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline.
Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.

* Set the env variable in the CUDA backend registry allocation

* Add link to PR in code comment

* Remove warning logs and update documentation
2026-01-30 15:56:40 +02:00
shalinib-ibm b2e2032856 ggml-cpu: Enable FP16 MMA kernels on PPC (llama/19060) 2026-01-30 15:56:40 +02:00
lhez 56f82a9f33 opencl: add flattened q6_K mv (llama/19054) 2026-01-30 15:56:40 +02:00
Johannes Gäßler 41d5d7bb0e CUDA: fix padding of GQA to power of 2 in FA (llama/19115) 2026-01-30 15:56:40 +02:00
Johannes Gäßler f63848eada CUDA: faster FA for GQA > 1 but not power of 2 (llama/19092) 2026-01-30 15:56:40 +02:00
ccbinn 4372b87b8e metal : fix recommendedMaxWorkingSetSize availability on legacy iOS/macOS (llama/19088)
Co-authored-by: chenbin11 <chenbin11@kuaishou.com>
2026-01-30 15:56:40 +02:00
Aman Gupta 1642a4fb60 ggml-cpu: Use tiled FA for prompt-processing (llama/19012)
* ggml-cpu: Use tiled FA for prompt-processing

the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine.

* fix out of bounds for mask

* skip rows where there are all masks

* skip tile if mask is inf

* store mask in worksize

* check inf tile earlier
2026-01-30 15:56:40 +02:00
Georgi Gerganov d2b51404e4 kv-cache : support V-less cache (llama/19067)
* kv-cache : support V-less cache

* cuda : better check for V_is_K_view

* cuda : improve V_is_K_view check

* graph : add comments

* hparams : refactor
2026-01-30 15:56:40 +02:00
Johannes Gäßler f53eafd745 CUDA: re-use MLA K data for V in MMA FA (llama/19057) 2026-01-30 15:56:40 +02:00
Aman Gupta 13577a6ce4 ggml-cuda: enable cuda-graphs for `n-cpu-moe` (llama/18934)
* ggml-cuda: add split-wise cuda graph

* add n-cpu-moe compare_llama_bench.py

* fix hip/musa builds
2026-01-30 15:56:40 +02:00
nullname 79f1bb3d35 ggml-hexagon: flash-attn opt (llama/19025)
* optimize flash attention kernel by improving score computation and online softmax update

* wip

* Refactor online softmax update in flash attention kernel for improved performance

* Optimize flash attention kernel by replacing float array with HVX_Vector for score computation

* wip
2026-01-30 15:56:40 +02:00
Neo Zhang 0d9dda5a99 use malloc to support both iGPU and dGPU in same time (llama/18992)
* use malloc to support both iGPU and dGPU in same time

* support windows

---------

Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2026-01-30 15:56:40 +02:00
Alberto Cabrera Pérez e090d91f5e ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm) (llama/18860)
* Boilerplate for q5_Kx8 REPACK on ARM and fallback

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Implements make_block_q5_Kx8 by extending make_block_q4_Kx8

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* q5_K repack gemm and gemv generics

* Gemm and Gemv ARM implementations (i8mm)

* Improved qh manipulation looking at non-repack vec_dot implementation

* Full unroll

* Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments.

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fix wrong fallback definitions of Q5_K

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fixed comments. Reverted unnecessary formatting

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fixed typo in generic definitions

* Switching AND + Shift with Shift Insert. Better op interleaving.

* Vectorize + unroll the block scales

* Apply gemm optimizations to gemv

* Improve bias calculation

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
2026-01-30 15:56:40 +02:00
Georgi Gerganov 3f96a1da0e mla : make the V tensor a view of K (llama/18986)
* mla : pass V as a view of K to the FA op

* cuda : adjust mla logic to new layout

* kv-cache : fix rope shift

* tests : remove comment

* cuda : fix reusable_cutoff

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-01-30 15:56:40 +02:00
Johannes Gäßler f21d0cbb1a CUDA: fix alignment check for FA (llama/19023) 2026-01-30 15:56:40 +02:00
lhez 0e030b852a opencl: enable the general fp mm for non-cont input and as a fallback for specialized kqv kernel for adreno (llama/18970)
* opencl: add `copy_to_contiguous` and utilize mm kernels

* opencl: only copy to cont for f32 and f16 tensors

* opencl: use cont mm for fallback when dst is large

* opencl: use nb local to copy-to-cont

* opencl: use local offset as well
2026-01-30 15:56:40 +02:00
Aman Gupta d4fafcfc6f CUDA: add gqa_ratio 4 for GLM 4.7 flash (llama/18953) 2026-01-30 15:56:40 +02:00
shaofeiqi 167fec69d5 opencl: add TRI op support (llama/18979) 2026-01-30 15:56:40 +02:00
Aleksei Nikiforov 55927d42ef ggml-zdnn : mark zDNN buffers as non-host (llama/18967)
While buffers reside in host memory,
additional transformation is needed to use buffers with zDNN.

Fixes #18848
2026-01-30 15:56:40 +02:00
Jeff Bolz b7e323f40b vulkan: Remove transfer_ctx, do everything in compute_ctx. (llama/18945)
* vulkan: Remove transfer_ctx, do everything in compute_ctx.

We had a bug where a set_tensor_async (using transfer_ctx) didn't get
submitted before the graph_compute (using compute_ctx) that came after
it. To avoid this sort of issue, just do everything in compute_ctx.

Remove transfer_cmd_pool, which was already unused.

* fix crash with perf logger
2026-01-30 15:56:40 +02:00
Jeff Bolz b2bc4d810b vulkan: support flash attention GQA/split_k with small batches (llama/18938) 2026-01-30 15:56:40 +02:00
Masato Nakasaka 3bbf4ced47 Revert "vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356)" (llama/18831)
This reverts commit 980b7cd17e055c8c587f79ffda7eb4fddf405566.
2026-01-30 15:56:40 +02:00
Jeff Bolz 660d943ff8 vulkan: Use mul_mat_vec_id for small values of n (llama/18918)
Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.

Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.
2026-01-30 15:56:40 +02:00
Oliver Simons 924a9e292c CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator (llama/18964)
* CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator

Strided iterator was added in [CCCL
3.1](https://github.com/NVIDIA/cccl/releases/tag/v3.1.0), which is packaged into
[CTK
13.1](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id5)

* Unindent as per code review request
2026-01-30 15:56:40 +02:00
Oliver Simons fdc83ee3c0 CUDA: Replace init_offsets kernel with iterators in cub-based argsort (llama/18930)
* CUDA: Replace `init_offsets` with iterators in argsort

This is a QOL improvement, saving us the cost of materializing the
iterator

* Remove unnecessary include from top-k.cu
2026-01-30 15:56:40 +02:00
Adrien Gallouët bf71ffa6b3 ggml : cleanup path_str() (llama/18928)
- Remove pragmas as `std::codecvt_utf8` is not used.
- Avoid implicit `strlen()`.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-30 15:56:40 +02:00
Georgi Gerganov b0517d6912 metal : enable FA for MLA heads (llama/18950) 2026-01-30 15:56:40 +02:00
Georgi Gerganov 47f3e3b927 ggml : add ggml_build_forward_select (llama/18550)
* ggml : add ggml_build_forward_select

* cuda : adapt CUDA graph compat to new feature

* vulkan : update logic to handle command buffer closing

* ggml : check compute for fusion

* ggml : add comment
2026-01-30 15:56:40 +02:00
lhez 62a09b106d opencl: fix q6_K mv for m=1 (llama/18893) 2026-01-30 15:56:40 +02:00
Reese Levine 389dafc7c2 ggml webgpu: support for backend sampling (llama/18880) 2026-01-30 15:56:40 +02:00
Thore Koritzius 511ca7a1f4 ggml : extend ggml_pool_1d + metal (llama/16429)
* chore: resolve conflicts

* feat: ggml metal impl

* fix: ggml_metal_kargs_pool_1d struct

* fix: require contiguous input

* chore: test pool_1d

* chore: limit pool1d test cases to p0=0 and s0=k0 to conform with asserts

* chore: add p0 and s0 to testing

* fix: allow padding for cpu and metal

* Update ggml/src/ggml-metal/ggml-metal.metal

* fix: correct single-threaded loop

* ggml : cleanup

* tests : add ne[1] != 1 tests

* fix: ne[1] handling in np

* cont : fixes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-30 15:56:40 +02:00
Perry Naseck ecb4b80c35 ggml-blas: hide warnings from included BLAS headers (llama/18818)
* fix compile def openblas, blis for compat libs, nvpl compile def, warn if no blas vendor set

* ggml-blas: hide warnings from included BLAS headers
2026-01-30 15:56:40 +02:00
Raul Torres 42960b6073 CANN: Remove unused `ggml_cann_get_device` function (llama/18625) 2026-01-30 15:56:40 +02:00
Chenguang Li 2fceb5a80f CANN: fix an issue where get_env was not fully renamed (llama/18796)
* CANN: fix an issue where get_env was not fully renamed

* ci: add cann with acl group

* ci: define use_acl_graph using GitHub Action

* ci: update cann dockerfile with acl graph
2026-01-30 15:56:40 +02:00
hipudding 854274a297 CANN: support gated linear attn (llama/18653)
* CANN: support gated linear attn

This change adds support for the GGML_OP_GATED_LINEAR_ATTN operator.
The feature was implemented by YushengZhao. Because the previous
submission was based on an outdated codebase, this PR was rebased to
merge.

Co-authored-by: YushengZhao <yusheng.chao@outlook.com>
Co-authored-by: hipudding <huafengchun@gmail.com>

* CANN: optimize OP gla

Optimize gla for high preformance

* Remove unused comments

---------

Co-authored-by: 赵禹昇 <2501112001@cninfer02.localdomain>
Co-authored-by: YushengZhao <yusheng.chao@outlook.com>
2026-01-30 15:56:40 +02:00
shaofeiqi ed6004d051 OpenCL: add SOLVE_TRI op support (llama/18846) 2026-01-30 15:56:40 +02:00
Georgi Gerganov 290ff3d28d cuda : print less debug logs when disabling cuda graphs (llama/18868) 2026-01-30 15:56:40 +02:00
Johannes Gäßler f2f0ba0384 CUDA: fix allignment on register spill for FA (llama/18815) 2026-01-30 15:56:40 +02:00