Commit Graph

2159 Commits

Author SHA1 Message Date
Jeff Bolz 37d4bba152 vulkan: change graph_compute to be async and enable get_tensor_async (llama/17158)
* vulkan: change graph_compute to be async and enable get_tensor_async

This allows some additional CPU/GPU overlap for large pp workloads. Also seems
to help a bit for token gen, maybe getting rid of a small bubble between
graph_compute and get_tensor.

Async set and copy functions seem to be very rarely used, so I didn't enable
them because I didn't have a good way to test them.

The async commands need to be ordered against each other, so put them all on
the compute queue. The non-async commands still use the transfer queue.

The fence for graph_compute/get_tensor_async is submitted and waited on in
ggml_vk_synchronize.

* fix thread safety errors

* teardown context cleanly

* Handle async read to non-pinned dst
2025-11-17 21:05:46 +02:00
Georgi Gerganov 523a6c27ea metal : support argsort for ne00 > 1024 (llama/17247)
* metal : refactor argsort

* cont : sort chunks

* cont : merge sorted buckets

* cont : cleanup
2025-11-17 21:05:46 +02:00
Georgi Gerganov b4d7df3ba2 metal : make the FA extra sizes consistent (llama/17143) 2025-11-17 21:05:46 +02:00
Alberto Cabrera Pérez a81fbfc78e ggml-cpu: handle 3d tensors in repack mat_mul (llama/17241)
* ggml-cpu: handle 3d tensors in repack mul_mat

* Removed unnecessary branch, removed need for <algorithm>

* Fixed dst_ptr pointer in chunk + clang_format

* GGML_ASSERT to check wdata within bounds

* Accidental ggml.h inclusion

* Improved GGML_ASSERT on wdata boundaries

* Address performance regression in Qwen and llama.cpp due to chunking
2025-11-17 21:05:46 +02:00
Piotr Wilkin (ilintar) 3e684f26c1 ggml : add ops SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM (llama/17063)
* Add ops needed for new hybrid models: SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM

* Update ggml/include/ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Code review

* Whitespace

* Update tests/test-backend-ops.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* This is actually sigmoid, duh.

* Add CONST, remove TRI_KEEP, other changes from review

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cuda/unary.cu

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Remove extra script

* Update ggml/src/ggml.c

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* moving changes from laptop [no ci]

* pre-rebase

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Refactor tests

* ggml : cleanup

* cont : fix ggml_fill srcs

* tests : add note

* ggml : add ggml_fill_inplace

* ggml : add asserts

* ggml : fix ggml_fill constant cast

* cont : ggml_tri minor

* Use TENSOR_LOCALS

* Fix regression from #14596, regenerate

* Don't make commits at night...

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-17 21:05:46 +02:00
Ruben Ortlam e8e0004fe5 vulkan: remove shell call from vulkan-shaders-gen tool, revert file check (llama/17219)
* vulkan: remove shell call from vulkan-shaders-gen tool

* use string vector for command execution

* Fix condition

* use string, remove const_cast

* Fix dependency file quotation on Windows

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-11-17 21:05:46 +02:00
Diego Devesa 210f0f860b sched : fix reserve ignoring user tensor assignments (llama/17232) 2025-11-17 21:05:46 +02:00
ixgbe 91fa5b5cac ggml-cpu : add RISC-V vector intrinsic support for silu and cvar operations (llama/17227)
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-11-17 21:05:46 +02:00
bagheera 265d326fa8 metal: accelerated conv2d (llama/17175)
* metal: accelerated conv2d

* cont : cleanup

---------

Co-authored-by: bghira <bghira@users.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-17 21:05:46 +02:00
Georgi Gerganov 6a1d830dfd Revert "ggml-cpu: handle 3d tensors in repack mat_mul (llama/17030)" (llama/17233)
This reverts commit 1c398dc9eca9c366ce98deb0e6f3538e444ebc8a.
2025-11-17 21:05:46 +02:00
Diego Devesa 6a91780c3b ggml-cpu : use template for argsort (llama/17222) 2025-11-17 21:05:46 +02:00
TecJesh 726912d1cb CANN: Add cross_entropy_loss op support (llama/16886)
* update L2_NORM op support

* update L2_NORM op support

* remove extra whitespace

* cann: update cross_entropy_loss op support

* remove trailing whitespaces

* rebase the latest code in the main repository and remove the l2_norm operator that already exists in another pull request.

* undo the l2_norm operator deletion
2025-11-17 21:05:46 +02:00
Aman Gupta 84275fc493 CUDA: fuse rope + set_rows (llama/16884)
* CUDA: add fused rope

* move k forward_expand up

* create helper function instead of re-using params

* make assert statement more in line with comment

* rope_norm: coalesced writes to global mem
2025-11-17 21:05:46 +02:00
Johannes Gäßler 566c4c4469 CUDA: static assert to prevent misuse of memcpy_1 (llama/17198) 2025-11-17 21:05:46 +02:00
Georgi Gerganov 3810a6180b ggml : use std::sort in ggml_argsort CPU implementation (llama/17211)
* ggml : use std::sort in ggml_argsort CPU implementation

* cont : add missing header
2025-11-17 21:05:46 +02:00
Alberto Cabrera Pérez 7df8515824 ggml-cpu: handle 3d tensors in repack mat_mul (llama/17030)
* ggml-cpu: handle 3d tensors in repack mul_mat

* Removed unnecessary branch, removed need for <algorithm>

* Fixed dst_ptr pointer in chunk + clang_format

* GGML_ASSERT to check wdata within bounds

* Accidental ggml.h inclusion

* Improved GGML_ASSERT on wdata boundaries
2025-11-17 21:05:46 +02:00
TecJesh e8b66d9f94 CANN: Add L2_NORM op support (llama/16856)
* update L2_NORM op support

* update L2_NORM op support

* remove extra whitespace
2025-11-17 21:05:46 +02:00
Neo Zhang Jianyu 8388350c66 fix ci crash about SSM_CONV (llama/17169)
* fix ci crash

* Update ggml-sycl.cpp

* Update ggml/src/ggml-sycl/ggml-sycl.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-17 21:05:46 +02:00
Max Krasnyansky 6748d27f55 hexagon: various Op fixes (llama/17135)
* hexagon: explicitly check for ops with zero nrows

llm_graph_context::build_inp_out_ids() can generate tensors with zero nrows.
Somehow other backends seems to handle this without obvious explicit checks.
In the hexagon case we need to check explicitly and skip them.

* hexagon: introduce fastdiv, fix test-backend-ops for ADD/SUB/MUL

Co-authored-by: chraac <chraac@gmail.com>

* hexagon: use fastdiv in ADD_ID

* hexagon: use ggml_op_is_empty and ggml_is_empty to check for NOPs

---------

Co-authored-by: chraac <chraac@gmail.com>
2025-11-17 21:05:46 +02:00
Eve 559091005a disable rms norm mul rope for chips with no fp16 rte (llama/17134) 2025-11-17 21:05:46 +02:00
ixgbe cd8f64d1b5 ggml-cpu : add RISC-V RVV (Zvfh) optimization for FP16 to FP32 conversion (llama/17161)
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-11-17 21:05:46 +02:00
duduta 1cefb03571 ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16 (llama/16805)
* extract rotate_pairs logic from ggml_compute_forward_rope_f32

* templateify ggml_compute_forward_rope_f32 and _f16

* abort when rope type not supported, remove GLM from test-rope

* add imrope branch to switch

* add rope tests for perf

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-17 21:05:46 +02:00
Charles Xu 3920ecce3a kleidiai: add optimized per-channel kernels for Q8_0 (llama/16993) 2025-11-17 21:05:46 +02:00
Mike Abbott c01bf73dd1 cmake : add version to all shared object files (llama/17091)
When compiling llama.cpp in Yocto, it fails QA checks because the generated so files aren't versioned.  This applies a version to all generated so files, allowing the package to build without errors.
2025-11-17 21:05:46 +02:00
lhez 46615d74d3 opencl: add fastdiv and use it in set_rows, ported from cuda (llama/17090)
* opencl: add fastdiv for mm q8_0

* opencl: use uint4 for fastdiv vals

* opencl: use fastdiv for set_rows

* opencl: do not use fastdiv for q8_0 mm
2025-11-17 21:05:46 +02:00
Max Krasnyansky ccf525baf0 cpu: skip NOPs to avoid barriers (llama/17133)
* cpu: skip NOPs to avoid barriers

* cpu: use ggml_op_is_empty
2025-11-17 21:05:46 +02:00
Georgi Gerganov 40aebfe8bf metal : cap threadgroups size of set_rows (llama/17146) 2025-11-17 21:05:46 +02:00
Adrien Gallouët 86be60093e ggml-cpu : inspect -march and -mcpu to found the CPU (llama/16333)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-11-17 21:05:46 +02:00
Ruben Ortlam ef71d83b76 vulkan: check glslc executable string (llama/17144) 2025-11-17 21:05:46 +02:00
Ruben Ortlam 43f2c1ff54 vulkan: fix validation issue introduced by #16868 (llama/17145) 2025-11-17 21:05:46 +02:00
Georgi Gerganov bb92c79f56 metal : enable tensor API for A19 (llama/17087) 2025-11-17 21:05:46 +02:00
fj-y-saito 4fea91f06e arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… (#15277)
* add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_q8_K

* Surround SVE function with compiler directive

* fix compile switch

* fix coding style

* ggml : fix indent

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-17 21:05:46 +02:00
Acly 58a97d988f cuda/vulkan : bicubic interpolation (llama/17022)
* vulkan : implement upscale with bicubic interpolation

* cuda : implement upscale with bicubic interpolation

* tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests

* adapt OpenCL backend to not support the OP in that case so tests don't fail

* print scale mode & flags in test-backend-ops
2025-11-17 21:05:46 +02:00
Ruben Ortlam 2e04e7a906 vulkan: fix memory allocations (llama/17122) 2025-11-17 21:05:46 +02:00
Ruben Ortlam 1993e397bb vulkan: iGPU memory reporting fix (llama/17110)
* vulkan: use all device-local heaps for memory availability reporting

Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>

* use all available heaps for iGPU memory reporting

* Allow multiple memory types per buffer request for devices with split heaps

---------

Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-11-09 23:38:03 +02:00
Ruben Ortlam ee8349cf10 vulkan: fix mmq out of bounds reads (llama/17108)
* vulkan: fix mmq out of bounds reads, streamline outdated matmul host code

* fix mul_mat_id quantization call

* Fix compiler warnings
2025-11-09 23:38:03 +02:00
Jeff Bolz db98e8c5b4 vulkan: fuse mul_mat_id + mul (llama/17095)
* vulkan: fuse mul_mat_id + mul

This comes up in qwen3 moe.

* split mul_mat_id fusion tests into a separate class
2025-11-09 23:38:03 +02:00
Georgi Gerganov a4339e2ea7 metal : retain src and dst buffers during async ops (llama/17101) 2025-11-09 23:38:03 +02:00
Jeff Bolz 6de3404773 vulkan: Use spec constants for conv2d s/d/p and kernel W/H (llama/16978)
* vulkan: Use spec constants for conv2d s/d/p and kernel W/H

Also add some additional unroll hints, which seems to help.

* lock around map lookup
2025-11-09 23:38:03 +02:00
Aman Gupta 8967c9ad9b Revert "CUDA: add expert reduce kernel (ggml/16857)" (llama/17100) 2025-11-09 23:38:03 +02:00
Aman Gupta 522b9bce33 CUDA: skip fusion for repeating adds in bias (llama/17080) 2025-11-09 23:38:03 +02:00
SavicStefan 0caa32c772 vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp (llama/16636)
Signed-off-by: Stefan Savic <stefan.savic@huawei.com>
Co-authored-by: Stefan Savic <stefan.savic@huawei.com>
2025-11-09 23:38:03 +02:00
Aleksei Nikiforov 3c975ad523 ggml: disable vxe for cross-compilation by default (llama/16966)
Otherwise compilation will fail due to enabling -mvx -mzvector
and not setting corresponding -march options.
2025-11-09 23:38:03 +02:00
Jeff Bolz 257ce2f5c0 vulkan: fuse rms_norm + mul + rope (+ view + set_rows) (llama/16977)
This change combines the rms_norm+mul and rope+view+set_rows fusions to
allow fusing the whole sequence together. This comes up in Qwen3, Bailing,
and some other models.
2025-11-09 23:38:03 +02:00
Jeff Bolz 4eef518167 vulkan: Fix test-thread-safety crashes (llama/17024)
The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the
same time, which needs to hold the lock. To be safe, hold the lock for all of
ggml_vk_load_shaders.
2025-11-09 23:38:03 +02:00
Johannes Gäßler 358f77aca7 CUDA: fix MMQ stream-k fixup ne1 indices (llama/17089) 2025-11-09 23:38:03 +02:00
Reese Levine 78ea6c5b67 ggml webgpu: faster matrix multiplication/matrix-vector multiplication (llama/17031)
* Faster tensors (llama/8)

Add fast matrix and matrix/vector multiplication.

* Use map for shader replacements instead of pair of strings
2025-11-09 23:38:03 +02:00
bssrdf 547724b0a5 CUDA: properly handle nb00=nb02 case for cpy (llama/17081) 2025-11-09 23:38:03 +02:00
Acly 11543bf446 vulkan : refactor buffer handling in vk_op_f32 (llama/16840)
* vulkan : refactor/simplify buffer handling in vk_op_* functions

* Combine UMA handling into ggml_vk_tensor_subbuffer
2025-11-09 23:38:03 +02:00
Johannes Gäßler af8a88792f CUDA: fix should_use_mmvf for ne11 == 1 (llama/17085)
* CUDA: fix should_use_mmvf for ne11 == 1

* Apply suggestion from @am17an

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2025-11-09 23:38:03 +02:00
Adrien Gallouët a1746097bc Revert "ggml-cpu: detect correct cpu flags for arm64 (llama/16229) (#16239)" (llama/17084)
This reverts commit 7c23f3f0d4b9f5d6ea140756eb694b562d5acebb.
2025-11-09 23:38:03 +02:00
iron 512592513c ggml-cpu: detect correct cpu flags for arm64 (ggml/16229) (llama/16239)
When using GCC 9 and GCC 12 on the arm64 platform of ubuntu 2004,
the command "gcc -mcpu=native -E -v -" fails to detect the correct CPU flags,
which results in compilation failures for certain extended instructions,
but the correct CPU flags can be obtained by using gcc -march.

Signed-off-by: lizhenneng <lizhenneng@kylinos.cn>
Co-authored-by: lizhenneng <lizhenneng@kylinos.cn>
2025-11-09 23:38:03 +02:00
xctan 5bce732795 ggml-cpu : optimize RVV q2_k and q3_k kernels (llama/16887) 2025-11-09 23:38:03 +02:00
Johannes Gäßler b5d6fa438f CUDA: fix crash on uneven context without FA (llama/16988) 2025-11-09 23:38:03 +02:00
Georgi Gerganov 32ed574370 metal : initial Metal4 tensor API support (llama/16634)
* metal : rework mat-mat multiplication

* metal : initial Metal4 support

* cont

* metal : detect tensor support

* cont : better ifdefs

* metal : support tensors in mul_mm_id

* metal : add env for disabling tensor API

* tests : restore

* metal : remove unused constants

* metal : fix check for bfloat tensor support

* cont : handle API incompatibilities

* cont : handle even more incompatibilities

* metal : use tensor API only on M5 and later
2025-11-09 23:38:03 +02:00
YehuditE 45588b272e sycl: add CONCAT operator support (llama/16047)
* sycl: add CONCAT operator support

* cleanup: remove stray lines added by mistake

* fix: code format issues in concat.cpp and tests/test-backend-ops.cpp

* chore: fix editorconfig violations

* cleanup: drop unnecessary i16 type support

* docs: update sycl-csv and regenerate ops.md

* update docs/ops.md

* fix: adapt to upstream master changes after rebase

* fix: remove empty files

* fix: drop whitespace

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-09 23:38:03 +02:00
l3utterfly b3324ae7d1 ggml-hexagon: graceful fallback for older socs where rpcmem_alloc2 and FASTRPC_GET_URI is unsupported (llama/16987)
* support older socs where FASTRPC_GET_URI is unsupported

* added graceful fallback when FASTRPC_GET_URI call fails

* use weak symbols instead of loading libcdsprpc.so dynamically

* Add weak pragma for rpcmem_alloc2

* Remove weak declaration for rpcmem_alloc2 in ggml-hexagon.cpp

Removed weak declaration for rpcmem_alloc2.

* Enforce ndev to 1 for archs below v75

Force ndev to 1 for SoCs architectures lower than v75.
2025-11-09 23:38:03 +02:00
bssrdf 13cd906501 improve CUDA cpy memory bandwidth when copying transposed tensor (llama/16841)
* WIP

* added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth

* added BF16 support

* more strict check to make sure src0 is a transpose

* reformulated to handle more complicated transpose cases

* bring back 2D transpose for higher performance

* allow build on windows

* tranpose copy more shapes

* minor tweak

* final clean up

* restore some test cases

* keep only the kernel for true tranposed case; updated with review suggestions

* make CI happy

* remove headers not needed

* reduced bank conflicts for fp16 and bf16

* add missing const*

* now bank conflicts free

* use padding instead of swizzling

---------

Co-authored-by: bssrdf <bssrdf@gmail.com>
2025-11-09 23:38:03 +02:00
Jeff Bolz 558a04c9c7 vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (llama/16919) 2025-11-09 23:38:03 +02:00
Reese Levine e734b5d6ef ggml webgpu: minor set rows optimization (llama/16810)
* Add buffer label and enable dawn-specific toggles to turn off some checks

* Minor set_rows optimization (ggml/4)

* updated optimization, fixed errors

* non vectorized version now dispatches one thread per element

* Simplify

* Change logic for set_rows pipelines

---------

Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Comment on dawn toggles

* Remove some comments

* Implement overlap binary operators

* Revert "Implement overlap binary operators"

This reverts commit ed710b36f51ab3f53fa13db15c1685dc8678a32a.

* Disable support for non-contiguous binary_op tensors and leave note for future support

---------

Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
2025-11-09 23:38:03 +02:00
nullname 44e77ccee6 refactor: replace sprintf with snprintf for safer string handling in dump functions (llama/16913) 2025-11-09 23:38:03 +02:00
Jeff Bolz 1672d41ab0 vulkan: remove the need for the dryrun (llama/16826)
* vulkan: remove the need for the dryrun

Allocate pipelines and descriptor sets when requested.

Reallocate the prealloc buffers when needed, and flush any pending work
before reallocating.

For rms_partials and total_mul_mat_bytes, use the sizes computed the last time
the graph was executed.

* remove dryrun parameters
2025-11-09 23:38:03 +02:00
Acly 997fdde0c4 ggml-cpu : bicubic interpolation (llama/16891) 2025-11-09 23:38:03 +02:00
Noah 52e43a2fa5 Fix garbled output with REPACK at high thread counts (llama/16956)
* Fix garbled output with REPACK at high thread counts

Fixed a race condition in the REPACK matrix multiplication code that caused garbled output when using 26+ threads (model-dependent threshold). The issue occurred because with high thread counts, the code forced chunk count to equal thread count, creating many small chunks. After aligning these chunks to NB_COLS boundaries, adjacent chunks could overlap, causing data corruption and race conditions. The fix enforces minimum chunk sizes based on NB_COLS and caps maximum chunk count to prevent creating too many tiny chunks, ensuring proper alignment without overlaps.

* Update ggml/src/ggml-cpu/repack.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cpu/repack.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-09 23:38:03 +02:00
Aman Gupta e51a2f90fe CUDA: avoid mul + bias fusion when doing fusion (llama/16935) 2025-11-09 23:38:03 +02:00
lhez f856023f46 opencl: support imrope (llama/16914)
* opencl: support imrope

* opencl: fix whitespace
2025-11-09 23:38:03 +02:00
theo77186 82ede64cd0 ggml: CUDA: add head size 72 for flash-attn (llama/16962) 2025-11-09 23:38:03 +02:00
Jinyang He 79801188f7 ggml : LoongArch fixes (llama/16958)
* Fix test-quantize-fns f16 and q4_0 failed when use LSX

* Fix LoongArch set float intrinsic when use LSX/LASX
2025-11-09 23:38:03 +02:00
shani-f f1da026bb8 SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt (#16869)
* SYCL repeat_back v1 — add core op + switch case

* Implement repeat_back SYCL operation and minor fixes

* SYCL: optimize repeat_back kernel

* Remove Hebrew comment from repeat_back.cpp

* Remove comments for code clarity

Removed comments to clean up the code.

* Fix formatting in ggml-sycl.cpp

* Formatted lambda according to legacy style. No logic changes

* Remove blank line in repeat_back.cpp

Remove unnecessary blank line before assigning acc to dst_dd.
2025-11-09 23:38:03 +02:00
Georgi Gerganov 39834fde1b clip : use FA (llama/16837)
* clip : use FA

* cont : add warning about unsupported ops

* implement "auto" mode for clip flash attn

* clip : print more detailed op support info during warmup

* cont : remove obsolete comment [no ci]

* improve debugging message

* trailing space

* metal : remove stray return

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-11-09 23:38:03 +02:00
mnehete32 5ed97df483 CUDA: add FLOOR, CEIL, ROUND, TRUNC unary ops (llama/16917) 2025-11-09 23:38:03 +02:00
Aaron Teo 84854d246a ggml: add s390x cpu-feats (llama/16774) 2025-11-09 23:38:03 +02:00
Jeff Bolz 2001457367 vulkan: Fix multi_add invalid descriptor usage (llama/16899) 2025-11-09 23:38:03 +02:00
Jeff Bolz 90be9c9de1 vulkan: fuse mul_mat+add and mul_mat_id+add_id (llama/16868)
* vulkan: fuse mul_mat+add and mul_mat_id+add_id

The fusion is only applied for the mat-vec mul paths.

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix 32b build

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-09 23:38:03 +02:00
Oliver Simons 7d55fba06f CUDA: Remove unneded bias/gate dims in fused mmvq (llama/16858)
* CUDA: Remove unneded bias/gate dims in fused mmvq

Pointed out
[here](https://github.com/ggml-org/llama.cpp/pull/16847#discussion_r2476798989)
that only a single value is needed per target col per thread

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Fix "Error 991-D: extra braces are nonstandard" during compilation

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-11-09 23:38:03 +02:00
Johannes Gäßler 52e1bbb554 CUDA: Volta tensor core support for MMF (llama/16843)
* CUDA: Volta tensor core support for MMF

* more generic checks for hardware support

* Update ggml/src/ggml-cuda/mmf.cuh

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2025-11-09 23:38:03 +02:00
Georgi Gerganov addda802dd ggml : fix conv2d_dw SVE path (ggml/1380)
* Fix test-conv2d-dw failure on ARM SVE by using runtime vector length

The ggml_compute_forward_conv_2d_dw_cwhn function was using a hardcoded GGML_F32_EPR (8) for SIMD vectorization, but on ARM SVE the actual vector length varies by hardware. This caused incorrect computation when processing CWHN layout tensors on ARM machines.

Fix by using svcntw() to get the runtime SVE vector length instead of the compile-time constant.

Co-authored-by: ggerganov <1991296+ggerganov@users.noreply.github.com>

* ci : reduce sam score threshold

* ci : update bbox checks for sam test

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: ggerganov <1991296+ggerganov@users.noreply.github.com>
2025-11-09 23:38:03 +02:00
Aman Gupta 7d60b431a5 CUDA: add expert reduce kernel (llama/16857)
* CUDA: add expert reduce kernel

* contigous checks, better formatting, use std::vector instead of array

* use vector empty instead of size

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-11-09 23:38:03 +02:00
Jeff Bolz a9ba988e56 vulkan: disable spirv-opt for rope shaders (llama/16872) 2025-11-09 23:38:03 +02:00
Masato Nakasaka e2b3eca0dc vulkan: Fix crash when FP16 mul_mat accumulation is not supported (llama/16796)
* Experimenting crash fix

* added assert for aborting and fixed comment

* changed to check if a pipeline is empty or not

* Moved function in class definition

* replaced with is_empty

* Modified is_empty to check only unaligned pipelines
2025-11-09 23:38:03 +02:00
Ruben Ortlam 7ed570ee94 vulkan: fix shmem overrun in mmq id shader (llama/16873)
* vulkan: fix shmem overrun in mmq id shader

* metal : fix mul_mm_id

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-09 23:38:03 +02:00
l3utterfly 486d39c2cb ggml-hexagon: respect input size when getting/setting tensor data (llama/16836)
* respect input size when getting/setting tensor data

allows partial repacking/copying when get tensor size is smaller than the actual tensor

* Removed duplicate repack_mxfp4_mxfp4x4x2 function
2025-11-09 23:38:03 +02:00
lhez 7fdd53ac0d opencl: fix boundary handling for mul_mm (llama/16875) 2025-11-09 23:38:03 +02:00
Max Krasnyansky ffe1c832bd cpu: introduce chunking for repack matmuls and enable matmul-id chunking on ARM64 (llama/16833)
Very similar implementation to the flash-attention chunking, with similar benefits.
2025-11-09 23:38:03 +02:00
JJJYmmm e1780b209d model: add support for qwen3vl series (llama/16780)
* support qwen3vl series.

Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com>
Co-authored-by: yairpatch <yairpatch@users.noreply.github.com>
Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>

* bugfix: fix the arch check for qwen3vl-moe.

* use build_ffn

* optimize deepstack structure

* optimize deepstack feature saving

* Revert "optimize deepstack feature saving" for temporal fix

This reverts commit f321b9fdf13e59527408152e73b1071e19a87e71.

* code clean

* use fused qkv in clip

* clean up / rm is_deepstack_layers for simplification

* add test model

* move test model to "big" section

* fix imrope check

* remove trailing whitespace

* fix rope fail

* metal : add imrope support

* add imrope support for sycl

* vulkan: add imrope w/o check

* fix vulkan

* webgpu: add imrope w/o check

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix tensor mapping

---------

Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com>
Co-authored-by: yairpatch <yairpatch@users.noreply.github.com>
Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-09 23:38:03 +02:00
Max Krasnyansky f1fdb91e95 cpu: introduce chunking for flash attention (llama/16829)
Factor out the core FA loop into flash_atten_f16_one_chunk and add an outter loop
on top that handles the chunks.
2025-11-09 23:38:03 +02:00
Sigbjørn Skjæret f7dfa39104 cuda : fix argsort with 64k+ rows (llama/16849) 2025-11-09 23:38:03 +02:00
Jeff Bolz 887d984558 vulkan: Handle argsort with a large number of rows (llama/16851) 2025-11-09 23:38:03 +02:00
Oliver Simons 41f4daca57 Hide latency of bias and gate-loading (llama/16847)
This is realised by loading them into registers before computation of
the dot-product, effectively batching them together with said
dot-product. As a lot of threads are alive here, the warp scheduler has
enough threads available to effectively hide the cost of additionally
loading those two floats.
2025-11-09 23:38:03 +02:00
Jeff Bolz efe8099268 vulkan: Fuse rope+set_rows (llama/16769)
This pattern appears in a lot of models, the rope operation is applied right
before storing into the KV cache (usually on the K tensor).

Add a path to some of the rope shaders that computes the destination address
based on the set_rows tensor. Compile variants of the shader with D_TYPE of
f16 (the usual KV cache type).

Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs
the fourth for the row indices.

Add fused_ops_write_mask to indicate which intermediate tensors need to write
their results to memory. Skipping writing the roped K value helps to allow more
nodes to run concurrently.

Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It
rarely starts out that way in the graph.

Add new backend tests.
2025-11-09 23:38:03 +02:00
Jeff Bolz 35a3fda240 vulkan: Update topk_moe fusion to handle gpt's late softmax (llama/16656)
* vulkan: Update topk_moe fusion to handle gpt's late softmax

Based on #16649.

* Add ggml_check_edges

* Add sync logging to show fusion effects

* handle clamp added in #16655

* Update ggml/src/ggml-impl.h

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-11-09 23:38:03 +02:00
Ruben Ortlam bc944bddc8 Vulkan MMQ Integer Dot Refactor and K-Quant support (llama/16536)
* vulkan: add mmq q2_k integer dot support

* Refactor mmq caching

* Reduce mmq register use

* Load 4 quant blocks into shared memory in one step

* Pack q2_k blocks into caches of 32

* Use 32-bit accumulators for integer dot matmul

* Add q4_k mmq

* Add q3_k mmq

* Add q5_k mmq

* Add q6_k mmq

* Add mxfp4 mmq, enable MMQ MUL_MAT_ID

* Fix mmv dm loads
2025-11-09 23:38:03 +02:00
Max Krasnyansky 4d74160c9a Hexagon Op queue & dispatch optimizations (llama/16820)
* hexagon: remove dspqueue callbacks and do all read processing inplace

* hexagon: there is no need to ref/deref the buffers at this point

We're not going to release the buffers without flushing the session queue.
So there is no need to inc/dec the refcounts for every request.
We also don't need to include those bufs in the response.

* hexagon: bump the thread count in the adb wrapper scripts

We can use more CPU cores now that the dedicated dspqueue polling threads are not used (ie no contention).
Also enable more agressive polling for now since we still map Flash Attention (and a few other kernels) to
the CPU and those dspqueue threads were keeping the CPU cores are higher clock freqs.

* hexagon: add lhez as the second code owner
2025-11-09 23:38:03 +02:00
Aman Gupta 6051c704a0 CUDA: use fastdiv in set-rows (llama/16834)
* CUDA: use fastdiv in set-rows

* add assert about value fitting in u32
2025-11-09 23:38:03 +02:00
Jeff Bolz 82a23ca9c4 vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy (llama/16793)
This lets the copy to the destination device use the host-visible
vidmem optimization.
2025-11-09 23:38:03 +02:00
Aman Gupta 5c316c48f7 CUDA: Fix bug in topk-moe for gpt-oss (llama/16821)
* CUDA: Fix bug in topk-moe for gpt-oss

When using ggml_can_fuse_subgraph, the output nodes which are passed are wrong. This causes `test-backend-ops` to still fuse ndoes (because the nodes are not used elsewhere in the graph),
but it actually doesn't fuse in the actual gpt-oss

* fix for qwen3 too

* change ifndef to ifdef
2025-11-09 23:38:03 +02:00
YaelLogic 5850c952e5 sycl: add RMS_NORM_BACK operation support (llama/16808)
* sycl: add RMS_NORM_BACK operation support

* sycl: rms_norm_back: add dual reduction paths (FP64 and FP32) and savepoint before further changes

* sycl: add RMS_NORM_BACK support

Implement RMS_NORM_BACK for the SYCL backend using FP32 compensated parallel reduction. Minimal docs updates (ops.md / SYCL.csv).

* revert: restore .gitignore and tools/run/CMakeLists.txt to upstream

* revert: restore tests/CMakeLists.txt to upstream

* sycl: optimize rms_norm_back

* fix: restore SYCL.csv to correct state with RMS_NORM_BACK support

* Update ggml/src/ggml-sycl/norm.cpp

Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>

* fix: remove trailing whitespace and add missing newline (EditorConfig)

---------

Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2025-11-09 23:38:03 +02:00
YaelGitAccount a983c9219d cuda: add SET operation support (llama/16804)
* feat(cuda): add GGML_OP_SET support

Implement CUDA kernel for SET operation with f32 support.

All tests passing (14598/14598).

* cuda(set): add I32 support; keep F32

* refactor(cuda): use ggml_cuda_cpy to unify SET operator logic and remove code duplication

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update ggml/src/ggml-cuda/set.cu

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-09 23:38:03 +02:00
l3utterfly f863a42d97 initialise buffer.device in ggml_hexagon_session (llama/16816) 2025-11-09 23:38:03 +02:00
Chenguang Li cb39359e7f CANN: Improve device ID handling and aclnnArange checks (llama/16752)
* cann: improve device ID handling and aclnnArange checks

- Stop relying on CANN's internal device ID retrieval; use a global variable instead.
- Enforce stricter dimension validation in aclnnArange for better compatibility across CANN versions.

* cann: use thread local var
2025-11-09 23:38:03 +02:00
Aman Gupta 0c8ff48103 CUDA: add unused vars to mmvf and mmvq (llama/16807) 2025-11-09 23:38:03 +02:00
tamarPal 9664420a54 sycl: add SSM_CONV operation support (llama/16800)
* feat: Add SYCL backend support for SSM_CONV operator

* Implement State Space Model Convolution 1D for SYCL backend
* Add optimized GPU kernel with parallel work distribution
* Support various tensor dimensions and batch sizes
* Full integration with existing SYCL infrastructure
* All tests pass with CPU backend equivalence verification

* feat: Implement SYCL backend support for SSM_CONV operation

- Add ggml-sycl/ssm_conv.cpp and ssm_conv.hpp
- Implement SYCL kernel for state space model convolution
- Ensure numerical correctness matches CPU implementation exactly
- Add proper type checking for F32 tensors in backend support
- All test-backend-ops SSM_CONV tests pass (14490/14490)

* Perfect SSM_CONV SYCL implementation - 100% CPU parity

 Flawless numerical accuracy - matches CPU bit-for-bit
 Optimal SYCL kernel design - efficient parallel execution
 Complete tensor layout compatibility - handles all strides correctly
 Robust error handling - comprehensive assertions and validation
 All official tests pass - 14,490/14,490 backend operations verified
 Production-ready code - clean, documented, maintainable

Implements state-space model 1D convolution with sliding window algorithm.
Eliminates blocking queue.wait() for better async performance.

* Clean SSM_CONV code - remove all comments for production

Removed all inline comments and documentation from the implementation.
Clean, minimal code ready for production merge.

* fix: Final formatting corrections for CI compliance

- Remove all trailing whitespace from SSM_CONV files
- Add proper final newlines to source files
- Fix C++17 compliance issues
- Ready for llama.cpp CI validation

* sycl: fix trailing whitespace and minor safety casts in ssm_conv

* fix: Clean up duplicated content in ssm_conv.hpp header file

---------

Co-authored-by: tamarPal <tamarPal@example.com>
2025-11-09 23:38:03 +02:00
Acly bcda7c3e58 ggml : fix interpolate with align-corners and ne=1 (llama/16700)
* ggml : fix interpolate with align-corners and ne=1

* avoid division by zero if one of the spatial dimensions is 1
* cpu, cuda, opencl returned correct result anyway due to clamp
* vulkan didn't clamp for align-corners so results were broken

* fix clang warning
2025-11-09 23:38:03 +02:00
Johannes Gäßler 1471b1fda7 HIP: fix AMDGPU_TARGETS, update documentation (llama/16803) 2025-11-09 23:38:03 +02:00
tamarPal 0e1b6c5fc4 sycl: add ROLL operation support (llama/16665)
* sycl: add ROLL operation support

- Implement ggml_sycl_roll function for F32 tensors
- Add multi-axis roll operation with SYCL kernel
- Support all 4 tensor dimensions with proper shift normalization
- Add roll.cpp and roll.hpp to SYCL backend
- Update backend dispatch and supports_op for GGML_OP_ROLL
- Tests: 17662/17662 pass with identical CPU reference results

* fix: remove trailing whitespace from roll.cpp

- Fix EditorConfig violations in ggml/src/ggml-sycl/roll.cpp
- Remove trailing spaces from lines 6, 11, 28, 47, 58, 60

* ci: retrigger

* sycl: remove wait() calls from ROLL operation

* fix: editorconfig — LF endings + final newline for roll.hpp

---------

Co-authored-by: tamarPal <tamarPal@example.com>
2025-11-09 23:38:03 +02:00
shani-f 543221d824 sycl: add REPEAT_BACK operation support (llama/16734)
* SYCL repeat_back v1 — add core op + switch case

* Implement repeat_back SYCL operation and minor fixes

* Update ggml/src/ggml-sycl/repeat_back.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update ggml/src/ggml-sycl/repeat_back.hpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update ggml/src/ggml-sycl/ggml-sycl.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-09 23:38:03 +02:00
Aman Gupta 97c3285cc4 CUDA: support for weight clamp in top-k norm (llama/16702) 2025-11-09 23:38:03 +02:00
Acly bd8734c050 ggml-alloc : make gallocr prefer chunks that allow memory reuse (llama/16788) 2025-11-09 23:38:03 +02:00
Sigbjørn Skjæret e6ff2bceed cuda : use fast copy when src and dst are of different type and contiguous (llama/16789)
* use fast copy when src and dst are contiguous and same shape

* use int64_t ne and ignore shape
2025-11-09 23:38:03 +02:00
leejet 4f4246dcb4 ggml: fix cuda kernel launch configuration for k_compute_batched_ptrs to support large batch (llama/16744)
* fix k_compute_batched_ptrs

* add backend ops test

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* reduce the batch size

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-11-09 23:38:03 +02:00
Aman Gupta 9f75cc7eef CUDA: General GEMV fusion (llama/16715) 2025-11-09 23:38:03 +02:00
Gilad S c00ab7e5e6 vulkan: deduplicate Microsoft Direct3D12 devices (llama/16689)
* fix: deduplicate and deprioritize Microsoft Direct3D12 vulkan devices from the `vulkan-dozen` driver

* style: indent

* fix: decrease priority

* fix: switch to `||`
2025-11-09 23:38:03 +02:00
Giuseppe Scrivano d0b544da70 vulkan: delete dead code (llama/16732)
ggml_vk_create_buffer_temp is not used anywhere, and it is the only
caller for ggml_vk_pool_malloc.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-11-09 23:38:03 +02:00
Jeff Bolz 070b24f65c vulkan: Optimize SSM_SCAN (llama/16645) 2025-11-09 23:38:03 +02:00
leejet 5166efa7f0 ggml: fix CUDA grid launch condition for large block_nums.y in binbcast (llama/16742)
* Fix CUDA grid launch condition for large block_nums.y

* add backend ops test

* reduce test  repetitions
2025-11-09 23:38:03 +02:00
Aman Gupta 524046d4d1 CUDA: use CUB for arbitary size argsort (llama/16754) 2025-11-09 23:38:03 +02:00
Aman Gupta 47efc4f115 ggml-cuda: use passed ops instead of hardcoded ops (llama/16712) 2025-11-09 23:38:03 +02:00
Matthew Michel 0a5b4c2e9b sycl: use async memory allocation to fix crashes during graph recording (llama/16644)
* sycl: use async memory allocation to fix graph recording failures

GGML_SYCL_DISABLE_GRAPHS=0 causes crashes because:
  - Host waits are currently unsupported in graph recording mode.
  - SYCL malloc / free calls are unsupported in graph recording mode.

The following changes are made to fix SYCL graph functionality:
  - When graphs are enabled, use the SYCL async memory extension for temp
    buffers which is supported with SYCL graphs.
  - For compiler versions that do not support this extension, skip
    graphs with the affected op.
  - Switch from USM shared to device memory as the async extension
    currently just supports device allocations.

* Address reviewer feedback

* Use global async variable to decide path in sycl_ext_[malloc_device|free]
2025-11-09 23:38:03 +02:00
Max Krasnyansky 8bb12395fe Add experimental ggml-hexagon backend for the Hexagon NPU (llama/16547)
* model: add support for extra bufs for all devices

* hexagon: add experimental ggml-hexagon backend for the Hexagon NPU

This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU.

Highlights:
- Supports Hexagon versions: v73, v75, v79, and v81
- Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5
- Supports Q4_0, Q8_0, MXFP4, and FP32 data types
- Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX

**Note:** This backend is experimental and may exhibit instability or limited performance across supported devices.
It is intended for early testing and feedback from llama.cpp/ggml developer and user community.

Co-Authored-By: Rajdeep Ganguly <rganguly@qti.qualcomm.com>
Co-Authored-By: Todor Boinovski <todorb@qti.qualcomm.com>

* hexagon: fix format checker errors

* hexagon: update readme and cmake presets

* ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions

* hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input

* hexagon: move ADB helper scripts into scripts/snapdragon/adb

* hexagon: replace all f/printfs with GGML_LOG_...

* readme: add hexagon to the list supported backends

* hexagon: stack malmuts with quantized inputs only

* hexagon: add TODO for fixing issues in hexagon_graph_optimize

* hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC

* scripts: fix lint errors

* scripts: update qdc pytest script to make linter happy

* hexagon: add reduce sum in fp32

* hexagon: reduce number of vector stores in matmul output

* hexagon: remove the need for vdelta in reduce-multiply-x8

* hexagon: consistent use of reduce_sum_fp32 for row_sums

* hexagon: some more matmul optimizations and comments

Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models).
We've handled those cases already but at a higher overhead.

* hexagon: update cmake presets

* hexagon: add OPMASK support for run-bench.sh wrapper

* hexagon: update to use GGML_BACKEND_API

* hexagon: remove unused logic for setting tensor flags for the views

* hexagon: add asserts to set/get_tensor to make sure we handle complete tensors

Same asserts as the CPU backend.

* hexagon: use cpy_tensor slow path for non-host buffers

* hexagon: error checks in the buffer allocator

* cmake: move include(extProj) under ggml-hexagon

* hexagon: don't forget to delete the backend on free

* hexagon: set/get_tensor size assert apply only to quantized tensors

* hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now

GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way.
Ideally we need a bit more finer log levels.

* docs: typos in hexagon developer docs (libggm-...)

* hexagon: overhaul error handling in the session/device allocation

this should handle all failure paths in the session allocation.

* hexagon: update cmake presets to enable fp16 vectors

* hexagon: remove unused time_usec function

* hexagon: don't forget to release buffer contexts

* hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure)

* hexagon: remove custom can_repeat function and use ggml_can_repeat

---------

Co-authored-by: Rajdeep Ganguly <rganguly@qti.qualcomm.com>
Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
2025-11-09 23:38:03 +02:00
Diego Devesa a2130ac501 Revert "ggml : Leverage the existing GGML_F32_VEC helpers to vectorize ggml_v…" (#16723)
This reverts commit 19a5a3edfd306516cc419679d69d6435943b6816.
2025-11-09 23:38:03 +02:00
sirus20x6 773041e336 ggml : Leverage the existing GGML_F32_VEC helpers to vectorize ggml_vec_set_f32 for faster fills (llama/16522)
* Leverage the existing GGML_F32_VEC helpers to broadcast the fill value across SIMD registers and store in vector-sized chunks, while retaining the scalar tail for leftover elements and non-SIMD builds.

* Vectorize additional f32 helper loops

* Normalize f32 helper tails for ggml vec ops

---------

Co-authored-by: Aaron <shelhamer.aaron@gmail.com>
2025-11-09 23:38:03 +02:00
Aman Gupta 431aaf56f0 CUDA: fix bug in topk-moe softmax (llama/16711) 2025-11-09 23:38:03 +02:00
Aman Gupta ba41a6ca6a CUDA: topk-moe: add optional parameter for gpt-oss (llama/16649) 2025-11-09 23:38:03 +02:00
Johannes Gäßler 99cea274e5 CUDA: better error for FA kernel with 0 occupancy (llama/16643) 2025-11-09 23:38:03 +02:00
Aman Gupta 9a8cfb040c ggml: add ggml_can_fuse_subgraph (llama/16662)
* ggml: add ggml_can_fuse_subgraph

* ggml-cuda: use ggml_can_fuse_subgraph for topk-moe

* format

* 1. remove inputs from signature as they are transient nodes
2. add check for views: view_src should be part of the subgraph

* - combine check into one loop
- check all view_src parents
- other minor review comments

* remove redudant if test

* - rename and other minor review comments

* add assert about count < 32
2025-10-22 12:58:11 +03:00
lhez 5c4c477d00 opencl: fix warnings and clean up profiling (llama/16688)
* opencl: remove unused headers, fix warnings

* opencl: clean up profiling, only keep kernel time
2025-10-22 12:58:11 +03:00
Jeff Bolz 7f16c71068 vulkan: Handle FA with all -inf mask values (llama/16447) 2025-10-22 12:58:11 +03:00
YehuditE 55cf00c20a sycl : add PAD_REFLECT_D1 operator support (llama/16145)
* sycl: add PAD_REFLECT_D1 operator support

* docs(ops): regenerate docs/ops.md

* remove trailing whitespaces

* style: fix editorconfig issues — trim trailing spaces and normalize EOLs

* fix: move PAD_REFLECT_1D case outside of fall-through block
2025-10-22 12:58:11 +03:00
Diego Devesa 70b4d22f01 ggml-alloc : fix leak when reusing a tensor with a larger size (llama/16679) 2025-10-22 12:58:11 +03:00
safranowith bb76672081 SYCL: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators (llama/16613)
* SYCL: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators

Clean up unrelated changes from previous commit

* Chore: remove empty lines and fix indentation

* Clean up: remove leftover blank lines and fix spacing

* chore: fix trailing whitespace and ensure final newline

* Cleanup: remove redundant declarations already defined in header

* Sync docs/ops.md with updated backend operation support

* docs: update ops.md after rebase

* docs: update ops.md - Vulkan supports SSM_CONV and SSM_SCAN
2025-10-22 12:58:11 +03:00
Aaron Teo 82bdf31267 ci : fix binaries release failure for s390x (binaries may not work yet) (llama/16664)
* devops: initial patch

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: forgot the z15 suffix

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: attempt at impl GGML_CPU_ALL_VARIANTS for s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: rm baseline version

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-10-22 12:58:11 +03:00
Johannes Gäßler 72d98011db HIP: fix GPU_TARGETS (llama/16642) 2025-10-22 12:58:11 +03:00
Jeff Bolz 414901a42c vulkan: Implement topk_moe fused shader, ported from CUDA (llama/16641)
This is similar to the CUDA shader from #16130, but doesn't use shared memory
and handles different subgroup sizes.
2025-10-22 12:58:11 +03:00
Aman Gupta 08345f15ec CUDA: use registers instead of smem in topk-moe (llama/16647)
Uses the technique used in the vulkan PR #16641. Neat trick!
2025-10-22 12:58:11 +03:00
Shawn Gu 8ffdf4bd96 opencl: transposed gemm/gemv moe kernel with mxfp4,f32 (llama/16602)
* opencl: transposed gemm/gemv moe kernel with mxfp4,f32

* add restore kernel for moe transpose

* fix trailing whitespaces

* resolve compilation warnings
2025-10-22 12:58:11 +03:00
Radoslav Gerganov 6aa18cccd8 rpc : report actual free memory (llama/16616)
* rpc : report actual free memory

Start reporting the free memory on every device instead of using
fixed values. Now llama-cli users can get a nice memory breakdown
when using RPC devices.

* drop --mem in rpc-server
2025-10-22 12:58:11 +03:00
Giuseppe Scrivano d22008b631 vulkan: Add State Space Model (SSM) Operations Support (llama/16463)
* vulkan: implement SSM scan operation

Add State Space Model scan operation to the Vulkan backend.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

* vulkan: implement SSM conv operation

Add State Space Model conv operation to the Vulkan backend.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

---------

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-10-22 12:58:11 +03:00
muggle-stack 328263f8fd ggml : fix SpaceMit IME array out-of-bounds in task assignment (llama/16629)
Fix incorrect task-to-batch index calculation in the quantization phase.

The bug caused out-of-bounds access to qnbitgemm_args array when
compute_idx exceeded per_gemm_block_count_m, leading to invalid
pointer dereferences and SIGBUS errors.

Correctly map tasks to batches by dividing compute_idx by
per_gemm_block_count_m instead of block_size_m.

Example:
  batch_feature=1, gemm_m=30, block_size_m=4
  per_gemm_block_count_m = 8, task_count = 8

  Old: gemm_idx = 4/4 = 1 (out of bounds  New: gemm_idx = 4/8 = 0 (correct)

Tested on SpaceMit K1 RISC-V64 with qwen2.5:0.5b model.

Co-authored-by: muggle <mingjun.rong@spacemit.com>
2025-10-22 12:58:11 +03:00
Jeff Bolz 4a384826a8 vulkan: fix debug build (add_rms_len/data not found) (llama/16624) 2025-10-22 12:58:11 +03:00
Ilia Ilmer 0ae492641c metal : add `CONV_TRANSPOSE_2D` (llama/16542)
* initial: headers and metal-device.cpp updates

* adding conv_transpose_2d

* fix type

* fix type: int32->int64

* Update ggml/src/ggml-metal/ggml-metal.metal

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-metal/ggml-metal.metal

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-metal/ggml-metal.metal

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add checks for src[0] and src[1]; add type checks

* Update ggml-metal.metal

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add more tests, add optimization to threading

* add dynamic memory allocation in metal

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-10-22 12:58:11 +03:00
GittyBurstein 82332cea27 SYCL SET operator optimized for F32 tensors (llama/16350)
* SYCL/SET: implement operator + wire-up; docs/ops updates; element_wise & ggml-sycl changes

* sycl(SET): re-apply post-rebase; revert manual docs/ops.md; style cleanups

* move SET op to standalone file, GPU-only implementation

* Update SYCL SET operator for F32

* ci: fix editorconfig issues (LF endings, trailing spaces, final newline)

* fixed ggml-sycl.cpp

---------

Co-authored-by: Gitty Burstein <gitty@example.com>
2025-10-22 12:58:11 +03:00
GittyBurstein 7bb53032b3 sycl : add ARANGE operator (llama/16362)
* SYCL: update element-wise ops and presets

* clean arange

* Re-trigger CI

---------

Co-authored-by: Gitty Burstein <gitty@example.com>
2025-10-22 12:58:11 +03:00
Chenguang Li fe965613c0 CANN: format code using .clang-format (llama/15863)
This commit applies .clang-format rules to all source files under the
ggml-cann directory to ensure consistent coding style and readability.
The .clang-format option `SortIncludes: false` has been set to disable
automatic reordering of include directives.
No functional changes are introduced.

Co-authored-by: hipudding <huafengchun@gmail.com>
2025-10-22 12:58:11 +03:00
takuya kodama 3c136d699a ggml-cpu: replace putenv with setenv for const-correctness (llama/16573)
## Why it failed

When compiling with strict compiler flags (-Wwrite-strings -Werror=discarded-qualifiers),
the build fails with the following error:

```
cmake \
  -S . \
  -B ../llama.cpp.build \
  --preset=x64-linux-gcc-debug \
  -DCMAKE_INSTALL_PREFIX=/tmp/local \
  -DCMAKE_C_FLAGS="-Wwrite-strings -Werror=discarded-qualifiers" && \
cmake --build ../llama.cpp.build/
...
/home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c: In function ‘ggml_cpu_init’:
/home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:3572:24: error: passing argument 1 of ‘putenv’ discards ‘const’ qualifier from pointer target type [-Werror=discarded-qualifiers]
 3572 |                 putenv("KMP_BLOCKTIME=200"); // 200ms
      |                        ^~~~~~~~~~~~~~~~~~~
In file included from /home/otegami/work/cpp/llama.cpp/ggml/src/./ggml-impl.h:10,
                 from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu-impl.h:6,
                 from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/traits.h:3,
                 from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:6:
/usr/include/stdlib.h:786:26: note: expected ‘char *’ but argument is of type ‘const char *’
  786 | extern int putenv (char *__string) __THROW __nonnull ((1));
      |                    ~~~~~~^~~~~~~~
cc1: some warnings being treated as errors
ninja: build stopped: subcommand failed.
```

The issue is that putenv() expects a non-const char * but receives a string literal (const char *).

## How to fix

This PR replaces putenv("KMP_BLOCKTIME=200") with setenv("KMP_BLOCKTIME", "200", 0).

Benefits of setenv():
- Accepts const char * parameters (no qualifier warnings)
- Makes copies of the strings (safer memory handling)
- The third parameter (0) ensures we don't overwrite if already set
2025-10-22 12:58:11 +03:00
yael-works f7b5ecf195 SYCL: Add GGML_OP_MEAN operator support (llama/16009)
* SYCL: Add GGML_OP_MEAN operator support

* SYCL: Fix formatting for GGML_OP_MEAN case

* Update ggml/src/ggml-sycl/ggml-sycl.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-22 12:58:11 +03:00
safranowith 757d51d21d cpu : add FLOOR, CEIL, ROUND and TRUNC unary operators (llama/16083)
* CPU: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators

- Added the operators to unary op enum
- Implemented API functions
- Implemented forward and unary-op logic in CPU backend
- Updated ggml_get_n_tasks
- Updated operators names array and static_assert
- Updated docs and enabled automatic tests

* docs: add documentation for ggml_trunc and ggml_trunc_inplace in ggml.h

* chore: remove trailing whitespace from ggml.h

* Remove unresolved merge markers

* Apply review suggestions: cleanup formatting, enum order and leftover artifacts

* Regenerate ops.md using create_ops_docs.py
2025-10-22 12:58:11 +03:00
lhez bef9f74553 opencl: add q8_0 mm support (llama/16469)
* opencl: add mm_q8_0_f32

* opencl: fix data loading for incomplete tile

* opencl: use q8_0 mm for larger matrix

* opencl: add some tests to cover the path
2025-10-22 12:58:11 +03:00
lhez 16dab3d122 opencl: fix FA for f32 (llama/16584) 2025-10-22 12:58:11 +03:00
Sam/Samuel d8a146b0f9 metal: optimise `GGML_OP_SUM` (llama/16559)
* optimise GGML_OP_SUM

* add non-contiguous tests by permuting the input

* change tests to require full contiguity of OP_SUM

* cuda : add check GGML_OP_SUM

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-10-22 12:58:11 +03:00
Julius Tischbein 0c9d49927c CUDA: Changing the CUDA scheduling strategy to spin (llama/16585)
* CUDA set scheduling strategy to spinning for cc121

* Using prop.major and prop.minor, include HIP and MUSA

* Exclude HIP and MUSA

* Remove trailing whitespace

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Remove empty line

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-10-22 12:58:11 +03:00
Georgi Gerganov 8ed913da0e metal : avoid using Metal's gpuAddress property (llama/16576)
* metal : avoid using Metal's gpuAddress property

* metal : fix rope kernels buffer check
2025-10-22 12:58:11 +03:00
SavicStefan 499f183e75 vulkan: Add ACC_TYPE_VEC2 implementation (llama/16203)
Signed-off-by: Stefan Savic <stefan.savic@huawei.com>
Co-authored-by: Stefan Savic <stefan.savic@huawei.com>
2025-10-15 09:29:17 +03:00
Aman Gupta 2eb9119754 CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (llama/16577) 2025-10-15 09:29:17 +03:00
Jeff Bolz 393fbbc80b vulkan: Support FA with K/V in F32 (llama/16543) 2025-10-15 09:29:17 +03:00
Jeff Bolz 73e200ee85 vulkan: Improve build time for MSVC (llama/16545)
Enable CMP0147 so custom build steps (invoking vulkan-shader-gen) are run in parallel.

Enable /MP so source files are compiled in parallel.
2025-10-15 09:29:17 +03:00
Johannes Gäßler 1bdd746bc8 CUDA: enable FA for FP32 KV cache (llama/16546) 2025-10-15 09:29:17 +03:00
Aman Gupta f2075667fa CUDA: use fastdiv + ggml_cuda_mad for mmvf (llama/16557)
* CUDA: use fastdiv + ggml_cuda_mad for mmvf

* use bf16 directly + fix formatting

* Add exception for HIP code
2025-10-15 09:29:17 +03:00
Aman Gupta b4c5c6f71f CUDA: add fp kernel for larger batch size MoE (llama/16512)
* CUDA: kernel for larger batch sizes for MoE

* WIP

* WIP

* WIP

* WIP

* WIP

* WIP

* fixup

* tests

* Move mmq_ids_helper to mmid

* cleanup

* Remove redundant checks
2025-10-15 09:29:17 +03:00
Anav Prasad a12848e8e9 cuda : remove legacy copy-op pointer indirection code (llama/16485)
* remove legacy copy-op pointer indirection code

* further removal of copy-op indirection code

* renamed check_node_graph_compatibility_and_refresh_copy_ops function
2025-10-15 09:29:17 +03:00
Georgi Gerganov 25ac94a6cb metal : FA support F32 K and V and head size = 32 (llama/16531)
* metal : FA support F32 K and V and head size = 32

* graph : remove obsolete comment [no ci]
2025-10-15 09:29:17 +03:00
lhez 66b0fc2fb7 opencl: fix build targeting CL 2 (llama/16554) 2025-10-15 09:29:17 +03:00
Johannes Gäßler 77272fe0df CUDA: fix numerical issues in tile FA kernel (llama/16540) 2025-10-15 09:29:17 +03:00
Jie Fu (傅杰) 8a9c2ba6a1 ggml : fix build broken with -march=armv9-a on MacOS (llama/16520)
* ggml : fix build broken with -march=armv9-a on MacOS

Signed-off-by: Jie Fu <jiefu@tencent.com>

* Add #pragma message

Signed-off-by: Jie Fu <jiefu@tencent.com>

* Address review comment.

Signed-off-by: Jie Fu <jiefu@tencent.com>

* Update ggml/src/ggml-cpu/ggml-cpu.c

---------

Signed-off-by: Jie Fu <jiefu@tencent.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-10-15 09:29:17 +03:00
Chenguang Li 417ecdddc5 CANN: fix CPU memory leak in CANN backend (llama/16549)
This commit fixes a CPU-side memory leak issue in the CANN backend,
which occurred when intermediate aclTensorList objects were not properly
released after operator execution. The leak happened during repeated
invocations of CANN ops (e.g., FlashAttention), leading to increasing
host memory usage over time.

Proper resource cleanup (aclDestroyTensorList and related release logic)
has been added to ensure that all temporary tensors are correctly freed.
2025-10-15 09:29:17 +03:00
Sam/Samuel bfd88b8b6e metal: add support for opt_step_sgd (llama/16539)
* metal: add support for opt_step_sgd

* add newline to pass EditorConfig check
2025-10-15 09:29:17 +03:00
Georgi Gerganov ccac1b4772 ggml : fix scalar path for computing norm (llama/16558) 2025-10-15 09:29:17 +03:00
hipudding 53e21364a6 CANN: Update several operators to support FP16 data format (llama/16251)
Many Ascend operators internally use FP16 precision for computation.
If input data is in FP32, it must first be cast to FP16 before
computation, and then cast back to FP32 after computation, which
introduces unnecessary cast operations. Moreover, FP16 computation
requires significantly less workload compared to FP32, leading to
noticeable efficiency improvements.

In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended
to support multiple data types. Validation on the Qwen2 0.5b model shows
correct accuracy and about 10% performance gain in concurrent scenarios.

Co-authored-by: noemotiovon <757486878@qq.com>
2025-10-15 09:29:17 +03:00
Sam/Samuel 7f22fe5d8f metal : add opt_step_adamw and op_sum (llama/16529)
* scaffold to support opt step adamw on metal (not written so far)

* add opt-step-adamw kernel for metal

* pass op->src[4] as a separate buffer to the pipeline

* add bounds check to opt-step-adamw kernel

* complete scaffold for GGML_OP_SUM

* naive GGML_OP_SUM kernel

* remove unwanted comment

* change OP_SUM capability gate

* Add has_simdgroup_reduction to both ops to pass CI
2025-10-15 09:29:17 +03:00
Neo Zhang Jianyu be778c992f fix UT fault cases: count-equal, argsort, pad OPs (llama/16521)
* fix/refactor OP argsort, pad

* fix count-equal op

* update SYCL OP list

* fix format issue

---------

Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>
2025-10-15 09:29:17 +03:00
sirus20x6 70eb30f28e ggml : Fix FP16 ELU positive branch (llama/16519)
Co-authored-by: Aaron <shelhamer.aaron@gmail.com>
2025-10-15 09:29:17 +03:00
sirus20x6 53721d6309 ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (llama/16518)
The previous SVE implementation for `ggml_vec_dot_f16_unroll` contained a bug due to a copy-paste error. The wrong variable was used in an FMA instruction, leading to incorrect results. This commit corrects the variable usage and improves the clarity of the code by renaming variables to avoid confusion.

Co-authored-by: Aaron <shelhamer.aaron@gmail.com>
2025-10-15 09:29:17 +03:00
Johannes Gäßler b5fb9b9f58 CUDA: faster tile FA, add oob checks, more HSs (llama/16492) 2025-10-15 09:29:17 +03:00
Georgi Gerganov d201705e71 metal : fix mul-mm condition + fix mul-mv permuted kernels (llama/16494) 2025-10-12 11:16:23 +03:00
Diego Devesa 1cc342427b cuda : avoid initializing unused devices (llama/16510) 2025-10-12 11:16:23 +03:00
Prajwal B Mehendarkar d8f1aa4e1d cmake : Dont define XOPENSOURCE on AIX (llama/16481) 2025-10-12 11:16:23 +03:00
duduta d83fef35df cpu : optimize the ggml NORM operation (llama/15953)
* ggml-cpu: optimize norm operation to use intrinsics or Accelerate

          rename function

          add endif macro comment

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* implement s390x SIMD suggested by @taronaeo

* add TODO comment

* tidy up spaces

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
2025-10-12 11:16:23 +03:00
Chenguang Li b9eac9419c CANN: Improve ACL graph matching (llama/16166)
* CANN: improve ACL graph matching

Record `ne` and `nb` information for src tensors and include them in the
graph matching check. This enhances the robustness of ACL graph matching
by preventing incorrect matches when src tensors share the same data
address but differ in shape or stride.

* CANN: add op_params match
2025-10-12 11:16:23 +03:00
Charles Xu c8b2c56fd2 kleidiai: kernel interface refactoring (llama/16460) 2025-10-12 11:16:23 +03:00
Neo Zhang Jianyu 7df6766b63 refactor soft_max, add soft_max_back (llama/16472)
* refactor to support soft_max_ext

* fix error and support soft_max_back

* rm unused functions

* fix format issue

---------

Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>
2025-10-12 11:16:23 +03:00
ai-fonsi 21e6e72a2f Disable CUDA host buffers on integrated GPUs (llama/16308) 2025-10-12 11:16:23 +03:00
Georgi Gerganov 7ef78a72e1 metal : mark FA blocks (llama/16372)
* metal : better unroll in the FA kernels

* metal : index FA blocks

* tests : restore [no ci]

* metal : prevent division by zero in FA kernels

* metal : fix -INF detection logic
2025-10-12 11:16:23 +03:00
Reese Levine 4eea3efc49 ggml webgpu: profiling, CI updates, reworking of command submission (llama/16452)
* Add profiling

* More detailed profiling

* Rework command submission to avoid global locks

* Update wait handling

* try new method of waiting on futures

* Add serializing of command submission in some cases

* Add new pool for timestamp queries and clean up logging

* Serialize command submission in CI and leave a TODO note

* Update webgpu CI

* Add myself as WebGPU codeowner

* Deadlock avoidance

* Leave WebGPU/Vulkan CI serialized

* Fix divide by 0

* Fix logic in division by inflight_threads

* Update CODEOWNERS and remove serialize submit option
2025-10-12 11:16:23 +03:00
Georgi Gerganov 4bce4fa5e9 metal : add support for non-padded FA KV (llama/16148)
* metal : pad K, V and Mask when needed

* cont : simplify

* cuda : add TODO about KV padding requirement

* metal : add comments

* metal : remove mask padding requirement
2025-10-12 11:16:23 +03:00
Georgi Gerganov 6cf0c21b09 tests : add -INF blocks to the KQ mask in the FA tests (llama/16380)
* tests : add -INF blocks to the KQ mask in the FA tests

* cont : bump -INF block size to 64

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

* ggml : prevent division by zero in FA CPU op

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-10-12 11:16:23 +03:00
Georgi Gerganov 1a4116f942 metal : various optimizations + refactoring (llama/16446)
* metal : ssm_scan minor opts

* metal : get_rows optimize

* metal : cpy optimize

* metal : ssm_conv opt

* metal : ssm_scan simplify

* metal : ssm_Scan opt
2025-10-12 11:16:23 +03:00
Georgi Gerganov 0e431b3cea ggml : fix unaligned access in AMX code (llama/16315) 2025-10-12 11:16:23 +03:00
Daniel Bevenius 0f29d7c3fa ggml-cpu : fix leftover handling in ggml_vec_scale_f32 for SVE (llama/16443)
This commit updates the leftover handling in ggml_vec_scale_f32.

The motivation for this is that the code currently incorrectly assumes
there would be fewer than ggml_f32_epr leftover elements. However,
since the main loop processes 2*ggml_f32_epr elements per iteration
, there can be up to (2*ggml_f32_epr - 1) leftover elements.

The original single-pass leftover code could only process ggml_f32_epr
elements, leaving some elements unscaled.

Example scenario with 256-bit SVE:
```
ggml_f32_epr  = 8 (elements per register)
ggml_f32_step = 16 (two registers per iteration)
n             = 25
np            = 16
leftovers     = 9 elements (16-24)

Original    : processes only elements 16-23, misses element 24
This commit : loop processes elements 16-23, then element 24
```

Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630
2025-10-12 11:16:23 +03:00
Reese Levine b8bdf06182 ggml webgpu: actually add softmax, fix rms_norm offset (llama/16400)
* implement soft_max

* Fix soft_max data race

* Temporary fix, wait on each submit
2025-10-12 11:16:23 +03:00
Eve 2ca8fa37fa vulkan: use a more appropriate amount of threads when generating shaders (llama/16418)
* use a more flexible amount of threads

* fix windows compile and 0 thread case

* nominmax
2025-10-12 11:16:23 +03:00
Radoslav Gerganov 93882335a8 rpc : check src buffer when copying tensor (llama/16421)
Only dst buffer is guaranteed to be an RPC buffer. Add check for the src
one.
2025-10-12 11:16:23 +03:00
Radoslav Gerganov af51bbab88 rpc : add support for multiple devices (llama/16276)
* rpc : add support for multiple devices

Allow rpc-server to expose multiple devices from a single endpoint.
Change RPC protocol to include device identifier where needed.

closes: #15210

* fixes

* use ggml_backend_reg_t

* address review comments

* fix llama-bench backend report

* address review comments, change device naming

* fix cmd order
2025-10-12 11:16:23 +03:00
Acly 49e0a426f3 vulkan : incremental shader builds (llama/16341)
* vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times

* support dep-files so shaders are recompiled if their included files change

* rename shader files which are used as "headers" to use .glsl extension
* move glslc extension detection shaders to separate folders
* the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled

* vulkan : only write embedded shader .hpp/.cpp when they change

* avoid recompiling ggml-vulkan.cpp when editing shaders
* pass single --source argument instead of --input-dir & --filter to shader gen
* check for source file match earlier

* fix hang in vulkan-shaders-gen when there are compilation errors

* early out did not decrement compile_count

* clean up

* fix glslc integer dot product test

* unconditionally write the embedded shader cpp output

* replace output filepath in generated dep-files to match output in CMakeLists

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-10-12 11:16:23 +03:00
Georgi Gerganov 93c1305565 metal : fix loop bound in ggml_mem_ranges (llama/16412) 2025-10-12 11:16:23 +03:00
Acly a70144a873 ggml : fix graph reallocation with multiple chunks (llama/16396)
reallocation is needed if a single chunk grows in size,
even if total allocation size stays the same or is lower
2025-10-12 11:16:23 +03:00
Jeff Bolz 2e6888089f vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (llama/16354)
* vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE

Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers.
The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed
beyond that limit. This allows > 4GB buffers to be allocated on some
implementations (e.g. NVIDIA) and tensors this large can be used for im2col
and mul_mat.

For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange.
I'm not sure this check is ideal, but we always use these buffers as a single
full size binding and the limit may be smaller than maxMemoryAllocationSize
or maxBufferSize, so I think this is reasonable.

Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range.
The maxStorageBufferRange may be smaller than the maxBufferSize or
maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and
it's invalid usage if VK_WHOLE_SIZE computes a range larger than
maxStorageBufferRange.

With this change, it should be possible to generate videos using wan networks
in stable-diffusion.cpp.

* vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull
2025-10-12 11:16:23 +03:00
Jeff Bolz 90bdcf2ef6 vulkan: Fix FA coopmat1 invalid array indexing (llama/16365)
When computing sinks, the cm1 shader was looping r from 0 to Br rather than
to rows_per_thread. I must have copied this from the scalar path (where it is
correct), and somehow it wasn't causing failures on current drivers.
2025-10-12 11:16:23 +03:00
Jeff Bolz fd11cd97ab vulkan: in flash attention, bounds check against nem1 (don't rely on GGML_KQ_MASK_PAD) (llama/16316) 2025-10-12 11:16:23 +03:00
Reese Levine 27ebde6afd ggml webgpu: add support for soft_max, optimize rms_norm (llama/16357)
* Add inplace softmax

* Move rms_norm to split row approach

* Update debug for supports_op

* clean up debug statements

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-10-12 11:16:23 +03:00
Piotr Wilkin (ilintar) 33ca8355c4 model : Apertus model implementation (llama/15852)
* First attempt

* No permute during convert (fixes qk tensors), proper norm application.

* RoPE = NeoX

* Coherence!

* Migrate xielu params from tensors to hyperparameters

* Simple CUDA kernel

* Revert stupid LLM refactorings

* Chat template support

* configchecker / flake8 errors

* Reorder unary.cu

* I do conclude that LLMs are, in fact, stupid.

* Fix after merge

* Final newline

* Make xIELU an UNARY_OP

* Final newline

* Correctly account for parameter shift

* Argh.

* Update ggml/src/ggml-cpu/unary-ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Refactor: remove unused methods, inline and factorize softplus, add const modifiers

* Revert CUDA changes, implement xIELU as a separate OP

* Pesky newline

* Add float2half / half2float for F16 inputs/outputs

* CUDA variants, attempt 2

* Actually, attempt 3

* Update ggml/src/ggml-cuda/unary.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Missing convert header

* Proper formula and reference for xIELU in the comments.

* Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Add tensor mappings for Apertus to global list instead

* Fix lazy on scalars

* Update ggml/src/ggml-cuda/unary.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Add comment about the constraints on positive/negative alpha

* Change `softplus` to `ggml_softplus`

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-10-12 11:16:23 +03:00
R0CKSTAR e29508be8b musa: update compile flags (llama/16265)
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
2025-10-12 11:16:23 +03:00
uvos b73f67d3f6 HIP: Disable ROCWMMA fattn on CDNA when compiled against ROCWMMA 2.0.0 (llama/16221)
* HIP: Disable ROCWMMA fatt on CDNA when compiled against ROCWMMA 2.0.0

rocwmma 2.0.0 includes a bug in the code fakeing fp16 accumulation on CDNA

* CUDA: Fix volta condition in ggml_cuda_should_use_wmma_fattn
2025-10-12 11:16:23 +03:00
Eve b0560310aa vulkan: make ggml_vk_default_dispatcher support older vulkan headers (llama/16345)
* make ggml_vk_default_dispatcher support older vulkan headers

* simpilfy with using
2025-10-12 11:16:23 +03:00
lhez 31bb869929 opencl: support pad_ext (llama/15888) 2025-10-12 11:16:23 +03:00
Reese Levine 8208cea829 ggml webgpu: support for rope,div,sub,glu,scale,cont operators (llama/16187)
* Work on rope

* Simplify inplace operation generation and combine mul/add generation

* Work on rope variants

* implement neox rope

* rope complete

* Add sub,div,glu operators

* implement scale op

* Update cpy shader to handle cont/more types

* formatting

* Update test vars printing for rope,rms_norm

* Avoid ROPE hardcoded constants

* Add TODO to change ROPE constants to enum

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix TODO comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-10-12 11:16:23 +03:00
lhez 199626d79e opencl: support ne3 in get_rows (llama/15866) 2025-10-12 11:16:23 +03:00
Georgi Gerganov 527ff158d0 ggml : bump version to 0.9.4 (ggml/1363) 2025-09-30 13:54:08 +03:00
anavp-nvidia 62b3b86e3f
cuda : Enable CUDA Graph usage for Nemotron Nano v2 (NemotronH) (llama/16328)
* Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs

* fix to ensure test-backend-ops check passes
2025-09-30 12:31:04 +03:00
Georgi Gerganov 78f85f2b92
metal : dynamic simdgroups for MV kernels (llama/16340)
* metal : dynamic simdgroups for MV kernels

* cont : minor
2025-09-30 12:31:04 +03:00
Charles Xu 01e86b69ab
kleidiai : fix work size and threads sync for fp16 (llama/16246) 2025-09-30 12:31:04 +03:00
alex-spacemit 35ebdf7304
ggml: riscv: add riscv spacemit backend (llama/15288)
* ggml: add spacemit backend

Change-Id: I249bdc043485d815a9c351867137bc1e27cc2e23

* add new line at end of file

Change-Id: I889ed1c85fb45e62350ecde0c06f70450cadfbe2

* add riscv zba extension limit

Change-Id: I321eb200f859751727afe5cae13074dfce2bb0ce

* fixed for review comments, file renamed and format

Change-Id: Ia20b6ec24a36638e62e0fe07cf100916a7cce3ce

* fixed for code format, after clang-format

Change-Id: I5dc33a0412da3d3f2d77075d8939185d3009eca2

* use _Float16 instead of __fp16

Change-Id: I039fb02bb95270e641bc4442204e658735859d43

* add ci for riscv64-spacemit-ime-native

Change-Id: I711c1033061df1a289ea77891b2997599dfe8279

* update debian-13-riscv64-spacemit-ime-native ci label

Change-Id: Ifb2b891e2fca57b5da604fce2ac255f27731179a

* remove license comment for spacemit ime

Change-Id: If0dc3ca30a958631ccca0a28b62e0b825f9fb0c3

* upgrade binutils for gcc ime

Change-Id: Ibf2fa74c1064408974cb5b45f044d40987e5fb45

* add spacemit ime cross jobs

Change-Id: I80d74909941d41cb9cd09e51d8baf01c985cbfc6

* remove native compile for riscv64-spacemit-ime

Change-Id: I01920afafdc73fa7424014fd648d243f8ec9e25e

* ci : add caching for spacemit ime cross toolchain

Change-Id: Ic54a192019a2fd982bbd58225ce3bbc38f4053de

* ci: bug fixed for cache path and env

Change-Id: I28c42e10b6fff053bb6580926ca2353448cb042a

* Update .github/workflows/build-linux-cross.yml for cache path

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* bugfixed for  build-linux-cross.yml,  syntax error

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: cailinxi <linxi.cai@spacemit.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-09-30 12:31:03 +03:00
Rafal Lewczuk 94fe9bbe2b
ggml-backend : add root cause in error message if loading backend library fails (llama/16172)
This PR adds additional information to an error message when loading backend library via ld_load_library() fails. This helps spotting why backend library did not load (missing library, missing dependency or unresolved symbol etc.).
2025-09-30 12:31:00 +03:00
Georgi Gerganov 22c12ee86d
ggml : remove oboslete files (#0) 2025-09-29 16:47:30 +03:00
Georgi Gerganov 3201382792
cmake : remove metal flag (llama/0) 2025-09-29 15:18:13 +03:00
Sigbjørn Skjæret 112e10f2e4
ggml : check cuda and metal argsort limits and add test (llama/16323)
* check cuda argsort limits and add test

* add metal check
2025-09-29 15:18:12 +03:00
Georgi Gerganov 7ce0a7bcd0
ggml : fix dependencies for ggml_set_rows (llama/16318) 2025-09-29 15:18:12 +03:00
Jeff Bolz a375e4c4d2
vulkan: Fix validation failure in quantized flash attention (llama/16292) 2025-09-29 15:18:12 +03:00
Sigbjørn Skjæret 5c6e795607
ggml : fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 (llama/16307)
* fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32

* add test that fails on simd
2025-09-29 15:18:12 +03:00
Jeff Bolz 55d45edf6d
vulkan: 64-bit im2col (llama/16135)
* vulkan: 64-bit im2col

Add variants of the im2col shaders that use buffer_device_address/buffer_reference,
and use 64-bit address calculations. This is needed for large convolutions used in
stable-diffusion.cpp.

* fix validation error for large im2col
2025-09-29 15:18:12 +03:00
Georgi Gerganov 0102733cca
metal : extend mat-mat multiplication support (llama/16225)
* metal : support mul_mm with src1->type == GGML_TYPE_F16

* metal : support mul_mm_id with src1->type == GGML_TYPE_F16

[no ci]

* metal : mul_mm support ne00 % 32 != 0

* metal : support mul_mm_id with ne00 % 32 != 0

* cont : remove unnecessary unrolls

* cont : simplify data loading

* metal : optimize mul_mm when output bounds checks are not needed
2025-09-29 15:18:12 +03:00
Georgi Gerganov 45976f2857
metal : fuse non-sequential nodes (llama/16102)
* metal : fuse non-sequential nodes

* cont : add comment

* cont : simplify bounds checks
2025-09-29 15:18:12 +03:00
Jeff Bolz 91ab93b756
vulkan: handle mat_mul with A matrix > 4GB (llama/16176)
* vulkan: handle mat_mul with A matrix > 4GB

This change splits mat_mul operations with huge A matrix into chunks in the M
dimension. This works well for stable-diffusion use cases where the im2col
matrix has very large M.

Fix the order of setting the stride in mul_mm_cm2 - setting the dimension
clobbers the stride, so stride should be set after.

* build fixes
2025-09-29 15:18:12 +03:00
Jeff Bolz eb982dd786
vulkan: support arbitrary KV dimension in flash attention (llama/16160)
The "Clamp" spec constant is already based on whether KV is a multiple of Bc,
so use that to control whether bounds checking is performed. Add bounds checking
to the scalar and coopmat1 paths. Coopmat2 didn't need any changes (the K/V
tensors are already optionally clamped, nothing else needed to be changed).
2025-09-29 15:18:12 +03:00
Acly bc1ac13c2f
vulkan : make the vulkan.hpp dynamic dispatcher instance private (llama/16224)
* don't use VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE which can cause conflicts if application or other libraries do the same
2025-09-29 15:18:12 +03:00
Aman Gupta 85e4455cd3
CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32 (llama/16277)
* CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32

This commit adds mul_mat_id support for ncols_dst >= 16. It does this by
packing ncols_dst tiles into the blockDim.y.

My tests on a RTX 3090 show that this is faster than the cuBLAS fallback
for f16 till bs=64, and for f32 till bs=32

* Review: refactor if statement
2025-09-29 15:18:11 +03:00
Johannes Gäßler e856483cd6
CUDA: refactor and deduplicate vector FA kernels (llama/16208)
* CUDA: refactor and deduplicate vector FA kernels
2025-09-29 15:18:11 +03:00
Dmytro Minochkin 88dd9e0d45
vulkan: throw system error instead of SIGABRT during init on older devices (llama/16156)
* Throw system error on old Vulkan driver rather than SIGABRT

* Optionally handle any potential error in vulkan init
2025-09-29 15:18:11 +03:00
Jeff Bolz 97bd65f90f
vulkan: support GET_ROWS for k-quants (llama/16235)
The dequantize functions are copy/pasted from mul_mm_funcs.comp with very few
changes - add a_offset and divide iqs by 2. It's probably possible to call
these functions from mul_mm_funcs and avoid the duplication, but I didn't go
that far in this change.
2025-09-29 15:18:11 +03:00
Aaron Teo 23b3598952
devops: add s390x & ppc64le CI (llama/15925)
* devops: move s390x and ppc64le ci build

we have access to ubuntu-24.04-s390x and ppc64le images now

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: disable ppc64le for now since they have compiler errors

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: stop warnings as errors

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: switch to non-macro flag

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: going the llama macro route

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add big-endian gguf test models

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: disable ppc64le to test s390x, check test build

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: dup .gguf.inp files for big-endian tests

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: dup .gguf.out files for big-endian too

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add python setup and endian byteswap

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: pooring thing does not have s390x python3

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add missing rust compiler for s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: try rust actions runner

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "devops: try rust actions runner"

This reverts commit 3f8db04356033d6c1d7eccc75ca396bc5298250c.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: try a different path for rust

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: dump home directory and user info

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: install gguf-py only

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: missed relative path

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: remove big-endian files since local swapping is working

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: revert test-tokenizer-0 cmakelists

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Fix unicode flags conversion from and to uint16_t

Bitfields are allocated in different order on s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Simplify byteswap command

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Add byteswapping and git-lfs for test-tokenizers-ggml-vocabs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Fix endianness detection in vocab loader

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Disable test-thread-safety on s390x

In this test a model is downloaded,
then immediately loaded to check if more downloads are needed,
and then used for test.

There is no clean way to separate all those steps
 to add byteswapping between them, so just skip this test.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Fix q8_0 test in test-quantize-fns

vec_signed uses unexpected rounding mode.
Explicitly use different rounding function.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add big-endian stories260K

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add s390x test-eval-callback

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix test does not exist

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix model not found llama-eval-callback

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Fix q3_K dot product error in test-quantize-fns on s390x

Array q8bytes had only 4 elements allocated, but 8 elements accessed.
This lead to write out of bounds and later read of overwritten values out of bounds
and incorrect result.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: re-enable ppc64le for testing

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: activate test-thread-safety for s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: disable ppc64le tests

for some reason it keeps failing test-thread-safety tests and I do not
    have a machine that is able to replicate the tests.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: LLAMA_FATAL_WARNINGS=ON

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Correct repository URL for s390x for test-thread-safety model

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Fix fs_get_cache_directory

Ensure it works even if both XDG_CACHE_HOME and HOME are unset.
This might happen in containers.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Re-enable CI for ppc64le

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Fortify ggml_rope_impl

Only memcpy data from sections argument if it's non-NULL.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Add TODO in struct unicode_cpt_flags to reimplement it in endian-independent way

* Update URL for big-endian model

* Update .github/workflows/build.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update remaining mentions of BE models to ggml-org/models repo

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@linux.ibm.com>
Co-authored-by: Aleksei Nikiforov <103434461+AlekseiNikiforovIBM@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-09-29 15:18:11 +03:00
Georgi Gerganov 670d54ef5d
metal : report OOM errors (llama/16274) 2025-09-29 15:18:11 +03:00
Adrien Gallouët 9823c5cc51
common : use cpp-httplib as a cURL alternative for downloads (llama/16185)
* vendor : update httplib

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* common : use cpp-httplib as a cURL alternative for downloads

The existing cURL implementation is intentionally left untouched to
prevent any regressions and to allow for safe, side-by-side testing by
toggling the `LLAMA_CURL` CMake option.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* ggml : Bump to Windows 10

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-09-29 15:18:11 +03:00
Aaron Teo 89a7b4d22c
ggml-cpu: implement MXFP4 SIMD for s390x (llama/16193)
* ggml-cpu: impl mxfp4 s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: missing s = sumf

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix incorrect kval_mxfp4 type

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: rework mxfp4

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: missing delta calc

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix typo

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix typo for vec_splats

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: expand to 2 blocks per loop

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add unroll to boost perf

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: back to 1 block per loop to test perf

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "ggml-cpu: back to 1 block per loop to test perf"

This reverts commit 1fe55724e2dc295701101bf838bdd4a512237492.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: rm unroll from single block

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-29 15:18:11 +03:00
R0CKSTAR 98ac209ae1
musa: fix build warnings (llama/15611)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-09-29 15:18:10 +03:00
Aman Gupta d9bf63cfb8
CUDA: add a fused top-K MoE kernel (llama/16130)
* CUDA: add a fused top-K MoE kernel

This kernel does the following:
1. softmax over the logits per token [n_experts, n_tokens]
2. argmax reduce over the top-k (n_experts_used) logits
3. write weights + ids to global memory

It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models

* Refactor into ggml_cuda_should_use_topk_moe

* Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before

* Review: format + micro-optimizations

* Fix bug: fix tie breakers

* Add optional norm + clean-up code

* Use smem for final write

* Add bounds check

* Use better memory pattern for writeback
2025-09-29 15:18:10 +03:00
junchao-zhao 24ea5476de
ggml : fix loongarch lsx compilation error (llama/15864) 2025-09-29 15:18:10 +03:00
Daniel Bevenius 611ff19f20
ggml : remove -dev suffix from release version (ggml/1355)
This commit removes the `-dev` suffix from the version string in
CMakeLists.txt and the release script. The version will now be
just be formatted as `MAJOR.MINOR.PATCH`.
2025-09-29 15:18:10 +03:00
Daniel Bevenius 06d7b3d124
ggml : bump version to 0.9.3 (ggml/1353) 2025-09-29 15:18:10 +03:00
Georgi Gerganov ac678efb35
metal : fuse NORM + MUL + ADD, support non-multiples of 4 (llama/16220)
* metal : fuse NORM + MUL + ADD

* metal : support norms of non-multiple of 4

* cont : fix comment [no ci]
2025-09-29 15:18:10 +03:00
Georgi Gerganov 268f1c961b
metal : relax reorder conditions (llama/16216) 2025-09-29 15:18:10 +03:00
Georgi Gerganov 0a5b811f2e
metal : restore im2col perf (llama/16219) 2025-09-29 15:18:10 +03:00
Radoslav Gerganov 0946619662
rpc : use ggml logging facilities
Use RPC_DEBUG environment variable to enable debug messages.
Add helper macro LOG_DBG() which does an early
check of the env var before calling GGML_LOG_DEBUG().
Make sure we log a debug message for every server function.
2025-09-29 15:18:10 +03:00
Johannes Gäßler cd431223e0
llama: print memory breakdown on exit (llama/15860)
* llama: print memory breakdown on exit
2025-09-29 15:18:10 +03:00
Acly 5069c08034
ggml : split graph allocations according to backend max buffer size (llama/15815)
* ggml : make gallocr respect the backend's max buffer size

* if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers
* vulkan: report the actual max  allocation size in buffer type  interface

* fix missing newline, apple-clang warning

* track size of individual chunks in ggml_dyn_tallocr and raise max chunks.
revert to use suballocation_block_size as max chunk size for vulkan.

* track (chunk, offset) pairs instead of "global" offsets through gallocr.

* simpler, don't need loops to map between local/global offsets
* touches more code

* fix dyn_tallocr_max_size and initialization

* fix memory leak when buffers are reused due to same buffer type appearing multiple times

* make vbuffer allocation follow the same logic as backend_buffer did before

* continue to use leftover unallocated space of previous chunks after a new one has been created

* treat free blocks of each chunk as separate list
* they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges
* exhaust freed blocks of all chunks before considering their last blocks with unallocated space
* start with 0 chunks/blocks and create chunks as needed
* allow the last chunk to grow beyond max size

* refactor: move adding new free block and new chunk into separate functions

* allocate chunks individually with a separate free-blocks list for each one

* needs a bit more memory/allocations/indirections, but code is simpler

* fix warnings (missing static) & debug checks
2025-09-29 15:18:09 +03:00
Xiangyan Sun 41245891c1
ggml-cpu: Respect cpumask settings (llama/16164) 2025-09-29 15:18:09 +03:00
Sigbjørn Skjæret 73e8f3acb8
ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (llama/15928)
* fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl

* change initialization to true
2025-09-29 15:18:09 +03:00
Aaron Teo c706a50746
zdnn: refactor codebase + add docs (llama/16178)
* zdnn: initial matmul refactor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rm static from funcs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: update ggml-zdnn.h

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: change header files to hpp

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: switch to common.hpp

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: move mulmat forward around

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rm inline from utils

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: code cleanup

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: add zDNN docs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-29 15:18:09 +03:00
Daniel Bevenius d8d31e3638
ggml-cpu : fix typo in gemm comments [no ci] (llama/16189) 2025-09-29 15:18:09 +03:00
Sigbjørn Skjæret 4e32ee733b
ggml : implement set_rows with i32 index (llama/16159)
* implement set_rows with i32 index

* template fix

* test quantized path

warnings--

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* forgotten name change

* deduplicate cuda/sycl and test-fix

* indent++

* vulkan: support set_rows with i32 index type (llama/16162)

* disable i32 index for webgpu for now

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-09-29 15:18:09 +03:00
Georgi Gerganov df672c6372
ggml : extend ggml_can_fuse to work with non-sequential nodes (llama/16123)
* ggml : extend ggml_can_fuse to work with non-sequential nodes in the graph

* cont : fix wrong bounds check condition

* cont : remove unnecessary overload
2025-09-29 15:18:09 +03:00
Georgi Gerganov 973054a8cd
ggml : add ggml_op_is_empty (llama/16122)
* ggml : add ggml_op_is_empty

* ggml : move to ggml-impl.h
2025-09-29 15:18:09 +03:00
Shin-myoung-serp 9f673df08d
Vulkan: add conv_transpose_2d operation (llama/16022)
* Vulkan: add conv_transpose_2d operation

* Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L)

* Vulkan: fix incorrect indentation in conv_transpose_2d shader

* Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation

* Vulkan: revert the order of the index calculation and bound check in conv_2d shader

* Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation.

* Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.
2025-09-29 15:18:09 +03:00
Jeff Bolz 14723f25a1
vulkan: add RTE variants of exp shader (llama/16165)
This fixes some failures on Turing where "round to zero" rounds to the max f16
value but the CPU reference value is infinite.
2025-09-29 15:18:08 +03:00
Ruben Ortlam 95b29fab78
vulkan: vec dot matrix multiplication fix (llama/16151)
* vulkan: fix matrix multiplication index calculation for odd m/n and odd k in combination with batching

* add odd m/n + odd k test with batching
2025-09-29 15:18:08 +03:00
lhez 4b7f09ac0b
opencl: fix concat crash on win arm64 with Adreno (llama/15944) 2025-09-29 15:18:08 +03:00
lhez 0a7096f4f3
opencl: initial `q8_0` mv support (llama/15732) 2025-09-29 15:18:08 +03:00
Giuseppe Scrivano eae2be0ca2
vulkan: optimize UMA buffer operations and fix driver hangs (llama/16059)
* vulkan: optimize UMA buffer operations and fix driver hangs

The previous implementation was blocking the GPU for extended periods,
causing the i915 driver to reset the context due to the hangcheck
protection.

[32628.443070] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:85dffffb, in llama-server [194114]
[32628.443091] i915 0000:00:02.0: [drm] llama-server[194114] context reset due to GPU hang

* vulkan: implement deferred_memset on UMA

---------

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-09-29 15:18:08 +03:00
Jeff Bolz 9a6c2036a9
vulkan: fix validation error about VK_PIPELINE_CREATE_CAPTURE_STATISTICS_BIT_KHR (llama/16086) 2025-09-29 15:18:08 +03:00
Georgi Gerganov 8d10ded025
ggml : prepare for development of 0.9.2-dev 2025-09-29 15:18:08 +03:00
Georgi Gerganov d89164a08d
ggml : bump version to 0.9.1 2025-09-29 15:18:05 +03:00
Ruben Ortlam 76d0934287
vulkan: use vec dot for matrix matrix multiplications (llama/16056)
* vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions

* use fma instead of dot to fix Nvidia and Apple performance issues
2025-09-20 13:46:39 +03:00
Xuan-Son Nguyen 2ad00d5586
ggml : refactor forward_dup for cpu backend (llama/16062)
* ggml : refactor forward_dup for cpu backend

* clean up a bit

* add quant/dequant perf test
2025-09-20 13:46:39 +03:00
Adrien Gallouët 4d8cd07825
ggml-amx : fix ggml_amx_init() on generic Linux (llama/16049)
Generalize Linux check to `__linux__` to support non-glibc systems (like musl).
Also, return `false` on unknown/untested OS.

Without this commit, the code compiles (with warnings) but fails:

    register_backend: registered backend CPU (1 devices)
    register_device: registered device CPU (Intel(R) Xeon(R) Platinum 8488C)
    build: 6487 (51c4cac6) with x86_64-linux-musl-gcc (GCC) 15.1.0 for x86_64-linux-musl (debug)
    system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
    ....
    print_info: n_ctx_orig_yarn  = 262144
    print_info: rope_finetuned   = unknown
    print_info: model type       = 4B
    Illegal instruction (core dumped)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-09-20 13:46:39 +03:00
Adrien Gallouët 4575f96873
cmake : fix static linking for OpenMP on Unix-like systems (llama/16031)
When compiling with GGML_STATIC=ON, the build process would produce a
binary that was still dynamically linked to OpenMP. This defeats the
purpose of a static build:

    $ cmake -B build \
            -DBUILD_SHARED_LIBS=OFF \
            -DLLAMA_CURL=OFF \
            -DGGML_CCACHE=OFF \
            -DGGML_NATIVE=OFF \
            -DGGML_STATIC=ON

    $ ldd llama-server
            linux-vdso.so.1 (0x0000e1a434e3b000)
            libgomp.so.1 => /lib/aarch64-linux-gnu/libgomp.so.1 (0x0000e1a4345a0000)
            libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000e1a434300000)
            libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000e1a434240000)
            libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000e1a434200000)
            libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000e1a434030000)
            /lib/ld-linux-aarch64.so.1 (0x0000e1a434df0000)

This commit resolves the issue by modifying `CMAKE_FIND_LIBRARY_SUFFIXES`
to prioritize `.a` files, forcing CMake to link the static version of
the library.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-09-20 13:46:39 +03:00
Shawn Gu f4a225cea6
opencl: optimize mxfp4 kernels (llama/16037)
- flatten mxfp4 and packed fp4->fp16 bit-wise convert function (replace lut)
- MoE kernel optimizations

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2025-09-20 13:46:39 +03:00
Jeff Bolz 7fcb7e83ec
rename optimize_graph to graph_optimize (llama/16082) 2025-09-20 13:46:39 +03:00
Bowen Han fce6354e0f
CUDA: Optimize PAD_REFLECT_1D (llama/15957)
* CUDA: Optimize PAD_REFLECT_1D
feat: add more test cases for PAD_REFLECT_1D

* use fast_div to improve performance

* Apply suggestion from JohannesGaessler

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Apply suggestion from JohannesGaessler

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* optimize

* use a concise expression to further speedup the cuda kernel

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-09-20 13:46:38 +03:00
Johannes Gäßler 05bdfd4380
CUDA: fix compilation on CC 6.0 (llama/16091) 2025-09-20 13:46:38 +03:00
Georgi Gerganov 960aaa9904
metal : use function constants for mul_mv_ext kernels (llama/16074)
* metal : use function constants for mul_mv_ext kernels

ggml-ci

* metal : remove NW template argument

ggml-ci

* metal : adjust constants

ggml-ci
2025-09-20 13:46:38 +03:00
Sigbjørn Skjæret 225d7c1d5a
cuda : add missing F32<->I32 entries in ggml_cuda_cpy_fn (llama/16060) 2025-09-20 13:46:38 +03:00
Georgi Gerganov d37f590a77
metal : improve F32, F16 and BF16 mat-vec multiplication (llama/16057)
* metal : improve F32, F16 and BF16 mat-vec multiplication

ggml-ci

* metal : make the NSG a function constant in mul_mv kernels

ggml-ci
2025-09-20 13:46:38 +03:00
Jhen-Jie Hong 32b6d9c134
metal : avoid call free for non-owned buffer (llama/16067) 2025-09-20 13:46:38 +03:00
Georgi Gerganov 1f24b1df4d
metal : handle nil cv during pipeline creation (llama/16065)
ggml-ci
2025-09-20 13:46:38 +03:00
Chenguang Li c46adc0817
CANN: Remove print (llama/16044)
Signed-off-by: noemotiovon <757486878@qq.com>
2025-09-20 13:46:38 +03:00
Reese Levine 1361f679cc
GGML WebGPU: Support for ADD, MUL, RMS_NORM, GET_ROWS operators (llama/16018)
* Add paramater buffer pool, batching of submissions, refactor command building/submission

* Add header for linux builds

* Free staged parameter buffers at once

* Format with clang-format

* Fix thread-safe implementation

* Use device implicit synchronization

* Update workflow to use custom release

* Remove testing branch workflow

* some f32 tests passing

* Disable set_rows until it's implemented

* f32 add all tests passing

* Begin work on set_rows

* Work on set rows

* Add error buffers for reporting unsupported SET_ROWS indices

* Remove extra comments

* Add templated addition, clean up code

* Get addition and multiplication working

* Implement rms_norm

* Add get_rows implementation

* Add new get_rows files

* Refactor use of wg size entry

* Fix compilation

* Try manually unrolled q4_0 quant

* Revert "Try manually unrolled q4_0 quant"

This reverts commit 77f8b96515f7e640ae4b0e44f066321fbc4a6166.

* Move to constant max wg size

* Check for tensor size in supports_op

* Vectorize f32 and change default workgroup size

* Move f32 get_rows from < 4 to % 4 != 0

* fix linter errors

* Add in-place tests

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
2025-09-20 13:46:37 +03:00
Georgi Gerganov eb2c01f92e
metal : refactor + optimize v2 (llama/15995) 2025-09-20 13:46:10 +03:00
Johannes Gäßler d452f0cf8c
CUDA: fix FA occupancy, optimize tile kernel (llama/15982) 2025-09-20 13:45:30 +03:00
Eve e96b285011
vulkan: automatically remove unsupported devices (llama/15976)
* remove unsupported vulkan devices

* make this happen during selection instead

* pass by reference
2025-09-20 13:45:30 +03:00
Chenguang Li e32c3b0fd3
CANN: Optimize ggml_cann_set_device (llama/15935)
* CANN: Fix ggml_cann_set_device to avoid redundant device switches

- Added a check to skip aclrtSetDevice if the current device is already set.
- Prevents unnecessary context switches while keeping thread/device consistency.

* CANN: add device default id
2025-09-20 13:45:30 +03:00
Daniel Bevenius 5c524bb879
ggml : fix padding in timestep embedding kernels (llama/15932)
* ggml : remove adding extra dim timestep embedding

This commit updates the ggml_timestep_embedding function to no longer
add an extra dimension when the specified dimension is odd.

The motivation for this change is that this introduces an unnecessary
dimension when the dimension is odd, which caused an issue in the
kernels which were not expecting this extra dimension and it resulted in
uninitialized memory for the second to last dimension.

* ggml-cuda : fix padding in timestep embedding kernel

This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.

* ggml-metal : fix padding in timestep embedding kernel

This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel

* ggml-opencl : fix padding in timestep embedding kernel

This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.

* ggml-sycl : fix padding in timestep embedding kernel

This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.

* ggml-vulkan : fix padding in timestep embedding kernel

This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.

* ggml-cpu : fix padding in timestep embedding function

This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.
2025-09-20 13:45:30 +03:00
Jake Karnes f72ec185fb
CUDA: fix im2col_3d to respect non-contiguous inputs (views) (llama/15956)
* fix im2col_3d to respect non-contiguous inputs (views)

The CUDA 3D im2col kernel computed source addresses assuming compact layout (products of dims), ignoring nb[] strides.

This patch switches im2col_3d source indexing to use true strides derived from src1->nb[] (in elements), mirroring the approach used in the 2D CUDA im2col path. Destination indexing is unchanged.

* use ggml_element_size() for src strides

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-09-20 13:45:30 +03:00
yael-works a642b533a4
SYCL: Add COUNT_EQUAL operator support (llama/15991)
* SYCL: Add COUNT_EQUAL operator support (rebased on master)

* SYCL: remove duplicate op_count_equal definition

* tests: remove test_count_equal_typed and use test_count_equal for all cases

* tests: keep only I32 case for COUNT_EQUAL as suggested

* tests: keep only I32 case for COUNT_EQUAL as requested
2025-09-20 13:45:30 +03:00
Aman Gupta 10bd5d3626
CUDA: some micro-optimizations in mmf.cuh for mul_mat_id (llama/15926) 2025-09-20 13:45:30 +03:00
Georgi Gerganov 82a8c141ea
metal : remove memory pools (llama/15966)
* metal : remove mem pool usage

ggml-ci

* metal : remove mem pool implementation

ggml-ci

* metal : take into account the actual allocated memory of the tensor

ggml-ci

* cont : use ggml_backend_buft_get_alloc_size

ggml-ci

* cont : improve, comments

ggml-ci

* cont : add functions for the extra tensor sizes

* metal : add comments

ggml-ci

* metal : implement .get_alloc_size for the rest of the buffer types

ggml-ci

* metal : remove ggml_metal_heap

ggml-ci
2025-09-20 13:45:29 +03:00
Ruben Ortlam c36358cb3c
Vulkan: Clean up mul_mm shader (llama/15987)
* vulkan: move mul_mm dequantization steps into a separate file and functions

* improve mul_mm vector load code

* fix debug mode issues and warnings
2025-09-20 13:45:29 +03:00
Georgi Gerganov 2d3f15607f
metal : fix kernel requirements (llama/15983)
* metal : fix kernel requirements

ggml-ci

* cont : fix supports_op

* cont : fix supports_op for ARGMAX
2025-09-20 13:45:29 +03:00
Aaron Teo 7dca05ca77
ggml-zdnn: rm user mapped buffers (llama/15965)
* ggml-zdnn: rm user mapped buffers

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rm dead code

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: attempt to fix missing extra data buffer free

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-20 13:45:29 +03:00
Jeff Bolz 1789ed3f2c
vulkan: fix failing dequant shaders (llama/15862)
* vulkan: fix failing dequant shaders

* add missing const
2025-09-20 13:45:29 +03:00
Jeff Bolz a3defb0a3b
vulkan: initialize vulkan-hpp to allow using extension function pointers (llama/15705)
Use this to query register count for shader compiles on NVIDIA. Currently
this is only for performance debug, but it could eventually be used in some
heuristics like split_k.
2025-09-20 13:45:29 +03:00
Georgi Gerganov 2caf15d68a
metal : refactor kernel loading (llama/15964)
* metal : refactor bin kernels loading

ggml-ci

* metal : refactor rms kernel loading

ggml-ci

* ci : try to add memory leaks check

ggml-ci

* ci : try to enable memory leak detection for Mac

* cont : seems to be working
2025-09-20 13:45:29 +03:00
Georgi Gerganov 0d36ba9e1a
metal : allow ops to run concurrently (llama/15929)
* metal : run graphs ops concurrently

ggml-ci

* cont : add flags for debugging and disabling concurrency

ggml-ci

* cont : refactor and handle fusing

ggml-ci

* cont : simplify - no need to use GPU address

ggml-ci

* cont : prepare mem ranges for reuse + add ggml-metal-common.cpp

ggml-ci

* cont : avoid redundant keywords in cpp [no ci]

* metal : reorder graph for better concurrency

ggml-ci

* metal : fix race on mem pool buffers

ggml-ci

* cont : add env GGML_METAL_GRAPH_OPTIMIZE_DISABLE

ggml-ci

* cont : refactor, optimize, add comments

ggml-ci

* cont : refactor ggml-metal.m

ggml-ci

* minor : update logs [no ci]
2025-09-20 13:45:29 +03:00
Georgi Gerganov 20a930ec94
metal : fix memory leaks (llama/15962)
ggml-ci
2025-09-20 13:45:28 +03:00
Aaron Teo e902731ccc
ggml-zdnn: fix #15414, activate FP16 and BF16 acceleration and incorrect zTensor free (llama/15839) 2025-09-20 13:45:28 +03:00
Ruben Ortlam 424c85f22a
Vulkan iGPU device selection overhaul and PCI ID API support (llama/15947)
* vulkan: implement ggml igpu device type, implement pci id support

* fix compiler warning

* prevent printf overflow warning
2025-09-20 13:45:28 +03:00
Mathieu Baudier 5a752bab84
vulkan: Make device memory check more portable (llama/15939) 2025-09-20 13:45:28 +03:00
Neo Zhang Jianyu cd764eaf2b
Revert "sycl: add usage of enqueue_functions extension (llama/14244)" (llama/15910)
* Revert "sycl: add usage of enqueue_functions extension (#14244)"

This reverts commit 8308f98c7fb778e54bf75538f5234d8bd20915e9.

* fix missed revert code, format the code
2025-09-20 13:45:28 +03:00
Diego Devesa 555dcb3e01
ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type (llama/15797)
* ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type

ggml-backend : add device id to device props

llama : only use iGPU devices if there are no GPU devices

llama : do not use multiple devices from different backends with the same device id
2025-09-20 13:45:28 +03:00
Johannes Gäßler f0768eb575
CUDA: larger SRAM reads for tile FA, AMD FP16 dot (llama/15927)
* CUDA: larger SRAM reads for tile FA, AMD FP16 dot

* fix logic for availability of v_dot2_f32_f16
2025-09-20 13:45:28 +03:00
Daniel Bevenius 020eb19eb3
ggml-cpu : add check for ARM MATMUL_INT8/i8mm support (llama/15922)
This commit adds a check for GGML_MACHINE_SUPPORTS_i8mm when enabling
MATMUL_INT8 features, ensuring that i8mm intrinsics are only used when
the target hardware actually supports them.

The motivation for this is to fix ggml CI build failures where the
feature detection correctly identifies that i8mm is not supported,
adding the +noi8mm flag, but MATMUL_INT8 preprocessor definitions are
still enabled, causing the compiler to attempt to use vmmlaq_s32
intrinsics without i8mm support.

Refs: https://github.com/ggml-org/ggml/actions/runs/17525174120/job/49909199499
2025-09-20 13:45:28 +03:00
Charles Xu b079d9c8b0
kleidiai: fix GGML_ASSERT(*cur_backend_id != -1) failed (llama/15614)
* kleidiai: fix GGML_ASSERT(*cur_backend_id != -1) failed

* removes the Whisper-specific check for GET_ROWS support
2025-09-20 13:45:27 +03:00
hipudding dadf73665a
CANN: Disable acl_graph for prefill stage (llama/15933)
Since the prefill length is not fixed, graphs constructed for the
prefill stage cannot be reused. For this reason, ACL graph
execution is disabled by default during prefill.
2025-09-20 13:45:27 +03:00
Oliver Simons f5ef0e25e2
CUDA: Add `fastdiv` to `k_bin_bcast*`, giving 1-3% E2E performance (llama/15872)
* Add fastdiv and fastmodulo to k_bin_bcast kernel

* Address review comments

* `prod_` instead of `prod` suffix

* Add test case for `k_bin_bcast_unravel` in CUDA backend
2025-09-20 13:45:27 +03:00
Daniel Bevenius 3617008c37
ggml-cpu : fix padding in ggml_timestep_embedding (llama/15917)
This commit fixes the zero padding for odd dimensions in
ggml_compute_forward_timestep_embedding_f32.
The motivation for this is that currently if an odd dimension is used,
the padding check incorrectly uses the dimension value for indexing.
For example, with dim=15:

Elements 0-6 are set to cosine values
Elements 7-13 are set to sine values
Element 14 is left uninitialized (contains garbage)
Element 15 is correctly set to zero

This fix changes embed_data[dim] to embed_data[2 * half] so that
element 14 (the first unused element) is properly set to zero as well
as the last element.

Resolves: https://github.com/ggml-org/ggml/issues/1324
2025-09-20 13:45:27 +03:00
Georgi Gerganov 7eae055e61
metal : make the backend async (llama/15906) 2025-09-20 13:44:27 +03:00
Chenguang Li 4d453b14a9
CANN: Add ROPE sin/cos cache for reuse (llama/15912)
* CANN: Add ROPE sin/cos cache for reuse

Introduce sin/cos caching mechanism in ROPE to avoid redundant
computation across layers. The cache is built on the first layer
per device and reused by subsequent layers if parameters match.

- Added sin_cache / cos_cache pointers and position_length tracking
- Introduced cache validity flags and properties:
  (ext_factor, theta_scale, freq_scale, attn_factor, is_neox)
- Accelerates ROPE by eliminating repeated sin/cos generation

This change reduces overhead in multi-layer scenarios while
preserving correctness by verifying parameter consistency.

Co-authored-by: hipudding <huafengchun@gmail.com>

* fix typo

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
2025-09-20 13:42:53 +03:00
Chenguang Li 9b773acac0
CANN: implement LRU cache for ACL graphs (llama/15814)
* CANN: implement LRU cache for ACL graphs in CANN backend

- Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects.
- Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded.
- Updated push, move_to_front, and clear methods to manage cached graphs efficiently.
- Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend.

* fix typo

* The LRU cache capacity can be configured via an env variable

Signed-off-by: noemotiovon <757486878@qq.com>

* refactory acl graph

* refactory && fix review comments

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-09-20 13:42:53 +03:00
Ruben Ortlam 7abe187860
vulkan: throw the oom error instead of no memory type found (llama/15905) 2025-09-20 13:42:53 +03:00
Jeff Bolz d0e98656c3
vulkan: Fix OOB accesses in soft_max_back (llama/15861) 2025-09-20 13:42:52 +03:00
Johannes Gäßler e35d1375ee
HIP: use v_dot2_f32_f16 instruction for FA (llama/15884) 2025-09-20 13:42:52 +03:00
lksj92hs 7fbbb67b47
Workaround for subgroup arithmetic failing on MoltenVK with AMD GPUs (issue 15846) (llama/15886) 2025-09-20 13:42:52 +03:00
Aman Gupta 621764b1a5
CUDA: Add mul_mat_id support for the mmf kernel (llama/15767)
* CUDA: Add mul_mat_id support the mmf

Add support for mul_mat_id for bs < 16

* Review: use warp_size, fix should_use_mmf condition

* Launch one block per expert, stride along n_expert_used

* templatize mul_mat_id

* Pad shmem to 16 bytes, add helper function mul_mat_f_switch_ids

* Reduce compile times by dividing mmf into f16, bf16 and f32 variants

* Divide mmf by ncols_dst

* Add missing files

* Fix MUSA/HIP builds
2025-09-20 13:42:52 +03:00
Johannes Gäßler 260982232c
CUDA: fix GET_ROWS for large tensors (llama/15882) 2025-09-20 13:42:52 +03:00
Jeff Bolz c29cd54818
vulkan: sort graph to allow more parallel execution (llama/15850)
* vulkan: sort graph to allow more parallel execution

Add a backend proc to allow the backend to modify the graph. The
vulkan implementation looks at which nodes depend on each other
and greedily reorders them to group together nodes that don't
depend on each other. It only reorders the nodes, doesn't change
the contents of any of them.

With #15489, this reduces the number of synchronizations needed.

* call optimize_graph per-split
2025-09-20 13:42:52 +03:00
Aman Gupta 70ee808f3d
CUDA: generate_cu_files.py - add missing mxfp4 (llama/15880) 2025-09-20 13:42:52 +03:00
Georgi Gerganov ae6cc6a386
cuda : fix supports_op condition for get_rows when number of blocks is too large (llama/15868)
* cuda : fix supports_op condition for get_rows when src1->ne2 > 1

ggml-ci

* ggml : add comment about ggml_get_rows

ggml-ci

* cuda : add FIXME [no ci]

* cuda : update support condition

ggml-ci
2025-09-20 13:42:52 +03:00
Georgi Gerganov e9cb59e970
metal : refactor + optimize (llama/15857) 2025-09-20 13:42:51 +03:00
Xuan-Son Nguyen 40bcd1a469
ggml: allow casting between f32 and i32 (llama/15783)
* ggml: allow casting between f32 and i32

* fix cuda

* add vulkan

* fix CPU non-cont

* add non-cont test case

* add note

* extend test number range

* correct note

* add cont version for vulkan
2025-09-20 13:42:51 +03:00
Sigbjørn Skjæret 0175a1df8d
CUDA: non-contiguous src0 not supported for PAD (llama/15869) 2025-09-20 13:42:51 +03:00
Chenguang Li d9c0ead2ab
CANN: Stream sync between devices for acl_graph (llama/15809)
* CANN: Switch to stream synchronization

Switch to stream synchronization because events are not effective.

Co-authored-by: hipudding <huafengchun@gmail.com>

* CANN: add Comments

---------

Co-authored-by: hipudding <huafengchun@gmail.com>
2025-09-20 13:42:51 +03:00
Jeff Bolz dfa7722e2e
vulkan: support im2col_3d (llama/15795) 2025-09-20 13:42:51 +03:00
Aaron Teo db4f504b69
ggml-cpu: clean up s390x SIMD (llama/15855)
* ggml-cpu: clean up s390x simd

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 0da4b6aa07d96b758812d17b2c82267632fa4ba5)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix hsum data types

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-20 13:42:51 +03:00
Jeff Bolz 9523fd8de6
vulkan: Support pad_ext (llama/15794) 2025-09-20 13:42:51 +03:00
Jeff Bolz 647e2d7de5
vulkan: Use larger loads in scalar/coopmat1 matmul (llama/15729)
I think glslang will translate an access like x[i][1].z to
OpAccessChain ... x, i, 1, 2
OpLoad float16_t ...

rather than loading all of x[i] in a single OpLoad. Change the
code to explicitly load the vector/matrix.
2025-09-20 13:42:51 +03:00
Daniel Bevenius cda7d4e5ac
ggml WebGPU: remove userdata from request adapter callback (llama/15527)
* ggml WebGPU: remove userdata from request adapter callback

This commit removes the `userdata` parameter from the WebGPU request
adapter callback in `ggml-webgpu.cpp`. Instead, the lambda function
captures the `webgpu_context` directly.

The motivation for this change is to simplify the code and improve
readability.

* inline the callback lambda into the RequestAdapter call

This commit removes the callback lambda variable and inlines it directly
into the RequestAdapter call.
2025-09-20 13:42:50 +03:00
Johannes Gäßler cd70d89628
CUDA: faster tile FA (Pascal/AMD), headsize 256 (llama/15769) 2025-09-20 13:42:50 +03:00
Charles Xu be2676bb1c
kleidiai: generalize compute_forward_kv_cache to compute_forward_fp16 (llama/15817) 2025-09-20 13:42:50 +03:00
Johannes Gäßler 69400f16f1
ggml-cpu: document use of "free" memory [no ci] (llama/15834) 2025-09-20 13:42:50 +03:00
Aaron Teo f499271c4e
ggml-cpu: drop support for nnpa intrinsics (llama/15821) 2025-09-20 13:42:50 +03:00
Johannes Gäßler 6ff468cfaa
CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (llama/15802)
* CUDA: fastdiv, launch bounds for mmvq + q8_1 quant
2025-09-20 13:42:50 +03:00
Daniel Bevenius 4d6e1144b1
ggml : introduce semantic versioning (ggml/1336)
* ggml : introduce semantic versioning

This commit introduces semantic versioning for the GGML library.

The motivation for this is that the current versioning, using build
numbers, makes it difficult to track changes and releases for projects
that use ggml.

The release steps are the following:
1. Sync the changes from llama.cpp using sync-llama-am.sh and after the
   PR has been approved and merged move to step 2.
2. Run scripts/release.sh and specify the type of release, major, minor,
   or patch. This script will handle incrementing the version
   (major|minor|patch), create a new commit with the version change,
   create a tag for the version, and prepare for the next development
   iteration.
3. Inspect the commits/tag and push to master. This will trigger the
   github release workflow which is triggered for new tags which will
   then publish a new release on github.

Example usage:
```console
$ ./scripts/release.sh major --dry-run
[dry-run] - No changes will be made

Step 1: Reading current version...
Current version: 0.9.0-dev
New release version: 1.0.0

Step 2: Updating version in CMakeLists.txt...
  [dry-run] Would update GGML_VERSION_MAJOR to 1
  [dry-run] Would update GGML_VERSION_MINOR to 0
  [dry-run] Would update GGML_VERSION_PATCH to 0
  [dry-run] Would remove -dev suffix

Step 3: Committing version bump...
  [dry-run] Would commit: 'ggml : bump version to 1.0.0'

Step 4: Creating git tag...
  [dry-run] Would create tag: v1.0.0 with message 'Release version 1.0.0'

Step 5: Preparing for next development cycle...
  [dry-run] Would update GGML_VERSION_MINOR to 1
  [dry-run] Would add -dev suffix back

Step 6: Committing development version...
  [dry-run] Would commit: 'ggml : prepare for development of 1.1.0-dev'

[dry-run] Summary (no changes were made):
  • Would have released version: 1.0.0
  • Would have created tag: v1.0.0
  • Would have set next development version: 1.1.0-dev
```

Refs: https://github.com/ggml-org/ggml/issues/1333

* ggml: create branch for release candidate and check master

* ggml : sign the git tag
2025-09-20 13:42:50 +03:00
Gregor Jasny c80f78cc7b
CUDA : conditionally add cuda architectures (ggml/1341) 2025-09-20 13:42:50 +03:00
Gabe Goodhart ffe560cbb1
metal : Add template specialization for mul_mm_id w/ ne20 == 10 (llama/15799)
Branch: GGMLMetalNE20

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-09-20 13:42:49 +03:00
Chenguang Li 3780a3c917
CANN: Refactor ND to NZ workspace to be per-device (llama/15763)
* CANN:Refactor ND to NZ workspace to be per-device in Ascend backend

- Replaced the previous single global ND→NZ workspace with a per-device
  cache using unordered_map keyed by device ID.
- Functions `release_nz_workspace`, `relloc_nz_workspace`, and
  `get_nz_workspace` now manage workspace independently for each device,
  preventing memory conflicts in multi-device / pipeline parallel scenarios.
- This change fixes potential precision issues caused by workspace
  overwrites when multiple devices perform ND→NZ conversions concurrently.

Co-authored-by: hipudding <huafengchun@gmail.com>

* refactor

Signed-off-by: noemotiovon <757486878@qq.com>

* rename

Signed-off-by: noemotiovon <757486878@qq.com>

* fix review comments

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
2025-09-20 13:42:49 +03:00
leejet 2228462b19
ggml: add ops for WAN video model (cuda && cpu) (llama/15669)
* add conv3d support

* add ggml_pad_ext for cpu & cuda backend

* cuda/cpu: add im2col_3d support

* cuda: make im2col a little faster

* fix cuda pad/scale/im2col3d

* make im2col_3d faster

* gguf: support loading tensors which n_dims > GGML_MAX_DIMS

* fix cuda get_rows

* avoid ggml_conv_3d conflict

* correct GGML_OP_COUNT assertion

* avoid build failure

* avoid build failure on MacOS

* cuda: remove unnecessary MIN define

* fix cpu im2col_3d

* adjust the code style

* cuda: use simpler loop in get_rows

* add test_im2col_3d to test-backend-ops

* test-backend-ops.cpp: remove trailing whitespace

* cpu: im2col_3d support non continuous src

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

* fix test_im2col_3d

* remove unused variables

* cuda: get_rows: dfloat2 -> float2

* add test_pad_ext to test-backend-ops.cpp

* add gguf_init_from_file_ext impl

* Revert "gguf: support loading tensors which n_dims > GGML_MAX_DIMS"

This reverts commit d8377a0a37f314bd3713fe043b4333ad661610c1.

* Revert "add gguf_init_from_file_ext impl"

This reverts commit d9f1d13208c68ef83b3538201ac7f31614fb1994.

* update ggml_backend_vk_device_supports_op

* fix ggml_backend_vk_device_supports_op

* update other backend supports op for ggml_pad_ext

* metal/opencl/sycl/vulkan: fix GGML_OP_PAD check in supports_op

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-09-20 13:42:49 +03:00
hipudding 96efb472b4
CANN: Fix precision issue on 310I DUO multi-devices (llama/15784) 2025-09-20 13:42:49 +03:00
rmatif 1569daf524
opencl: add hs=40 to FA (llama/15758) 2025-09-20 13:42:49 +03:00
Chenguang Li 5c860e94c6
CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (llama/15760)
Fixes #15330

Adjust the allocation size of acl_rstd. The parameter `dims` is set to 3 according to the CANN documentation.

Co-authored-by: Yuchuan <yuchuan-cao@users.noreply.github.com>
2025-09-20 13:42:49 +03:00
Ruben Ortlam 719a05c665
vulkan: fix mmv subgroup16 selection (llama/15775) 2025-09-20 13:42:49 +03:00
Jeff Bolz 4a702a867c
vulkan: don't use std::string in load_shaders, to improve compile time (llama/15724)
* vulkan: don't use std::string in load_shaders, to improve compile time

* keep the string version for those calls that use it
2025-09-20 13:42:49 +03:00
Daniel Bevenius 4144ae10e9
vulkan : update ggml_vk_instance_validation_ext_available (llama/15666)
* vulkan : update ggml_vk_instance_validation_ext_available

This commit updates ggml_vk_instance_validation_ext_available() to
check for VK_EXT_validation_features instead of
VK_KHR_portability_enumeration.

Based on how the returned boolean is used later in the code (to enable
both the validation layer and the VK_EXT_validation_features extension),
it appears the function may have been intended to check for the
validation layer features extension.

* remove try/catch

This was a left over from a previous iteration where I was explicitly
quering for a specific validation layer first, which would throw.

* update warning message about validation layers
2025-09-20 13:42:48 +03:00
Shin-myoung-serp 85c7aa3750
ggml vulkan: add hardsigmoid and hardswish operations (llama/15762) 2025-09-20 13:42:48 +03:00
Oliver Simons 9eef377330
CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E (llama/15715)
* Add fastdiv, use it in modulo and use modulo in rms_norm_f32

Fastdiv is much faster way to do integer division, which was identified
as bottleneck in rms_norm_f32

* Support more `block_size` values in `rms_norm_f32`

This makes us more flexible in selecting the optimal threads w.r.t
paralellizing across a col vs. launch-overheads of threads and mio
throttles

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Replace modulo with fastmodulo in `rms_norm_f32`

* Use `BinPackArguments=true` for formating function calls

Will file a separate PR to adjust .clang-format file

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Use uint3 for both `fastdiv` and `fastmodulo`

The compiler seems to reliably optimize away the unused .z component in
the fastdiv use-case, see https://godbolt.org/z/rx8KPrKr3

* More constrained type declarations

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Rename fastdiv and fastmodulo variables to shared variable name

As suggest by JohannesGaessler, this increases clarity of the intended
use

* Pack fastdiv/fastmodulo constants into uint2/uint3 objects

By packing constants to be used together into a struct, we are less
likely to make errors.

* Rename function parameter of fastmodulo

`modulo_consts` is more fitting/descriptive

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-09-20 13:42:48 +03:00
hipudding 51bc843f3a
CANN: Add RoPE contiguous check for 310I DUP device (llama/15735) 2025-09-20 13:42:48 +03:00
xctan 75f739c7c8
ggml-cpu : optimize RVV kernels (llama/15720)
* ggml-cpu : optimize rvv ggml_vec_dot_f32

* ggml-cpu : optimize 128-bit rvv ggml_vec_dot_q4_K_q8_K

* ggml-cpu : fix riscv arch flags

* ggml-cpu : add more rvv ops

* ggml-cpu : optimize rvv ggml_vec_dot_q4_K_q8_K

* ggml-cpu : optimize rvv ggml_vec_dot_q6_K_q8_K

* ggml-cpu : minor rvv adjustments

* ggml-cpu : fix riscv include
2025-09-20 13:42:48 +03:00
hipudding 91e9e72ecd
CANN: Mask unsupported TRANSPOSE_1D operator (llama/15733)
CANN currently does not support kernels larger than 255.
This change disables such cases.
2025-09-20 13:42:48 +03:00
Chenguang Li d84b96d9d0
CANN: Fix type float_t to float (llama/15736)
Signed-off-by: noemotiovon <757486878@qq.com>
2025-09-20 13:42:48 +03:00
Ruben Ortlam e584edb5ba
vulkan: fix shaders gen when no integer dot is available (llama/15740) 2025-09-20 13:42:48 +03:00
hipudding 5aee53c40f
CANN: Resolve soft_max precision issue (llama/15730)
Previously, the slope tensor was set to fp16 to improve efficiency.
While this worked correctly in FA, it caused precision issues in soft_max.
This change applies different data types for different operators
to balance both accuracy and performance.
2025-09-20 13:42:47 +03:00
Jeff Bolz 1e03aa66f7
vulkan: Fix macro parameter order for f32 matmul shaders (llama/15716) 2025-09-20 13:42:47 +03:00
rmatif fb37f91163
opencl: add attn sinks support for FA kernels (llama/15706) 2025-09-20 13:42:47 +03:00
Chenguang Li 3db49c1c26
CANN: Support eager execution mode under ACL graph compilation (llama/15712)
* [CANN] Support eager execution mode under ACL graph compilation

Add support for running operators in eager mode while ACL graph
compilation is enabled. This allows bypassing graph execution
and directly submitting ops, which is useful for debugging and
reducing graph build overhead in certain scenarios.

Signed-off-by: noemotiovon <757486878@qq.com>

* fix typo

Signed-off-by: noemotiovon <757486878@qq.com>

* rename to acl_graph_mode

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-09-20 13:42:47 +03:00
hipudding 13d3963f71
CANN: Support ext_factor in rope (llama/15710) 2025-09-20 13:42:47 +03:00
Johannes Gäßler f20a7b0e99
ggml-backend: raise GGML_MAX_SPLIT_INPUTS (llama/15722) 2025-09-20 13:42:47 +03:00
Gilad S 9e3600e569
vulkan: use memory budget extension to read memory usage (llama/15545)
* vulkan: use memory budget extension to read memory usage

* fix: formatting and names

* formatting

* fix: detect and cache memory budget extension availability on init

* fix: read `budgetprops.heapBudget` instead of `heap.size` when memory budget extension is available

* style: lints
2025-09-20 13:42:47 +03:00
Jeff Bolz 7a5e7368a3
vulkan: add missing clamps in new mul_mat_id paths (llama/15702)
This is a missing interaction between #15546 and #15652
2025-09-20 13:42:46 +03:00
Ruben Ortlam d5f80a2982
vulkan: disable large mmv subgroups on older Nvidia GPUs (llama/15717) 2025-09-20 13:42:46 +03:00
s-goto-11 8218dc609c
ggml: SVE support for exponential functions (llama/15145)
* SVE support for exponential functions

Add const notation to variable pg

* Update ggml/src/ggml-cpu/vec.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Add const

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-09-20 13:42:46 +03:00
Prashant Vithule 31840a3a56
ggml: aarch64: Implement SVE F16 kernels for vector functions (llama/15115)
* Added sve implementation for vec_dot_fp16 Kernel

* removed white spaces

* Added comment

* removed white spaces

* changed GGML_F16x_VEC_FMA for code consistency

* Update vec.h

---------

Co-authored-by: vithulep <p.m.vithule1517@gmail.com>
2025-09-20 13:42:46 +03:00
Ruben Ortlam 5e70d901b0
Vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants (llama/14903)
* vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants

* vulkan: use subgroup operations for quantize_q8_1 shader

* vulkan: add q8_1_x4 type with 128-bit alignment, use in mul_mat_vecq shader

* vulkan: use q8_1_x4 blocks in mul_mmq shader

* vulkan: do 8 calculations per invocation instead of 32 in mul_mat_vecq, similar to mul_mat_vec

* vulkan: tune mul_mat_vecq performance for Intel

* vulkan: fix quantizing issue when tensor is not divisible by 128

* vulkan: adapt integer dot mmv to mmv small m optimization (llama/15355)

* vulkan: allow all subgroup modes for mmv and mmvq

* vulkan: use prealloc intermediate reuse for mmvq path

* vulkan: tune mmvq for Intel, AMD GCN and Nvidia RTX 3090

* vulkan: adapt mmv quantize_y path to conditional sync logic

* vulkan: disable q8_0 mmvq on Nvidia

* vulkan: enable q8_0 on Nvidia pre-turing

* fix prealloc sync condition

* fix llvmpipe subgroup 8 issue
2025-09-20 13:42:46 +03:00
Daniel Bevenius c5f511e697
ggml : WebGPU add TRANSPOSE and RESHAPE to supported ops (llama/15695)
* ggml : WebGPU add TRANSPOSE and RESHAPE to supported ops

This commit adds support for the TRANSPOSE and RESHAPE operations in the
ggml webgpu backend.

Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-09-20 13:42:46 +03:00
Akarshan Biswas 2ba5e0cb47
CUDA: fix build error from ambiguous __half conversions in conv2d (llama/15690)
* CUDA: fix build error from ambiguous __half conversions in conv2d

Building conv2d with half precision failed because `__half` defines
multiple implicit conversion operators (to float, int, short, etc.),
causing ambiguous overload resolution when multiplying with float.

Introduce a templated `to_float` helper that explicitly converts
`__half` via `__half2float`, while passing through float unchanged.
Use this helper in conv2d accumulation to ensure unambiguous and
correct promotion to float.

Fixes some build errors with half-precision kernels on CUDA.

ggml-ci

* CUDA: Replace custom to_float helper with unified ggml_cuda_cast and add half‑>float conversion

* CUDA: Add missing convert.cuh header

* CUDA: remove unnecessary extension in ggml_cuda_cast

* CUDA: Address review comment, remove second type template argument
2025-09-20 13:42:46 +03:00
hipudding bb5f844ec7
CANN: Optimize MUL_MAT_ID (llama/15658) 2025-09-20 13:42:46 +03:00
hipudding ed7ebdc757
CANN: fix RoPE cache issue on multi-device (llama/15629)
* CANN: fix RoPE cache issue on multi-device

RoPE cache only needs to be computed once per token.
However, in multi-device scenarios, not every device starts
computation from layer 0, which may lead to unallocated memory
issues and precision errors.

This commit records the first layer of each device to avoid
the above issues.

* CANN: Optimize first-layer detection method

* CANN: Remove trailing whitespace

* CANN: Only cache the data that can be determined as unchanged through the parameters.

* CANN: Update function comment
2025-09-20 13:42:45 +03:00
Georgi Gerganov 3d470687de
metal : fix checks for available FA kernels (llama/15700)
* metal : fix checks for available FA kernels

ggml-ci

* cont : fix comment [no ci]
2025-09-20 13:42:45 +03:00
Diego Devesa b11c972b88
llama : separate compute buffer reserve from fattn check (llama/15696)
Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.
2025-09-20 13:42:45 +03:00
Jeff Bolz db7ecfb61d
vulkan: handle large sizes for get_rows (llama/15686) 2025-09-20 13:42:45 +03:00
Jeff Bolz 191def71ce
vulkan: mul_mat_id coopmat2 optimizations (llama/15546)
* vulkan: mul_mat_id coopmat2 optimizations

Add a path for when the tile fits in BN/2, similar to what we have for mul_mat.

Only call fetch_scales/store_scales once per QUANT_K block, and once at the
beginning in case start_k is not aligned.

* Also add a path for BN/4 - worth a couple more percent
2025-09-20 13:42:45 +03:00
Daniel Bevenius b092e95aaa
vulkan : remove unused portability_enumeration_ext variable (llama/15679)
This commit removes the portability_enumeration_ext variable from the
ggml_vk_instance_portability_enumeration_ext_available function as it
is initialized to false but never modified, making it redundant.
2025-09-20 13:42:45 +03:00
Jeff Bolz 20ce6fcf6a
vulkan: Allow fallback to sysmem memory when vidmem is full (llama/15649)
* vulkan: Allow fallback to sysmem memory when vidmem is full

* vulkan: Add env var GGML_VK_ALLOW_SYSMEM_FALLBACK
2025-09-20 13:42:45 +03:00
Jeff Bolz 71f0ee70bf
vulkan: clamp matmul and FA results to the max finite value (llama/15652)
* vulkan: clamp matmul and FA results to the max finite value

* only clamp for fp16
2025-09-20 13:42:45 +03:00
Charles Xu 74583845b6
ggml: update kleidiai to v1.13.0 (llama/15663) 2025-09-20 13:42:44 +03:00
Johannes Gäßler f6ba3949b6
llama: use FA + max. GPU layers by default (llama/15434)
* llama: use max. GPU layers by default, auto -fa

* ggml-backend: abort instead of segfault
2025-09-20 13:42:44 +03:00
Johannes Gäßler b7809c401b
CUDA: use FP32 arithmetic for conv2d (llama/15683) 2025-09-20 13:42:44 +03:00
Jeff Bolz a6dec4f49d
vulkan: Skip syncing for prealloc_y when it is reused (llama/15544) 2025-09-20 13:42:44 +03:00
Chenguang Li d629af157e
CANN: FIx compiler warnings (llama/15661)
Signed-off-by: noemotiovon <757486878@qq.com>
2025-09-20 13:42:44 +03:00
Aman Gupta 82ce91e7d2
CUDA: fix bug in rms_norm fusion (llama/15660)
* CUDA: fix bug in rms_norm fusion

* Fix bug for OP_REPEAT

* Fix index for add
2025-09-20 13:42:44 +03:00
Aman Gupta 6d7ddaf793
CUDA: fuse adds, fuse add with rms norm (llama/15631)
* CUDA: fused add with rms_norm_mul

* Non-broadcast fuse works

* Add fused adds

* format

* Remove n_fuse from template params

* Address review comments

* Move template inside binbcast
2025-09-20 13:42:44 +03:00
mnehete32 dc9f55bbb0
CUDA: add conv2d (llama/15635)
* CUDA: add conv2d

* CUDA: conv2d - correct formatting and added const
2025-09-20 13:42:44 +03:00
Aaron Teo 6287027a2c
ggml-cpu: fix invalid hsum build in debug s390x (llama/15634)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-20 13:42:43 +03:00
compilade 6dffbaa0cb
ggml : fix SSM_SCAN for n_groups > 1 (llama/15625) 2025-09-20 13:42:43 +03:00
Georgi Gerganov cac6253744
kv-cache : remove LLAMA_SET_ROWS checks (llama/15505)
ggml-ci
2025-09-20 13:42:43 +03:00
matiaslin 88c0582b61
cuda: Add cublasLt_static linking when GGML_STATIC is enabled (llama/15622)
Prior to this change, we faced undefined cublasLt references when
attempting to compile 'llama-cli' with GGML_STATIC=ON on Linux.

We add linking with CUDA::cublasLt_static when CUDA version is greater
than 10.1.
2025-09-20 13:42:43 +03:00
uvos 65fa2c0c1a
HIP: Enable support for ggml_backend_cuda_register_host_buffer (llama/15615) 2025-09-20 13:42:43 +03:00
Chenguang Li 02e8b23137
CANN: refactor mask handling and improve performance in FA (llama/15561)
* CANN(flash-attn): refactor mask handling and improve performance

1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode.
2. Optimized performance in non-alibi scenarios by reducing one repeat operation.
3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16.

Signed-off-by: noemotiovon <757486878@qq.com>

* [CANN]: fix review

Signed-off-by: noemotiovon <757486878@qq.com>

* [CANN]: Optimization FA BNSD to BSND

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-09-20 13:42:43 +03:00
xctan ece1bdfe7e
ggml-cpu : add basic RVV support for vector f32 ops (llama/15057)
* ggml-cpu : add basic RVV support for vector f32 ops

* ggml-cpu : add RVV support for f32 softmax
2025-09-20 13:42:43 +03:00
rmatif a6ec224efa
OpenCL: add fused group_norm/norm, mul, add (llama/15314)
* add fused group_norm/norm, mul, add

* fix spacing

* revert rms_norm logic

* fix trailing whitespace
2025-09-20 13:42:43 +03:00
Akarshan Biswas 94fa9f63b3
SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (llama/15592)
The original implementation unconditionally returned true for this operation, leading to a failure when the tensor's first dimension (ne[0]) was not a multiple of WARP_SIZE. This caused an GGML_ASSERT(ncols % WARP_SIZE == 0) failure in ggml-sycl/norm.cpp.

This change updates the ggml_backend_sycl_device_supports_op check to correctly return true for GGML_OP_RMS_NORM only when the first dimension of the tensor is a multiple of WARP_SIZE, ensuring the operation can be performed without error.
2025-09-20 13:42:42 +03:00
shalinib-ibm 31c7784e09
llamafile: PowerPC Sgemm Optimization (llama/15558)
This patch improves GEMM for FP32 Data Type on PowerPC

Implements GEMM on large blocks with configurable block size mc, nc, kc
(default: 256, 256, 256).
Packing Function optimized to access blocks as per memory layout.
GEMM Optimized to work on larger blocks.
Isolated Packing from GEMM Operations for better MMA utilization.

Verified functionality and correctness uing llama-cli and stand alone
test case (performs matmul and compares final mattrix C result with base).

Minor code refactoring changes:
Replace macro with inline function
Code Indent made consistent with 4 spaces

Performance Testing:

Observed 50% ~ 70% improvement in Prompt Processing Speed mesured using
llama-bench with Meta-Llama3-8B FP32 Model.  Similar gains observed with
Mistral-7b-Instruct-v0.3 Model.

model                   Size                Params     Backend       Threads   Test    Patch   Base
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp512   98.58   60.3
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp1024  95.88   57.36
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp2048  85.46   53.26
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp4096  68.66   45.78
llama 8B all F32        29.92 GiB           8.03 B      CPU           20       pp6144  57.35   40.44

25 ~ 30% improvement in llama-batched-bench with Metla-Llama3-8B in
Prompt Processing Speed for large prompts (256, 512, 1024, 2048, 4096)tokens with various batch
sizes ( 1, 2, 4, 8, 16)

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2025-09-20 13:42:42 +03:00
Johannes Gäßler 53010199a1
CUDA: return -1 for nonexistent compiled arch (llama/15587) 2025-09-20 13:42:42 +03:00
Georgi Gerganov 1c21a850be
metal : optimize FA vec for large sequences and BS <= 8 (llama/15566)
* metal : optmize FA vec for large heads and sequences

* metal : adjust small-batch mul mv kernels

ggml-ci

* batched-bench : fix total speed computation

ggml-ci

* cont : add comments

ggml-ci
2025-09-20 13:42:42 +03:00
Georgi Gerganov dc693ca8c9
metal : improve `MUL_MAT_ID` (llama/15541)
* metal : mul_mm_id remove hdst

* metal : remove mul_mm_id hsrc1

* metal : mul_mm_id simplify + add test

* metal : opt mul_mm_id map0

* metal : optimize mul_mm_id id gathering

* metal : mul/div opt

* metal : optimize mul_mm_id_map0

ggml-ci
2025-09-20 13:42:42 +03:00
Sigbjørn Skjæret 3bb52acb46
metal : remove contiguous assertion for src0 in IM2COL (llama/15577)
* remove contiguous assertion for src0 in IM2COL

* add contiguous check in supports_op
2025-09-20 13:42:42 +03:00
Yoshi_likes_e4 9828caafb5
Add a warning for special devices (llama/15563)
* Add warning

* Print the devices names

* Add newlines

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Fix vector names

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-09-20 13:42:42 +03:00
Jeff Bolz 79e2bd5ea8
vulkan: Remove splitting for mul_mat_id (llama/15568)
row_ids only needs to hold the BN rows for the current tile.
2025-09-20 13:42:42 +03:00
Qeeweew 2468074e91
CUDA: Accelerate MXFP4 table lookup using `__byte_perm` (llama/15451)
* CUDA: optimize get_int_from_table_16

* CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs

* revise documentation

---------

Co-authored-by: xix <xiapc@outlook.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-09-20 13:42:41 +03:00
lhez 582ef379ab
opencl: fix support ops condition for `rms_norm` (llama/15560) 2025-09-20 13:42:41 +03:00
Ruben Ortlam 335d2a5405
vulkan: fix min subgroup 16 condition for mmid subgroup optimization (llama/15565) 2025-09-20 13:42:41 +03:00
Ihar Hrachyshka 8851ef5463
metal: fix regression when no metal devices are present (llama/15531) 2025-09-20 13:42:41 +03:00
Johannes Gäßler 1e856b2919
CUDA: MoE helper in device code, better tile sizes (llama/15525)
* CUDA: MoE helper in device code, better tile sizes

* reduce superfluous CUDA blocks
2025-09-20 13:42:41 +03:00
Georgi Gerganov 54be54f4ce
metal : add FA kernels for HS=40 (llama/15559)
ggml-ci
2025-09-20 13:42:41 +03:00
Chenguang Li 86331f74e0
CANN: ROPE cache sin/cos repeat (llama/15501)
Signed-off-by: noemotiovon <757486878@qq.com>
2025-09-20 13:42:41 +03:00
Ruben Ortlam ee11ed42a9
vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices (llama/15524)
* vulkan: use subgroup function for mul_mat_id shader even without coopmat

* vulkan: fix compile warnings

* vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id

* vulkan: disable subgroup mul_mat_id on devices with subgroups < 16
2025-09-20 13:42:41 +03:00
Jeff Bolz 85d4d2c875
vulkan: Support FA with any multiple of 8 head sizes (llama/15537)
The scalar FA shader already handled multiples of 8. The coopmat1 FA
shader assumed 16x16x16 and the shared memory allocations need the HSK
dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation
requires multiples of 16 for N and K, and needs the matrix dimensions
padded and loads clamped.

Store the FA pipelines in a map, indexed by the pipeline state.
2025-09-20 13:42:40 +03:00
Ruben Ortlam 8c7872d6ed
vulkan: enable Conv2D for Apple after MoltenVK fixed the bug (llama/15526) 2025-09-20 13:42:40 +03:00
Jeff Bolz 27817867cc
vulkan: workaround MoltenVK compile failure in multi_add (llama/15506)
* vulkan: workaround MoltenVK compile failure in multi_add

* Update ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp

Co-authored-by: 0cc4m <picard12@live.de>
2025-09-20 13:42:40 +03:00
Johannes Gäßler b0d15e1eb6
CUDA: fix half2 -> half conversion for HIP (llama/15529) 2025-09-20 13:42:40 +03:00
Jeff Bolz 2f6288c33c
vulkan: optimize rms_norm, and allow the work to spread across multiple SMs (llama/15281)
* vulkan: optimize rms_norm, and allow the work to spread across multiple SMs

There are really two parts to this change:
(1) Some optimizations similar to what we have in soft_max, to unroll with
different numbers of iterations.
(2) A fusion optimization where we detect add followed by rms_norm, and make
the add shader atomically accumulate the values^2 into memory. Then the
rms_norm shader can just load that sum. This allows the rms_norm to be
parallelized across multiple workgroups, it just becomes a simple per-element
multiply.

The fusion optimization is currently only applied when the rms_norm is on a
single vector. This previously always ran on a single SM. It could apply more
broadly, but when there are other dimensions the work can already spread across
SMs, and there would be some complexity to tracking multiple atomic sums.

* Change add+rms_norm optimization to write out an array of partial sums
rather than using atomic add, to make it deterministic. The rms_norm
shader fetches a subgroup's worth in parallel and uses subgroupAdd to
add them up.

* complete rebase against fused adds - multi_add shader can also compute partial sums

* fix validation errors

* disable add_rms_fusion for Intel due to possible driver bug

* resolve against #15489, sync after clearing partial sums
2025-09-20 13:42:40 +03:00
Jeff Bolz d8eb9f7d67
vulkan: Rewrite synchronization to allow some overlap between nodes (llama/15489)
Track a list of nodes that need synchronization, and only sync if the new node
depends on them (or overwrites them). This allows some overlap which can
improve performance, and centralizes a big chunk of the synchronization logic.

The remaining synchronization logic involves writes to memory other than the
nodes, e.g. for dequantization or split_k. Each of these allocations has a bool
indicating whether they were in use and need to be synced. This should be
checked before they are written to, and set to true after they are done being
consumed.
2025-09-20 13:42:40 +03:00
Acly 5094171c37
vulkan : support ggml_mean (llama/15393)
* vulkan : support ggml_mean

* vulkan : support sum, sum_rows and mean with non-contiguous tensors

* vulkan : fix subbuffer size not accounting for misalign offset

* tests : add backend-op tests for non-contiguous sum_rows

* cuda : require contiguous src for SUM_ROWS, MEAN support
* sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support

* require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader
2025-09-20 13:42:40 +03:00
Jeff Bolz 485c5c3b3b
vulkan: optimize mul_mat_id loading row ids into shared memory (llama/15427)
- Spread the work across the whole workgroup. Using more threads seems to
far outweigh the synchronization overhead.
- Specialize the code for when the division is by a power of two.
2025-09-20 13:42:40 +03:00
Reese Levine bb5d7e2c31
ggml WebGPU: add support for quantization types (llama/15440)
* Begin work on set_rows

* Work on set rows

* Add error buffers for reporting unsupported SET_ROWS indices

* Remove extra comments

* Work on templating for different types in shaders

* Work on shader type generation

* Working q4_0 mul_mat and some templating for different types

* Add q4_0_f16 matmul and fix device init

* Add matmul support for basic quantization types

* Add q2_k and q3_k quantization

* Add rest of k-quants

* Get firt i-quant working

* Closer to supporting all i-quants

* Support rest of i-quants

* Cleanup code

* Fix python formatting

* debug

* Bugfix for memset

* Add padding to end of buffers on creation

* Simplify bit-shifting

* Update usage of StringView
2025-09-20 13:42:39 +03:00
rmatif d7b7498e76
ggml: add `conv3d` op (llama/15182)
* add conv3d

* bump GGML_OP_COUNT
2025-09-20 13:42:39 +03:00
Yavor Ivanov 18ca4e8f63
cuda : add Pad Reflect 1D support (llama/14659)
* Add Pad Reflect 1D CUDA support

* Update ggml/src/ggml-cuda/pad_reflect_1d.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-09-20 13:42:39 +03:00
Aaron Teo 380d3db216
ggml-cpu: Support Q5_0 and Q5_1 on s390x (llama/15486)
* ggml-cpu: initial q5_0 impl for s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: updated q5_0 code for better performance

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: use optimised hsum for better performance

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: introduce q5_1 simd + refactor q5_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix incorrect return type vec_hsum

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: q5_0 incomplete refactor + table_b2b_0 activation

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: refactor q5_1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: q5_1 update loop unroll to 4

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: update q5_0 unroll to 4

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: update build-s390x docs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: update unused variables q5_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: update the last update date

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-20 13:42:39 +03:00
Chenguang Li be841c3f6e
CANN: Optimize RMS_NORM using cache (llama/15419)
* [CANN] Optimize RMS_NORM using cache

Signed-off-by: noemotiovon <757486878@qq.com>

* fix typo

Signed-off-by: noemotiovon <757486878@qq.com>

* fix review comment

Signed-off-by: noemotiovon <757486878@qq.com>

* codestyle adjustment

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-09-20 13:42:39 +03:00
Diego Devesa 554f96f385
sched : fix possible use of wrong ids tensor when offloading moe prompt processing (llama/15488) 2025-09-20 13:42:39 +03:00
Acly 9dd5039968
vulkan : support conv_2d_dw with f16 weights (llama/15392) 2025-09-20 13:42:39 +03:00
Dong Won Kim 7eebd498ff
vulkan: add exp operation (llama/15456)
Co-authored-by: aeseulgi <kim2h7903@gmail.com>
2025-09-20 13:42:39 +03:00
Jeff Bolz 04d0f9a066
vulkan: Reuse conversion results in prealloc_y (llama/15410)
* vulkan: Reuse conversion results in prealloc_y

Cache the pipeline and tensor that were most recently used to fill prealloc_y,
and skip the conversion if the current pipeline/tensor match.

* don't use shared pointer for prealloc_y_last_pipeline_used
2025-09-20 13:42:38 +03:00
Xuan-Son Nguyen c5874bcf42
ggml : fix condition of im2col on Metal backend (llama/15460) 2025-09-20 13:42:38 +03:00
R0CKSTAR 7c077845fd
musa: add GGML_UNUSED_VARS (llama/15446)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-09-20 13:42:38 +03:00
Diego Devesa 622dec5bf6
sched : copy only the used experts when offloading prompt processing (llama/15346) 2025-09-20 13:42:38 +03:00
Johannes Gäßler 8f0579a33d
CUDA: refactor FA support/selection code (llama/15454) 2025-09-20 13:42:38 +03:00
Johannes Gäßler 316ed78d68
CUDA: replace GGML_CUDA_F16 with CUDA arch checks (llama/15433) 2025-09-20 13:42:38 +03:00
Jeff Bolz 5907ab3e4a
vulkan: shorten pipeline name strings (llama/15431)
These detailed strings were causing increased build time on gcc.
2025-09-20 13:42:38 +03:00
R0CKSTAR 0eb2d653bd
musa: fix build warnings (llama/15258)
* musa: fix build warnings

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* fix warning: comparison of integers of different signs: 'const int' and 'unsigned int' [-Wsign-compare]

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-09-20 13:42:38 +03:00
lhez db1d2380a0
opencl: mark `argsort` unsupported if cols exceed workgroup limit (llama/15375) 2025-09-20 13:42:37 +03:00
SHUAI YANG 2572322bac
CANN: optimize rope operator (llama/15335)
* optimize rope ops

* amendment

* delete trailing whitespace

* change the variable name
2025-09-20 13:42:37 +03:00
R0CKSTAR 02b49af98d
musa: handle __hgt2_mask, available starting from MUSA SDK rc4.3.0 (llama/15413)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-09-20 13:42:37 +03:00
Marvin Gießing 2ce5860a62
ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware (llama/15385)
* Added VSX intrinsics for Power9+ systems

Signed-off-by: mgiessing <marvin.giessing@gmail.com>

* Manual unrolling for minor perf improvement

Signed-off-by: mgiessing <marvin.giessing@gmail.com>

* Update ggml/src/ggml-cpu/arch/powerpc/quants.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: mgiessing <marvin.giessing@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-09-20 13:42:37 +03:00
Georgi Gerganov 80447f7412
cuda : remove obsolete sources (ggml/1332)
ggml-ci
2025-09-20 13:42:37 +03:00
Carlos Zoido 44fa2f647c
ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (#3426)
While working on the [whisper-cpp](https://conan.io/center/recipes/whisper-cpp) Conan package for ConanCenter, I noticed that enabling the `with_blas` option fails to build due to an issue in the _MKL_ detection logic.  

The problem is that the CMake condition currently expands `BLAS_INCLUDE_DIRS` without quotes:

```cmake
if (${BLAS_INCLUDE_DIRS} MATCHES "mkl" AND (${GGML_BLAS_VENDOR} MATCHES "Generic" OR ${GGML_BLAS_VENDOR} MATCHES "Intel"))
```
When `BLAS_INCLUDE_DIRS` is a list (as Conan provides it), the `if()` command receives multiple arguments and produces a CMake error:

```bash
...
-- BLAS found, Includes: /root/.conan2/p/b/openb034c5a6ca927b/p/include;/root/.conan2/p/b/openb034c5a6ca927b/p/include/openblas
CMake Error at ggml/src/ggml-blas/CMakeLists.txt:77 (if):
  if given arguments:

    "/root/.conan2/p/b/openb034c5a6ca927b/p/include" "/root/.conan2/p/b/openb034c5a6ca927b/p/include/openblas" "MATCHES" "mkl" "AND" "(" "OpenBLAS" "MATCHES" "Generic" "OR" "OpenBLAS" "MATCHES" "Intel" ")"

  Unknown arguments specified
...
```
This PR fixes the issue by quoting the variable:

```cmake
if ("${BLAS_INCLUDE_DIRS}" MATCHES "mkl" AND (${GGML_BLAS_VENDOR} MATCHES "Generic" OR ${GGML_BLAS_VENDOR} MATCHES "Intel"))
```

With this change, the whole list is treated as a single string and the regex still works correctly.
2025-09-19 05:33:53 +02:00
Reese Levine 5ed45b2518 ggml: Add initial WebGPU backend (llama/14521)
ggml-ci
2025-08-18 20:30:45 +03:00
Aaron Teo 03d6607691 ggml : initial zDNN backend (llama/14975) 2025-08-18 20:30:45 +03:00
compilade 0fd4a250df ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors (llama/15379)
* ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors

* ggml-quants : avoid division by zero in make_q3_quants
2025-08-18 20:30:45 +03:00
Jeff Bolz fcd694ec1a vulkan: disable spirv-opt for bfloat16 shaders (llama/15352) 2025-08-18 20:30:45 +03:00
Jeff Bolz 6835e0cf77 vulkan: Use larger workgroups for mul_mat_vec when M is small (llama/15355)
* vulkan: Use larger workgroups for mul_mat_vec when M is small

Also use subgroup instructions for (part of) the reduction when supported.
Without this, the more expensive reductions would eat into the benefits of
the larger workgroups.

* update heuristic for amd/intel

Co-authored-by: 0cc4m <picard12@live.de>

---------

Co-authored-by: 0cc4m <picard12@live.de>
2025-08-18 20:30:45 +03:00
Dong Won Kim c225f25907 vulkan: support sqrt (llama/15370) 2025-08-18 20:30:45 +03:00
Jeff Bolz 0a8285186a vulkan: Optimize argsort (llama/15354)
- Launch an appropriate number of invocations (next larger power of two).
32 invocations is common and the barrier is much cheaper there.
- Specialize for "needs bounds checking" vs not.
- Make the code less branchy and [[unroll]] the loops. In the final code,
I see no branches inside the main loop (only predicated stores) when
needs_bounds_check is false.
- Always sort ascending, then apply the ascending vs descending option when
doing the final stores to memory.
- Copy the values into shared memory, makes them slightly cheaper to access.
2025-08-18 20:30:45 +03:00
Jeff Bolz c44d449635 vulkan: fuse adds (llama/15252)
* vulkan: fuse adds

Fuse adds that have the same shape, which are common in MoE models.
It will currently fuse up to 6 adds, because we assume no more than
8 descriptors per dispatch. But this could be changed.

* check runtimeDescriptorArray feature

* disable multi_add for Intel due to likely driver bug
2025-08-18 20:30:45 +03:00
Jeff Bolz d14e626e6a vulkan: Support mul_mat_id with f32 accumulators (llama/15337)
* vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id

* vulkan: Support mul_mat_id with f32 accumulators, but they are not hooked up

- There's no explicit way to request f32 precision for mul_mat_id, but there
probably should be, and this gets the code in place for that.
- A couple fixes to check_results.
- Remove casts to fp16 in coopmat1 FA shader (found by inspection).
2025-08-18 20:30:45 +03:00
Jeff Bolz 5b62995350 vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id (llama/15334) 2025-08-18 20:30:45 +03:00
rmatif e27f4f205d OpenCL: add initial FA support (llama/14987)
* add F16/F16 fa support

* fix kernel init

* use mad instead of fma

* use inline function

* mark FA with sinks as unsupported for now

* add pragma unroll to loops
2025-08-18 20:30:45 +03:00
lhez 77771b2711 opencl: add initial mxfp4 support via mv (llama/15270)
* opencl: add reference `mul_mv_mxfp4_f32`

* opencl: add reference `mul_mv_id` for mxfp4

* Q4_0 tranpose fix for Adreno

---------

Co-authored-by: shawngu-quic <shawngu@qti.qualcomm.com>
2025-08-18 20:30:45 +03:00
Georgi Gerganov 1e8d692365 vulkan : fix out-of-bounds access in argmax kernel (llama/15342)
ggml-ci
2025-08-18 20:30:45 +03:00
Georgi Gerganov 1a92fde1b6 vulkan : fix compile warnings on macos (llama/15340)
ggml-ci
2025-08-18 20:30:45 +03:00
Aaron Teo f797a6f9c8 ggml: initial IBM zDNN backend (llama/14975)
* ggml-zdnn: inital backend impl

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml-zdnn: temp change z17 to arch15

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml-zdnn: fix build bugs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: tensor->extra logging check

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml-zdnn: add layout name mapping, ztensor information

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml-zdnn: separate logging into its own line

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml-zdnn: add shape comparison

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml-zdnn: add ggml_tensor shape log

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml-zdnn: fix incorrect shape logging

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add output buffer check

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: run compute and store into tensor->extra

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add set_tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add more loggers

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: update set_tensor logging to check only for matmul

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: last working matmul version

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add comments to prevent accidentally deleting lines

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: support op out_prod

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: update op out_prod to use tensor->extra

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rewrite the backend implementation

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: bugfix new impl

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix compiler warnings and bugfixes

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: test ztensor finding in init_tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: implement at least 1 op to test

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: assign tensor->extra to buffer

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add check for view tensors to prevent init_tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rework init_tensor to create new buffers

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: switch to std vector instead of array

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: switch buffers back and set to arbitrary number

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: impl init_tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: update supports_op matmul matrix

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix incorrect ztensor shape, reduce memory padding

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: code clean up

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: impl matmul

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix compiler error missing type

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix missing data transform call

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add bias init_tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: tighten memory usage, change string allocation

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add bias ztensor and data free

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add bias data transform

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add more debug info for extra buffer transform

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add logger to check if mat mul ops go through set_tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: activate bias transform in matmul

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: move weights transform into mulmat

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add more safeguards in matmul

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix sequencing of transforms

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: bugfix transform ztensor vs origtensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: figure out why sigtrap is happening

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix sigsegv

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: move everything back to local declaration

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: move bias data to local also

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: bring back working matmul

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rewrite into mre

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix missing vector import

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix missing vector import in header

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: attempt to fix sigsegv

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix missing load tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix invalid ztensor buffer release

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add logging to debug free buffer

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: remove free_buffer debug info

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add parmblkformat detections

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add nnpa installed detection

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add zdnn_init call for static libs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add init_tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: attempt at fixing invalid buffer

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: switch to using deque to fix pointer deref problem

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add weights logging to check

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: attempt to use unique ptr

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add tensor to pre_tfm_desc logging

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add inputs logging

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: disable op_none initialisation for testing

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix missing return from init_tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: load ztensors in cgraph exec

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: work on moving output ztensor as well

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: disable logging and breakpoints for full test

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: attempt at manually changing the layout

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: attempt at using default nwhc format instead

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: disable global load ztensor for now

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix errorenous output load tensor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: add guards to prevent loading ztensor if transformed

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: code cleanup

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: bring load ztensor back to init routine

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: code clean up

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix ztensor deallocation abort

stabilise ggml <-> zdnn api

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: clean up matmul selection

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: clean up project structure

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: update documentation, prepare for upstream

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* chore: add codeowners

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: disable batched matmul

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: attempt at fixing tensor views during matmul

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: deny all view tensors directly

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix pr comments

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: update ops docs for zdnn

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: redo test-backend-ops for ops.md

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: fix typo in build-s390x.md

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* codeowners: remove taronaeo for now

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "codeowners: remove taronaeo for now"

This reverts commit 411ea4ed78d08778967bd0bd33a6538cfcbe082f.

* ggml-zdnn: remove unused ggml_zdnn macro

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-08-18 20:30:45 +03:00
Johannes Gäßler ba32f5df0a CUDA: fix negative KV_max values in FA (llama/15321) 2025-08-18 20:30:45 +03:00
uvos 0e15332255 HIP: Cleanup hipification header (llama/15285)
add expicit conversion operator to support older versions of rocm
Switch over to hip_bf16 from legacy hip_bfloat16
Simplify RDNA3 define
Reduce swap over of new hipblas api to rocm 6.5 as this version is used for rocm 7.0 previews

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-08-18 20:30:45 +03:00
Jeff Bolz 1d8b21caa0 vulkan: perf_logger improvements (llama/15246)
* vulkan: perf_logger improvements

- Account for batch dimension in flops calculation.
- Fix how "_VEC" is detected for mat_mul_id.
- Fix "n" dimension for mat_mul_id (in case of broadcasting).
- Include a->type in name.

* use <=mul_mat_vec_max_cols rather than ==1
2025-08-18 20:30:45 +03:00
Jason Ni 4a6cf896ad ggml: fix ggml_conv_1d_dw bug (ggml/1323)
* ggml: fix ggml_conv_1d_dw bug

* Fixed conv1d_dw weight tensor dimension.
2025-08-18 20:30:45 +03:00
Sigbjørn Skjæret 367cd11f5d cuda : fix GGML_CUDA_GRAPHS=OFF (llama/15300)
* fix USE_CUDA_GRAPH=OFF

ggml-ci

* check capture status

* completely disable capturing check instead
2025-08-18 20:30:45 +03:00
Jonathan Graehl c76ec72d59 finetune: SGD optimizer, more CLI args (llama/13873)
* examples/finetune -opt SGD (stochastic gradient descent) memory opt

add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.

support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)

llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)

(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val:   [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00

SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val:   [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)

note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')

-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.

note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence

new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)

cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)

since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)

test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values);  tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)

* Vulkan: Implement GGML_OP_OPT_STEP_SGD

* tests: Fix OPT_STEP_SGD test-backend-ops

* SGD op param store weight-decay and not 1-alpha*wd

* minor + cosmetic changes

* fix vulkan sgd

* try CI fix

---------

Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-08-18 20:30:45 +03:00
uvos cbaec6c4ac HIP: bump requirement to rocm 6.1 (llama/15296) 2025-08-18 20:30:45 +03:00
Judd 80ef57f0f0 ggml : update `ggml_rope_multi` (llama/12665)
* update `rope_multi`:

1. add `ggml_rope_multi_inplace`;
1. use `GGML_MROPE_SECTIONS` instead of 4.

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-08-18 20:30:45 +03:00
Georgi Gerganov 0e8b244366 ggml : repack block_iq4_nlx8 (llama/14904)
ggml-ci
2025-08-18 20:30:45 +03:00
Oliver Simons b8b1b50c47 CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n (llama/15132)
* Factor out `reduce_rows_f32` from common.cuh

This increases iteration cycle speed by not having to recompile
every kernel all the time

* Hide memory-latency by loop unrolling in reduce_rows_f32

* Further optimizations to `reduce_rows_f32`

1. Increase threadblock size to better hide latency of memory requests.
   As a consequence of bigger threadblocks, do 2-step summation, using
   shared memory to communicate results between invocations
2. Use sum_temp array to reduce waits on sum
3. Adjust num_unroll to reflext bigger threadblock
4. Improve default block_dims, increase support for more block_dims

* Add perf tests for `reduce_rows_f32` kernel

* Add heuristic to toggle 128/512 threads based on sm count

Break even point was the minimum of the following multiples.

| GPU Model                     | Nrow SM Count Multiple |
| -----------                   | -----------            |
| RTX 4000 SFF ADA              | 2.0x                   |
| RTX 6000 ADA                  | 2.5x                   |
| RTX PRO 6000 Blackwell Max-Q  | 3.04x                  |
| RTX PRO 4500 Blackwell	| 3.15x                  |

* Ensure perf gains also for small ncols and large nrows

Alternative to this, one could have also made the number of unrollings
template-able, but that would require compiling the kernel multiple
times, increasing binary size unnecessarily

* Modify perf and unit-tests

* Apply auto-formatting by clang

* Fix CI build failure

See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486
Building with VS generator worked though.

* Remove sm_count property from `ggml_backend_cuda_context`

Requested by @JohannesGaessler, and should fix remaining CI issues as a
side-effect

* Add CUB-based implementation for GGML_OP_MEAN

Currently this branch is only executed for nrows==1

* Add heuristics to execute CUB branch only when it brings perf

Heuristics were determined on the following HW:

* RTX 4000 SFF ADA
* RTX 6000 ADA
* RTX PRO 6000 Blackwell Max-Q
* RTX PRO 4500 Blackwell

* Add unit-test for CUB-based mean

Tests should run with CUDA Graphs enabled per default on NVGPUs

* Rename `USE_CUB` to `GGML_CUDA_USE_CUB`

Suggested by @JohannesGaessler

* Unindent Preprocessor directives

See
https://github.com/ggml-org/llama.cpp/pull/15132#discussion_r2269213506
2025-08-18 20:30:45 +03:00
Tak-RS 4e234ac013 ggml-rpc: chunk send()/recv() to avoid EINVAL for very large tensors over RPC (macOS & others) (llama/15188)
* ggml-rpc: chunk send()/recv() to avoid EINVAL for very large tensors over RPC (macOS & others). Fixes #15055

* ggml-rpc: rename RPC_IO_CHUNK->MAX_CHUNK_SIZE, use std::min() for cap, switch to GGML_LOG_ERROR, handle 0-length send/recv

* rpc: drop n==0 special case in send_data(); retry in loop per review

* rpc: remove trailing whitespace in send_data()

---------

Co-authored-by: Shinnosuke Takagi <nosuke@nosukenoMacBook-Pro.local>
2025-08-18 20:30:45 +03:00
uvos 8df931b608 HIP: disable sync warp shuffel operators from clr amd_warp_sync_functions.h (llama/15273) 2025-08-18 20:30:45 +03:00
Romain Biessy 1334f434f3 sycl: Fix and disable more configurations of mul_mat (llama/15151)
* sycl: Fix and disable more configurations of mul_mat

* Disable more configurations
2025-08-18 20:30:45 +03:00
rmatif 139110701e opencl: allow mixed f16/f32 `add` (llama/15140) 2025-08-18 20:30:45 +03:00
Aman Gupta 082c7ba67c CUDA cmake: add `-lineinfo` for easier debug (llama/15260) 2025-08-18 20:30:45 +03:00
Chenguang Li 0effaad964 CANN: GGML_OP_CPY optimization (llama/15070)
Signed-off-by: noemotiovon <757486878@qq.com>
2025-08-18 20:30:45 +03:00
R0CKSTAR 8e2ddfec31 musa: fix failures in test-backend-ops for mul_mat_id op (llama/15236)
* musa: fix failures in test-backend-ops for mul_mat_id op

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-08-18 20:30:45 +03:00
hipudding 3e2c262c08 CANN: Add broadcast for softmax and FA (llama/15208)
* refactor softmax

* fix fa

* fix mask shape

* format

* add comments

* Remove whitespace
2025-08-18 20:30:45 +03:00
Charles Xu 30cc11dc94 kleidiai: fix unsigned overflow bug (llama/15150)
* kleidiai: fix unsigned overflow bug

* address review comments
2025-08-18 20:30:45 +03:00
David Zhao 457eadfe6f cuda: refactored ssm_scan and use CUB (llama/13291)
* cuda: refactored ssm_scan to use CUB

* fixed compilation error when when not using CUB

* assign L to constant and use size_t instead of int

* deduplicated functions

* change min blocks per mp to 1

* Use cub load and store warp transpose

* suppress clang warning
2025-08-18 20:30:45 +03:00
Aman Gupta 93c7a08019 CUDA: add attention sinks for tile and wmma (llama/15178)
* CUDA: add attention sinks for tile and wmma

* Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma
2025-08-18 20:30:45 +03:00
compilade 62566a5436 gguf-py : add Numpy MXFP4 de/quantization support (llama/15111)
* gguf-py : add MXFP4 de/quantization support

* ggml-quants : handle zero amax for MXFP4
2025-08-18 20:30:45 +03:00
AN Long 573bf9d128 ggml : fix field name when new ggml_backend (llama/14944) 2025-08-18 20:30:45 +03:00
Johannes Gäßler 2baea5e4b3 CUDA: attention sinks for mma FlashAttention (llama/15157) 2025-08-18 20:30:45 +03:00
lhez 8a36cd924a opencl: support sink in `soft_max` (attn sinks) (llama/15152) 2025-08-18 20:30:45 +03:00
Jeff Bolz 1984530710 vulkan: support fattn sinks (llama/15126) 2025-08-18 20:30:45 +03:00
Jeff Bolz 414e9074e0 vulkan: Add env var to disable host visible vidmem (llama/15109) 2025-08-18 20:30:45 +03:00
uvos 813ceb2a74 HIP: add cmake option to enable compiler output of kernel resource usage metrics (llama/15103) 2025-08-18 20:30:45 +03:00
Christian Kastner 6d7ffea292 ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (llama/15094)
Any available libraries are found and loaded dynamically at runtime.
2025-08-18 20:30:45 +03:00
Johannes Gäßler 5caf8a1ea2 CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16 (llama/15131)
* CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16
2025-08-18 20:30:45 +03:00
rmatif b405fd88b3 fix profiling crash (llama/15072) 2025-08-18 20:30:45 +03:00
lhez d153cfb507 opencl: add `swiglu_oai` and `add_id` (llama/15121)
* opencl: add `swiglu-oai`

* opencl: add `add_id`

* opencl: add missing `add_id.cl`
2025-08-18 20:30:45 +03:00
Diego Devesa 6fb55d8f7c ggml : fix fallback to CPU for ununsupported ops (llama/15118) 2025-08-18 20:30:45 +03:00
Chenguang Li e809e81e69 CANN: add support for ACL Graph (llama/15065)
* feat(cann): add optional support for ACL Graph execution

This commit adds support for executing ggml computational graphs using
Huawei's ACL graph mode via the USE_CANN_GRAPH flag. The support can be
enabled at compile time using the CMake option:

    -DUSE_CANN_GRAPH=ON

By default, ACL graph execution is **disabled**, and the fallback path
uses node-by-node execution.

Key additions:
- CMake option  to toggle graph mode
- Graph capture and execution logic using
- Tensor property matching to determine whether graph update is required
- Safe fallback and logging if the environment variable LLAMA_SET_ROWS
  is unset or invalid

This prepares the backend for performance improvements in repetitive graph
execution scenarios on Ascend devices.

Signed-off-by: noemotiovon <757486878@qq.com>

* Fix review comments

Signed-off-by: noemotiovon <757486878@qq.com>

* remane USE_CANN_GRAPH to USE_ACL_GRAPH

Signed-off-by: noemotiovon <757486878@qq.com>

* fix typo

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-08-18 20:30:45 +03:00
Georgi Gerganov d3aab3efde llama : add gpt-oss (llama/15091)
* oai moe

* compat with new checkpoint

* add attn sink impl

* add rope scaling yarn

* logits match with latest transformers code

* wip chat template

* rm trailing space

* use ggml_scale_bias

* rm redundant is_swa_all

* convert interleaved gate_up

* graph : fix activation function to match reference (llama/7)

* vocab : handle o200k_harmony special tokens

* ggml : add attention sinks support (llama/1)

* llama : add attn sinks

* ggml : add attn sinks

* cuda : add attn sinks

* vulkan : add support for sinks in softmax

remove unnecessary return

* ggml : add fused swiglu_oai op (llama/11)

* ggml : add fused swiglu_oai op

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update CUDA impl

* cont : metal impl

* add vulkan impl

* test-backend-ops : more test cases, clean up

* llama : remove unfused impl

* remove extra lines

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>

* repack mxfp4 upon conversion

* clean up a bit

* enable thinking

* add quick hack to render only some special tokens

* fix bf16 conversion

* remove vocab hack

* webui ok

* support chat parsing for gpt-oss

* fix webui

* direct mapping mxfp4, FINALLY

* force using mxfp4

* properly use lazy tensor

* ggml : add mxfp4

ggml : use e8m0 conversion instead of powf

Co-authored-by: Diego Devesa <slarengh@gmail.com>

change kvalues_mxfp4 table to match e2m1 (llama/6)

metal : remove quantization for now (not used)

cuda : fix disabled CUDA graphs due to ffn moe bias

vulkan : add support for mxfp4

cont : add cm2 dequant

* ggml : add ggml_add_id (llama/13)

* ggml : add ggml_add_id

* add cuda impl

* llama : add weight support check for add_id

* perf opt

* add vulkan impl

* rename cuda files

* add metal impl

* allow in-place ggml_add_id

* llama : keep biases on CPU with --cpu-moe

* llama : fix compile error

ggml-ci

* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw

ggml-ci

* cleanup

ggml-ci

* sycl : fix supports_op for MXFP4

ggml-ci

* fix Unknown reasoning format

* ggml-cpu : fix AVX build

ggml-ci

* fix hip build

ggml-ci

* cuda : add mxfp4 dequantization support for cuBLAS

ggml-ci

* ggml-cpu : fix mxfp4 fallback definitions for some architectures

ggml-ci

* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: slaren <slarengh@gmail.com>
2025-08-18 20:30:45 +03:00
Romain Biessy 6558022873 sycl: fix mul_mat selection (llama/15092) 2025-08-18 20:30:45 +03:00
Christian Kastner 349b9a2097 cmake: Add GGML_BACKEND_DIR option (llama/15074)
* cmake: Add GGML_BACKEND_DIR option

This can be used by distributions to specify where to look for backends
when ggml is built with GGML_BACKEND_DL=ON.

* Fix phrasing
2025-08-18 20:30:45 +03:00
Jeff Bolz 00ff38376a vulkan: fix build when using glslang that does not support coopmat2 (llama/15062) 2025-08-18 20:30:45 +03:00
Jeff Bolz abc971e69a vulkan: Use coopmat2 for conv2d (llama/14982) 2025-08-18 20:30:45 +03:00
lhez 53d8c5179f opencl: fix adreno compiler detection logic (llama/15029) 2025-08-18 20:30:45 +03:00
Johannes Gäßler d6e7315717 CUDA: use mma FA kernel for gqa > 4 on RTX 4000 (llama/15035) 2025-08-18 20:30:45 +03:00
leejet a3123e105b cuda: make im2col a little faster (llama/15025) 2025-08-18 20:30:45 +03:00
Georgi Gerganov d119ecf0c1 cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 (llama/15038)
* cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1

ggml-ci

* cont : fix cont types

ggml-ci

* cont : adopt variable names and comment from the other branch
2025-08-18 20:30:45 +03:00
Jeff Bolz b374fd6172 vulkan: coopmat2 mul_mat optimizations (llama/14934)
- Increase tile size for k-quants, to match non-k-quants
- Choose more carefully between large and medium tiles, considering how it
  interacts with split_k
- Allow larger/non-power of two split_k, and make the splits a multiple of 256
- Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used
2025-08-18 20:30:45 +03:00
Jeff Bolz 97341224b2 vulkan: Support ne[3]>1 in noncontig matrix-vector multiply (llama/15015) 2025-08-18 20:30:45 +03:00
Jeff Bolz 46e9e5b9a7 vulkan: optimizations for direct convolution (llama/14933)
* vulkan: optimizations for direct convolution

- Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill
  the GPU. The new size should be amenable to using coopmat, too.
- Fix shmem bank conflicts. 16B padding should work with coopmat.
- Some explicit loop unrolling.
- Skip math/stores work for parts of the tile that are OOB.
- Apply fastdiv opt.
- Disable shuffles for NV.

* Three tiles sizes for CONV_2D, and a heuristic to choose

* reallow collectives for pre-Turing

* make SHMEM_PAD a spec constant

* fixes for intel perf - no shmem padding, placeholder shader core count

* shader variants with/without unrolling

* 0cc4m's fixes for AMD perf

Co-authored-by: 0cc4m <picard12@live.de>

---------

Co-authored-by: 0cc4m <picard12@live.de>
2025-08-18 20:30:45 +03:00
Johannes Gäßler 7e7557ac50 CUDA: fix MMQ nwarps for AMD with warp_size==32 (llama/15014) 2025-08-18 20:30:45 +03:00
lhez ba6a81c9c9 opencl: add f16 for `add`, `sub`, `mul`, `div` (llama/14984) 2025-08-18 20:30:45 +03:00
Srihari-mcw 1c6cb7df47 ggml : Q2k interleaving implementation - x86/x64 SIMD (llama/14373)
* Initial Q2_K Block Interleaving Implementation

* Addressed review comments and clean up of the code

* Post rebase fixes

* Initial CI/CD fixes

* Update declarations in arch-fallback.h

* Changes for GEMV Q2_K in arch-fallback.h

* Enable repacking only on AVX-512 machines

* Update comments in repack.cpp

* Address q2k comments

---------

Co-authored-by: Manogna-Sree <elisetti.manognasree@multicorewareinc.com>
2025-08-18 20:30:45 +03:00
diannao 78668cb8d1 docker : add cann build pipline (llama/14591)
* docker: add cann build pipline

* docker: add cann build pipline

* docker: fix cann devops

* cann : fix multi card hccl

* Update ggml/src/ggml-cann/ggml-cann.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Update ggml-cann.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-08-18 20:30:45 +03:00
Ruben Ortlam 41e161657e Vulkan: Fix minor debug mode issues (llama/14899)
* vulkan: fix debug mode issues

* vulkan: remove broken check_results GGML_OP_SET_ROWS support
2025-08-18 20:30:45 +03:00
hipudding 572152d6af CANN: Improve loading efficiency after converting weights to NZ format. (llama/14985)
* CANN: Improve loading efficiency after converting weights to NZ format.

* CANN: fix typo
2025-08-18 20:30:45 +03:00
lhez 4904bc3bda opencl: add `mul_mat_f32_f32_l4_lm` and `mul_mat_f16_f32_l4_lm` (llama/14809) 2025-08-18 20:30:45 +03:00
uvos 8ed27b407d HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes (llama/14949) 2025-08-18 20:30:45 +03:00
Johannes Gäßler 113d88686b CUDA: skip masked KV slices for all FA kernels (llama/14924) 2025-08-18 20:30:45 +03:00
uvos 4e624e42fa HIP: remove the use of __HIP_PLATFORM_AMD__, explicitly support only AMD targets (llama/14945) 2025-08-18 20:30:45 +03:00
uvos 7f203f41aa HIP: add GGML_HIP_MMQ_MFMA option to allow disableing the MFMA path. (llama/14930)
This is useful for testing for regressions on GCN with CDNA hardware.

With GGML_HIP_MMQ_MFMA=Off and GGML_CUDA_FORCE_MMQ=On we can conveniently test the GCN code path on CDNA. As CDNA is just GCN renamed with MFMA added and limited use ACC registers, this provides a good alternative for regression testing when GCN hardware is not available.
2025-08-18 20:30:45 +03:00
uvos a3899e78af HIP: Ignore unsupported unroll transformation in fattn-vec (llama/14931)
llvm with the amdgcn target dose not support unrolling loops with conditional break statements, when those statements can not be resolved at compile time. Similar to other places in GGML lets simply ignore this warning.
2025-08-18 20:30:45 +03:00
hipudding c42e55e054 CANN: Add ggml_set_rows (llama/14943) 2025-08-18 20:30:45 +03:00
Sigbjørn Skjæret 682d659416 cuda : add softcap fusion (llama/14907) 2025-08-18 20:30:45 +03:00
Aman Gupta 577f47111e CUDA: add roll (llama/14919)
* CUDA: add roll

* Make everything const, use __restrict__
2025-08-18 20:30:45 +03:00
xctan 4dca34a4de ggml-cpu : deduplicate scalar implementations (llama/14897)
* remove redundant code in riscv

* remove redundant code in arm

* remove redundant code in loongarch

* remove redundant code in ppc

* remove redundant code in s390

* remove redundant code in wasm

* remove redundant code in x86

* remove fallback headers

* fix x86 ggml_vec_dot_q8_0_q8_0
2025-08-18 20:30:45 +03:00
Akarshan Biswas 4908e9dd05 SYCL: Add set_rows support for quantized types (llama/14883)
* SYCL: Add set_rows support for quantized types

This commit adds support for GGML_OP_SET_ROWS operation for various
quantized tensor types (Q8_0, Q5_1, Q5_0, Q4_1, Q4_0, IQ4_NL) and BF16
type in the SYCL backend.

The quantization/dequantization copy kernels were moved from cpy.cpp
to cpy.hpp to make them available for set_rows.cpp.

This addresses part of the TODOs mentioned in the code.

* Use get_global_linear_id() instead

ggml-ci

* Fix formatting

ggml-ci

* Use const for ne11 and size_t variables in set_rows_sycl_q

ggml-ci

* Increase block size for q kernel to 256

ggml-ci

* Cleanup imports

* Add float.h to cpy.hpp
2025-08-18 20:30:45 +03:00
Johannes Gäßler 24d3524bfd CUDA: fix pointer incrementation in FA (llama/14916) 2025-08-18 20:30:45 +03:00
Alberto Cabrera Pérez 923619ffd5 sycl: refactor quantization to q8_1 (llama/14815)
* sycl: quantization to q8_1 refactor

* Refactored src1 copy logic in op_mul_mat
2025-08-18 20:30:45 +03:00
Kai Pastor 45784c05ae cmake : Fix BLAS link interface (ggml/1316) 2025-08-18 20:30:45 +03:00
Kai Pastor 01bdc522e0 vulkan : fix 32-bit builds (ggml/1313)
The pipeline member can be cast to VkPipeline.
This is a VkPipeline_T* on 64 bit but a uint64_t on 32 bit.
Cf. VK_DEFINE_NON_DISPATCHABLE_HANDLE documentation.
2025-08-18 20:30:45 +03:00
Georgi Gerganov 28b39c624e
ggml : remove old kompute, cann (skip) (#3349)
ggml-ci
2025-07-30 16:08:57 +03:00
Erik Scholz d96f4d8ea1 vulkan : add fp16 support for the conv_2d kernel (llama/14872)
* add f16 to conv_2d testing
* weaken conv2d test error threshold
2025-07-28 13:02:32 +03:00
Jeff Bolz 5693b857d2 vulkan: skip empty set_rows to avoid invalid API usage (llama/14860) 2025-07-28 13:02:32 +03:00
deepsek b275e52b46 HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (llama/14624)
This commit adds support for MFMA instructions to MMQ. CDNA1/GFX908 CDNA2/GFX90a and CDNA3/GFX942 are supported by the MFMA-enabled code path added by this commit. The code path and stream-k is only enabled on CDNA3 for now as it fails to outperform blas in all cases on the other devices.
Blas is currently only consistently outperformed on CDNA3 due to issues in the amd-provided blas libraries.
This commit also improves the awareness of MMQ towards different warp sizes and as a side effect improves the performance of all quant formats besides q4_0 and q4_1, which regress slightly, on GCN gpus.
2025-07-28 13:02:32 +03:00
hipudding 4692558a1f CANN: Implement GLU ops (llama/14884)
Implement REGLU, GEGLU, SWIGLU ops according to #14158
2025-07-28 13:02:32 +03:00
R0CKSTAR 8643960acc musa: fix build warnings (unused variable) (llama/14869)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-28 13:02:32 +03:00
Aaron Teo 6629201471 ggml-cpu : disable GGML_NNPA by default due to instability (llama/14880)
* docs: update s390x document for sentencepiece

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit e086c5e3a7ab3463d8e0906efcfa39352db0a48d)

* docs: update huggingface links + reword

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 8410b085ea8c46e22be38266147a1e94757ef108)

* ggml-cpu: disable ggml-nnpa compile flag by default

fixes #14877

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 412f4c7c88894b8f55846b4719c76892a23cfe09)

* docs: update s390x build docs to reflect nnpa disable

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit c1eeae1d0c2edc74ab9fbeff2707b0d357cf0b4d)

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-07-28 13:02:32 +03:00
Gabe Goodhart 0b0de0bbf2 metal: SSM_SCAN performance (llama/14743)
* feat: Add s_off as a parameter in the args struct

This may not be necessary, but it more closely mirrors the CUDA kernel

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* perf: Parallelize mamba2 SSM_SCAN metal kernel over d_state

This is a first attempt at optimizing the metal kernel. The changes here
are:

- Launch the kernel with a thread group of size d_state
- Use simd groups and shared memory to do the summation for the y
  computation

When tested with G4 tiny preview, this shows roughly a 3x speedup on
prefill and 15% speedup on decode.

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Update logic to correctly do the multi-layer parallel sum

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Correctly size the shared memory bufer and assert expected size relationships

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Compute block offsets once rather than once per token

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Use local variable for state recursion

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Use a secondary simd_sum instead of a for loop

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add assertion and comment about relationship between simd size and num simd groups

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parallelize of d_state for mamba-1

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parallel sum in SSM_CONV

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Revert "feat: Parallel sum in SSM_CONV"

After discussion with @compilade, the size of the parallelism here is
not worth the cost in complexity or overhead of the parallel for.

https://github.com/ggml-org/llama.cpp/pull/14743#discussion_r2223395357

This reverts commit 16bc059660c1c59e566628201c0ca2c20c9f4bc3.

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Simplify shared memory sizing

Branch: GraniteFourPerf

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-Authored-By: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-28 13:02:32 +03:00
lhez d414c3f6ac opencl: add fused `rms_norm_mul` (llama/14841)
* opencl: add fused `rms_norm` + `mul`

* opencl: improve workgroup size for `rms_norm_mul`
2025-07-28 13:02:32 +03:00
Oliver Simons bbf2389919 ggml : remove invalid portPos specifiers from dot files (llama/14838)
Neither "g" nor "x" are valid portPos specifiers per the official
[graphviz documents](https://graphviz.org/docs/attr-types/portPos/):

> If a compass point is used, it must have the form "n","ne","e","se","s","sw","w","nw","c","_".

I tested locally for it to fall back to default portPos specifier if an
invalid portPos is specified. As a consequence, we can remove associated
code.
2025-07-28 13:02:32 +03:00
Chris Rohlf 56350ecc12 rpc : check for null buffers in get/set/copy tensor endpoints (llama/14868) 2025-07-28 13:02:32 +03:00
Diego Devesa 270fa9b25c sched : fix multiple evaluations of the same graph with pipeline parallelism (llama/14855)
ggml-ci
2025-07-28 13:02:32 +03:00
R0CKSTAR 89ae789450 musa: upgrade musa sdk to rc4.2.0 (llama/14498)
* musa: apply mublas API changes

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: update musa version to 4.2.0

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: restore MUSA graph settings in CMakeLists.txt

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: disable mudnnMemcpyAsync by default

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: switch back to non-mudnn images

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* minor changes

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: restore rc in docker image tag

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-28 13:02:32 +03:00
Kai Pastor 5823eabc78 cmake : Indent ggml-config.cmake (ggml/1310) 2025-07-28 13:02:32 +03:00
Alberto Cabrera Pérez 7dc5ae2d6a sycl: fixed semantics of block offset calculation (llama/14814) 2025-07-28 13:02:32 +03:00
Georgi Gerganov faedce5dcb metal : fix fusion across different encoders (llama/14849)
* metal : fix fusion across different encoders

ggml-ci

* cont : add assertion

ggml-ci
2025-07-28 13:02:32 +03:00
Donghyeon Jeong e648f9f079 sycl: fix undefined variable in work group size check (llama/14843) 2025-07-28 13:02:32 +03:00
Johannes Gäßler 95efcf011d CUDA: fix overflow in FA, tune performance (llama/14840) 2025-07-28 13:02:32 +03:00
Johannes Gäßler 8272aa9f14 CUDA: fix compilation with GGML_CUDA_F16 (llama/14837) 2025-07-28 13:02:32 +03:00
Johannes Gäßler a65976fc3c CUDA: fix quantized KV cache + multiple sequences (llama/14822)
* CUDA: fix quantized KV cache + multiple sequences

* Update ggml/src/ggml-cuda/fattn-common.cuh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-28 13:02:32 +03:00
lixing-star 026d8a0c6e ggml: fix loongarch quantize_row_q8_1 error (llama/14827) 2025-07-28 13:02:32 +03:00
chen fan 49d5540206 CANN: weight format to NZ for Ascend310P3 (llama/14407)
* weight format to nz for 310p

* remove quant weight format to nz

* clean code

* fix

* make the conditions for converting weights to NZ format consistent

* clean code
2025-07-28 13:02:32 +03:00
Aman Gupta f8402d0a95 CUDA: add fused rms norm (llama/14800) 2025-07-28 13:02:32 +03:00
Jeff Bolz c91361379a vulkan: fix rms_norm_mul to handle broadcasting dim0 (llama/14817) 2025-07-28 13:02:32 +03:00
Sigbjørn Skjæret 810018a63a cuda : implement bf16 cpy ops and enable bf16 cont (llama/14763)
* implement bf16 cpy ops and enable bf16 cont

* deduplicate copy functions

* deduplicate checks
2025-07-28 13:02:32 +03:00
lhez de49384ab3 opencl: remove unreachable `return` (llama/14806) 2025-07-28 13:02:32 +03:00
R0CKSTAR 9008410087 cuda: remove linking to cublasLt (llama/14790)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-28 13:02:32 +03:00
Sigbjørn Skjæret e81e17b048 opencl: fix `im2col` when `KW!=KH` (llama/14803) 2025-07-28 13:02:32 +03:00
rmatif a2a5612402 opencl: add conv2d kernel (llama/14403)
* add conv2d kernel

* fix trailing whitespace

* whitespace fixe

* handle f16 input and f16 kernel, more opt

* resolve conflicts

* use enqueue_ndrange_kernel
2025-07-28 13:02:32 +03:00
Romain Biessy 52ad451c8a sycl: Fix im2col (llama/14797) 2025-07-28 13:02:32 +03:00
Charles Xu fc2ff438fd kleidiai: add support for get_rows (llama/14676)
* kleidiai: add support for get_rows

* apply fixes based on code review

* apply more fixes based on code review
2025-07-28 13:02:32 +03:00
Jeff Bolz e3f4162a06 vulkan/cuda: Fix im2col when KW!=KH (llama/14789)
The tid is decomposed into "ow + ky*OW + kx*OW*KH". Change "ksize" to match.
2025-07-28 13:02:32 +03:00
Ervin Áron Tasnádi 92a9e85d8b ggml: adds CONV_2D op and direct GEMM Vulkan implementation (llama/14316)
* ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan

* ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly
with gemm (no need for im2col),

* test-backend-ops: adds test_case_ref to check the validity/performance of ops
against reference implementations having different graphs, adds tests

* * Performance fixes: minimized branch divergence, uses collectives to
  eliminate redundant calculation, macros removed.

* Kernel shared memory size check

* Updates test-backend-ops to support graphs for performance
  measurement.

* * Apple/Win32 compile errors fixed

* Subgroup size used to determine tile size -> fixes llvmpipe errors.

* Collectives disabled by default.

* Intel support is disabled as the performance is poor.

* Conv2d enabled for Intel with disabled collectives, disabled for Apple

* test-backend-ops modifications are reverted

* Trailing spaces and missing override fixed.

* Triggering pipeline relaunch.

* Code formatted with .clang-format.
2025-07-28 13:02:32 +03:00
Peter0x44 50f983a17e vulkan: Add logging for bf16 features to ggml_vk_print_gpu_info (#13274) (llama/14707) 2025-07-28 13:02:32 +03:00
0cc4m b06f314667 Vulkan: Fix fprintf format-security warning (llama/14770) 2025-07-28 13:02:32 +03:00
Kai Pastor 5c3b794c51 cmake : fix usage issues (ggml/1257)
* CMake config: Create target only once

Fix error on repeated find_package(ggml).
For simplicity, check only for the top-level ggml::ggml.

* CMake config: Add CUDA link libs

* CMake config: Add OpenCL link libs

* CMake config: Use canonical find_dependency

Use set and append to control link lib variables.
Apply more $<LINK_ONLY...>.

* CMake config: Wire OpenMP dependency
2025-07-28 13:02:32 +03:00
Daniel Bevenius e238dc1bdd ggml-cpu : remove stdlib include from repack.cpp (ggml/1276)
This commit removes the inclusion of `<cstdlib>`.

The motivation for this change is that this source file does not seem to
use any functions from this header and the comment about `qsort` is a
little misleading/confusing.
2025-07-28 13:02:32 +03:00
Georgi Gerganov 0ed687c6f1 metal : fuse add, mul + add tests (llama/14596)
ggml-ci
2025-07-20 00:23:50 +03:00
Oliver Simons d4a7ea1634 cuda : Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs (llama/14741)
* Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs

Gemma3n uses Matrix-Matrix addition as part of their input processing,
wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size
of 1 is used.

* Exclude `project_per_layer_input` by matching node names

This ensures that all other graphs which don't exhibit this pattern do
not have their behavior changed.

* Revert unnecessary formatting changes
2025-07-20 00:23:50 +03:00
Aman Gupta 9a07cb064a CUDA: set_rows + cpy.cu refactor (llama/14712) 2025-07-20 00:23:50 +03:00
Neo Zhang Jianyu fed20b0682 use max work group size for device to replace the magic number (llama/14732) 2025-07-20 00:23:50 +03:00
Reese Levine 17c5411195 ggml: Add initial WebGPU backend (llama/14521)
* Minimal setup of webgpu backend with dawn. Just prints out the adapter and segfaults

* Initialize webgpu device

* Making progress on setting up the backend

* Finish more boilerplate/utility functions

* Organize file and work on alloc buffer

* Add webgpu_context to prepare for actually running some shaders

* Work on memset and add shader loading

* Work on memset polyfill

* Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs since webgpu doesn't support it

* Implement get_tensor and buffer_clear

* Finish rest of setup

* Start work on compute graph

* Basic mat mul working

* Work on emscripten build

* Basic WebGPU backend instructions

* Use EMSCRIPTEN flag

* Work on passing ci, implement 4d tensor multiplication

* Pass thread safety test

* Implement permuting for mul_mat and cpy

* minor cleanups

* Address feedback

* Remove division by type size in cpy op

* Fix formatting and add github action workflows for vulkan and metal (m-series) webgpu backends

* Fix name

* Fix macos dawn prefix path
2025-07-20 00:23:50 +03:00
Georgi Gerganov ae1bb2c8ea llama : add high-throughput mode (llama/14363)
* kv-cache : prepare K/V buffers for separation

ggml-ci

* batched-bench : fix oob write

ggml-ci

* llama : add "virtual sequences"

ggml-ci

* llama : use "stream" vs "virtual sequence"

ggml-ci

* graph : fix stream splitting when KV cache is not used

ggml-ci

* kv-cache : add multi-stream save/load support

ggml-ci

* llama : add "--attn-streams" flag

ggml-ci

* kv-cache : fix handling when find_slot fails

ggml-ci

* kv-cache : restore find_slot impl

ggml-ci

* kv-cache : add comments

* kv-cache : add bounds checks for sequence id

ggml-ci

* cont : add n_seq_max to batch allocr

ggml-ci

* kv-cache : perform stream copies lazily after llama_synchronize

ggml-ci

* kv-cache : avoid throwing exceptions across the C boundary

ggml-ci

* CUDA: 4D FlashAttention support (llama/14628)

* CUDA: 4D FlashAttention support

* CUDA: fix WMMA FA kernel

* llama : rename attn_streams -> kv_unified

ggml-ci

* common : rename kv_split -> kv_unified

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-07-20 00:23:50 +03:00
Georgi Gerganov 9cc645fec0 ggml : add asserts (llama/14720)
* ggml : add asserts

ggml-ci

* cont : fix constant type

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-07-20 00:23:50 +03:00
Jeff Bolz 8d1a0485f1 vulkan: fix noncontig check for mat_mul_id splitting (llama/14683)
* vulkan: fix noncontig check for mat_mul_id splitting

Remove supports_op check for > 4096 (splitting fixes this)

* vulkan: fix batched matmul dequant for Q*_K
2025-07-20 00:23:50 +03:00
Jeff Bolz b33841c453 vulkan: add RTE variants for glu/add/sub/mul/div (llama/14653) 2025-07-20 00:23:50 +03:00
R0CKSTAR ab79c6c118 cuda: fix build warnings in set-rows.cu (unused variable) (llama/14687)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-20 00:23:50 +03:00
Anton Mitkov a6b9271c2c sycl: Hotfix for non dnnl codepath (llama/14677) 2025-07-20 00:23:50 +03:00
shalinib-ibm ded2e3cf6d ggml : refactor llamafile_sgemm PPC code (llama/14673)
Remove un-necessary templates from class definition and packing functions
Reduce deeply nested conditionals, if-else switching in mnapck function
Replace repetitive code with inline functions in Packing functions

2 ~ 7% improvement in Q8 Model
15 ~ 50% improvement in Q4 Model

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2025-07-20 00:23:50 +03:00
Akarshan Biswas ebb0e9d0ed SYCL: use 1D kernel for set_rows (llama/14618)
* SYCL: Use 1D kernel for set_rows

* Remove dangling comment

* Refactor and use ceil_div
2025-07-20 00:23:50 +03:00
Anton Mitkov 24803d62c6 sycl: Batched mulmat rework for oneDNN dispatch (llama/14617) 2025-07-20 00:23:50 +03:00
Sigbjørn Skjæret 0611387d17 cuda : add set rows for bf16 (llama/14664) 2025-07-20 00:23:50 +03:00
Yavor Ivanov fe33572b22 cuda : add ELU support (llama/14657) 2025-07-20 00:23:50 +03:00
Georgi Gerganov 21308b4e6e ggml : add build-time message to remind about ggml_set_rows (llama/14661)
ggml-ci
2025-07-20 00:23:50 +03:00
Yavor Ivanov 3cad26d807 metal : Add missing unary ops Metal support (llama/14660) 2025-07-20 00:23:50 +03:00
Aman Gupta 66b3a39bdc CUDA: add set rows for f32 and f16 (llama/14551)
* CUDA: add set rows for f32 and f16

* Review: change kernel params, use strides from host

* Use 1-d kernel

* Review: use int64_t for blockDim.x, rename nb->s for clarity
2025-07-20 00:23:50 +03:00
Georgi Gerganov 3775c503d5 sync : resolve conflicts (#0)
ggml-ci
2025-07-12 19:23:56 +03:00
Georgi Gerganov 85dcc74b88 sync : resolve conflicts (ggml/0)
ggml-ci
2025-07-12 19:23:56 +03:00
Jeff Bolz 915fc153a5 vulkan: support SET_ROWS (llama/14587)
* vulkan: support SET_ROWS

Add variants of the copy_to_quant shader that do the SET_ROWS operation.
Change these shaders to spread the work across the workgroup.
The memory access pattern is probably not great (one thread per quant block),
but should be fine for now.

* vulkan: optimize set_rows

Larger workgroups for non-quant types.
Set "norepeat" (there is manual repeat logic).
Use fastmod.
2025-07-12 19:23:56 +03:00
Jeff Bolz 8670a3fd5d vulkan: optimizations for deepseek prompt processing (llama/14555)
* vulkan: allow unclamped loads in coopmat2 mul_mat_id shader

* vulkan: increase coopmat2 mul_mat_id tile size

* vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path

* vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)
2025-07-12 19:23:56 +03:00
Tarek Dakhran 74f6d47904 model : support LiquidAI LFM2 hybrid family (llama/14620)
**Important**
LFM2 was [merged ](https://github.com/huggingface/transformers/pull/39340)into transformers, but has not yet been released.
To convert into gguf, install transformers from source
```shell
pip install "transformers @ git+https://github.com/huggingface/transformers.git@main"
```
2025-07-12 19:23:56 +03:00
Slobodan Josic a4ff4ec9cb HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (llama/14634) 2025-07-12 19:23:56 +03:00
rmatif b0754136be opencl: add tiled mul_mat_f16_f32 (llama/14535)
* add tiled mul_mat_f16_f32

* fix trailing whitespace

* add insightful comments
2025-07-12 19:23:56 +03:00
lhez 6f113cbcaa opencl: add `set_rows` for `f16` and `f32` (llama/14547)
* opencl: add `set_rows` for `f16` and `f32`

* opencl: better choose workgroup size for `set_rows`
2025-07-12 19:23:56 +03:00
Akarshan Biswas 3c21cde540 SYCL: Initial set_rows kernel implementation (llama/14562)
* SYCL: Initial set_rows kernel implementation

* Revert max_threads to 256

* Refactor set_rows and address review comments

* Deduplicate conversion function

* Remove guard before kernel launch and refactor

* Fix and add back SFINAE
2025-07-12 19:23:56 +03:00
compilade fb885fa48b cuda : support Falcon-H1 state size for SSM_SCAN (llama/14602) 2025-07-12 19:23:56 +03:00
Xuan-Son Nguyen 2021870fb8 ggml : add ggml_scale_bias (llama/14417)
* ggml : add ggml_scale_bias

* ggml_vec_mad1_f32

* add more simd

* add CUDA

* sycl

* vulkan

* cann (placeholder)

* opencl

* will this fix cpu?

* fix cuda

* suggestions from coderabbit

* fix cann compile error

* vDSP_vsmsa

* rm __ARM_FEATURE_SVE

* use memcpy for op params

* make code looks more consistent

* use scalar for __ARM_FEATURE_SVE

* add x param to ggml_vec_mad1_f32
2025-07-12 19:23:56 +03:00
Miaoqian Lin 48b18f9eb8 ggml : prevent integer overflow in gguf tensor size calculation (llama/14595) 2025-07-12 19:23:56 +03:00
Jeff Bolz fadb3233b6 vulkan: optimize flash attention split_k_reduce (llama/14554)
* vulkan: allow FA split_k with smaller KV values

* vulkan: spread split_k_reduce work across more threads

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).
2025-07-12 19:23:56 +03:00
Jeff Bolz 9750e4c988 vulkan : fix rope with partial rotation and non-cont src (llama/14582) 2025-07-12 19:23:56 +03:00
Georgi Gerganov c3942b3db6 cuda : fix rope with partial rotation and non-cont src (llama/14580)
* cuda : fix rope non-cont

ggml-ci

* cont : fix multi-rope + add test

ggml-ci

* sycl : try fix

ggml-ci

* cont : fix sycl + clean-up cuda

ggml-ci
2025-07-12 19:23:56 +03:00
Aman Gupta 98e7beac6c CUDA: add bilinear interpolation for upscale (llama/14563) 2025-07-12 19:23:56 +03:00
R0CKSTAR 7e9c6bbab2 musa: fix build warnings (unused variable) (llama/14561)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-07-12 19:23:56 +03:00
Aman Gupta 8e545f466c CUDA: add bf16 and i32 to getrows (llama/14529) 2025-07-12 19:23:56 +03:00
Eve e753b9a952 vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (llama/14485)
Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260

Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com>
2025-07-12 19:23:56 +03:00
Jeff Bolz 9d0c408260 vulkan: fix rms_norm+mul fusion (llama/14545)
The fused operation was grabbing the epsilon value from the wrong place.

Add an env var to disable fusion.

Add some missing checks for supported shapes/types.

Handle fused rms_norm+mul in check_results.
2025-07-12 19:23:56 +03:00
Jeff Bolz 3aebb8d5d3 vulkan: Handle updated FA dim2/3 definition (llama/14518)
* vulkan: Handle updated FA dim2/3 definition

Pack mask boolean and n_head_log2 into a single dword to keep the push
constant block under the 128B limit.

* handle null mask for gqa

* allow gqa with dim3>1
2025-07-12 19:23:56 +03:00
Sigbjørn Skjæret df5af1dc75 opencl: add GELU_ERF (llama/14476) 2025-07-12 19:23:56 +03:00
Georgi Gerganov 10d0d28f7c metal : disable fast math in all quantize kernels (llama/14528)
ggml-ci
2025-07-12 19:23:56 +03:00
luyhcsu af304ef080 CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (llama/14002)
Co-authored-by: luyuhong <luyuhong@kylinos.cn>
2025-07-12 19:23:56 +03:00
Sigbjørn Skjæret e8138c51d2 ggml : implement GEGLU_ERF and GEGLU_QUICK ops (llama/14445) 2025-07-12 19:23:56 +03:00
lhez 7cec4cc83a opencl : broadcast for soft_max (llama/14510) 2025-07-12 19:23:56 +03:00
Jeff Bolz a432929d58 vulkan: support mixed/deepseekR1 FA head sizes (llama/14509)
* vulkan: better parameterize FA by head sizes

* vulkan: support mixed/deepseekR1 FA head sizes
2025-07-12 19:23:56 +03:00
Johannes Gäßler 4aaf8114e7 ggml: backward pass for split swiglu (llama/14483) 2025-07-12 19:23:56 +03:00
Nicolò Scipione 0ca760433c Fix conditional enabling following arch checks for ggml-sycl (llama/14504)
Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
2025-07-12 19:23:56 +03:00
Georgi Gerganov ed639c7f22 kv-cache : use ggml_set_rows (llama/14285)
* kv-cache : use ggml_set_rows

ggml-ci

* graph : separate k and v indices

ggml-ci

* cont : remove redundant ifs

ggml-ci

* kv-cache : improve find_slot impl

* kv-cache : bounds-check when accessing slot_info indices

* kv-cache : add comments

ggml-ci

* ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends

ggml-ci
2025-07-12 19:23:56 +03:00
Georgi Gerganov 0abd0660e1 ggml : fix FA mask dim 2 and 3 (llama/14505)
* ggml : fix FA mask dim 2 and 3

ggml-ci

* backends : unsupport batched FA in CUDA and Vulkan

ggml-ci

* vulkan : disable FA for mask->ne[2] != 1
2025-07-12 19:23:56 +03:00
Aman Gupta 9cde908c0a CUDA: add dynamic shared mem to softmax, refactor general usage (llama/14497) 2025-07-12 19:23:56 +03:00