Commit Graph

2159 Commits

Author SHA1 Message Date
Mario Limonciello 39b5f414a3 Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (llama/19461)
There is an upstream problem [1] with AMD's LLVM 22 fork and
rocWMMA 2.2.0 causing compilation issues on devices without
native fp16 support (CDNA devices).

The specialized types aren't resolved properly:
```
/opt/rocm/include/rocwmma/internal/mfma_impl.hpp:2549:37: error: ambiguous partial specializations of 'amdgcn_mfma<__half, __half, __half, 16, 16, 16>'
 2549 |             using ARegsT = typename Impl::ARegsT;
```

Add a workaround to explicitly declare the types and cast when
compiling with HIP and ROCWMMA_FATTN [2].  When this is actually
fixed upstream some guards can be used to detect and wrap the
version that has the fix to only apply when necessary.

Link: https://github.com/ROCm/rocm-libraries/issues/4398 [1]
Link: https://github.com/ggml-org/llama.cpp/issues/19269 [2]

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2026-02-15 21:44:37 +02:00
Max Krasnyansky 304205679c hexagon: further optimization and tuning of matmul and dot kernels (llama/19407)
* ggml-hexagon: implement 2x2 matmul kernel

* hexmm: implement vec_dot_rx2x2 for Q8_0 and MXFP4

* hexagon: fix editor config failures

* hexagon: refactor matmul ops to use context struct and remove wrappers

Also implement vec_dot_f16 2x2

* hexagon: refactor dyn quantizers to use mmctx

* hexagon: remove mm fastdiv from op_ctx

* hexagon: refactor matmul entry point to reduce code duplication

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
2026-02-15 21:44:37 +02:00
lhez 0326fd37dd opencl: add general Q6_K mm and Q4_K mv (llama/19347)
* opencl: add general q6_k mm

* opencl: refine condition for q6_K mm

* opencl: add general q4_K mv

* opencl: fix whitespace
2026-02-15 21:44:37 +02:00
Georgi Gerganov f3e78985be ggml : unary ops support non-cont src0 + metal F16 unary ops (llama/19511)
* ggml : unary ops support non-cont src0

* metal : support F16 unary ops + fix ELU
2026-02-15 21:44:37 +02:00
Georgi Gerganov 3ffa1fd84e metal : extend l2_norm support for non-cont src0 (llama/19502) 2026-02-15 21:44:37 +02:00
Max Krasnyansky 09587ceb12 hexagon: Add ARGSORT, DIV, SQR, SQRT, SUM_ROWS, GEGLU (llama/19406)
* hexagon: add ARGSORT op

Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com>

* hexagon: argsort reject tensors with huge rows for now

* Adding support for DIV,SQR,SQRT,SUM_ROWS ops in hexagon backend

* hexagon : Add GEGLU op

* hexagon: fix editor config check

* hexagon: rewrite and optimize binary ops ADD/SUB/MUL/DIV/ADD_ID to use DMA

---------

Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com>
Co-authored-by: Manohara Hosakoppa Krishnamurthy <mhosakop@qti.qualcomm.com>
2026-02-15 21:44:37 +02:00
Georgi Gerganov 3504358056 ggml : extend bin bcast for permuted src1 (llama/19484)
* tests : extend bin bcast for permuted src1

* cont : extend bin support

* cont : s0 is always 1

* tests : simplify
2026-02-15 21:44:37 +02:00
Georgi Gerganov de949fb1db metal : consolidate unary ops (llama/19490) 2026-02-15 21:44:37 +02:00
Oliver Simons 57c620b4b1 CUDA : Update CCCL-tag for 3.2 to final release from RC (llama/19486)
CCCL 3.2 has been released since it was added to llama.cpp as part of
the backend-sampling PR, and it makes sense to update from RC to final
released version.

https://github.com/NVIDIA/cccl/releases/tag/v3.2.0
2026-02-15 21:44:37 +02:00
Nikhil Jain 562255fd77 Plug memory leaks and free resources on shutdown (llama/19315)
* Fix memory leaks in shader lib, backend, backend_context, buffer_context, and webgpu_buf_pool

* Free pools

* Cleanup

* More cleanup

* Run clang-format

* Fix arg-parser and tokenizer test errors that free an unallocated buffer

* Fix device lost callback to not print on device teardown

* Fix include and run clang-format

* remove unused unused

* Update binary ops

---------

Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-02-15 21:44:37 +02:00
Alberto Cabrera Pérez d77265c818 ggml-cpu: arm64: q6_K repack gemm and gemv (and generic) implementations (dotprod) (llama/19360)
* First working version of GEMM and GEMV

* interleave loads and compute

* Clang-format

* Added missing fallback. Removed tested TODO.

* Swap M and N to be consistent with the repack template convention
2026-02-15 21:44:37 +02:00
k4ss4n b0fe2e84fa ggml : use noexcept overload for is_regular_file in backend registration (llama/19452)
using noexcept std::filesystem::directory_entry::is_regular_file
overload prevents abnormal termination upon throwing an error
(as caused by symlinks to non-existent folders on linux)

Resolves: #18560
2026-02-15 21:44:37 +02:00
Raul Torres 2de2fc9270 CANN: Remove unnecessary wrapper for `gml_backend_buft_is_cann` (llama/18968) 2026-02-15 21:44:37 +02:00
hipudding 6a74f56212 CANN: implement quantized MUL_MAT_ID for MoE models (llama/19228)
Implement ggml_cann_mul_mat_id_quant function to support quantized matrix
multiplication for Mixture of Experts (MoE) architectures on CANN backend.

Key features:
- Support Q4_0 and Q8_0 quantized weight formats
- Use IndexSelect to dynamically route expert-specific weights based on indices
- Leverage WeightQuantBatchMatmulV2 for efficient quantized computation
- Handle automatic F16 type conversion for hardware compatibility
- Support both per-expert and broadcast input modes

Implementation details:
- Extract expert weights and scales using CANN IndexSelect operation
- Process each batch and expert combination independently
- Create proper tensor views with correct stride for matmul operations
- Automatic input/output type casting to/from F16 as needed

Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).
2026-02-15 21:44:37 +02:00
Georgi Gerganov a36210c836 cuda : extend GGML_OP_PAD to work with non-cont src0 (llama/19429)
* cuda : extend GGML_OP_PAD to work with non-cont src0

* tests : add permuted pad
2026-02-15 21:44:37 +02:00
Oliver Simons 808904277e CUDA: Fix non-contig rope (llama/19338)
* Rename variables + fix rope_neox

Seems memory layout is shared with Vulkan so we can port fix from
https://github.com/ggml-org/llama.cpp/pull/19299

* Fix rope_multi

* Fix rope_vision

* Fix rope_norm

* Rename ne* to ne0* for consistent variable naming

* cont : consistent stride names

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-02-15 21:44:37 +02:00
Georgi Gerganov 55d7cb2e93 metal : consolidate bin kernels (llama/19390)
* metal : refactor bin kernels

* cont

* cont : fix cv
2026-02-08 09:29:10 +02:00
Georgi Gerganov a9a0a51fba metal : fix event synchronization in cpy_tensor_async (llama/19402) 2026-02-08 09:29:10 +02:00
Abhijit Ramesh 1739af663a ggml-webgpu: JIT compile binary operators and handle binding overlaps (llama/19310)
* ggml webgpu: port binary operators to use pre-wgsl

* Add binary.wgsl: unified shader with conditionals for all 4 ops

* Add gen_binary_shaders.cpp: build tool for using pre_wgsl preprocessor

* Remove bin_op.tmpl.wgsl and binary.wgsl (Python template)

* Update CMake to generate binary operator shaders at build time

* ggml-webgpu: migrate binary ops to JIT compilation with overlap handling

* port binary operators from AOT to pre-wgsl JIT compilation

* add src1=dst overlap handling for binary ops

* use compile-time workgroup size defines instead of runtime overrides

* ggml-webgpu: complete overlap handling for binary ops

* add support for inplace & overlap case in binding setup

* restructure conditional logic to handle all overlap cases

* ensure all buffer bindings are correctly assigned for edge cases

* ggml-webgpu: remove unused binary overlap cases

Remove src0==src1 binary overlap case that never occurs in practice.

* keep INPLACE (src0==dst), OVERLAP (src1==dst), DEFAULT

* remove unused src0==src1 and all-same variant

* refactor wgsl to eliminate duplication
2026-02-08 09:29:10 +02:00
Nechama Krashinski f2f7320817 sycl: add F16 support for GGML_OP_CEIL (llama/19306)
* Fix SYCL CEIL operator

* sycl: implement GGML_OP_CEIL
2026-02-08 09:29:10 +02:00
Jeff Bolz cea22b3075 vulkan: For coopmat2 FA, use fp16 accumulators for the final result (llama/19376)
The cpu and cuda backends use fp16 for the VKQ accumulator type, this change
does the same for vulkan. This helps particularly with large head sizes which
are very register-limited.

I tried this for the coopmat1 path and it slowed down a bit. I didn't try for
scalar.

I applied the softmax bias that the cuda backend uses to avoid overflow,
although I was not able to reproduce the original bug without it.
2026-02-08 09:29:10 +02:00
Jeff Bolz c1b63354bb vulkan: make FA mask/softcap enables spec constants (llama/19309)
* vulkan: make FA mask/softcap enables spec constants

* don't specialize for sinks

* bump timeout a little bit
2026-02-08 09:29:10 +02:00
Georgi Gerganov 776cf61857 metal : skip loading all-zero mask (llama/19337)
* metal : skip loading all-zero mask

* cont : minor
2026-02-08 09:29:10 +02:00
Georgi Gerganov 2a7d5490f1 cuda : cuda graphs now compare all node params (llama/19383) 2026-02-08 09:29:10 +02:00
Georgi Gerganov 34d332aca5 metal : adaptive CPU/GPU interleave based on number of nodes (llama/19369) 2026-02-08 09:29:10 +02:00
Jeff Bolz a567c140a3 vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (llama/19281)
Write out a 2-bit code per block and avoid loading the mask when it
matches these two common cases.

Apply this optimization when the mask is relatively large (i.e. prompt
processing).
2026-02-08 09:29:10 +02:00
Georgi Gerganov 0781df2518 metal : add diag (llama/19330) 2026-02-08 09:29:10 +02:00
Oleksandr Kuvshynov 932def3198 vulkan: fix GPU deduplication logic. (llama/19222)
* vulkan: fix GPU deduplication logic.

As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the
(same uuid, same driver) logic is problematic for windows+intel igpu.

Let's just avoid filtering for MoltenVK which is apple-specific, and
keep the logic the  same as before 88d23ad5 - just dedup based on UUID.

Verified that MacOS + 4xVega still reports 4 GPUs with this version.

* vulkan: only skip dedup when both drivers are moltenVk
2026-02-08 09:29:10 +02:00
Jeff Bolz 5a786f7648 vulkan: Set k_load_shmem to false when K is too large (llama/19301) 2026-02-08 09:29:10 +02:00
Jeff Bolz e0a3f393ad vulkan: fix non-contig rope (llama/19299) 2026-02-08 09:29:10 +02:00
will-lms eecc9bfa69 metal : add missing includes (llama/19348) 2026-02-08 09:29:10 +02:00
Kevin Pouget 2763054f99 ggml-virtgpu: make the code thread safe (llama/19204)
* ggml-virtgpu: regenerate_remoting.py: add the ability to deprecate a function

* ggml-virtgpu: deprecate buffer_type is_host remoting

not necessary

* ggml-virtgpu: stop using static vars as cache

The static init isn't thread safe.

* ggml-virtgpu: protect the use of the shared memory to transfer data

* ggml-virtgpu: make the remote calls thread-safe

* ggml-virtgpu: backend: don't continue if couldn't allocate the tensor memory

* ggml-virtgpu: add a cleanup function for consistency

* ggml-virtgpu: backend: don't crash if buft->iface.get_max_size is missing

* fix style and ordering

* Remove the static variable in apir_device_get_count

* ggml-virtgpu: improve the logging

* fix review minor formatting changes
2026-02-08 09:29:10 +02:00
Aman Gupta 4685ec9555 ggml-cpu: use LUT for converting e8->f32 scales on x86 (llama/19288)
* ggml-cpu: use LUT for converting e8->f32 scales on x86

* add dispatch based on macro
2026-02-08 09:29:10 +02:00
Georgi Gerganov 5dda94dd2e metal : add solve_tri (llama/19302) 2026-02-08 09:29:10 +02:00
Ruben Ortlam aa34558b6f vulkan: disable coopmat1 fa on Nvidia Turing (llama/19290) 2026-02-08 09:29:10 +02:00
Aman Gupta 8eede801e3 CUDA: use mmvq for mul-mat-id for small batch sizes (llama/18958)
* CUDA: use mmvq for mul-mat-id for small batch sizes

* add mmvq too

* Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs

* templatize multi_token_path
2026-02-08 09:29:10 +02:00
Georgi Gerganov ce8a2da620 metal : minor cleanup (llama/19251) 2026-02-08 09:29:10 +02:00
Oliver Simons 698265d754 CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup (llama/19053)
By providing stride_* variables as size_t (i.e., 64-bit) the compiler can
correctly unroll the [two for-loops](557515be1e/ggml/src/ggml-cuda/mmq.cuh (L3789-L3816))
on BW. This gives some perf for prefill/pp phase on BW, while not affecting
other SMs:

| GPU                                                     | Model                 | Test   |   t/s master |   t/s osimons/fix_bw_mmq_fixup_kernel |   Speedup |
|:--------------------------------------------------------|:----------------------|:-------|-------------:|--------------------------------------:|----------:|
| NVIDIA RTX 6000 Ada Generation                          | gpt-oss 20B MXFP4 MoE | pp8096 |      8404.05 |                               8375.79 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | llama 3B Q4_K_M       | pp8096 |     16148.93 |                              16019.60 |      0.99 |
| NVIDIA RTX 6000 Ada Generation                          | llama 8B Q4_0         | pp8096 |      8008.29 |                               7978.80 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | nemotron_h 9B BF16    | pp8096 |      4263.16 |                               4248.53 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | nemotron_h 9B Q4_K_M  | pp8096 |      5165.11 |                               5157.43 |      1.00 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | gpt-oss 20B MXFP4 MoE | pp8096 |     12582.80 |                              12758.37 |      1.01 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 3B Q4_K_M       | pp8096 |     16879.10 |                              17619.47 |      1.04 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 8B Q4_0         | pp8096 |     10649.90 |                              10982.65 |      1.03 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B BF16    | pp8096 |      7717.73 |                               7716.22 |      1.00 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B Q4_K_M  | pp8096 |      7301.90 |                               7370.38 |      1.01 |
2026-02-08 09:29:10 +02:00
George 57107b2bf8 ggml: added cleanups in ggml_quantize_free (llama/19278)
Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.
2026-02-08 09:29:10 +02:00
Gaurav Garg 6ec362d2e0 cuda : revert CUDA_SCALE_LAUNCH_QUEUES override until investigated (llama/19227)
Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.
2026-02-08 09:29:10 +02:00
lhez 591072fcc8 opencl: refactor some ops, concat, repeat, tanh and scale (llama/19226)
* opencl: refactor concat

* opencl: refactor repeat

* opencl: refactor tanh

* opencl: enable fp16 for tanh

* opencl: refactor scale

* opencl: fix unused variables
2026-02-08 09:29:10 +02:00
Aman Gupta 871063016d ggml-cpu: FA split across kv for faster TG (llama/19209)
* ggml-cpu: split across kv for faster TG

* simplify sinks application

* add ref impl
2026-02-08 09:29:10 +02:00
Neo Zhang c4003da2b8 Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. (llama/19246)
User can't build up the software for Nvidia & AMD GPU.
rm the oneMath since it is only used in NV and AMD code path.
2026-02-08 09:29:10 +02:00
Tamar 74353e90a1 sycl: implement GGML_OP_TOP_K (llama/19242) 2026-02-08 09:29:10 +02:00
Georgi Gerganov 73e04555eb metal : support virtual devices (llama/18919)
* metal : support virtual devices

* cont : manage buffer type context memory

* metal : add events

* cont : implement cpy_tensor_async
2026-02-08 09:29:10 +02:00
Johannes Gäßler 625c8d863e ggml-backend: fix async set/get fallback sync (llama/19179) 2026-02-08 09:29:10 +02:00
Christian Kastner 0e219ebf89 docs : Minor cleanups (llama/19252)
* Update old URLs to github.com/ggml-org/

* Bump copyrights
2026-02-08 09:29:10 +02:00
Nikhil Jain a0256b8159 Remove pipeline cache mutexes (llama/19195)
* Remove mutex for pipeline caches, since they are now per-thread.

* Add comment

* Run clang-format

* Cleanup

* Run CI again

* Run CI once more

* Run clang-format
2026-02-08 09:29:10 +02:00
Max Krasnyansky aca5953d8d Bump cmake max version (needed for Windows on Snapdragon builds) (llama/19188)
* Bump max cmake version (needed for Windows on Snapdragon builds)

* cmake: move max version setting into ggml/CMakeLists
2026-02-08 09:29:10 +02:00
nullname 9b927dd849 ggml-hexagon: flash-attention and reduce-sum optimizations (llama/19141)
* wip

* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation

* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations

* wip

* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance

* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability

* optimize vector dot product functions to use unified reduction for improved performance

* wip

* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation

* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations

* wip

* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance

* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability

* optimize vector dot product functions to use unified reduction for improved performance

* hexagon: optimize reduce-sum for v75+

* hexagon: always keep row_sums in sf/fp32

* ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAGON_TOOLS_ROOT

* fix compiling error after rebase

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-02-08 09:29:10 +02:00
shaofeiqi db9c88744d opencl: add optimized q8_0 mm kernel for adreno (llama/18871)
* Add Q8_0 OpenCL kernel

Co-authored-by: yunjie <yunjie@qti.qualcomm.com>

* opencl: fix build for non-adreno

* opencl: refactor q8_0

* opencl: enforce subgroup size of 64 for adreno for q8_0

* For A750 and older generations, subgroup size can be 64 or 128.
  This kernel assumes subgroup size 64.

* opencl: suppress warning when adreno kernels are disabled

---------

Co-authored-by: yunjie <yunjie@qti.qualcomm.com>
Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-02-08 09:29:10 +02:00
Simon Redman efd6344939 Correctly fetch q8_1 quantize pipeline in test as needed by 8a3519b (llama/19194) 2026-02-08 09:29:10 +02:00
Georgi Gerganov 06e3750407 ggml : bump version to 0.9.6 (ggml/1423) 2026-02-08 09:29:10 +02:00
Georgi Gerganov fc1a3e579e cmake : remove unused file (ggml/1419) 2026-02-08 09:29:10 +02:00
Georgi Gerganov acbace0571 cuda : fix compile warnings (#0) 2026-01-30 15:56:40 +02:00
bssrdf 5dca0db99c add tensor type checking as part of cuda graph properties (llama/19186) 2026-01-30 15:56:40 +02:00
s8322 2a16e7a67f sycl: implement GGML_UNARY_OP_SOFTPLUS (llama/19114)
* sycl: add softplus unary op implementation

* sycl: add softplus unary op implementation

* docs(ops): mark SYCL SOFTPLUS as supported

* docs: update SYCL status for SOFTPLUS
2026-01-30 15:56:40 +02:00
RachelMantel 1b3c27efae sycl: implement GGML_OP_TRI (llama/19089)
* sycl: implement GGML_OP_TRI

* docs: update ops.md for SYCL TRI

* docs: regenerate ops.md

* docs: update SYCL support for GGML_OP_TRI
2026-01-30 15:56:40 +02:00
Zheyuan Chen 829e70044b ggml-webgpu: improve flastAttention performance by software pipelining (llama/19151)
* webgpu : pipeline flash_attn Q/K loads in WGSL

* ggml-webgpu: unroll Q*K accumlation inner loop

* ggml-webgpu: vectorization

* ggml-webgpu: unrolling

* ggml-webgpu: remove redundant unrolling

* ggml-webgpu: restore the config

* ggml-webgpu: remove redundant comments

* ggml-webgpu: formatting

* ggml-webgpu: formatting and remove vectorization

* ggml-webgpu: remove unnecessary constants

* ggml-webgpu: change QKV buffer to read_write to pass validation

* ggml-webgpu: add explanation for the additional bracket around Q K accumulate

* Indentation and for -> if for tail

* Kick off CI on wgsl only commits

---------

Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-01-30 15:56:40 +02:00
Todor Boinovski 2a89a3f35c hexagon: enable offloading to Hexagon on Windows on Snapdragon (llama/19150)
* hexagon: updates to enable offloading to HTP on WoS

* Update windows.md

* Update windows.md

* hexagon: enable -O3 optimizations

* hexagon: move all _WINDOWS conditional compilation to _WIN32

* hexagon: updates to enable offloading to HTP on WoS

* hexagon: use run-time vs load-time dynamic linking for cdsp driver interface

* refactor htp-drv

* hexagon: add run-bench.ps1 script

* hexagon: htdrv refactor

* hexagon: unify Android and Windows build readmes

* hexagon: update README.md

* hexagon: refactor htpdrv

* hexagon: drv refactor

* hexagon: more drv refactor

* hexagon: fixes for android builds

* hexagon: factor out dl into ggml-backend-dl

* hexagon: add run-tool.ps1 script

* hexagon: merge htp-utils in htp-drv and remove unused code

* wos: no need for getopt_custom.h

* wos: add missing CR in htpdrv

* hexagon: ndev enforecement applies only to the Android devices

* hexagon: add support for generating and signing .cat file

* hexagon: add .inf file

* hexagon: working auto-signing and improved windows builds

* hexagon: futher improve skel build

* hexagon: add rough WoS guide

* hexagon: updated windows guide

* hexagon: improve cmake handling of certs and logging

* hexagon: improve windows setup/build doc

* hexagon: more windows readme updates

* hexagon: windows readme updates

* hexagon: windows readme updates

* hexagon: windows readme updates

* hexagon: windows readme updates

* Update windows.md

* Update windows.md

* snapdragon: rename docs/backend/hexagon to docs/backends/snapdragon

Also added a power shell script to simplify build env setup.

* hexagon: remove trailing whitespace and move cmake requirement to user-presets

* hexagon: fix CMakeUserPresets path in workflow yaml

* hexagon: introduce local version of libdl.h

* hexagon: fix src1 reuse logic

gpt-oss needs a bigger lookahead window.
The check for src[1] itself being quantized was wrong.

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-01-30 15:56:40 +02:00
Georgi Gerganov b997e690ef cuda : fix nkvo, offload and cuda graph node properties matching (llama/19165)
* cuda : fix nkvo

* cont : more robust cuda graph node property matching

* cont : restore pre-leafs implementation

* cont : comments + static_assert
2026-01-30 15:56:40 +02:00
yulo 34a3e28a08 HIP: add mmf for CDNA (llama/18896)
* refactor mmf rows_per_block

* speed up compile

* pass cdna compile

* fix cuda error

* clean up mmf

* f32 mmf

* clean float mma

* fix mmf error

* faster mmf

* extend tile k

* fix compile error

* Revert "extend tile k"

This reverts commit 4d2ef3d483932659801a59a5af0b6b48f6ffd5c7.

* fix smem overflow

* speed up compiling mmf

* speed up compile for hip

* 512 block for cdna

* config pad size

* fix as comment

* update select logic

* move some code to cuh

* fix as comment

* correct cdna3 config

---------

Co-authored-by: zhang hui <you@example.com>
2026-01-30 15:56:40 +02:00
Vishal Singh e0a2182970 ggml-zendnn : resolve ZenDNN backend cross-module symbol dependency (llama/19159) 2026-01-30 15:56:40 +02:00
Aman Gupta 62ba8b537f CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) (llama/19126) 2026-01-30 15:56:40 +02:00
Neo Zhang f0e85bb142 sycl: fix norm kernels: l2_norm, group_norm, rms_norm by remove assert to support more cases (llama/19154)
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2026-01-30 15:56:40 +02:00
Ruben Ortlam 33148bb523 Vulkan Flash Attention Coopmat1 Refactor (llama/19075)
* vulkan: use coopmat for flash attention p*v matrix multiplication

* fix P loading issue

* fix barrier position

* remove reduction that is no longer needed

* move max thread reduction into loop

* remove osh padding

* add bounds checks and padding

* remove unused code

* fix shmem sizes, loop duration and accesses

* don't overwrite Qf, add new shared psh buffer instead

* add missing bounds checks

* use subgroup reductions

* optimize

* move bounds check, reduce barriers

* support other Bc values and other subgroup sizes

* remove D_split

* replace Of register array with shared memory Ofsh array

* parallelize HSV across the rowgroups

* go back to Of in registers, not shmem

* vectorize sfsh

* don't store entire K tile in shmem

* fixes

* load large k tiles to shmem on Nvidia

* adapt shared memory host check function to shader changes

* remove Bc 32 case

* remove unused variable

* fix missing mask reduction tmspsh barrier

* fix mask bounds check

* fix rowmax f16 under/overflow to inf

* fix flash_attn_cm2 BLOCK_SIZE preprocessor directives
2026-01-30 15:56:40 +02:00
Patryk Kaminski cc0c103b5d ggml-sycl: remove unused syclcompat header (llama/19140)
The syclcompat/math.hpp is not used anymore. The change that intrduced it was successfuly reverted (https://github.com/ggml-org/llama.cpp/pull/17826).
This include path will become obsolete and dropped in oneAPI 2026.0 effectively breaking ggml-sycl builds.
2026-01-30 15:56:40 +02:00
Oleksandr Kuvshynov dda7d9cd1c vulkan: handle device dedup on MacOS + Vega II Duo cards (llama/19058)
Deduplication here relied on the fact that vulkan would return unique
UUID for different physical GPUs. It is at the moment not always the case.
On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total),
MotlenVK would assign same UUID to pairs of GPUs, unless they
are connected with Infinity Fabric.

See more details here: KhronosGroup/MoltenVK#2683.

The right way is to fix that in MoltenVK, but until it is fixed,
llama.cpp would only recognize 2 of 4 GPUs in such configuration.

The deduplication logic here is changed to only filter GPUs if UUID is
same but driver is different.
2026-01-30 15:56:40 +02:00
Kevin Pouget 531d7b6781 ggml: new backend for Virglrenderer API Remoting acceleration (v2) (llama/18718) 2026-01-30 15:56:40 +02:00
Alberto Cabrera Pérez 3701413a71 ggml-cpu: arm64: Q4_K scale unroll and vectorization (llama/19108) 2026-01-30 15:56:40 +02:00
Georgi Gerganov 7fb0f823de cuda : fix "V is K view" check for non-unified KV cache (llama/19145) 2026-01-30 15:56:40 +02:00
Georgi Gerganov f28a733025 CUDA: tune GLM 4.7 Flash FA kernel selection logic (DGX Spark) (llama/19142) 2026-01-30 15:56:40 +02:00
Nikhil Jain dfdd2fee83 ggml webgpu: Split shared state (webgpu_context) into global state and per-thread state (llama/18976)
* Squashed commit of the following:

commit b3c6bf4b0450d8d452b934df27a0fb7cb53cd755
Author: Abhijit Ramesh <abhijitramesh2k@gmail.com>
Date:   Mon Dec 1 18:29:00 2025 -0800

    ggml webgpu: fix xielu parameter passing (llama/11)

    The XIELU operation was incorrectly using static_cast to convert
    float parameters to uint32_t, which converted numeric values instead
    of preserving IEEE 754 bit patterns. This caused incorrect values
    to be interpreted by the GPU shader.

    * Use reinterpret_cast to preserve float bit patterns when passing
      through uint32_t params buffer
    * Update WGSL shader parameter types from u32 to f32
    * Re-enable XIELU support (was disabled due to numerical issues)

    Fixes NMSE test failures for XIELU operation on WebGPU backend.

commit 5ca9b5e49ea7cddc9ab7c8b43a11a9c76a4dff4a
Author: neha-ha <137219201+neha-ha@users.noreply.github.com>
Date:   Tue Nov 18 12:17:00 2025 -0800

    Refactored pipelines and workgroup calculations (llama/10)

    * refactored pipelines

    * refactored workgroup calculation

    * removed commented out block of prior maps

    * Clean up ceiling division pattern

    ---------

    Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu>
    Co-authored-by: Reese Levine <reeselevine1@gmail.com>

Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 29 23:13:06 2025 -0700

    formatted embed wgsl and ggml-webgpu.cpp

commit e1f6baea31645e5d96ad53664acae856f74b96f4
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 29 23:08:37 2025 -0700

    implemented REPL_Template support and removed bug in unary operators kernel

commit 8c70b8fece445cdc9a8c660dbddbf201e52da2bb
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 15 16:14:20 2025 -0700

    responded and dealt with PR comments

commit f9282c660c10dec4487d434549bdb707a9cd9f37
Author: James Contini <jamescontini@gmail.com>
Date:   Sun Oct 12 13:41:41 2025 -0700

    removed unnecesarry checking if node->src[1] exists for unary operators

commit 4cf28d7dec41c29186d66152735b244c5699f9dc
Author: James Contini <jamescontini@gmail.com>
Date:   Sun Oct 12 13:32:45 2025 -0700

    All operators (inlcluding xielu) working

commit 74c6add1761a59d2c2ff60b60e8ad3c8300f6d3e
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 13:16:48 2025 -0700

    fixed autoconfig

commit 362749910be4f0120c8ffb21ceddeb7d2c088e51
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 13:10:46 2025 -0700

    removed vestigial files

commit cb0858333785757804c5104e59c4981843207c16
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 12:59:32 2025 -0700

    abides by editor-config

commit 5360e2852a4b51197d7d67d0a5d42e908b02d7ed
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 12:45:57 2025 -0700

    rms_norm double declaration bug atoned

commit 7b09baa4aa53711be5a126043670cc182c78bfcd
Merge: 8a6ec843 74b8fc17
Author: James Contini <jamescontini@gmail.com>
Date:   Fri Oct 10 11:50:03 2025 -0700

    resolving merge conflicts

commit 8a6ec843a50ab82f8cef59b4558eb63f318ba02d
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 8 18:06:47 2025 -0700

    unary operators pass ggml tests

commit c3ae38278a2db236adc5912c9140e4f0d63f2c19
Author: James Contini <jamescontini@gmail.com>
Date:   Wed Oct 1 16:22:40 2025 -0700

    neg passes backend test

commit aa1c9b2f8877a405470ca56709c42a1fd43713de
Author: James Contini <jamescontini@gmail.com>
Date:   Tue Sep 30 23:55:27 2025 -0700

    neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though

Co-authored-by: James Contini <jamescontini@gmail.com>
Co-authored-by: Neha Abbas <neabbas@ucsc.edu>
Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com>

* Remove extra code and format

* Add ops documentation (finally)

* ggml webgpu: add SOFTPLUS unary operator

Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32
precision for intermediate calculations to prevent f16 overflow.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support
* Follow Vulkan backend numerical stability pattern

* ggml webgpu: add EXPM1 unary operator

Implements EXPM1 (exp(x) - 1) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add FLOOR unary operator

Implements FLOOR (rounds down to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add CEIL unary operator

Implements CEIL (rounds up to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add ROUND unary operator

Implements ROUND (rounds to nearest integer) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* ggml webgpu: add TRUNC unary operator

Implements TRUNC (truncates towards zero) with f16/f32 support.

* Add shader implementation and 4 variants (f32/f16, inplace/non-inplace)
* Register pipelines and device support

* docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS)

* Updates to webgpu get_memory

* Move shared state (webgpu_context) and device creation out of registration context, device context, and buffer context, and move into backend context

* Small cleanup

* Move Instance, Device, Adapter, Device creation, and capabilities to global state while moving Queue, pipelines, and buffers to per-thread state.

* Cleanups

* More cleanup

* Move staging_buf mutex to global context

* Resolve merge

* Resolve merge

* Resolve merge

* Clean up merge errors, delete forward declaration, and run clang-format

* Rename device_init to backend_init

* Move webgpu_context to backend_context

* Move buffer context members into global context and refactor function calls

* Run clang-format

* Remove commends

* Move parameter buffers to per-thread, add single memset_tensor param buf

* Fix CI compilation issue

* Fix builds for emscripten not supporting subgroups

* cleanup

* cleanup

---------

Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-01-30 15:56:40 +02:00
Vishal Singh 9c75c793a6 ggml-zendnn : update ZenDNN git tag to main branch (llama/19133) 2026-01-30 15:56:40 +02:00
Johannes Gäßler 9d94d0f782 CUDA: tune GLM 4.7 Flash FA kernel selection logic (llama/19097) 2026-01-30 15:56:40 +02:00
Alberto Cabrera Pérez 00885e08e2 ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860 (llama/18888)
* Boilerplate for q6_K repack

* q6_K repack to q6_Kx8 implementation

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* q6_K generic gemv and gemm

* wip, gemm_q6_K 8x8

* Still WIP: loading of q8s, q6h and q6l

* first working version of q6_K gemm

* Moved q6 loads outside of sb block, Unrolled inner loop

* Replaced modulo with mask

* First implementation of GEMV

* ggml_vdotq_s32 -> vdotq_s32

* Reduce width of accumulators in q6_K gemv

* Bsums instead of calc bias. Preload scales to use vget_lane. Unroll.

* Reuse scales in GEMM (same GEMV opt)

* Added todos for bsum and different qh repack

* Arch fallback

* VSLIQ for merging qh adn ql

* Removed TODO, already tested

* Apply suggestions

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Removed unused import

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-30 15:56:40 +02:00
Gaurav Garg 5fcbbdc0dd Reduce CPU-side stalls due to the CUDA command buffer being full (llama/19042)
* [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline.
Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.

* Set the env variable in the CUDA backend registry allocation

* Add link to PR in code comment

* Remove warning logs and update documentation
2026-01-30 15:56:40 +02:00
shalinib-ibm b2e2032856 ggml-cpu: Enable FP16 MMA kernels on PPC (llama/19060) 2026-01-30 15:56:40 +02:00
lhez 56f82a9f33 opencl: add flattened q6_K mv (llama/19054) 2026-01-30 15:56:40 +02:00
Johannes Gäßler 41d5d7bb0e CUDA: fix padding of GQA to power of 2 in FA (llama/19115) 2026-01-30 15:56:40 +02:00
Johannes Gäßler f63848eada CUDA: faster FA for GQA > 1 but not power of 2 (llama/19092) 2026-01-30 15:56:40 +02:00
ccbinn 4372b87b8e metal : fix recommendedMaxWorkingSetSize availability on legacy iOS/macOS (llama/19088)
Co-authored-by: chenbin11 <chenbin11@kuaishou.com>
2026-01-30 15:56:40 +02:00
Aman Gupta 1642a4fb60 ggml-cpu: Use tiled FA for prompt-processing (llama/19012)
* ggml-cpu: Use tiled FA for prompt-processing

the FA performance is gimped on CPU on long contexts because it essentially uses a vector kernel. This PR adds a tiled FA for PP. Perf tuning for tile sizes done on a AMD EPYC single-socket 64-c machine.

* fix out of bounds for mask

* skip rows where there are all masks

* skip tile if mask is inf

* store mask in worksize

* check inf tile earlier
2026-01-30 15:56:40 +02:00
Georgi Gerganov d2b51404e4 kv-cache : support V-less cache (llama/19067)
* kv-cache : support V-less cache

* cuda : better check for V_is_K_view

* cuda : improve V_is_K_view check

* graph : add comments

* hparams : refactor
2026-01-30 15:56:40 +02:00
Johannes Gäßler f53eafd745 CUDA: re-use MLA K data for V in MMA FA (llama/19057) 2026-01-30 15:56:40 +02:00
Aman Gupta 13577a6ce4 ggml-cuda: enable cuda-graphs for `n-cpu-moe` (llama/18934)
* ggml-cuda: add split-wise cuda graph

* add n-cpu-moe compare_llama_bench.py

* fix hip/musa builds
2026-01-30 15:56:40 +02:00
nullname 79f1bb3d35 ggml-hexagon: flash-attn opt (llama/19025)
* optimize flash attention kernel by improving score computation and online softmax update

* wip

* Refactor online softmax update in flash attention kernel for improved performance

* Optimize flash attention kernel by replacing float array with HVX_Vector for score computation

* wip
2026-01-30 15:56:40 +02:00
Neo Zhang 0d9dda5a99 use malloc to support both iGPU and dGPU in same time (llama/18992)
* use malloc to support both iGPU and dGPU in same time

* support windows

---------

Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2026-01-30 15:56:40 +02:00
Alberto Cabrera Pérez e090d91f5e ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm) (llama/18860)
* Boilerplate for q5_Kx8 REPACK on ARM and fallback

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Implements make_block_q5_Kx8 by extending make_block_q4_Kx8

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* q5_K repack gemm and gemv generics

* Gemm and Gemv ARM implementations (i8mm)

* Improved qh manipulation looking at non-repack vec_dot implementation

* Full unroll

* Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments.

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fix wrong fallback definitions of Q5_K

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fixed comments. Reverted unnecessary formatting

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fixed typo in generic definitions

* Switching AND + Shift with Shift Insert. Better op interleaving.

* Vectorize + unroll the block scales

* Apply gemm optimizations to gemv

* Improve bias calculation

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
2026-01-30 15:56:40 +02:00
Georgi Gerganov 3f96a1da0e mla : make the V tensor a view of K (llama/18986)
* mla : pass V as a view of K to the FA op

* cuda : adjust mla logic to new layout

* kv-cache : fix rope shift

* tests : remove comment

* cuda : fix reusable_cutoff

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-01-30 15:56:40 +02:00
Johannes Gäßler f21d0cbb1a CUDA: fix alignment check for FA (llama/19023) 2026-01-30 15:56:40 +02:00
lhez 0e030b852a opencl: enable the general fp mm for non-cont input and as a fallback for specialized kqv kernel for adreno (llama/18970)
* opencl: add `copy_to_contiguous` and utilize mm kernels

* opencl: only copy to cont for f32 and f16 tensors

* opencl: use cont mm for fallback when dst is large

* opencl: use nb local to copy-to-cont

* opencl: use local offset as well
2026-01-30 15:56:40 +02:00
Aman Gupta d4fafcfc6f CUDA: add gqa_ratio 4 for GLM 4.7 flash (llama/18953) 2026-01-30 15:56:40 +02:00
shaofeiqi 167fec69d5 opencl: add TRI op support (llama/18979) 2026-01-30 15:56:40 +02:00
Aleksei Nikiforov 55927d42ef ggml-zdnn : mark zDNN buffers as non-host (llama/18967)
While buffers reside in host memory,
additional transformation is needed to use buffers with zDNN.

Fixes #18848
2026-01-30 15:56:40 +02:00
Jeff Bolz b7e323f40b vulkan: Remove transfer_ctx, do everything in compute_ctx. (llama/18945)
* vulkan: Remove transfer_ctx, do everything in compute_ctx.

We had a bug where a set_tensor_async (using transfer_ctx) didn't get
submitted before the graph_compute (using compute_ctx) that came after
it. To avoid this sort of issue, just do everything in compute_ctx.

Remove transfer_cmd_pool, which was already unused.

* fix crash with perf logger
2026-01-30 15:56:40 +02:00
Jeff Bolz b2bc4d810b vulkan: support flash attention GQA/split_k with small batches (llama/18938) 2026-01-30 15:56:40 +02:00
Masato Nakasaka 3bbf4ced47 Revert "vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356)" (llama/18831)
This reverts commit 980b7cd17e055c8c587f79ffda7eb4fddf405566.
2026-01-30 15:56:40 +02:00
Jeff Bolz 660d943ff8 vulkan: Use mul_mat_vec_id for small values of n (llama/18918)
Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.

Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.
2026-01-30 15:56:40 +02:00
Oliver Simons 924a9e292c CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator (llama/18964)
* CUDA: Fix builds for older CCCL versions by ifdefing strided_iterator

Strided iterator was added in [CCCL
3.1](https://github.com/NVIDIA/cccl/releases/tag/v3.1.0), which is packaged into
[CTK
13.1](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id5)

* Unindent as per code review request
2026-01-30 15:56:40 +02:00
Oliver Simons fdc83ee3c0 CUDA: Replace init_offsets kernel with iterators in cub-based argsort (llama/18930)
* CUDA: Replace `init_offsets` with iterators in argsort

This is a QOL improvement, saving us the cost of materializing the
iterator

* Remove unnecessary include from top-k.cu
2026-01-30 15:56:40 +02:00
Adrien Gallouët bf71ffa6b3 ggml : cleanup path_str() (llama/18928)
- Remove pragmas as `std::codecvt_utf8` is not used.
- Avoid implicit `strlen()`.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-30 15:56:40 +02:00
Georgi Gerganov b0517d6912 metal : enable FA for MLA heads (llama/18950) 2026-01-30 15:56:40 +02:00
Georgi Gerganov 47f3e3b927 ggml : add ggml_build_forward_select (llama/18550)
* ggml : add ggml_build_forward_select

* cuda : adapt CUDA graph compat to new feature

* vulkan : update logic to handle command buffer closing

* ggml : check compute for fusion

* ggml : add comment
2026-01-30 15:56:40 +02:00
lhez 62a09b106d opencl: fix q6_K mv for m=1 (llama/18893) 2026-01-30 15:56:40 +02:00
Reese Levine 389dafc7c2 ggml webgpu: support for backend sampling (llama/18880) 2026-01-30 15:56:40 +02:00
Thore Koritzius 511ca7a1f4 ggml : extend ggml_pool_1d + metal (llama/16429)
* chore: resolve conflicts

* feat: ggml metal impl

* fix: ggml_metal_kargs_pool_1d struct

* fix: require contiguous input

* chore: test pool_1d

* chore: limit pool1d test cases to p0=0 and s0=k0 to conform with asserts

* chore: add p0 and s0 to testing

* fix: allow padding for cpu and metal

* Update ggml/src/ggml-metal/ggml-metal.metal

* fix: correct single-threaded loop

* ggml : cleanup

* tests : add ne[1] != 1 tests

* fix: ne[1] handling in np

* cont : fixes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-30 15:56:40 +02:00
Perry Naseck ecb4b80c35 ggml-blas: hide warnings from included BLAS headers (llama/18818)
* fix compile def openblas, blis for compat libs, nvpl compile def, warn if no blas vendor set

* ggml-blas: hide warnings from included BLAS headers
2026-01-30 15:56:40 +02:00
Raul Torres 42960b6073 CANN: Remove unused `ggml_cann_get_device` function (llama/18625) 2026-01-30 15:56:40 +02:00
Chenguang Li 2fceb5a80f CANN: fix an issue where get_env was not fully renamed (llama/18796)
* CANN: fix an issue where get_env was not fully renamed

* ci: add cann with acl group

* ci: define use_acl_graph using GitHub Action

* ci: update cann dockerfile with acl graph
2026-01-30 15:56:40 +02:00
hipudding 854274a297 CANN: support gated linear attn (llama/18653)
* CANN: support gated linear attn

This change adds support for the GGML_OP_GATED_LINEAR_ATTN operator.
The feature was implemented by YushengZhao. Because the previous
submission was based on an outdated codebase, this PR was rebased to
merge.

Co-authored-by: YushengZhao <yusheng.chao@outlook.com>
Co-authored-by: hipudding <huafengchun@gmail.com>

* CANN: optimize OP gla

Optimize gla for high preformance

* Remove unused comments

---------

Co-authored-by: 赵禹昇 <2501112001@cninfer02.localdomain>
Co-authored-by: YushengZhao <yusheng.chao@outlook.com>
2026-01-30 15:56:40 +02:00
shaofeiqi ed6004d051 OpenCL: add SOLVE_TRI op support (llama/18846) 2026-01-30 15:56:40 +02:00
Georgi Gerganov 290ff3d28d cuda : print less debug logs when disabling cuda graphs (llama/18868) 2026-01-30 15:56:40 +02:00
Johannes Gäßler f2f0ba0384 CUDA: fix allignment on register spill for FA (llama/18815) 2026-01-30 15:56:40 +02:00
shalinib-ibm 78a23d4830 ggml-cpu: optimize ggml_vec_dot_bf16 for Power9 (llama/18837) 2026-01-30 15:56:40 +02:00
Max Krasnyansky 50b7ab3d46 hexagon: support for OP_CPY, host buffers now optional (llama/18822) 2026-01-30 15:56:40 +02:00
Oliver Simons bc09047405 CUDA: Factor out and re-use `block_reduce` function (llama/18785)
* CUDA: Refactor and expose two_stage_warp_reduce_* function

* Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it

Moving smem out of `__device__` function to `__global__` function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (`cudaFuncSetAttribute` fails when not accounting for
it once for each call to two_stage_warp_reduce)

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Use two_stage_warp_reduce in group_norm_f32

* Use two_stage_warp_reduce in rms_norm_f32

* Fix smem calculation which expects bytes

* Make `two_stage_warp_reduce` accept all values warp_reduce accepts

Also integrate it into norm_f32 function

* Use two_stage_warp_reduce in l2_norm_f32

* Use type traits for block reduction for better legibility

Also adresss other requests by @am17an such as variable renaming

* Make norm tests cover all cuda paths

* Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK

Unit-tests passed locally, let's see if they pass in the CI as well

* Use `enum class` for `block_reduce_method`

This is more type-safe than plain enum

* Rename variables as suggested in code review by @am17an

* Rename two_stage_warp_reduce -> block_reduce

* Fix trailing whitespace in common.cuh

* Make condition of static_assert type-dependent

This delays evaluation until the template is actually instantiated.
Otherwise, some compilers may evaluate the assert when parsing the
template, resulting in build errors as observed here:

https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785

* Inline definitions

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-01-30 15:56:40 +02:00
Jeff Bolz 4b155e9bfb vulkan: Check maxStorageBufferRange in supports_op (llama/18709)
* vulkan: Check maxStorageBufferRange in supports_op

* skip maxStorageBufferRange check when shader64BitIndexing is enabled
2026-01-30 15:56:40 +02:00
Daniel Bevenius 25aeb66a4a CUDA : fix typo in clang pragma comment [no ci] (llama/18830) 2026-01-30 15:56:40 +02:00
Ruben Ortlam 49762e8fb3 vulkan: work around Intel fp16 bug in mmq (llama/18814) 2026-01-30 15:56:40 +02:00
Perry Naseck 17656e56dc ggml-metal: do not copy headers for embedded, use current binary dir for embedded (llama/18705) 2026-01-30 15:56:40 +02:00
yulo c6a495ae5d HIP: add fattn-mma-f16 for RDNA4 (llama/18481)
* finish VQ mma

* flash_attn_ext_f16_iter

* KQ_rowsum

* correct exp

* fix scale error

* fix softmax scale

* fix softmax scale

* enable fattn on cpu side

* fix random error

* disable fattn-mma-f16 on rdna3

* fix wrong col for rdna

* use identity mat to transpose

* resolve conflicts

* basic tuning for DeepSeek-R1-Distill-Qwen-1.5B

* fix volta compile error

* align rdna4 policy for fattn

* adjust fattn policy

* adjust kernel selection logic

* update as the review comments

* keep fattn-wmma logic

* adjust kernel selection logic

---------

Co-authored-by: zhang hui <you@example.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-01-30 15:56:40 +02:00
Georgi Gerganov 6ee0eaf531 CUDA : fix unused argument when USE_CUDA_GRAPH=OFF (llama/18800) 2026-01-14 09:11:59 +02:00
Jeff Bolz ab1828dc1c vulkan: change memory_logger to be controlled by an env var (llama/18769) 2026-01-14 09:11:59 +02:00
Jeff Bolz aedf332ec5 vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (llama/18678)
This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which
has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128.

This should work when the number of blocks in the A matrix is less than 2^32
(for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like
2^32*LOAD_VEC_A elements.

- Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b.
- Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle
variants. So far this change just adds a single use case for this, compiling with the
e64BitIndexingEXT flag.
- Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange.

64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort
to avoid enabling it unconditionally.
2026-01-14 09:11:59 +02:00
Ruben Ortlam 716d68aca9 vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (llama/18763)
* vulkan: Disable large coopmat matmul configuration on proprietary AMD driver

* Also disable the large tile size
2026-01-14 09:11:59 +02:00
Ruben Ortlam c0433783c3 Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (llama/18749)
* vulkan: Enable and optimize large matmul parameter combination for AMD

* limit tuning to AMD GPUs with coopmat support

* use tx_m values instead of _l
2026-01-14 09:11:59 +02:00
shaofeiqi d4ce2e554f opencl: add SOFTPLUS op support (llama/18726) 2026-01-14 09:11:59 +02:00
Johannes Gäßler 3a1ea96373 HIP: adjust RDNA3.5 MMQ kernel selction logic (llama/18666) 2026-01-14 09:11:59 +02:00
Perry Naseck 484b17053a cmake : update blas logic (llama/18205) 2026-01-14 09:11:59 +02:00
Michael Wand 45be2cd27a Corrected: changed s13 = src1->nb[3] instead of nb[2] (llama/18724) 2026-01-14 09:11:59 +02:00
shaofeiqi 4af27bf2da opencl: add EXPM1 op (llama/18704) 2026-01-14 09:11:59 +02:00
Reese Levine 4ac8c3b478 Updates to webgpu get_memory (llama/18707) 2026-01-14 09:11:59 +02:00
Aaron Teo fff3ebd93d llama: use host memory if device reports 0 memory (llama/18587) 2026-01-14 09:11:59 +02:00
Masashi Yoshimura a71127dfd8 ggml-webgpu: Fix GGML_MEM_ALIGN to 8 for emscripten. (llama/18628)
* Fix GGML_MEM_ALIGN to 8 for emscripten.

* Add a comment explaining the need for GGML_MEM_ALIGN == 8 in 64-bit wasm with emscripten
2026-01-14 09:11:59 +02:00
Reese Levine 1bb903f599 ggml webgpu: initial flashattention implementation (llama/18610)
* FlashAttention (llama/13)

* Add inplace softmax

* Move rms_norm to split row approach

* Update debug for supports_op

* clean up debug statements

* neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though

* neg passes backend test

* unary operators pass ggml tests

* rms_norm double declaration bug atoned

* abides by editor-config

* removed vestigial files

* fixed autoconfig

* All operators (inlcluding xielu) working

* removed unnecesarry checking if node->src[1] exists for unary operators

* responded and dealt with PR comments

* implemented REPL_Template support and removed bug in unary operators kernel

* formatted embed wgsl and ggml-webgpu.cpp

* Faster tensors (llama/8)

Add fast matrix and matrix/vector multiplication.

* Use map for shader replacements instead of pair of strings

* Wasm (llama/9)

* webgpu : fix build on emscripten

* more debugging stuff

* test-backend-ops: force single thread on wasm

* fix single-thread case for init_tensor_uniform

* use jspi

* add pthread

* test: remember to set n_thread for cpu backend

* Add buffer label and enable dawn-specific toggles to turn off some checks

* Intermediate state

* Fast working f16/f32 vec4

* Working float fast mul mat

* Clean up naming of mul_mat to match logical model, start work on q mul_mat

* Setup for subgroup matrix mat mul

* Basic working subgroup matrix

* Working subgroup matrix tiling

* Handle weirder sg matrix sizes (but still % sg matrix size)

* Working start to gemv

* working f16 accumulation with shared memory staging

* Print out available subgroup matrix configurations

* Vectorize dst stores for sg matrix shader

* Gemv working scalar

* Minor set_rows optimization (llama/4)

* updated optimization, fixed errors

* non vectorized version now dispatches one thread per element

* Simplify

* Change logic for set_rows pipelines

---------

Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Comment on dawn toggles

* Working subgroup matrix code for (semi)generic sizes

* Remove some comments

* Cleanup code

* Update dawn version and move to portable subgroup size

* Try to fix new dawn release

* Update subgroup size comment

* Only check for subgroup matrix configs if they are supported

* Add toggles for subgroup matrix/f16 support on nvidia+vulkan

* Make row/col naming consistent

* Refactor shared memory loading

* Move sg matrix stores to correct file

* Working q4_0

* Formatting

* Work with emscripten builds

* Fix test-backend-ops emscripten for f16/quantized types

* Use emscripten memory64 to support get_memory

* Add build flags and try ci

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* Remove extra whitespace

* Move wasm single-thread logic out of test-backend-ops for cpu backend

* Disable multiple threads for emscripten single-thread builds in ggml_graph_plan

* Refactored pipelines and workgroup calculations (llama/10)

* refactored pipelines

* refactored workgroup calculation

* removed commented out block of prior maps

* Clean up ceiling division pattern

---------

Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Start work on flash attention

* Shader structure set up (many bugs still)

* debugging

* Working first test

* Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32

* Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling

* Start work on integrating pre-wgsl

* Separate structs/initial shader compilation library into separate files

* Work on compilation choices for flashattention

* Work on subgroup matrix/tile size portability

* subgroup size agnostic online softmax

* Cleanups, quantization types

* more cleanup

* fix wasm build

* Refactor flashattention to increase parallelism, use direct loads for KV in somce cases

* Checkpoint

* formatting

* Update to account for default kv cache padding

* formatting shader

* Add workflow for ggml-ci webgpu

* Try passing absolute path to dawn in ggml-ci

* Avoid error on device destruction, add todos for proper cleanup

* Fix unused warning

* Forgot one parameter unused

* Move some flashattn computation to f32 for correctness
2026-01-14 09:11:59 +02:00
Jeff Bolz 0bc0e5616e vulkan: fix push constant size for quantize_q8_1 (llama/18687)
I added an assert to catch further mismatches, and it found several.
Fix those, too.
2026-01-14 09:11:59 +02:00
Jeff Bolz 678c660e62 vulkan: optimize ssm_scan (llama/18630)
* vulkan: optimize ssm_scan

* fix warp vs subgroup naming
2026-01-14 09:11:59 +02:00
도로로도로또 f2d8588229 metal : add MoE kernel specialization for ne20=5 (llama/18667)
Add template specialization for kernel_mul_mm_id_map0 with ne20=5
to support models using 5 active experts (e.g., VAETKI).
2026-01-14 09:11:59 +02:00
Doctor Shotgun b9965c89a1 ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH (llama/18535)
* ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH
* makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32

* ggml: read GGML_OP_OFFLOAD_MIN_BATCH once and store to dev ctx

* cann: forward declaration of device context struct

* cann: move offload op check after device context declaration

* cuda: fix whitespace

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-01-14 09:11:59 +02:00
shaofeiqi 85a329cb08 opencl: add FILL op support (llama/18682) 2026-01-14 09:11:59 +02:00
Oliver Walsh 4f2ca7c163 cuda : fix build on cuda 12.8 (llama/18672)
compute121 requires 12.9

Signed-off-by: Oliver Walsh <owalsh@redhat.com>
2026-01-14 09:11:59 +02:00
Jeff Bolz a91ab72bd9 vulkan: reject ops when a tensor is too large to allocate (llama/18646) 2026-01-14 09:11:59 +02:00
virajwad 096e7e911a vulkan: Warptile tuning for Intel Xe2/Xe3 (llama/18178)
* modify warptile tuning for xe3

* intel vendor check w/ coopmat support

* fix back formatting

* fix formatting change 2

* move intel check to chip specific tuning part

* Change to support both windows and linux

* modify m_warptile to l_warptile for intel

* modify warptile tuning for bf16 matmuls to fix regression (m_warptile to l_warptile)

* Code style changes

* Code style changes (2)

* Code style changes (3)
2026-01-14 09:11:59 +02:00
Eve a576ed944a vulkan: more mul mat optimizations (llama/18533)
* q4_k

* q5_k

* q2_k

* q4_1

* q5_1

* better buf index
2026-01-14 09:11:59 +02:00
hipudding 5c583f3c02 CANN: Fix rename for get_env (llama/18652)
In #18624, get_env in ggml-cann was renamed to get_env_as_lowercase
to accurately reflect the function’s behavior and reduce the chance
of misuse. However, the update missed renaming call sites in other
files. This commit fixes that oversight.
2026-01-14 09:11:59 +02:00
Raul Torres 47671c81db CANN: Rename `get_env` to `get_env_as_lowercase` (llama/18624) 2026-01-14 09:11:59 +02:00
Max Krasnyansky a5f51ac75b Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (llama/18611)
* hexagon: improve fp16 matmul and add fp32/fp16 flash-attention

* hexagon: add support for set-rows fp32 -> fp16 with i32/i64 row-idx

* hexagon: add support for SCALE fp32

* hexagon: replace scalar fp32 -> fp16 copy with HVX

* hexagon: optimize flash_atten_ext with aligned VTCM buffers and DMA

- Implements double-buffered DMA prefetching for K, V, and Mask tensors.
- Ensures K and V rows in VTCM are padded to 128 bytes to support aligned HVX operations.
- Correctly synchronizes DMA transfers to prevent race conditions.
- Uses `FLASH_ATTN_BLOCK_SIZE` of 128 for efficient chunking.

* hexagon: use aligned mad_f16

* hexagon: flash_atten more aligned ops

* hexagon: optimize scale_f32 hvx helpers

* hexagon: unroll fa loops

* hexagon: remove unused set-rows log

* hexagon: flash_attn_ext add support for DMAing Q

- Update `op_flash_attn_ext` to include Q row size in scratchpad allocation.
- Pad Q row size to 128 bytes for alignment.
- Implement DMA transfer for Q tensor in `flash_attn_ext_f16_thread`.
- Update dot product computations to use VTCM-buffered Q data.

* hexagon: fix handling of NANs hvx dotproducts

* hexagon: cleanup spad allocation in flash-atten

* hexagon: improve fp16/fp32 matmul

- Introduced `vec_dot_f16_f16` and `vec_dot_f16_f16_rx2` kernels using efficient HVX dot product intrinsics.
- Added `quantize_fp32_f16` to copy/convert weights from DDR to VTCM
- Updated `op_matmul` to use the optimized path when VTCM capacity allows and broadcasting requirements are compatible.
- Implemented fallback logic to the original implementation for complex broadcasting scenarios.

* hexagon: fix HVX_ARCH check

* hexagon: matmul cleanup and fp16 fixes

Use aligned vec_dot_f16 for 2d matmuls and unaligned version for 4d.

* hexagon: fix fp16 x fp16 matmuls and some minor refactoring

* hexagon: add support for GET_ROWS f32 -> f32

Also optimize SET_ROWS threading a bit when we have just a few rows to process.

* hexagon: optimize set-rows threading

* hexagon: update adb/run-bench.sh to properly support experimental and verbose options

* hexagon: flash_atten use aligned vectors for dot products
2026-01-14 09:11:59 +02:00
Aadeshveer Singh 436f30d05f ggml : optimize cuda ssm_scan using warp-level reduction (llama/18505)
* ggml : optimize cuda ssm_scan using warp-level reduction

* ggml : apply code review suggestions (style, const, constexpr)

* ggml : add TODO regarding stride consistency
2026-01-14 09:11:59 +02:00
Jeff Bolz dbec71f6cf vulkan: support buffer_from_host_ptr (llama/18467)
* vulkan: support buffer_from_host_ptr

* hacky use of buffer_from_host_ptr for directio

* disable buffer_from_host_ptr cap

* use external memory for ggml_vk_host_malloc, revert model loader changes

* disable external_memory_host for MoltenVK

* take buffer memory types into account

* don't use external_memory_host for ggml_vk_host_malloc
2026-01-14 09:11:59 +02:00
Aman Gupta 575d894603 ggml-cuda: refactor cuda graph usage (llama/18637)
* ggml-cuda: refactor cuda graph usage

* use is_enabled() instead of enabled
2026-01-14 09:11:59 +02:00
Beinsezii ed674cfc10 mmq.cu: tune mmq/rocblas switching for RDNA (llama/18537)
* Patch perf regression for mmq kernels in ROCm

recover performance regression for https://github.com/ggml-org/llama.cpp/issues/17917

* add n_experts branch like the cdna path

* mmq.cu: tune mmq/wmma switching for RDNA

* mmq.cu: move amd wmma mmq/wmma switching behind IS_RDNA3

* Update ggml/src/ggml-cuda/mmq.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Jiacheng (Jason) Chen <76919340+jiachengjason@users.noreply.github.com>
Co-authored-by: jiachengjason <jasonchen.jiacheng@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-01-14 09:11:59 +02:00
Adrien Gallouët 5520f27363 ggml : fix avx512bf16 build (llama/18623)
- include `immintrin.h` when required
- remove unused m512bh

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-14 09:11:59 +02:00
Raul Torres 9a1a6685ba CANN: Make `valid_values` variable `static const` (llama/18627) 2026-01-14 09:11:59 +02:00
nwyin e563e239a7 ggml webgpu: add CEIL operation support (llama/18605)
* ggml-webgpu: add CEIL operation support

      Add support for the CEIL unary operation in the WebGPU backend:
      - Add CEIL_FUNC shader template in unary_op.wgsl
      - Add 4 shader variants (f32, f16, inplace versions)
      - Initialize CEIL pipelines in ggml-webgpu.cpp
      - Register CEIL in supports_op function

* docs: update WebGPU ops support for CEIL
2026-01-14 09:11:59 +02:00
Johannes Gäßler 9956333361 CUDA: fix FA FP16 accumulator overflow for Granite (llama/18614) 2026-01-14 09:11:59 +02:00
Aman Gupta 804f545454 ggml-cuda: check for srcs outside the cgraph (llama/18583)
* ggml-cuda: check for srcs outside the cgraph

* review: use leafs instead
2026-01-14 09:11:59 +02:00
Jeff Bolz 52ba45e2b8 vulkan: fix topk_moe_sigmoid_norm_bias failures in GLM-4.6 (llama/18582) 2026-01-14 09:11:59 +02:00
Jeff Bolz 0a99b4c377 vulkan: handle quantize_q8_1 overflowing the max workgroup count (llama/18515)
* vulkan: handle quantize_q8_1 overflowing the max workgroup count

* vulkan: Fix small tile size matmul on lavapipe

* fix mul_mat_id failures
2026-01-14 09:11:59 +02:00
Chenguang Li 1d657effe3 CANN: add operator fusion support for ADD + RMS_NORM (llama/17512)
This commit implements operator fusion for ADD + RMS_NORM operations
in the CANN backend to reduce memory access overhead and improve
performance. The fusion is controlled by the GGML_CANN_OPERATOR_FUSION
environment variable (default: false).

Changes:
- Implement ggml_cann_op_add_rms_norm_fused() using ACLNN AddRmsNorm
- Add ggml_cann_can_fuse() to check fusion eligibility
- Integrate fusion logic into computation graph evaluation
- Add test cases for ADD + RMS_NORM fusion
- Update documentation with new environment variable

The fusion combines ADD and RMS_NORM into a single kernel call,
which is more efficient than executing them separately.
2026-01-14 09:11:59 +02:00
Daniel Bevenius 4d6a3fb00d sampling : add support for backend sampling (llama/17004)
* sampling : add support for backend sampling

This commit adds support for performing sampling operations on the
backend (e.g. GPU) as part of the model computation graph.

The motivation for this feature is to enable sampling to be performed
directly on the backend as part of the computation graph being executed,
allowing for some or all of the sampling to be done on the backend.

For example, the backend sampler chain might select/sample a token
directly in which case only the sampled token needs to be transferred
from device memory to host memory.

It is also possible for the backend samplers to perform filtering of
the logits, or compute and filter the probability distribution, in
which case only the filtered logits or probabilites need to be
transferred back to system memory for further processing by CPU
samplers.

Currently the backend sampling works in a similar manner to how
pooling works, it is a function that is called by build_graph and the
sampler operations become part of the models computation graph.

* llama-cli : add backend sampler configuration

* server : add backend sampling options/configuration

* webui : add backend sampling options

* ggml : add initial cumsum implementation for CUDA

* sampling : enable all backend sampler tests

This commit enables all exisiting backend sampler tests in the
test-backend-sampler. Previously, some tests were disabled because
there were missing ggml operation implementations.

* graph : do not include llama-model.h

* sampling : always expose sampled_ids

This commit precomputes and caches the full-vocab token id list in
llama_context's constructor, so llama_get_backend_sampled_token_ids_ith
always returns a valid pointer.

The motivation for this is that this enables both common/sampling.cpp
and src/llama-sampling.cpp can simplify their logic.

Not all backends samplers that process logits need to set the
sampled_tokens_id as they may not change the order of the logits, for
example the temperature sampler only scales the logits but does not
change their order. Simliar the logit bias sampler only adds bias to
specific token ids but does not change the order of the logits. In
these cases there will not be a device to host copy of the sampled
token ids, and this is the use case where having this precomputed
list is useful.

* sampling : ensure at most one output token per seq

This commit adds a check in the batch allocator to ensure that when
backend sampling is enabled, at most one output token is specified per
sequence.

* CUDA: Optimize argsort for gpu-based token sampling

Argsort is used for top-k currently. WE optimize argsort by 2 things:

1. Use `DeviceRadixSort` for single-row/sequence to parallelize it
   across our SMs
2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the
   correct entrypoint (the function chooses different execution paths,
   it contains `DeviceSegmentedRadixSort` as one of the paths and will
   choose the best one according to heuristics.
   https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview

Some perf numbers for a RTX PRO 6000:

On the kernel level, tested with
`GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf`
Before:
```
  ARGSORT(type=f32,ne=[65000,16,1,1],order=0):                  4130 runs -   359.24 us/run
  ARGSORT(type=f32,ne=[200000,1,1,1],order=0):                  8192 runs -   861.34 us/run
  ARGSORT(type=f32,ne=[200000,16,1,1],order=0):                 1343 runs -  1020.01 us/run
```

After:
```
  ARGSORT(type=f32,ne=[65000,16,1,1],order=0):                  4130 runs -   312.41 us/run
  ARGSORT(type=f32,ne=[200000,1,1,1],order=0):                 16384 runs -    63.48 us/run
  ARGSORT(type=f32,ne=[200000,16,1,1],order=0):                 1343 runs -   874.36 us/run
```
2026-01-14 09:11:59 +02:00
Aman Gupta f0bf5b8cc3 CUDA: disable cuda graph when using n-cpu-moe (llama/18593)
* CUDA: disable cuda graph when using n-cpu-moe

* call ggml_cuda_set_device
2026-01-14 09:11:59 +02:00
Aman Gupta 88f5765c82 ggml-cuda: remove unused params in ggml_cuda_graph (llama/18579) 2026-01-14 09:11:59 +02:00
Aman Gupta 1e725546b0 ggml-cuda: fixes for concurrent streams (llama/18496) 2026-01-14 09:11:59 +02:00
Johannes Gäßler 60d178cee9 CUDA: only allocate FA tmp buffer if needed (llama/18564) 2026-01-14 09:11:59 +02:00
pl752 304e780e5f (Bugfix, ggml-cuda) Pool alloc count fix + small size computation type adjustment (llama/18559)
* CUDA: Fixed obj byte size instead of obj count being passed to pool alloc (fattn-common, dst_tmp_meta)

* CUDA: Explicitly casted some of the int alloc counts before multiplication in argsort

---------

Co-authored-by: pl752 <maximpl752@gmail.com>
2026-01-14 09:11:59 +02:00
Shouyu c9e9f083c2 ggml-hexagon: optimize activation function (llama/18393)
* refactor: refactor silu

* refactor: optimize swiglu

* refactor: remove unncessary if in swiglu

* refactor: refactor swiglu_oai

* chore: fix formatting issue
2026-01-14 09:11:59 +02:00
Jeff Bolz 9d83865607 vulkan: Optimize GGML_OP_CUMSUM (llama/18417)
* vulkan: Optimize GGML_OP_CUMSUM

There are two paths: The preexisting one that does a whole row per workgroup
in a single shader, and one that splits each row into multiple blocks and does
two passes. The first pass computes partials within a block, the second adds
the block partials to compute the final result. The multipass shader is used
when there are a small number of large rows.

In the whole-row shader, handle multiple elements per invocation.

* use 2 ELEM_PER_THREAD for AMD/Intel

* address feedback
2026-01-14 09:11:59 +02:00
Jeff Bolz b7ff521e71 vulkan: Implement mmvq for iq1_s/iq1_m (llama/18450) 2026-01-14 09:11:59 +02:00
Georgi Gerganov b99c911c49 metal : adjust extra size for FA buffer to avoid reallocations (llama/18545) 2026-01-14 09:11:59 +02:00
Chris Rohlf f328b13d5c rpc : use unordered_map::reserve and emplace (llama/18513) 2026-01-14 09:11:59 +02:00
MeeMin fbde389665 cuda : fix copy of large tensors (ggml_nbytes <= INT_MAX assertion) (llama/18433)
* ggml-cuda: fixed assertion in ggml_cuda_cpy (llama/18140)

* ggml-cuda: changes in data types to int64_t

* ggml-cuda: added asserts for CUDA block numbers

* ggml-cuda: changed the condition for y and z dimension
2026-01-14 09:11:59 +02:00
Aman Gupta f22c1ccbe4 ggml-cuda: remove unneccesary prints on ggml_cuda_init (llama/18502) 2026-01-14 09:11:59 +02:00
Jeff Bolz b1f65a4a7e vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (llama/18295)
* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.

* change test_topk_moe to allow results in arbitrary order

* disable sigmoid fusion for moltenvk
2026-01-14 09:11:59 +02:00
Georgi Gerganov ce03f8e759 ggml : bump version to 0.9.5 (ggml/1410) 2025-12-31 18:27:20 +02:00
gatbontonpc 8189f2cb65 metal : add count_equal op (llama/18314)
* add count equal for metal

* remove trailing whitespace

* updated doc ops table

* changed shmem to i32

* added multi tg and templating

* removed BLAS support from Metal docs

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add memset to set dst to 0

* metal : cleanup

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-31 17:52:09 +02:00
Johannes Gäßler 2d250f8049 CUDA: fix KQ max calculation (llama/18487) 2025-12-31 17:52:09 +02:00
Georgi Gerganov 5deaf8f2a3 metal : remove BF16 x F16 kernels (llama/18456) 2025-12-31 17:52:09 +02:00
Aman Gupta 467933199a sycl: add newline at the end of CMakeLists.txt (llama/18503) 2025-12-31 17:52:09 +02:00
Rahul Sathe a3635494da Work around broken IntelSYCLConfig.cmake in Intel oneAPI 2025.x (llama/18345)
* cmake: work around broken IntelSYCLConfig.cmake in oneAPI 2025.x

* [AI] sycl: auto-detect and skip incompatible IntelSYCL package

Automatically detect compiler versions with incompatible IntelSYCL
CMake configuration files and fall back to manual SYCL flags instead
of requiring users to set options manually.

Fixes build failures with oneAPI 2025.x where IntelSYCLConfig.cmake
has SYCL_FEATURE_TEST_EXTRACT invocation errors.

* refactor: improve SYCL provider handling and error messages in CMake configuration

* refactor: enhance SYCL provider validation and error handling in CMake configuration

* ggml-sycl: wrap find_package(IntelSYCL) to prevent build crashes
2025-12-31 17:52:09 +02:00
Charles Xu c9955367d4 kleidiai: add and integrate SVE 256-bit vector-length kernel (llama/18458)
* kleidiai: add and integrate SVE 256-bit vector-length kernel

* updated for review comments
2025-12-31 17:52:09 +02:00
Aman Gupta 6d4aa96bfa CUDA: add log line when mxfp4 acceleration is used (llama/18483)
* CUDA: add log line when mxfp4 acceleration is used

* add in backend_get_features
2025-12-31 17:52:09 +02:00
Johannes Gäßler 5765c5b04e CUDA: fix replacment of bad archs in CMake (llama/18457) 2025-12-31 17:52:09 +02:00
Johannes Gäßler d6cb2407b7 CUDA: Blackwell features for non-native builds (llama/18436) 2025-12-31 17:52:09 +02:00
Aman Gupta e49e88b2d8 cuda: fix race condition in cumsum (llama/18448)
* ggml-cuda: fix race condition in cumsum

* remove unneccesary sync_threads
2025-12-31 17:52:09 +02:00
uvos 20f5729921 HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated (llama/18202) 2025-12-31 17:52:09 +02:00
Aman Gupta b8d209f55c Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413)" (llama/18426) 2025-12-31 17:52:09 +02:00
o7si 54fe9a645d rpc: fix segfault on invalid endpoint format (llama/18387)
* rpc: fix segfault on invalid endpoint format

* rpc: add error log for failed endpoint connection
2025-12-31 17:52:09 +02:00
Boian Berberov b3788ef729 cmake: Added more x86_64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` (llama/18186)
* minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h`

* cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On`

- `ivybridge`
- `piledriver`
- `cannonlake`
- `cascadelake`
- `cooperlake`
- `zen4`

Resolves: #17966
2025-12-31 17:52:09 +02:00
QDelta 31fc2c37c8 ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (llama/18413) 2025-12-31 17:52:09 +02:00
lhez a800a3acd1 opencl: allow resizing transpose buffers (llama/18384)
* opencl: allow resizing transpose buffers instead of using fixed sizes

* opencl: remove commented code
2025-12-31 17:52:09 +02:00
Aman Gupta 29f8155445 ggml-cuda: Use same regex for GGML_NATIVE=OFF (llama/18407) 2025-12-31 17:52:09 +02:00
Jeff Bolz 015b618d96 vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (llama/18352)
Run a preprocess to count how many times each expert is used, and use this to
quickly discard workgroups that aren't needed.
2025-12-31 17:52:09 +02:00
Jeff Bolz e37c8ed94e vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (llama/18349)
* vulkan: Use BK=32 for coopmat2 mul_mat_id

* vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader

Disable robustness, remove the OOB check in decodeFuncB, and initialize the
row_ids to zero to avoid OOB access.

Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
zero and remove the '& (BN - 1)'. This allows the compiler to common some of
the shared memory loads.
2025-12-31 17:52:09 +02:00
Jeff Bolz 331c6ccd31 vulkan: Use BK=32 for coopmat2 mul_mat_id (llama/18332) 2025-12-31 17:52:09 +02:00
Eve 35cb4abb67 vulkan: small dequantization improvements (llama/18380)
* iq4_xs

* quants
2025-12-31 17:52:09 +02:00
Jeff Bolz 181e36f194 vulkan: Support UPSCALE w/antialias (llama/18327) 2025-12-31 17:52:09 +02:00
Jeff Bolz 67473fef57 vulkan: handle rope with large number of rows (llama/18306) 2025-12-31 17:52:09 +02:00
0Marble 33f75a88ac CANN: implement the SSM_CONV operator (llama/17737)
* CANN: implement SSM_CONV operator

Co-authored-by: Aleksei Lobanov, <zeromarblectm@gmail.com>
Co-authored-by: Sujin Kang, <waterjin326@gmail.com>

* CANN: remove custom error limit for SSM_CONV

* CANN: merge SSM_CONV tensor shape/strides into one line

---------

Co-authored-by: Sujin Kang, <waterjin326@gmail.com>
2025-12-31 17:52:09 +02:00
Aman Gupta 51778354ce ggml-cuda: fix regex for arch list (llama/18371)
* ggml-cuda: fix regex for arch list

* make regex exact
2025-12-31 17:52:09 +02:00