Commit Graph

1887 Commits

Author SHA1 Message Date
Georgi Gerganov 6ee0eaf531 CUDA : fix unused argument when USE_CUDA_GRAPH=OFF (llama/18800) 2026-01-14 09:11:59 +02:00
Jeff Bolz ab1828dc1c vulkan: change memory_logger to be controlled by an env var (llama/18769) 2026-01-14 09:11:59 +02:00
Jeff Bolz aedf332ec5 vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (llama/18678)
This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which
has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128.

This should work when the number of blocks in the A matrix is less than 2^32
(for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like
2^32*LOAD_VEC_A elements.

- Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b.
- Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle
variants. So far this change just adds a single use case for this, compiling with the
e64BitIndexingEXT flag.
- Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange.

64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort
to avoid enabling it unconditionally.
2026-01-14 09:11:59 +02:00
Ruben Ortlam 716d68aca9 vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (llama/18763)
* vulkan: Disable large coopmat matmul configuration on proprietary AMD driver

* Also disable the large tile size
2026-01-14 09:11:59 +02:00
Ruben Ortlam c0433783c3 Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (llama/18749)
* vulkan: Enable and optimize large matmul parameter combination for AMD

* limit tuning to AMD GPUs with coopmat support

* use tx_m values instead of _l
2026-01-14 09:11:59 +02:00
shaofeiqi d4ce2e554f opencl: add SOFTPLUS op support (llama/18726) 2026-01-14 09:11:59 +02:00
Johannes Gäßler 3a1ea96373 HIP: adjust RDNA3.5 MMQ kernel selction logic (llama/18666) 2026-01-14 09:11:59 +02:00
Perry Naseck 484b17053a cmake : update blas logic (llama/18205) 2026-01-14 09:11:59 +02:00
Michael Wand 45be2cd27a Corrected: changed s13 = src1->nb[3] instead of nb[2] (llama/18724) 2026-01-14 09:11:59 +02:00
shaofeiqi 4af27bf2da opencl: add EXPM1 op (llama/18704) 2026-01-14 09:11:59 +02:00
Reese Levine 4ac8c3b478 Updates to webgpu get_memory (llama/18707) 2026-01-14 09:11:59 +02:00
Aaron Teo fff3ebd93d llama: use host memory if device reports 0 memory (llama/18587) 2026-01-14 09:11:59 +02:00
Masashi Yoshimura a71127dfd8 ggml-webgpu: Fix GGML_MEM_ALIGN to 8 for emscripten. (llama/18628)
* Fix GGML_MEM_ALIGN to 8 for emscripten.

* Add a comment explaining the need for GGML_MEM_ALIGN == 8 in 64-bit wasm with emscripten
2026-01-14 09:11:59 +02:00
Reese Levine 1bb903f599 ggml webgpu: initial flashattention implementation (llama/18610)
* FlashAttention (llama/13)

* Add inplace softmax

* Move rms_norm to split row approach

* Update debug for supports_op

* clean up debug statements

* neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though

* neg passes backend test

* unary operators pass ggml tests

* rms_norm double declaration bug atoned

* abides by editor-config

* removed vestigial files

* fixed autoconfig

* All operators (inlcluding xielu) working

* removed unnecesarry checking if node->src[1] exists for unary operators

* responded and dealt with PR comments

* implemented REPL_Template support and removed bug in unary operators kernel

* formatted embed wgsl and ggml-webgpu.cpp

* Faster tensors (llama/8)

Add fast matrix and matrix/vector multiplication.

* Use map for shader replacements instead of pair of strings

* Wasm (llama/9)

* webgpu : fix build on emscripten

* more debugging stuff

* test-backend-ops: force single thread on wasm

* fix single-thread case for init_tensor_uniform

* use jspi

* add pthread

* test: remember to set n_thread for cpu backend

* Add buffer label and enable dawn-specific toggles to turn off some checks

* Intermediate state

* Fast working f16/f32 vec4

* Working float fast mul mat

* Clean up naming of mul_mat to match logical model, start work on q mul_mat

* Setup for subgroup matrix mat mul

* Basic working subgroup matrix

* Working subgroup matrix tiling

* Handle weirder sg matrix sizes (but still % sg matrix size)

* Working start to gemv

* working f16 accumulation with shared memory staging

* Print out available subgroup matrix configurations

* Vectorize dst stores for sg matrix shader

* Gemv working scalar

* Minor set_rows optimization (llama/4)

* updated optimization, fixed errors

* non vectorized version now dispatches one thread per element

* Simplify

* Change logic for set_rows pipelines

---------

Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Comment on dawn toggles

* Working subgroup matrix code for (semi)generic sizes

* Remove some comments

* Cleanup code

* Update dawn version and move to portable subgroup size

* Try to fix new dawn release

* Update subgroup size comment

* Only check for subgroup matrix configs if they are supported

* Add toggles for subgroup matrix/f16 support on nvidia+vulkan

* Make row/col naming consistent

* Refactor shared memory loading

* Move sg matrix stores to correct file

* Working q4_0

* Formatting

* Work with emscripten builds

* Fix test-backend-ops emscripten for f16/quantized types

* Use emscripten memory64 to support get_memory

* Add build flags and try ci

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* Remove extra whitespace

* Move wasm single-thread logic out of test-backend-ops for cpu backend

* Disable multiple threads for emscripten single-thread builds in ggml_graph_plan

* Refactored pipelines and workgroup calculations (llama/10)

* refactored pipelines

* refactored workgroup calculation

* removed commented out block of prior maps

* Clean up ceiling division pattern

---------

Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Start work on flash attention

* Shader structure set up (many bugs still)

* debugging

* Working first test

* Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32

* Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling

* Start work on integrating pre-wgsl

* Separate structs/initial shader compilation library into separate files

* Work on compilation choices for flashattention

* Work on subgroup matrix/tile size portability

* subgroup size agnostic online softmax

* Cleanups, quantization types

* more cleanup

* fix wasm build

* Refactor flashattention to increase parallelism, use direct loads for KV in somce cases

* Checkpoint

* formatting

* Update to account for default kv cache padding

* formatting shader

* Add workflow for ggml-ci webgpu

* Try passing absolute path to dawn in ggml-ci

* Avoid error on device destruction, add todos for proper cleanup

* Fix unused warning

* Forgot one parameter unused

* Move some flashattn computation to f32 for correctness
2026-01-14 09:11:59 +02:00
Jeff Bolz 0bc0e5616e vulkan: fix push constant size for quantize_q8_1 (llama/18687)
I added an assert to catch further mismatches, and it found several.
Fix those, too.
2026-01-14 09:11:59 +02:00
Jeff Bolz 678c660e62 vulkan: optimize ssm_scan (llama/18630)
* vulkan: optimize ssm_scan

* fix warp vs subgroup naming
2026-01-14 09:11:59 +02:00
도로로도로또 f2d8588229 metal : add MoE kernel specialization for ne20=5 (llama/18667)
Add template specialization for kernel_mul_mm_id_map0 with ne20=5
to support models using 5 active experts (e.g., VAETKI).
2026-01-14 09:11:59 +02:00
Doctor Shotgun b9965c89a1 ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH (llama/18535)
* ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH
* makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32

* ggml: read GGML_OP_OFFLOAD_MIN_BATCH once and store to dev ctx

* cann: forward declaration of device context struct

* cann: move offload op check after device context declaration

* cuda: fix whitespace

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-01-14 09:11:59 +02:00
shaofeiqi 85a329cb08 opencl: add FILL op support (llama/18682) 2026-01-14 09:11:59 +02:00
Oliver Walsh 4f2ca7c163 cuda : fix build on cuda 12.8 (llama/18672)
compute121 requires 12.9

Signed-off-by: Oliver Walsh <owalsh@redhat.com>
2026-01-14 09:11:59 +02:00
Jeff Bolz a91ab72bd9 vulkan: reject ops when a tensor is too large to allocate (llama/18646) 2026-01-14 09:11:59 +02:00
virajwad 096e7e911a vulkan: Warptile tuning for Intel Xe2/Xe3 (llama/18178)
* modify warptile tuning for xe3

* intel vendor check w/ coopmat support

* fix back formatting

* fix formatting change 2

* move intel check to chip specific tuning part

* Change to support both windows and linux

* modify m_warptile to l_warptile for intel

* modify warptile tuning for bf16 matmuls to fix regression (m_warptile to l_warptile)

* Code style changes

* Code style changes (2)

* Code style changes (3)
2026-01-14 09:11:59 +02:00
Eve a576ed944a vulkan: more mul mat optimizations (llama/18533)
* q4_k

* q5_k

* q2_k

* q4_1

* q5_1

* better buf index
2026-01-14 09:11:59 +02:00
hipudding 5c583f3c02 CANN: Fix rename for get_env (llama/18652)
In #18624, get_env in ggml-cann was renamed to get_env_as_lowercase
to accurately reflect the function’s behavior and reduce the chance
of misuse. However, the update missed renaming call sites in other
files. This commit fixes that oversight.
2026-01-14 09:11:59 +02:00
Raul Torres 47671c81db CANN: Rename `get_env` to `get_env_as_lowercase` (llama/18624) 2026-01-14 09:11:59 +02:00
Max Krasnyansky a5f51ac75b Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (llama/18611)
* hexagon: improve fp16 matmul and add fp32/fp16 flash-attention

* hexagon: add support for set-rows fp32 -> fp16 with i32/i64 row-idx

* hexagon: add support for SCALE fp32

* hexagon: replace scalar fp32 -> fp16 copy with HVX

* hexagon: optimize flash_atten_ext with aligned VTCM buffers and DMA

- Implements double-buffered DMA prefetching for K, V, and Mask tensors.
- Ensures K and V rows in VTCM are padded to 128 bytes to support aligned HVX operations.
- Correctly synchronizes DMA transfers to prevent race conditions.
- Uses `FLASH_ATTN_BLOCK_SIZE` of 128 for efficient chunking.

* hexagon: use aligned mad_f16

* hexagon: flash_atten more aligned ops

* hexagon: optimize scale_f32 hvx helpers

* hexagon: unroll fa loops

* hexagon: remove unused set-rows log

* hexagon: flash_attn_ext add support for DMAing Q

- Update `op_flash_attn_ext` to include Q row size in scratchpad allocation.
- Pad Q row size to 128 bytes for alignment.
- Implement DMA transfer for Q tensor in `flash_attn_ext_f16_thread`.
- Update dot product computations to use VTCM-buffered Q data.

* hexagon: fix handling of NANs hvx dotproducts

* hexagon: cleanup spad allocation in flash-atten

* hexagon: improve fp16/fp32 matmul

- Introduced `vec_dot_f16_f16` and `vec_dot_f16_f16_rx2` kernels using efficient HVX dot product intrinsics.
- Added `quantize_fp32_f16` to copy/convert weights from DDR to VTCM
- Updated `op_matmul` to use the optimized path when VTCM capacity allows and broadcasting requirements are compatible.
- Implemented fallback logic to the original implementation for complex broadcasting scenarios.

* hexagon: fix HVX_ARCH check

* hexagon: matmul cleanup and fp16 fixes

Use aligned vec_dot_f16 for 2d matmuls and unaligned version for 4d.

* hexagon: fix fp16 x fp16 matmuls and some minor refactoring

* hexagon: add support for GET_ROWS f32 -> f32

Also optimize SET_ROWS threading a bit when we have just a few rows to process.

* hexagon: optimize set-rows threading

* hexagon: update adb/run-bench.sh to properly support experimental and verbose options

* hexagon: flash_atten use aligned vectors for dot products
2026-01-14 09:11:59 +02:00
Aadeshveer Singh 436f30d05f ggml : optimize cuda ssm_scan using warp-level reduction (llama/18505)
* ggml : optimize cuda ssm_scan using warp-level reduction

* ggml : apply code review suggestions (style, const, constexpr)

* ggml : add TODO regarding stride consistency
2026-01-14 09:11:59 +02:00
Jeff Bolz dbec71f6cf vulkan: support buffer_from_host_ptr (llama/18467)
* vulkan: support buffer_from_host_ptr

* hacky use of buffer_from_host_ptr for directio

* disable buffer_from_host_ptr cap

* use external memory for ggml_vk_host_malloc, revert model loader changes

* disable external_memory_host for MoltenVK

* take buffer memory types into account

* don't use external_memory_host for ggml_vk_host_malloc
2026-01-14 09:11:59 +02:00
Aman Gupta 575d894603 ggml-cuda: refactor cuda graph usage (llama/18637)
* ggml-cuda: refactor cuda graph usage

* use is_enabled() instead of enabled
2026-01-14 09:11:59 +02:00
Beinsezii ed674cfc10 mmq.cu: tune mmq/rocblas switching for RDNA (llama/18537)
* Patch perf regression for mmq kernels in ROCm

recover performance regression for https://github.com/ggml-org/llama.cpp/issues/17917

* add n_experts branch like the cdna path

* mmq.cu: tune mmq/wmma switching for RDNA

* mmq.cu: move amd wmma mmq/wmma switching behind IS_RDNA3

* Update ggml/src/ggml-cuda/mmq.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Jiacheng (Jason) Chen <76919340+jiachengjason@users.noreply.github.com>
Co-authored-by: jiachengjason <jasonchen.jiacheng@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-01-14 09:11:59 +02:00
Adrien Gallouët 5520f27363 ggml : fix avx512bf16 build (llama/18623)
- include `immintrin.h` when required
- remove unused m512bh

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-14 09:11:59 +02:00
Raul Torres 9a1a6685ba CANN: Make `valid_values` variable `static const` (llama/18627) 2026-01-14 09:11:59 +02:00
nwyin e563e239a7 ggml webgpu: add CEIL operation support (llama/18605)
* ggml-webgpu: add CEIL operation support

      Add support for the CEIL unary operation in the WebGPU backend:
      - Add CEIL_FUNC shader template in unary_op.wgsl
      - Add 4 shader variants (f32, f16, inplace versions)
      - Initialize CEIL pipelines in ggml-webgpu.cpp
      - Register CEIL in supports_op function

* docs: update WebGPU ops support for CEIL
2026-01-14 09:11:59 +02:00
Johannes Gäßler 9956333361 CUDA: fix FA FP16 accumulator overflow for Granite (llama/18614) 2026-01-14 09:11:59 +02:00
Aman Gupta 804f545454 ggml-cuda: check for srcs outside the cgraph (llama/18583)
* ggml-cuda: check for srcs outside the cgraph

* review: use leafs instead
2026-01-14 09:11:59 +02:00
Jeff Bolz 52ba45e2b8 vulkan: fix topk_moe_sigmoid_norm_bias failures in GLM-4.6 (llama/18582) 2026-01-14 09:11:59 +02:00
Jeff Bolz 0a99b4c377 vulkan: handle quantize_q8_1 overflowing the max workgroup count (llama/18515)
* vulkan: handle quantize_q8_1 overflowing the max workgroup count

* vulkan: Fix small tile size matmul on lavapipe

* fix mul_mat_id failures
2026-01-14 09:11:59 +02:00
Chenguang Li 1d657effe3 CANN: add operator fusion support for ADD + RMS_NORM (llama/17512)
This commit implements operator fusion for ADD + RMS_NORM operations
in the CANN backend to reduce memory access overhead and improve
performance. The fusion is controlled by the GGML_CANN_OPERATOR_FUSION
environment variable (default: false).

Changes:
- Implement ggml_cann_op_add_rms_norm_fused() using ACLNN AddRmsNorm
- Add ggml_cann_can_fuse() to check fusion eligibility
- Integrate fusion logic into computation graph evaluation
- Add test cases for ADD + RMS_NORM fusion
- Update documentation with new environment variable

The fusion combines ADD and RMS_NORM into a single kernel call,
which is more efficient than executing them separately.
2026-01-14 09:11:59 +02:00
Daniel Bevenius 4d6a3fb00d sampling : add support for backend sampling (llama/17004)
* sampling : add support for backend sampling

This commit adds support for performing sampling operations on the
backend (e.g. GPU) as part of the model computation graph.

The motivation for this feature is to enable sampling to be performed
directly on the backend as part of the computation graph being executed,
allowing for some or all of the sampling to be done on the backend.

For example, the backend sampler chain might select/sample a token
directly in which case only the sampled token needs to be transferred
from device memory to host memory.

It is also possible for the backend samplers to perform filtering of
the logits, or compute and filter the probability distribution, in
which case only the filtered logits or probabilites need to be
transferred back to system memory for further processing by CPU
samplers.

Currently the backend sampling works in a similar manner to how
pooling works, it is a function that is called by build_graph and the
sampler operations become part of the models computation graph.

* llama-cli : add backend sampler configuration

* server : add backend sampling options/configuration

* webui : add backend sampling options

* ggml : add initial cumsum implementation for CUDA

* sampling : enable all backend sampler tests

This commit enables all exisiting backend sampler tests in the
test-backend-sampler. Previously, some tests were disabled because
there were missing ggml operation implementations.

* graph : do not include llama-model.h

* sampling : always expose sampled_ids

This commit precomputes and caches the full-vocab token id list in
llama_context's constructor, so llama_get_backend_sampled_token_ids_ith
always returns a valid pointer.

The motivation for this is that this enables both common/sampling.cpp
and src/llama-sampling.cpp can simplify their logic.

Not all backends samplers that process logits need to set the
sampled_tokens_id as they may not change the order of the logits, for
example the temperature sampler only scales the logits but does not
change their order. Simliar the logit bias sampler only adds bias to
specific token ids but does not change the order of the logits. In
these cases there will not be a device to host copy of the sampled
token ids, and this is the use case where having this precomputed
list is useful.

* sampling : ensure at most one output token per seq

This commit adds a check in the batch allocator to ensure that when
backend sampling is enabled, at most one output token is specified per
sequence.

* CUDA: Optimize argsort for gpu-based token sampling

Argsort is used for top-k currently. WE optimize argsort by 2 things:

1. Use `DeviceRadixSort` for single-row/sequence to parallelize it
   across our SMs
2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the
   correct entrypoint (the function chooses different execution paths,
   it contains `DeviceSegmentedRadixSort` as one of the paths and will
   choose the best one according to heuristics.
   https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview

Some perf numbers for a RTX PRO 6000:

On the kernel level, tested with
`GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf`
Before:
```
  ARGSORT(type=f32,ne=[65000,16,1,1],order=0):                  4130 runs -   359.24 us/run
  ARGSORT(type=f32,ne=[200000,1,1,1],order=0):                  8192 runs -   861.34 us/run
  ARGSORT(type=f32,ne=[200000,16,1,1],order=0):                 1343 runs -  1020.01 us/run
```

After:
```
  ARGSORT(type=f32,ne=[65000,16,1,1],order=0):                  4130 runs -   312.41 us/run
  ARGSORT(type=f32,ne=[200000,1,1,1],order=0):                 16384 runs -    63.48 us/run
  ARGSORT(type=f32,ne=[200000,16,1,1],order=0):                 1343 runs -   874.36 us/run
```
2026-01-14 09:11:59 +02:00
Aman Gupta f0bf5b8cc3 CUDA: disable cuda graph when using n-cpu-moe (llama/18593)
* CUDA: disable cuda graph when using n-cpu-moe

* call ggml_cuda_set_device
2026-01-14 09:11:59 +02:00
Aman Gupta 88f5765c82 ggml-cuda: remove unused params in ggml_cuda_graph (llama/18579) 2026-01-14 09:11:59 +02:00
Aman Gupta 1e725546b0 ggml-cuda: fixes for concurrent streams (llama/18496) 2026-01-14 09:11:59 +02:00
Johannes Gäßler 60d178cee9 CUDA: only allocate FA tmp buffer if needed (llama/18564) 2026-01-14 09:11:59 +02:00
pl752 304e780e5f (Bugfix, ggml-cuda) Pool alloc count fix + small size computation type adjustment (llama/18559)
* CUDA: Fixed obj byte size instead of obj count being passed to pool alloc (fattn-common, dst_tmp_meta)

* CUDA: Explicitly casted some of the int alloc counts before multiplication in argsort

---------

Co-authored-by: pl752 <maximpl752@gmail.com>
2026-01-14 09:11:59 +02:00
Shouyu c9e9f083c2 ggml-hexagon: optimize activation function (llama/18393)
* refactor: refactor silu

* refactor: optimize swiglu

* refactor: remove unncessary if in swiglu

* refactor: refactor swiglu_oai

* chore: fix formatting issue
2026-01-14 09:11:59 +02:00
Jeff Bolz 9d83865607 vulkan: Optimize GGML_OP_CUMSUM (llama/18417)
* vulkan: Optimize GGML_OP_CUMSUM

There are two paths: The preexisting one that does a whole row per workgroup
in a single shader, and one that splits each row into multiple blocks and does
two passes. The first pass computes partials within a block, the second adds
the block partials to compute the final result. The multipass shader is used
when there are a small number of large rows.

In the whole-row shader, handle multiple elements per invocation.

* use 2 ELEM_PER_THREAD for AMD/Intel

* address feedback
2026-01-14 09:11:59 +02:00
Jeff Bolz b7ff521e71 vulkan: Implement mmvq for iq1_s/iq1_m (llama/18450) 2026-01-14 09:11:59 +02:00
Georgi Gerganov b99c911c49 metal : adjust extra size for FA buffer to avoid reallocations (llama/18545) 2026-01-14 09:11:59 +02:00
Chris Rohlf f328b13d5c rpc : use unordered_map::reserve and emplace (llama/18513) 2026-01-14 09:11:59 +02:00
MeeMin fbde389665 cuda : fix copy of large tensors (ggml_nbytes <= INT_MAX assertion) (llama/18433)
* ggml-cuda: fixed assertion in ggml_cuda_cpy (llama/18140)

* ggml-cuda: changes in data types to int64_t

* ggml-cuda: added asserts for CUDA block numbers

* ggml-cuda: changed the condition for y and z dimension
2026-01-14 09:11:59 +02:00