Commit Graph

4210 Commits

Author SHA1 Message Date
Jeff Bolz 52ba45e2b8 vulkan: fix topk_moe_sigmoid_norm_bias failures in GLM-4.6 (llama/18582) 2026-01-14 09:11:59 +02:00
Jeff Bolz 0a99b4c377 vulkan: handle quantize_q8_1 overflowing the max workgroup count (llama/18515)
* vulkan: handle quantize_q8_1 overflowing the max workgroup count

* vulkan: Fix small tile size matmul on lavapipe

* fix mul_mat_id failures
2026-01-14 09:11:59 +02:00
Chenguang Li 1d657effe3 CANN: add operator fusion support for ADD + RMS_NORM (llama/17512)
This commit implements operator fusion for ADD + RMS_NORM operations
in the CANN backend to reduce memory access overhead and improve
performance. The fusion is controlled by the GGML_CANN_OPERATOR_FUSION
environment variable (default: false).

Changes:
- Implement ggml_cann_op_add_rms_norm_fused() using ACLNN AddRmsNorm
- Add ggml_cann_can_fuse() to check fusion eligibility
- Integrate fusion logic into computation graph evaluation
- Add test cases for ADD + RMS_NORM fusion
- Update documentation with new environment variable

The fusion combines ADD and RMS_NORM into a single kernel call,
which is more efficient than executing them separately.
2026-01-14 09:11:59 +02:00
Daniel Bevenius 4d6a3fb00d sampling : add support for backend sampling (llama/17004)
* sampling : add support for backend sampling

This commit adds support for performing sampling operations on the
backend (e.g. GPU) as part of the model computation graph.

The motivation for this feature is to enable sampling to be performed
directly on the backend as part of the computation graph being executed,
allowing for some or all of the sampling to be done on the backend.

For example, the backend sampler chain might select/sample a token
directly in which case only the sampled token needs to be transferred
from device memory to host memory.

It is also possible for the backend samplers to perform filtering of
the logits, or compute and filter the probability distribution, in
which case only the filtered logits or probabilites need to be
transferred back to system memory for further processing by CPU
samplers.

Currently the backend sampling works in a similar manner to how
pooling works, it is a function that is called by build_graph and the
sampler operations become part of the models computation graph.

* llama-cli : add backend sampler configuration

* server : add backend sampling options/configuration

* webui : add backend sampling options

* ggml : add initial cumsum implementation for CUDA

* sampling : enable all backend sampler tests

This commit enables all exisiting backend sampler tests in the
test-backend-sampler. Previously, some tests were disabled because
there were missing ggml operation implementations.

* graph : do not include llama-model.h

* sampling : always expose sampled_ids

This commit precomputes and caches the full-vocab token id list in
llama_context's constructor, so llama_get_backend_sampled_token_ids_ith
always returns a valid pointer.

The motivation for this is that this enables both common/sampling.cpp
and src/llama-sampling.cpp can simplify their logic.

Not all backends samplers that process logits need to set the
sampled_tokens_id as they may not change the order of the logits, for
example the temperature sampler only scales the logits but does not
change their order. Simliar the logit bias sampler only adds bias to
specific token ids but does not change the order of the logits. In
these cases there will not be a device to host copy of the sampled
token ids, and this is the use case where having this precomputed
list is useful.

* sampling : ensure at most one output token per seq

This commit adds a check in the batch allocator to ensure that when
backend sampling is enabled, at most one output token is specified per
sequence.

* CUDA: Optimize argsort for gpu-based token sampling

Argsort is used for top-k currently. WE optimize argsort by 2 things:

1. Use `DeviceRadixSort` for single-row/sequence to parallelize it
   across our SMs
2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the
   correct entrypoint (the function chooses different execution paths,
   it contains `DeviceSegmentedRadixSort` as one of the paths and will
   choose the best one according to heuristics.
   https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview

Some perf numbers for a RTX PRO 6000:

On the kernel level, tested with
`GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf`
Before:
```
  ARGSORT(type=f32,ne=[65000,16,1,1],order=0):                  4130 runs -   359.24 us/run
  ARGSORT(type=f32,ne=[200000,1,1,1],order=0):                  8192 runs -   861.34 us/run
  ARGSORT(type=f32,ne=[200000,16,1,1],order=0):                 1343 runs -  1020.01 us/run
```

After:
```
  ARGSORT(type=f32,ne=[65000,16,1,1],order=0):                  4130 runs -   312.41 us/run
  ARGSORT(type=f32,ne=[200000,1,1,1],order=0):                 16384 runs -    63.48 us/run
  ARGSORT(type=f32,ne=[200000,16,1,1],order=0):                 1343 runs -   874.36 us/run
```
2026-01-14 09:11:59 +02:00
Aman Gupta f0bf5b8cc3 CUDA: disable cuda graph when using n-cpu-moe (llama/18593)
* CUDA: disable cuda graph when using n-cpu-moe

* call ggml_cuda_set_device
2026-01-14 09:11:59 +02:00
Aman Gupta 88f5765c82 ggml-cuda: remove unused params in ggml_cuda_graph (llama/18579) 2026-01-14 09:11:59 +02:00
Aman Gupta 1e725546b0 ggml-cuda: fixes for concurrent streams (llama/18496) 2026-01-14 09:11:59 +02:00
Johannes Gäßler 60d178cee9 CUDA: only allocate FA tmp buffer if needed (llama/18564) 2026-01-14 09:11:59 +02:00
pl752 304e780e5f (Bugfix, ggml-cuda) Pool alloc count fix + small size computation type adjustment (llama/18559)
* CUDA: Fixed obj byte size instead of obj count being passed to pool alloc (fattn-common, dst_tmp_meta)

* CUDA: Explicitly casted some of the int alloc counts before multiplication in argsort

---------

Co-authored-by: pl752 <maximpl752@gmail.com>
2026-01-14 09:11:59 +02:00
Shouyu c9e9f083c2 ggml-hexagon: optimize activation function (llama/18393)
* refactor: refactor silu

* refactor: optimize swiglu

* refactor: remove unncessary if in swiglu

* refactor: refactor swiglu_oai

* chore: fix formatting issue
2026-01-14 09:11:59 +02:00
Jeff Bolz 9d83865607 vulkan: Optimize GGML_OP_CUMSUM (llama/18417)
* vulkan: Optimize GGML_OP_CUMSUM

There are two paths: The preexisting one that does a whole row per workgroup
in a single shader, and one that splits each row into multiple blocks and does
two passes. The first pass computes partials within a block, the second adds
the block partials to compute the final result. The multipass shader is used
when there are a small number of large rows.

In the whole-row shader, handle multiple elements per invocation.

* use 2 ELEM_PER_THREAD for AMD/Intel

* address feedback
2026-01-14 09:11:59 +02:00
Jeff Bolz b7ff521e71 vulkan: Implement mmvq for iq1_s/iq1_m (llama/18450) 2026-01-14 09:11:59 +02:00
Georgi Gerganov b99c911c49 metal : adjust extra size for FA buffer to avoid reallocations (llama/18545) 2026-01-14 09:11:59 +02:00
Chris Rohlf f328b13d5c rpc : use unordered_map::reserve and emplace (llama/18513) 2026-01-14 09:11:59 +02:00
MeeMin fbde389665 cuda : fix copy of large tensors (ggml_nbytes <= INT_MAX assertion) (llama/18433)
* ggml-cuda: fixed assertion in ggml_cuda_cpy (llama/18140)

* ggml-cuda: changes in data types to int64_t

* ggml-cuda: added asserts for CUDA block numbers

* ggml-cuda: changed the condition for y and z dimension
2026-01-14 09:11:59 +02:00
Aman Gupta f22c1ccbe4 ggml-cuda: remove unneccesary prints on ggml_cuda_init (llama/18502) 2026-01-14 09:11:59 +02:00
Jeff Bolz b1f65a4a7e vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (llama/18295)
* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.

* change test_topk_moe to allow results in arbitrary order

* disable sigmoid fusion for moltenvk
2026-01-14 09:11:59 +02:00
Peter A. a96310871a
examples : fix executable example targets (#3600)
* cmake:
    - added `whisper-` prefix to unprefixed targets: `quantize`, `lsp`,
      `vad-speech-segments`
    - added `install(TARGETS ${TARGET} RUNTIME)` where it was missing

Signed-off-by: Peter A. <ink.splatters@pm.me>

* .github/workflows/build.yml: quantize -> whisper-quantize

Signed-off-by: Peter A. <ink.splatters@pm.me>

---------

Signed-off-by: Peter A. <ink.splatters@pm.me>
2026-01-13 08:08:18 +01:00
KITAITI Makoto 679bdb53db
ruby : fix segmentation fault (#3591)
* Mark long live variable

* Fix test for Whisper::Token#deconstruct_keys(nil)

* Don't use long live variable

* Fix indentation
2026-01-05 17:41:22 +09:00
Georgi Gerganov e9898ddfb9 sync : ggml 2025-12-31 18:27:20 +02:00
Georgi Gerganov ce03f8e759 ggml : bump version to 0.9.5 (ggml/1410) 2025-12-31 18:27:20 +02:00
Georgi Gerganov 7359ac94d5 talk-llama : sync llama.cpp 2025-12-31 17:52:09 +02:00
Georgi Gerganov 54fa8216ea sync : ggml 2025-12-31 17:52:09 +02:00
gatbontonpc 8189f2cb65 metal : add count_equal op (llama/18314)
* add count equal for metal

* remove trailing whitespace

* updated doc ops table

* changed shmem to i32

* added multi tg and templating

* removed BLAS support from Metal docs

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add memset to set dst to 0

* metal : cleanup

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-31 17:52:09 +02:00
Johannes Gäßler 2d250f8049 CUDA: fix KQ max calculation (llama/18487) 2025-12-31 17:52:09 +02:00
Georgi Gerganov 5deaf8f2a3 metal : remove BF16 x F16 kernels (llama/18456) 2025-12-31 17:52:09 +02:00
Aman Gupta 467933199a sycl: add newline at the end of CMakeLists.txt (llama/18503) 2025-12-31 17:52:09 +02:00
Rahul Sathe a3635494da Work around broken IntelSYCLConfig.cmake in Intel oneAPI 2025.x (llama/18345)
* cmake: work around broken IntelSYCLConfig.cmake in oneAPI 2025.x

* [AI] sycl: auto-detect and skip incompatible IntelSYCL package

Automatically detect compiler versions with incompatible IntelSYCL
CMake configuration files and fall back to manual SYCL flags instead
of requiring users to set options manually.

Fixes build failures with oneAPI 2025.x where IntelSYCLConfig.cmake
has SYCL_FEATURE_TEST_EXTRACT invocation errors.

* refactor: improve SYCL provider handling and error messages in CMake configuration

* refactor: enhance SYCL provider validation and error handling in CMake configuration

* ggml-sycl: wrap find_package(IntelSYCL) to prevent build crashes
2025-12-31 17:52:09 +02:00
Charles Xu c9955367d4 kleidiai: add and integrate SVE 256-bit vector-length kernel (llama/18458)
* kleidiai: add and integrate SVE 256-bit vector-length kernel

* updated for review comments
2025-12-31 17:52:09 +02:00
Aman Gupta 6d4aa96bfa CUDA: add log line when mxfp4 acceleration is used (llama/18483)
* CUDA: add log line when mxfp4 acceleration is used

* add in backend_get_features
2025-12-31 17:52:09 +02:00
Johannes Gäßler 5765c5b04e CUDA: fix replacment of bad archs in CMake (llama/18457) 2025-12-31 17:52:09 +02:00
Johannes Gäßler d6cb2407b7 CUDA: Blackwell features for non-native builds (llama/18436) 2025-12-31 17:52:09 +02:00
Aman Gupta e49e88b2d8 cuda: fix race condition in cumsum (llama/18448)
* ggml-cuda: fix race condition in cumsum

* remove unneccesary sync_threads
2025-12-31 17:52:09 +02:00
uvos 20f5729921 HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated (llama/18202) 2025-12-31 17:52:09 +02:00
Aman Gupta b8d209f55c Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413)" (llama/18426) 2025-12-31 17:52:09 +02:00
o7si 54fe9a645d rpc: fix segfault on invalid endpoint format (llama/18387)
* rpc: fix segfault on invalid endpoint format

* rpc: add error log for failed endpoint connection
2025-12-31 17:52:09 +02:00
Boian Berberov b3788ef729 cmake: Added more x86_64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` (llama/18186)
* minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h`

* cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On`

- `ivybridge`
- `piledriver`
- `cannonlake`
- `cascadelake`
- `cooperlake`
- `zen4`

Resolves: #17966
2025-12-31 17:52:09 +02:00
QDelta 31fc2c37c8 ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (llama/18413) 2025-12-31 17:52:09 +02:00
lhez a800a3acd1 opencl: allow resizing transpose buffers (llama/18384)
* opencl: allow resizing transpose buffers instead of using fixed sizes

* opencl: remove commented code
2025-12-31 17:52:09 +02:00
Aman Gupta 29f8155445 ggml-cuda: Use same regex for GGML_NATIVE=OFF (llama/18407) 2025-12-31 17:52:09 +02:00
Jeff Bolz 015b618d96 vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (llama/18352)
Run a preprocess to count how many times each expert is used, and use this to
quickly discard workgroups that aren't needed.
2025-12-31 17:52:09 +02:00
Jeff Bolz e37c8ed94e vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (llama/18349)
* vulkan: Use BK=32 for coopmat2 mul_mat_id

* vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader

Disable robustness, remove the OOB check in decodeFuncB, and initialize the
row_ids to zero to avoid OOB access.

Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
zero and remove the '& (BN - 1)'. This allows the compiler to common some of
the shared memory loads.
2025-12-31 17:52:09 +02:00
Jeff Bolz 331c6ccd31 vulkan: Use BK=32 for coopmat2 mul_mat_id (llama/18332) 2025-12-31 17:52:09 +02:00
Eve 35cb4abb67 vulkan: small dequantization improvements (llama/18380)
* iq4_xs

* quants
2025-12-31 17:52:09 +02:00
Jeff Bolz 181e36f194 vulkan: Support UPSCALE w/antialias (llama/18327) 2025-12-31 17:52:09 +02:00
Jeff Bolz 67473fef57 vulkan: handle rope with large number of rows (llama/18306) 2025-12-31 17:52:09 +02:00
0Marble 33f75a88ac CANN: implement the SSM_CONV operator (llama/17737)
* CANN: implement SSM_CONV operator

Co-authored-by: Aleksei Lobanov, <zeromarblectm@gmail.com>
Co-authored-by: Sujin Kang, <waterjin326@gmail.com>

* CANN: remove custom error limit for SSM_CONV

* CANN: merge SSM_CONV tensor shape/strides into one line

---------

Co-authored-by: Sujin Kang, <waterjin326@gmail.com>
2025-12-31 17:52:09 +02:00
Aman Gupta 51778354ce ggml-cuda: fix regex for arch list (llama/18371)
* ggml-cuda: fix regex for arch list

* make regex exact
2025-12-31 17:52:09 +02:00
Aman Gupta 8e02f0919d cuda: optimize cumsum cub path (llama/18362)
* cuda: optimize cumsum cub path

* remove heavy perf test
2025-12-31 17:52:09 +02:00
Aman Gupta ea07c5d3b7 ggml-cuda: fix blackwell native builds (llama/18361)
* ggml-cuda: fix blackwell native builds

Replace 12x in native architectures by 12xa

* replace for GGML_NATIVE=OFF too

* only replace for native

* remove 120f-virtual for default compilation

---------

Co-authored-by: Aman Gupta <aman>
2025-12-31 17:52:09 +02:00