Commit Graph

3689 Commits

Author SHA1 Message Date
Judd 80ef57f0f0 ggml : update `ggml_rope_multi` (llama/12665)
* update `rope_multi`:

1. add `ggml_rope_multi_inplace`;
1. use `GGML_MROPE_SECTIONS` instead of 4.

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-08-18 20:30:45 +03:00
Georgi Gerganov 0e8b244366 ggml : repack block_iq4_nlx8 (llama/14904)
ggml-ci
2025-08-18 20:30:45 +03:00
Oliver Simons b8b1b50c47 CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n (llama/15132)
* Factor out `reduce_rows_f32` from common.cuh

This increases iteration cycle speed by not having to recompile
every kernel all the time

* Hide memory-latency by loop unrolling in reduce_rows_f32

* Further optimizations to `reduce_rows_f32`

1. Increase threadblock size to better hide latency of memory requests.
   As a consequence of bigger threadblocks, do 2-step summation, using
   shared memory to communicate results between invocations
2. Use sum_temp array to reduce waits on sum
3. Adjust num_unroll to reflext bigger threadblock
4. Improve default block_dims, increase support for more block_dims

* Add perf tests for `reduce_rows_f32` kernel

* Add heuristic to toggle 128/512 threads based on sm count

Break even point was the minimum of the following multiples.

| GPU Model                     | Nrow SM Count Multiple |
| -----------                   | -----------            |
| RTX 4000 SFF ADA              | 2.0x                   |
| RTX 6000 ADA                  | 2.5x                   |
| RTX PRO 6000 Blackwell Max-Q  | 3.04x                  |
| RTX PRO 4500 Blackwell	| 3.15x                  |

* Ensure perf gains also for small ncols and large nrows

Alternative to this, one could have also made the number of unrollings
template-able, but that would require compiling the kernel multiple
times, increasing binary size unnecessarily

* Modify perf and unit-tests

* Apply auto-formatting by clang

* Fix CI build failure

See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486
Building with VS generator worked though.

* Remove sm_count property from `ggml_backend_cuda_context`

Requested by @JohannesGaessler, and should fix remaining CI issues as a
side-effect

* Add CUB-based implementation for GGML_OP_MEAN

Currently this branch is only executed for nrows==1

* Add heuristics to execute CUB branch only when it brings perf

Heuristics were determined on the following HW:

* RTX 4000 SFF ADA
* RTX 6000 ADA
* RTX PRO 6000 Blackwell Max-Q
* RTX PRO 4500 Blackwell

* Add unit-test for CUB-based mean

Tests should run with CUDA Graphs enabled per default on NVGPUs

* Rename `USE_CUB` to `GGML_CUDA_USE_CUB`

Suggested by @JohannesGaessler

* Unindent Preprocessor directives

See
https://github.com/ggml-org/llama.cpp/pull/15132#discussion_r2269213506
2025-08-18 20:30:45 +03:00
Tak-RS 4e234ac013 ggml-rpc: chunk send()/recv() to avoid EINVAL for very large tensors over RPC (macOS & others) (llama/15188)
* ggml-rpc: chunk send()/recv() to avoid EINVAL for very large tensors over RPC (macOS & others). Fixes #15055

* ggml-rpc: rename RPC_IO_CHUNK->MAX_CHUNK_SIZE, use std::min() for cap, switch to GGML_LOG_ERROR, handle 0-length send/recv

* rpc: drop n==0 special case in send_data(); retry in loop per review

* rpc: remove trailing whitespace in send_data()

---------

Co-authored-by: Shinnosuke Takagi <nosuke@nosukenoMacBook-Pro.local>
2025-08-18 20:30:45 +03:00
uvos 8df931b608 HIP: disable sync warp shuffel operators from clr amd_warp_sync_functions.h (llama/15273) 2025-08-18 20:30:45 +03:00
Romain Biessy 1334f434f3 sycl: Fix and disable more configurations of mul_mat (llama/15151)
* sycl: Fix and disable more configurations of mul_mat

* Disable more configurations
2025-08-18 20:30:45 +03:00
rmatif 139110701e opencl: allow mixed f16/f32 `add` (llama/15140) 2025-08-18 20:30:45 +03:00
Aman Gupta 082c7ba67c CUDA cmake: add `-lineinfo` for easier debug (llama/15260) 2025-08-18 20:30:45 +03:00
Chenguang Li 0effaad964 CANN: GGML_OP_CPY optimization (llama/15070)
Signed-off-by: noemotiovon <757486878@qq.com>
2025-08-18 20:30:45 +03:00
R0CKSTAR 8e2ddfec31 musa: fix failures in test-backend-ops for mul_mat_id op (llama/15236)
* musa: fix failures in test-backend-ops for mul_mat_id op

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-08-18 20:30:45 +03:00
hipudding 3e2c262c08 CANN: Add broadcast for softmax and FA (llama/15208)
* refactor softmax

* fix fa

* fix mask shape

* format

* add comments

* Remove whitespace
2025-08-18 20:30:45 +03:00
Charles Xu 30cc11dc94 kleidiai: fix unsigned overflow bug (llama/15150)
* kleidiai: fix unsigned overflow bug

* address review comments
2025-08-18 20:30:45 +03:00
David Zhao 457eadfe6f cuda: refactored ssm_scan and use CUB (llama/13291)
* cuda: refactored ssm_scan to use CUB

* fixed compilation error when when not using CUB

* assign L to constant and use size_t instead of int

* deduplicated functions

* change min blocks per mp to 1

* Use cub load and store warp transpose

* suppress clang warning
2025-08-18 20:30:45 +03:00
Aman Gupta 93c7a08019 CUDA: add attention sinks for tile and wmma (llama/15178)
* CUDA: add attention sinks for tile and wmma

* Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma
2025-08-18 20:30:45 +03:00
compilade 62566a5436 gguf-py : add Numpy MXFP4 de/quantization support (llama/15111)
* gguf-py : add MXFP4 de/quantization support

* ggml-quants : handle zero amax for MXFP4
2025-08-18 20:30:45 +03:00
AN Long 573bf9d128 ggml : fix field name when new ggml_backend (llama/14944) 2025-08-18 20:30:45 +03:00
Johannes Gäßler 2baea5e4b3 CUDA: attention sinks for mma FlashAttention (llama/15157) 2025-08-18 20:30:45 +03:00
lhez 8a36cd924a opencl: support sink in `soft_max` (attn sinks) (llama/15152) 2025-08-18 20:30:45 +03:00
Jeff Bolz 1984530710 vulkan: support fattn sinks (llama/15126) 2025-08-18 20:30:45 +03:00
Jeff Bolz 414e9074e0 vulkan: Add env var to disable host visible vidmem (llama/15109) 2025-08-18 20:30:45 +03:00
uvos 813ceb2a74 HIP: add cmake option to enable compiler output of kernel resource usage metrics (llama/15103) 2025-08-18 20:30:45 +03:00
Christian Kastner 6d7ffea292 ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (llama/15094)
Any available libraries are found and loaded dynamically at runtime.
2025-08-18 20:30:45 +03:00
Johannes Gäßler 5caf8a1ea2 CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16 (llama/15131)
* CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16
2025-08-18 20:30:45 +03:00
rmatif b405fd88b3 fix profiling crash (llama/15072) 2025-08-18 20:30:45 +03:00
lhez d153cfb507 opencl: add `swiglu_oai` and `add_id` (llama/15121)
* opencl: add `swiglu-oai`

* opencl: add `add_id`

* opencl: add missing `add_id.cl`
2025-08-18 20:30:45 +03:00
Diego Devesa 6fb55d8f7c ggml : fix fallback to CPU for ununsupported ops (llama/15118) 2025-08-18 20:30:45 +03:00
Chenguang Li e809e81e69 CANN: add support for ACL Graph (llama/15065)
* feat(cann): add optional support for ACL Graph execution

This commit adds support for executing ggml computational graphs using
Huawei's ACL graph mode via the USE_CANN_GRAPH flag. The support can be
enabled at compile time using the CMake option:

    -DUSE_CANN_GRAPH=ON

By default, ACL graph execution is **disabled**, and the fallback path
uses node-by-node execution.

Key additions:
- CMake option  to toggle graph mode
- Graph capture and execution logic using
- Tensor property matching to determine whether graph update is required
- Safe fallback and logging if the environment variable LLAMA_SET_ROWS
  is unset or invalid

This prepares the backend for performance improvements in repetitive graph
execution scenarios on Ascend devices.

Signed-off-by: noemotiovon <757486878@qq.com>

* Fix review comments

Signed-off-by: noemotiovon <757486878@qq.com>

* remane USE_CANN_GRAPH to USE_ACL_GRAPH

Signed-off-by: noemotiovon <757486878@qq.com>

* fix typo

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-08-18 20:30:45 +03:00
Georgi Gerganov d3aab3efde llama : add gpt-oss (llama/15091)
* oai moe

* compat with new checkpoint

* add attn sink impl

* add rope scaling yarn

* logits match with latest transformers code

* wip chat template

* rm trailing space

* use ggml_scale_bias

* rm redundant is_swa_all

* convert interleaved gate_up

* graph : fix activation function to match reference (llama/7)

* vocab : handle o200k_harmony special tokens

* ggml : add attention sinks support (llama/1)

* llama : add attn sinks

* ggml : add attn sinks

* cuda : add attn sinks

* vulkan : add support for sinks in softmax

remove unnecessary return

* ggml : add fused swiglu_oai op (llama/11)

* ggml : add fused swiglu_oai op

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update CUDA impl

* cont : metal impl

* add vulkan impl

* test-backend-ops : more test cases, clean up

* llama : remove unfused impl

* remove extra lines

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>

* repack mxfp4 upon conversion

* clean up a bit

* enable thinking

* add quick hack to render only some special tokens

* fix bf16 conversion

* remove vocab hack

* webui ok

* support chat parsing for gpt-oss

* fix webui

* direct mapping mxfp4, FINALLY

* force using mxfp4

* properly use lazy tensor

* ggml : add mxfp4

ggml : use e8m0 conversion instead of powf

Co-authored-by: Diego Devesa <slarengh@gmail.com>

change kvalues_mxfp4 table to match e2m1 (llama/6)

metal : remove quantization for now (not used)

cuda : fix disabled CUDA graphs due to ffn moe bias

vulkan : add support for mxfp4

cont : add cm2 dequant

* ggml : add ggml_add_id (llama/13)

* ggml : add ggml_add_id

* add cuda impl

* llama : add weight support check for add_id

* perf opt

* add vulkan impl

* rename cuda files

* add metal impl

* allow in-place ggml_add_id

* llama : keep biases on CPU with --cpu-moe

* llama : fix compile error

ggml-ci

* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw

ggml-ci

* cleanup

ggml-ci

* sycl : fix supports_op for MXFP4

ggml-ci

* fix Unknown reasoning format

* ggml-cpu : fix AVX build

ggml-ci

* fix hip build

ggml-ci

* cuda : add mxfp4 dequantization support for cuBLAS

ggml-ci

* ggml-cpu : fix mxfp4 fallback definitions for some architectures

ggml-ci

* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: slaren <slarengh@gmail.com>
2025-08-18 20:30:45 +03:00
Romain Biessy 6558022873 sycl: fix mul_mat selection (llama/15092) 2025-08-18 20:30:45 +03:00
Christian Kastner 349b9a2097 cmake: Add GGML_BACKEND_DIR option (llama/15074)
* cmake: Add GGML_BACKEND_DIR option

This can be used by distributions to specify where to look for backends
when ggml is built with GGML_BACKEND_DL=ON.

* Fix phrasing
2025-08-18 20:30:45 +03:00
Jeff Bolz 00ff38376a vulkan: fix build when using glslang that does not support coopmat2 (llama/15062) 2025-08-18 20:30:45 +03:00
Jeff Bolz abc971e69a vulkan: Use coopmat2 for conv2d (llama/14982) 2025-08-18 20:30:45 +03:00
lhez 53d8c5179f opencl: fix adreno compiler detection logic (llama/15029) 2025-08-18 20:30:45 +03:00
Johannes Gäßler d6e7315717 CUDA: use mma FA kernel for gqa > 4 on RTX 4000 (llama/15035) 2025-08-18 20:30:45 +03:00
leejet a3123e105b cuda: make im2col a little faster (llama/15025) 2025-08-18 20:30:45 +03:00
Georgi Gerganov d119ecf0c1 cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 (llama/15038)
* cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1

ggml-ci

* cont : fix cont types

ggml-ci

* cont : adopt variable names and comment from the other branch
2025-08-18 20:30:45 +03:00
Jeff Bolz b374fd6172 vulkan: coopmat2 mul_mat optimizations (llama/14934)
- Increase tile size for k-quants, to match non-k-quants
- Choose more carefully between large and medium tiles, considering how it
  interacts with split_k
- Allow larger/non-power of two split_k, and make the splits a multiple of 256
- Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used
2025-08-18 20:30:45 +03:00
Jeff Bolz 97341224b2 vulkan: Support ne[3]>1 in noncontig matrix-vector multiply (llama/15015) 2025-08-18 20:30:45 +03:00
Jeff Bolz 46e9e5b9a7 vulkan: optimizations for direct convolution (llama/14933)
* vulkan: optimizations for direct convolution

- Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill
  the GPU. The new size should be amenable to using coopmat, too.
- Fix shmem bank conflicts. 16B padding should work with coopmat.
- Some explicit loop unrolling.
- Skip math/stores work for parts of the tile that are OOB.
- Apply fastdiv opt.
- Disable shuffles for NV.

* Three tiles sizes for CONV_2D, and a heuristic to choose

* reallow collectives for pre-Turing

* make SHMEM_PAD a spec constant

* fixes for intel perf - no shmem padding, placeholder shader core count

* shader variants with/without unrolling

* 0cc4m's fixes for AMD perf

Co-authored-by: 0cc4m <picard12@live.de>

---------

Co-authored-by: 0cc4m <picard12@live.de>
2025-08-18 20:30:45 +03:00
Johannes Gäßler 7e7557ac50 CUDA: fix MMQ nwarps for AMD with warp_size==32 (llama/15014) 2025-08-18 20:30:45 +03:00
lhez ba6a81c9c9 opencl: add f16 for `add`, `sub`, `mul`, `div` (llama/14984) 2025-08-18 20:30:45 +03:00
Srihari-mcw 1c6cb7df47 ggml : Q2k interleaving implementation - x86/x64 SIMD (llama/14373)
* Initial Q2_K Block Interleaving Implementation

* Addressed review comments and clean up of the code

* Post rebase fixes

* Initial CI/CD fixes

* Update declarations in arch-fallback.h

* Changes for GEMV Q2_K in arch-fallback.h

* Enable repacking only on AVX-512 machines

* Update comments in repack.cpp

* Address q2k comments

---------

Co-authored-by: Manogna-Sree <elisetti.manognasree@multicorewareinc.com>
2025-08-18 20:30:45 +03:00
diannao 78668cb8d1 docker : add cann build pipline (llama/14591)
* docker: add cann build pipline

* docker: add cann build pipline

* docker: fix cann devops

* cann : fix multi card hccl

* Update ggml/src/ggml-cann/ggml-cann.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Update ggml-cann.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-08-18 20:30:45 +03:00
Ruben Ortlam 41e161657e Vulkan: Fix minor debug mode issues (llama/14899)
* vulkan: fix debug mode issues

* vulkan: remove broken check_results GGML_OP_SET_ROWS support
2025-08-18 20:30:45 +03:00
hipudding 572152d6af CANN: Improve loading efficiency after converting weights to NZ format. (llama/14985)
* CANN: Improve loading efficiency after converting weights to NZ format.

* CANN: fix typo
2025-08-18 20:30:45 +03:00
lhez 4904bc3bda opencl: add `mul_mat_f32_f32_l4_lm` and `mul_mat_f16_f32_l4_lm` (llama/14809) 2025-08-18 20:30:45 +03:00
uvos 8ed27b407d HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes (llama/14949) 2025-08-18 20:30:45 +03:00
Johannes Gäßler 113d88686b CUDA: skip masked KV slices for all FA kernels (llama/14924) 2025-08-18 20:30:45 +03:00
uvos 4e624e42fa HIP: remove the use of __HIP_PLATFORM_AMD__, explicitly support only AMD targets (llama/14945) 2025-08-18 20:30:45 +03:00
uvos 7f203f41aa HIP: add GGML_HIP_MMQ_MFMA option to allow disableing the MFMA path. (llama/14930)
This is useful for testing for regressions on GCN with CDNA hardware.

With GGML_HIP_MMQ_MFMA=Off and GGML_CUDA_FORCE_MMQ=On we can conveniently test the GCN code path on CDNA. As CDNA is just GCN renamed with MFMA added and limited use ACC registers, this provides a good alternative for regression testing when GCN hardware is not available.
2025-08-18 20:30:45 +03:00