Commit Graph

4210 Commits

Author SHA1 Message Date
Sigbjørn Skjæret 3bb52acb46
metal : remove contiguous assertion for src0 in IM2COL (llama/15577)
* remove contiguous assertion for src0 in IM2COL

* add contiguous check in supports_op
2025-09-20 13:42:42 +03:00
Yoshi_likes_e4 9828caafb5
Add a warning for special devices (llama/15563)
* Add warning

* Print the devices names

* Add newlines

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Fix vector names

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-09-20 13:42:42 +03:00
Jeff Bolz 79e2bd5ea8
vulkan: Remove splitting for mul_mat_id (llama/15568)
row_ids only needs to hold the BN rows for the current tile.
2025-09-20 13:42:42 +03:00
Qeeweew 2468074e91
CUDA: Accelerate MXFP4 table lookup using `__byte_perm` (llama/15451)
* CUDA: optimize get_int_from_table_16

* CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs

* revise documentation

---------

Co-authored-by: xix <xiapc@outlook.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-09-20 13:42:41 +03:00
lhez 582ef379ab
opencl: fix support ops condition for `rms_norm` (llama/15560) 2025-09-20 13:42:41 +03:00
Ruben Ortlam 335d2a5405
vulkan: fix min subgroup 16 condition for mmid subgroup optimization (llama/15565) 2025-09-20 13:42:41 +03:00
Ihar Hrachyshka 8851ef5463
metal: fix regression when no metal devices are present (llama/15531) 2025-09-20 13:42:41 +03:00
Johannes Gäßler 1e856b2919
CUDA: MoE helper in device code, better tile sizes (llama/15525)
* CUDA: MoE helper in device code, better tile sizes

* reduce superfluous CUDA blocks
2025-09-20 13:42:41 +03:00
Georgi Gerganov 54be54f4ce
metal : add FA kernels for HS=40 (llama/15559)
ggml-ci
2025-09-20 13:42:41 +03:00
Chenguang Li 86331f74e0
CANN: ROPE cache sin/cos repeat (llama/15501)
Signed-off-by: noemotiovon <757486878@qq.com>
2025-09-20 13:42:41 +03:00
Ruben Ortlam ee11ed42a9
vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices (llama/15524)
* vulkan: use subgroup function for mul_mat_id shader even without coopmat

* vulkan: fix compile warnings

* vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id

* vulkan: disable subgroup mul_mat_id on devices with subgroups < 16
2025-09-20 13:42:41 +03:00
Jeff Bolz 85d4d2c875
vulkan: Support FA with any multiple of 8 head sizes (llama/15537)
The scalar FA shader already handled multiples of 8. The coopmat1 FA
shader assumed 16x16x16 and the shared memory allocations need the HSK
dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation
requires multiples of 16 for N and K, and needs the matrix dimensions
padded and loads clamped.

Store the FA pipelines in a map, indexed by the pipeline state.
2025-09-20 13:42:40 +03:00
Ruben Ortlam 8c7872d6ed
vulkan: enable Conv2D for Apple after MoltenVK fixed the bug (llama/15526) 2025-09-20 13:42:40 +03:00
Jeff Bolz 27817867cc
vulkan: workaround MoltenVK compile failure in multi_add (llama/15506)
* vulkan: workaround MoltenVK compile failure in multi_add

* Update ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp

Co-authored-by: 0cc4m <picard12@live.de>
2025-09-20 13:42:40 +03:00
Johannes Gäßler b0d15e1eb6
CUDA: fix half2 -> half conversion for HIP (llama/15529) 2025-09-20 13:42:40 +03:00
Jeff Bolz 2f6288c33c
vulkan: optimize rms_norm, and allow the work to spread across multiple SMs (llama/15281)
* vulkan: optimize rms_norm, and allow the work to spread across multiple SMs

There are really two parts to this change:
(1) Some optimizations similar to what we have in soft_max, to unroll with
different numbers of iterations.
(2) A fusion optimization where we detect add followed by rms_norm, and make
the add shader atomically accumulate the values^2 into memory. Then the
rms_norm shader can just load that sum. This allows the rms_norm to be
parallelized across multiple workgroups, it just becomes a simple per-element
multiply.

The fusion optimization is currently only applied when the rms_norm is on a
single vector. This previously always ran on a single SM. It could apply more
broadly, but when there are other dimensions the work can already spread across
SMs, and there would be some complexity to tracking multiple atomic sums.

* Change add+rms_norm optimization to write out an array of partial sums
rather than using atomic add, to make it deterministic. The rms_norm
shader fetches a subgroup's worth in parallel and uses subgroupAdd to
add them up.

* complete rebase against fused adds - multi_add shader can also compute partial sums

* fix validation errors

* disable add_rms_fusion for Intel due to possible driver bug

* resolve against #15489, sync after clearing partial sums
2025-09-20 13:42:40 +03:00
Jeff Bolz d8eb9f7d67
vulkan: Rewrite synchronization to allow some overlap between nodes (llama/15489)
Track a list of nodes that need synchronization, and only sync if the new node
depends on them (or overwrites them). This allows some overlap which can
improve performance, and centralizes a big chunk of the synchronization logic.

The remaining synchronization logic involves writes to memory other than the
nodes, e.g. for dequantization or split_k. Each of these allocations has a bool
indicating whether they were in use and need to be synced. This should be
checked before they are written to, and set to true after they are done being
consumed.
2025-09-20 13:42:40 +03:00
Acly 5094171c37
vulkan : support ggml_mean (llama/15393)
* vulkan : support ggml_mean

* vulkan : support sum, sum_rows and mean with non-contiguous tensors

* vulkan : fix subbuffer size not accounting for misalign offset

* tests : add backend-op tests for non-contiguous sum_rows

* cuda : require contiguous src for SUM_ROWS, MEAN support
* sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support

* require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader
2025-09-20 13:42:40 +03:00
Jeff Bolz 485c5c3b3b
vulkan: optimize mul_mat_id loading row ids into shared memory (llama/15427)
- Spread the work across the whole workgroup. Using more threads seems to
far outweigh the synchronization overhead.
- Specialize the code for when the division is by a power of two.
2025-09-20 13:42:40 +03:00
Reese Levine bb5d7e2c31
ggml WebGPU: add support for quantization types (llama/15440)
* Begin work on set_rows

* Work on set rows

* Add error buffers for reporting unsupported SET_ROWS indices

* Remove extra comments

* Work on templating for different types in shaders

* Work on shader type generation

* Working q4_0 mul_mat and some templating for different types

* Add q4_0_f16 matmul and fix device init

* Add matmul support for basic quantization types

* Add q2_k and q3_k quantization

* Add rest of k-quants

* Get firt i-quant working

* Closer to supporting all i-quants

* Support rest of i-quants

* Cleanup code

* Fix python formatting

* debug

* Bugfix for memset

* Add padding to end of buffers on creation

* Simplify bit-shifting

* Update usage of StringView
2025-09-20 13:42:39 +03:00
rmatif d7b7498e76
ggml: add `conv3d` op (llama/15182)
* add conv3d

* bump GGML_OP_COUNT
2025-09-20 13:42:39 +03:00
Yavor Ivanov 18ca4e8f63
cuda : add Pad Reflect 1D support (llama/14659)
* Add Pad Reflect 1D CUDA support

* Update ggml/src/ggml-cuda/pad_reflect_1d.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-09-20 13:42:39 +03:00
Aaron Teo 380d3db216
ggml-cpu: Support Q5_0 and Q5_1 on s390x (llama/15486)
* ggml-cpu: initial q5_0 impl for s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: updated q5_0 code for better performance

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: use optimised hsum for better performance

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: introduce q5_1 simd + refactor q5_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix incorrect return type vec_hsum

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: q5_0 incomplete refactor + table_b2b_0 activation

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: refactor q5_1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: q5_1 update loop unroll to 4

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: update q5_0 unroll to 4

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: update build-s390x docs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: update unused variables q5_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: update the last update date

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-20 13:42:39 +03:00
Chenguang Li be841c3f6e
CANN: Optimize RMS_NORM using cache (llama/15419)
* [CANN] Optimize RMS_NORM using cache

Signed-off-by: noemotiovon <757486878@qq.com>

* fix typo

Signed-off-by: noemotiovon <757486878@qq.com>

* fix review comment

Signed-off-by: noemotiovon <757486878@qq.com>

* codestyle adjustment

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-09-20 13:42:39 +03:00
Diego Devesa 554f96f385
sched : fix possible use of wrong ids tensor when offloading moe prompt processing (llama/15488) 2025-09-20 13:42:39 +03:00
Acly 9dd5039968
vulkan : support conv_2d_dw with f16 weights (llama/15392) 2025-09-20 13:42:39 +03:00
Dong Won Kim 7eebd498ff
vulkan: add exp operation (llama/15456)
Co-authored-by: aeseulgi <kim2h7903@gmail.com>
2025-09-20 13:42:39 +03:00
Jeff Bolz 04d0f9a066
vulkan: Reuse conversion results in prealloc_y (llama/15410)
* vulkan: Reuse conversion results in prealloc_y

Cache the pipeline and tensor that were most recently used to fill prealloc_y,
and skip the conversion if the current pipeline/tensor match.

* don't use shared pointer for prealloc_y_last_pipeline_used
2025-09-20 13:42:38 +03:00
Xuan-Son Nguyen c5874bcf42
ggml : fix condition of im2col on Metal backend (llama/15460) 2025-09-20 13:42:38 +03:00
R0CKSTAR 7c077845fd
musa: add GGML_UNUSED_VARS (llama/15446)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-09-20 13:42:38 +03:00
Diego Devesa 622dec5bf6
sched : copy only the used experts when offloading prompt processing (llama/15346) 2025-09-20 13:42:38 +03:00
Johannes Gäßler 8f0579a33d
CUDA: refactor FA support/selection code (llama/15454) 2025-09-20 13:42:38 +03:00
Johannes Gäßler 316ed78d68
CUDA: replace GGML_CUDA_F16 with CUDA arch checks (llama/15433) 2025-09-20 13:42:38 +03:00
Jeff Bolz 5907ab3e4a
vulkan: shorten pipeline name strings (llama/15431)
These detailed strings were causing increased build time on gcc.
2025-09-20 13:42:38 +03:00
R0CKSTAR 0eb2d653bd
musa: fix build warnings (llama/15258)
* musa: fix build warnings

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* fix warning: comparison of integers of different signs: 'const int' and 'unsigned int' [-Wsign-compare]

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-09-20 13:42:38 +03:00
lhez db1d2380a0
opencl: mark `argsort` unsupported if cols exceed workgroup limit (llama/15375) 2025-09-20 13:42:37 +03:00
SHUAI YANG 2572322bac
CANN: optimize rope operator (llama/15335)
* optimize rope ops

* amendment

* delete trailing whitespace

* change the variable name
2025-09-20 13:42:37 +03:00
R0CKSTAR 02b49af98d
musa: handle __hgt2_mask, available starting from MUSA SDK rc4.3.0 (llama/15413)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-09-20 13:42:37 +03:00
Marvin Gießing 2ce5860a62
ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware (llama/15385)
* Added VSX intrinsics for Power9+ systems

Signed-off-by: mgiessing <marvin.giessing@gmail.com>

* Manual unrolling for minor perf improvement

Signed-off-by: mgiessing <marvin.giessing@gmail.com>

* Update ggml/src/ggml-cpu/arch/powerpc/quants.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: mgiessing <marvin.giessing@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-09-20 13:42:37 +03:00
Georgi Gerganov 80447f7412
cuda : remove obsolete sources (ggml/1332)
ggml-ci
2025-09-20 13:42:37 +03:00
Carlos Zoido 44fa2f647c
ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (#3426)
While working on the [whisper-cpp](https://conan.io/center/recipes/whisper-cpp) Conan package for ConanCenter, I noticed that enabling the `with_blas` option fails to build due to an issue in the _MKL_ detection logic.  

The problem is that the CMake condition currently expands `BLAS_INCLUDE_DIRS` without quotes:

```cmake
if (${BLAS_INCLUDE_DIRS} MATCHES "mkl" AND (${GGML_BLAS_VENDOR} MATCHES "Generic" OR ${GGML_BLAS_VENDOR} MATCHES "Intel"))
```
When `BLAS_INCLUDE_DIRS` is a list (as Conan provides it), the `if()` command receives multiple arguments and produces a CMake error:

```bash
...
-- BLAS found, Includes: /root/.conan2/p/b/openb034c5a6ca927b/p/include;/root/.conan2/p/b/openb034c5a6ca927b/p/include/openblas
CMake Error at ggml/src/ggml-blas/CMakeLists.txt:77 (if):
  if given arguments:

    "/root/.conan2/p/b/openb034c5a6ca927b/p/include" "/root/.conan2/p/b/openb034c5a6ca927b/p/include/openblas" "MATCHES" "mkl" "AND" "(" "OpenBLAS" "MATCHES" "Generic" "OR" "OpenBLAS" "MATCHES" "Intel" ")"

  Unknown arguments specified
...
```
This PR fixes the issue by quoting the variable:

```cmake
if ("${BLAS_INCLUDE_DIRS}" MATCHES "mkl" AND (${GGML_BLAS_VENDOR} MATCHES "Generic" OR ${GGML_BLAS_VENDOR} MATCHES "Intel"))
```

With this change, the whole list is treated as a single string and the regex still works correctly.
2025-09-19 05:33:53 +02:00
Siva Mahadevan edea8a9c3c
whisper : prefer curl over wget in download scripts (#3409)
On busybox-based systems like Alpine Linux, wget does not have
certain CLI flags such as '--no-config'. Thus, search for the
existence of 'curl' first in the PATH before wget. wget2 is
still the preferred download tool.
2025-09-08 06:32:19 +02:00
Daniel Bevenius bb0e1fc60f
ci : remove brew installation of cmake for macos-latest (#3408)
This commit remove the brew install of cmake for macos-latest
as this now seems to be pre-installed on the runner.

The motivation for this is that this job is failing with the following
error:
```console
Error: cmake was installed from the local/pinned tap
but you are trying to install it from the homebrew/core tap.
Formulae with the same name from different taps cannot be installed at the same time.
```
2025-09-05 15:20:32 +02:00
Daniel Bevenius 9bfc535130
tests : use CMake definitions for model/sample paths (#3406)
This commit modifies the test-vad and test-vad-full tests to use CMake
definitions for the model and sample paths.

The motivation for this is that currently the tests use relative paths
which might not always be correct depending on the working directory.
With the changes in this commit the tests can be run usins ctest:
```console
$ ctest -R ^test-vad$ --test-dir build
```
Or directly (which is not currently possible without this fix):
```
./build/bin/test-vad
```

Resolves: https://github.com/ggml-org/whisper.cpp/issues/3404
2025-09-04 15:08:30 +02:00
Treboko 7745fcf328
Handle negative value in padding (#3389)
this might happen depending on the way the $stderr.winsize is defined. If the expression "$stderr.winsize[1] - line.size" in Line 114 gets negative, we will get a "negative argument" exception in the padding calculation
2025-08-25 01:34:23 +09:00
Thea Mukhi c09b0e0c4c
models : update`./models/download-ggml-model.cmd` to allow for tdrz download (#3381)
* added patch to cmd to allow for tdrz download

* remove @signs

* Update models/download-ggml-model.cmd

Add missing closing double quote.

---------

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2025-08-24 11:52:33 +02:00
Georgi Gerganov fc45bb8625 talk-llama : sync llama.cpp
ggml-ci
2025-08-18 20:30:45 +03:00
Georgi Gerganov 33c3c2fe2e sync : ggml 2025-08-18 20:30:45 +03:00
Reese Levine 5ed45b2518 ggml: Add initial WebGPU backend (llama/14521)
ggml-ci
2025-08-18 20:30:45 +03:00
Aaron Teo 03d6607691 ggml : initial zDNN backend (llama/14975) 2025-08-18 20:30:45 +03:00