Yibo Cai
ea643c6ae3
arm64: optimize q4_k_q8_k kernel with i8mm (llama/13886)
...
This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction.
Tested on neoverse-n2 with llama3 8b q4_k_m quantization model.
- 34% ~ 50% S_PP uplift for all batch sizes
- 12% ~ 37% S_TG uplift for batch size 4 and above
Perplexity doesn't change with this PR.
```
// tested on neoverse-n2
$ llama-batched-bench \
-m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
--no-mmap -fa \
-c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
-npl 1,2,4,8,16,32 \
-t 64
---------------------------------------------------------------------
| PP | TG | B | S_PP t/s | S_TG t/s |
| | | | original | this pr | original | this pr |
|-------|--------|------|----------|----------|----------|----------|
| 128 | 128 | 1 | 110.12 | 147.83 | 24.36 | 24.28 |
| 128 | 128 | 2 | 121.16 | 172.42 | 46.36 | 47.93 |
| 128 | 128 | 4 | 120.15 | 169.75 | 74.68 | 84.00 |
| 128 | 128 | 8 | 130.97 | 196.81 | 91.04 | 114.74 |
| 128 | 128 | 16 | 131.01 | 196.88 | 101.43 | 135.79 |
| 128 | 128 | 32 | 130.85 | 196.51 | 106.97 | 147.29 |
---------------------------------------------------------------------
```
2025-06-01 15:14:44 +03:00
Christian Kastner
1d7b3c79f4
cmake: Factor out CPU architecture detection (llama/13883)
...
* cmake: Define function for querying architecture
The tests and results match exactly those of src/CMakeLists.txt
* Switch arch detection over to new function
2025-06-01 15:14:44 +03:00
Vineel Abhinav
ccfaac2bb0
ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (llama/13882)
...
* F32-Mamba-Seq_Scan-SVE
* Fix formatting
* ggml : missing space
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-01 15:14:44 +03:00
Vineel Abhinav
1230d37bca
ggml: aarch64: Implement SVE F32 kernels for vector functions (llama/13843)
...
* F32-Mamba-SVE
* F32-Mamba-SVE
* Resolve test errors-1
* Resolve test errors-2
* F32-vec-SVE
* F32-vec-SVE
* F32-vec-SVE
2025-06-01 15:14:44 +03:00
Johannes Gäßler
9a500394ad
CUDA: fix FA tg at long context for CC >= 8.9 (llama/13852)
2025-06-01 15:14:44 +03:00
leo-pony
0035b8527c
CANN: Add SOC TYPE printing in cmake configuration (llama/13837)
2025-06-01 15:14:44 +03:00
lhez
3623186312
opencl: add new ops - `argsort`, `div`, `sub`, `addrows`, `sigmoid`, `group_norm` (llama/13787)
...
* opencl: add `argsort`
* opencl: add `div`
* opencl: add `add_rows`
* opencl: add `sub`
* opencl: add `sigmoid`, both `f16` and `f32`
* opencl: add `group_norm`
2025-06-01 15:14:44 +03:00
lhez
67beac47f3
opencl: mark `mul_mat` `f32f32` as supporting non-contiguous tensors (llama/13790)
2025-06-01 15:14:44 +03:00
Jeff Bolz
47a19bae25
vulkan: use timestamp queries for GGML_VULKAN_PERF (llama/13817)
...
Also change it to be controlled by an env var rather than cmake flag
2025-06-01 15:14:44 +03:00
Akarshan Biswas
3d5c7ca4bc
SYCL: add gelu_erf kernel (llama/13749)
...
* SYCL: add gelu_erf kernel
* refactor code
Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com>
* Use scope_op_debug_print
---------
Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com>
2025-06-01 15:14:44 +03:00
Xuan-Son Nguyen
4dfb2c2215
ggml : add ggml_repeat_4d (llama/13824)
2025-06-01 15:14:44 +03:00
Kai Pastor
ad433403ce
vulkan : Remove unexpected ; (ggml/1253)
2025-06-01 15:14:44 +03:00
Kai Pastor
4064dd6484
cmake : Fix broken CMake error messages (ggml/1252)
2025-06-01 15:14:44 +03:00
Radoslav Gerganov
fd75c4995b
ggml : remove ggml_graph_import and ggml_graph_export declarations (ggml/1247)
...
The implementation is already deleted with commit 9d0762e.
closes : #1235
2025-06-01 15:14:44 +03:00
Daniel Tang
4d18e52f55
ggml : Fix backtrace breaking Windows build ( #3203 )
2025-05-29 13:26:58 +03:00
Radoslav Gerganov
48dddbbac1
ggml : install dynamic backends (ggml/1240)
2025-05-29 09:56:26 +03:00
Daniel Tang
5ea2c37a4c
ggml : Print backtrace on uncaught C++ exceptions (ggml/1232)
...
The goal is to have what users call "full logs" contain the backtrace.
This is registered upon ggml_init. Also fixes a minor fd leak on Linux.
2025-05-29 09:56:26 +03:00
Simon Booth
5720426d97
whisper : install shared libs when using GGML_BACKEND_DL ( #3195 )
2025-05-28 10:15:04 +02:00
xctan
15ae9dc2a4
ggml : riscv: add xtheadvector support (llama/13720)
...
* ggml : riscv: add xtheadvector support
* ggml : clean up some macro usage
2025-05-27 18:03:00 +03:00
Christian Kastner
2e7a1e3e43
ggml-cpu: x86 feature detection is specific to x86 (llama/13811)
2025-05-27 18:03:00 +03:00
Diego Devesa
b75babebb2
ggml : allow CUDA graphs when using pipeline parallelism (llama/13814)
2025-05-27 18:03:00 +03:00
Georgi Gerganov
cc7a0105ef
cuda : avoid cuGetErrorString (llama/13791)
...
ggml-ci
2025-05-27 18:03:00 +03:00
Akarshan Biswas
195fde8804
SYCL: Add non contiguous support in RMS_NORM and NORM kernels (llama/13611)
...
* SYCL: Add non contiguous input support to norm kernel
* refactor and add RMS_NORM non contiguous input support
ggml-ci
* restore subgroup reduction for multi-subgroup thread blocks in norm kernels
* Swap grid dims of nsamples and nrows
ggml-ci
* Revert "Swap grid dims of nsamples and nrows"
This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf.
* restore not required changes
ggml-ci
* address review comments: change it to more like SYCL
* Use a common function to calculate offset
* remove wrap around logic for handling broadcasts
* remove static from calculate_offset fn and use ceil_div
2025-05-27 18:03:00 +03:00
Romain Biessy
25e27904ca
sycl: Add more debug prints (llama/13640)
2025-05-27 18:03:00 +03:00
Jeff Bolz
474f7be8b6
vulkan: mark IM2COL as supporting non-contig (llama/13783)
2025-05-27 18:03:00 +03:00
Bizhao Shi
e35fecc2a1
CANN: Add the basic supports of Flash Attention kernel (llama/13627)
...
* cann: add the basic FA support
* cann: update the readme
* cann: update the FlashAttention with PSEShift
* cann: update the input parameters in FA
* cann: update the alibi with max_bias
* cann: add the constrints of softcap
* cann: update the docs CANN.md
* cann: update the docs CANN.md
* cann: fix typo of CANN.md
* cann: add some comments and update the CANN.md
* cann: update the CANN.md
* cann: update the inner precise for fusedInferAttention
* cann: update the constraints of flash_attn_ext on ggml-cann.cpp
* cann: clean the whitespace
* cann: clean the whitespace
* cann: add a new endline
2025-05-27 18:03:00 +03:00
Akarshan Biswas
1cd7028428
SYCL: revert "sycl: simplify bin_bcast_kernel (ggml/13383)" (llama/13752)
...
Temporarily reverted due to failing fp16 DIV operation
This reverts commit 02cdd2d8b092b5a4bb18e013c6887ce49ba20ac5.
ggml-ci
2025-05-27 18:03:00 +03:00
Diego Devesa
99596d6031
ggml-cpu : set openmp wait time if not set (llama/13758)
2025-05-27 18:03:00 +03:00
Xuan-Son Nguyen
2d6c6862f7
ggml : add ggml_gelu_erf() CUDA kernel (llama/13719)
...
* ggml : add ggml_gelu_erf() CUDA kernel
* missing semicolon
2025-05-27 18:03:00 +03:00
Johannes Gäßler
f1576b2659
CUDA: fix race condition in FA vector kernels (llama/13742)
2025-05-27 18:03:00 +03:00
Chenguang Li
994b4f86ab
CANN: Support MUL_MAT_ID for q8_0 and q4_0 (llama/13705)
...
* [CANN]Support MUL_MAT_ID Q8 && Q4
Signed-off-by: noemotiovon <757486878@qq.com>
* codestyle adjustment
Signed-off-by: noemotiovon <757486878@qq.com>
---------
Signed-off-by: noemotiovon <757486878@qq.com>
2025-05-27 18:03:00 +03:00
Xuan-Son Nguyen
3e7eaccf55
ggml : fix the order of ggml_unary_op (llama/13718)
2025-05-27 18:03:00 +03:00
Jeff Bolz
191f040414
vulkan: support CPY from any type to itself (llama/13695)
...
Reuse the f16/f32 copy shaders, and just scale the number of elements
according to the type size.
2025-05-27 18:03:00 +03:00
Jeff Bolz
2d49d4a9b5
vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (llama/13696)
2025-05-27 18:03:00 +03:00
Judd
000d65befb
use LOG_WARN to replace `std::cerr` (llama/13657)
2025-05-27 18:03:00 +03:00
Nicolò Scipione
f0803e6646
sycl : Remove waits from function calls (llama/13702)
...
* removes the waits in async memcpy functions
2025-05-27 18:03:00 +03:00
Ewan Crawford
730a00be8a
SYCL: Avoid using with SYCL-Graph for unsupported nodes (llama/13587)
...
Currently on a CUDA backend to SYCL when running
`GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there
are two operations that throw an exception from the blocking
waits during queue recording.
* `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187
* `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074
We've noticed that `ggml-cuda.cu` has the
[check_node_graph_compatibility_and_refresh_copy_ops](39e73ae0d6/ggml/src/ggml-cuda/ggml-cuda.cu (L2458-L2458) )
method for checking if a graph can be used, even if enabled. I've taken a
similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking
if a graph can be used for the operations even if a user has asked for it to be
enabled.
2025-05-27 18:03:00 +03:00
Henry Linjamäki
316600e8ee
opencl: Add support for multiple devices (llama/12622)
...
* opencl: Add support for multiple devices
... but limited to one platform. A platform with a GPU will be preferred.
Additionally:
* Filter out devices that lack capabilities needed by the backend
implementation (half support, OpenCL 2.0+, etc).
* Make ggml_backend_opencl_reg() thread-safe.
* fixup: fix an error in sync_with_other_backends
... when there is only one OpenCL device available.
2025-05-27 18:03:00 +03:00
Henry Linjamäki
42f2b3bb65
opencl: fix couple crashes (llama/12795)
...
* opencl: fix couple crashes
* fix kernel launches failed on devices which do not support
non-uniform work-groups. When non-uniform work-groups are not
supported, set `local_work_size` to NULL (= let driver choose the
work-group sizes). This patch does not cover everything - just the
cases tested by test-backend-ops.
* fix sub-buffer creation failed due to `cl_buffer_region::origin` not
being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`.
* OpenCL: query non-uniform WG sizes only on OpenCL 3.0+
2025-05-27 18:03:00 +03:00
Xuan-Son Nguyen
dd6ef64060
ggml : add ggml_gelu_erf() (llama/13667)
...
* ggml : add ggml_gelu_na (not approximated)
* fix naming order
* rename na --> erf
* apply review suggesions
* revert naming order
2025-05-27 18:03:00 +03:00
R0CKSTAR
131ee546ca
musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (llama/13647)
...
* musa: fix build warning (unused parameter)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* musa: upgrade MUSA SDK version to rc4.0.1
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Update ggml/src/ggml-cuda/cpy.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-05-27 18:03:00 +03:00
Eve
4712f7b663
vulkan: fix warnings (llama/13626)
...
* small fixes
* remove ifdef
2025-05-27 18:03:00 +03:00
Johannes Gäßler
926fe234e9
CUDA: skip fully masked-out KV in FA vec kernel (llama/13584)
...
* CUDA: skip fully masked-out KV in FA vec kernel
2025-05-27 18:03:00 +03:00
Svetlozar Georgiev
f44b53480f
sycl: disable reorder for sycl mulmat (llama/13536)
2025-05-27 18:03:00 +03:00
Georgi Gerganov
e04e8f1c79
metal : fix typo in FA kernel comments (llama/13651)
2025-05-27 18:03:00 +03:00
Nicolò Scipione
ee3f177cba
sycl : Overcoming workaround for mmap() allocation on Windows (llama/13482)
...
* Remove mmap workaround on windows
After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.
* Update llama-bench README
SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
2025-05-27 18:03:00 +03:00
0cc4m
0b69f74e15
Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (llama/13607)
2025-05-27 18:03:00 +03:00
Chenguang Li
9da3fc27be
CANN: Support MOE Model MUL_MAT_ID (llama/13042)
...
Signed-off-by: noemotiovon <757486878@qq.com>
2025-05-19 14:58:39 +03:00
Gilad S.
2c13651e08
cmake: use the current build config for vulkan-shaders-gen (llama/13595)
...
* fix: use the current build config for `vulkan-shaders-gen`
* fix: only pass a valid build type to `--config`
2025-05-19 14:58:39 +03:00
Jeff Bolz
13dca86c56
vulkan: move common FA code to flash_attn_base.comp (llama/13556)
...
* vulkan: move common FA code to flash_attn_base.comp
* vulkan: move common FA index/stride setup code to flash_attn_base.comp
* build fix
2025-05-19 14:58:39 +03:00