Chenguang Li
9da3fc27be
CANN: Support MOE Model MUL_MAT_ID (llama/13042)
...
Signed-off-by: noemotiovon <757486878@qq.com>
2025-05-19 14:58:39 +03:00
Gilad S.
2c13651e08
cmake: use the current build config for vulkan-shaders-gen (llama/13595)
...
* fix: use the current build config for `vulkan-shaders-gen`
* fix: only pass a valid build type to `--config`
2025-05-19 14:58:39 +03:00
Jeff Bolz
13dca86c56
vulkan: move common FA code to flash_attn_base.comp (llama/13556)
...
* vulkan: move common FA code to flash_attn_base.comp
* vulkan: move common FA index/stride setup code to flash_attn_base.comp
* build fix
2025-05-19 14:58:39 +03:00
Jeff Bolz
6d61a09bc4
vulkan: use scalar FA rather than coopmat2 when N==1 (llama/13554)
2025-05-19 14:58:39 +03:00
Georgi Gerganov
4fedad988b
metal : add FA-vec kernel for head size 64 (llama/13583)
...
ggml-ci
2025-05-19 14:58:39 +03:00
Łukasz Ślusarczyk
a8e17a244d
sycl : fixed compilation warnings (llama/13582)
2025-05-19 14:58:39 +03:00
Diego Devesa
0c76acd08a
gguf : use ggml log system (llama/13571)
...
* gguf : use ggml log system
* llama : remove unnecessary new lines in exception messages
2025-05-19 14:58:39 +03:00
Atharva Dubey
27964db1be
sycl: simplify bin_bcast_kernel (llama/13383)
2025-05-19 14:58:39 +03:00
Svetlozar Georgiev
8081e7a23d
sycl: reordered Q4_K MMVQ (llama/13109)
2025-05-19 14:58:39 +03:00
Łukasz Ślusarczyk
d807c497a4
sycl: use oneDNN for matrices multiplication (llama/12972)
2025-05-19 14:58:39 +03:00
Yibo Cai
8e9bf548f4
arm64: optimize q6_k_q8_k kernel with i8mm (llama/13519)
...
This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction.
Tested on neoverse-n2 with llama3 8b q6_k quantization model.
- 40% ~ 54% S_PP uplift for all batch sizes
- 16% ~ 47% S_TG uplift for batch size 4 and above
Perplexity doesn't change with this PR.
```
// tested on neoverse-n2
$ llama-batched-bench \
-m Meta-Llama-3-8B-Instruct-Q6_K.gguf \
--no-mmap -fa \
-c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \
-npl 1,2,4,8,16,32 \
-t 64
---------------------------------------------------------------------
| PP | TG | B | S_PP t/s | S_TG t/s |
| | | | original | this pr | original | this pr |
|-------|--------|------|----------|----------|----------|----------|
| 128 | 128 | 1 | 78.52 | 109.18 | 18.63 | 18.88 |
| 128 | 128 | 2 | 84.62 | 123.94 | 34.54 | 36.92 |
| 128 | 128 | 4 | 84.36 | 122.49 | 52.65 | 61.32 |
| 128 | 128 | 8 | 90.52 | 138.87 | 63.46 | 84.41 |
| 128 | 128 | 16 | 90.11 | 138.56 | 71.04 | 101.33 |
| 128 | 128 | 32 | 89.81 | 137.79 | 75.14 | 110.47 |
---------------------------------------------------------------------
```
2025-05-19 14:58:39 +03:00
Johannes Gäßler
0dda27bc0b
CUDA: fix crash on large batch size for quant. MoE (llama/13537)
2025-05-19 14:58:39 +03:00
Johannes Gäßler
ffa4720f25
CUDA: faster Deepseek FA, add Turing support (llama/13435)
2025-05-19 14:58:39 +03:00
bandoti
9b8eea28b5
cmake: simplify vulkan shader test logic (llama/13263)
2025-05-19 14:58:39 +03:00
Jeff Bolz
162bbe8220
vulkan: KHR_coopmat flash attention (llama/13506)
...
This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.
2025-05-19 14:58:39 +03:00
Jeff Bolz
a221288dc6
vulkan: workaround FA compile failures on macos (llama/13517)
2025-05-19 14:58:39 +03:00
Georgi Gerganov
08436716ae
metal : use FA-vec kernel up to batch size 20 (llama/13496)
...
* batched-bench : fix pp batch contents
* metal : optimize multi-sequence FA vec kernel
ggml-ci
* metal : use FA-vec kernel up to batch size 20
ggml-ci
2025-05-19 14:58:39 +03:00
Georgi Gerganov
e11fc21e6c
metal : optimize multi-sequence FA vec kernel (llama/13493)
...
* batched-bench : fix pp batch contents
* metal : optimize multi-sequence FA vec kernel
ggml-ci
2025-05-19 14:58:39 +03:00
Dan Johansson
a77a924b20
ggml-cpu: Update KleidiAI to v1.6 and fix include directives (llama/13509)
...
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
2025-05-19 14:58:39 +03:00
Johannes Gäßler
405b9c77ad
mnist: fix segmentation fault (ggml/1227)
2025-05-19 14:58:39 +03:00
Diego Devesa
9c3bfc1499
ggml : fix apple OS check in ggml_print_backtrace (ggml/1229)
2025-05-19 14:58:39 +03:00
Daniel Tang
5b7797f674
ggml : Fix missing backtrace on Linux (ggml/1228)
...
* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1
* Fixed lldb attach
* Simplify by having the child do ggml_print_backtrace_symbols
2025-05-19 14:58:39 +03:00
Xuan-Son Nguyen
75e9a840c5
ggml : add mrope kernel for metal (llama/13457)
2025-05-13 13:59:21 +03:00
Georgi Gerganov
41ed62bdbc
metal : optimize MoE for large batches (llama/13388)
2025-05-13 13:59:21 +03:00
lhez
029c8837f8
opencl: remove unnecessary assert for `add` (llama/13257)
2025-05-13 13:59:21 +03:00
Johannes Gäßler
5d8b068249
llama/ggml: add LLM training support (llama/10544)
...
* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period
2025-05-13 13:59:21 +03:00
Dan Johansson
93ef22657e
ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel (llama/13053)
...
* ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
* * code review fixes
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
* * adds a comment that clarifies barrier usage
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
---------
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
Co-authored-by: Charles Xu <charles.xu@arm.com>
2025-05-13 13:59:21 +03:00
Johannes Gäßler
866f685bbc
CUDA: fix misaligned synchronization in FA (llama/13469)
2025-05-13 13:59:21 +03:00
Atharva Dubey
250bcc041a
enable dpcpp nightly builds with libraries (llama/13406)
2025-05-13 13:59:21 +03:00
Johannes Gäßler
90b17a99bf
CUDA: fix crash with partial offloading of MoE (llama/13439)
2025-05-13 13:59:21 +03:00
David Huang
e1b2ace0f8
Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (llama/13386)
2025-05-13 13:59:21 +03:00
Johannes Gäßler
6db0e01db6
CUDA: fix race conditions FlashAttention kernels (llama/13438)
2025-05-13 13:59:21 +03:00
Johannes Gäßler
16f3546f38
CUDA: fix FlashAttention on Turing (llama/13415)
2025-05-13 13:59:21 +03:00
Jeff Bolz
a04b329ad1
vulkan: scalar flash attention implementation (llama/13324)
...
* vulkan: scalar flash attention implementation
* vulkan: always use fp32 for scalar flash attention
* vulkan: use vector loads in scalar flash attention shader
* vulkan: remove PV matrix, helps with register usage
* vulkan: reduce register usage in scalar FA, but perf may be slightly worse
* vulkan: load each Q value once. optimize O reduction. more tuning
* vulkan: support q4_0/q8_0 KV in scalar FA
* CI: increase timeout to accommodate newly-supported tests
* vulkan: for scalar FA, select between 1 and 8 rows
* vulkan: avoid using Float16 capability in scalar FA
2025-05-13 13:59:21 +03:00
Alberto Cabrera Pérez
45d8b2352e
sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (llama/12858)
...
* sycl : Implemented reorder Q4_0 mmvq
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
* sycl : Fixed mmvq being called when reorder is disabled
* sycl : Improved comments in the quants header
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
* Use static_assert
* safe_div -> ceil_div
* Clarify qi comment
* change the reorder tensor from init to execute OP
* dbg
* Undo changes to test-backend-ops
* Refactor changes on top of q4_0 reorder fix
* Missing Reverts
* Refactored opt_for_reorder logic to simplify code path
* Explicit inlining and unroll
* Renamed mul_mat_algo enum for consistency
---------
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
Co-authored-by: romain.biessy <romain.biessy@codeplay.com>
2025-05-13 13:59:21 +03:00
Johannes Gäßler
2d436bfbfb
CUDA: FA support for Deepseek (Ampere or newer) (llama/13306)
...
* CUDA: FA support for Deepseek (Ampere or newer)
* do loop unrolling via C++ template
2025-05-13 13:59:21 +03:00
Johannes Gäßler
4b7cbb62ef
CUDA: fix crash on large batch size for MoE models (llama/13384)
2025-05-13 13:59:21 +03:00
Radoslav Gerganov
e27c91f6d6
rpc : add rpc_msg_set_tensor_hash_req (llama/13353)
...
* rpc : add rpc_msg_set_tensor_hash_req
Use a dedicated struct for the request of RPC_CMD_SET_TENSOR_HASH which
makes the code cleaner.
* fix
2025-05-13 13:59:21 +03:00
Jeff Bolz
e46df4850f
vulkan: Allow up to 4096 elements for mul_mat_id row_ids (llama/13326)
...
This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:
GGML_ASSERT(nei0 * nei1 <= 3072);
The tensor is 8 x 512. Increase this array size to accommodate.
2025-05-13 13:59:21 +03:00
Alberto Cabrera Pérez
e8a7f1b7bb
sycl: addressing non-contiguous src1 mul_mats (nc and batched) (llama/13343)
...
* sycl: fixed non-contiguous src1 mul_mats (nc and batched)
* Fixed wrong static_cast inside kernel
2025-05-13 13:59:21 +03:00
R0CKSTAR
09e6b66025
cuda : remove nrows_x in mul_mat_q_process_tile (llama/13325)
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-05-07 21:00:32 +03:00
Johannes Gäßler
d41cf26a0f
CUDA: mix virt/real CUDA archs for GGML_NATIVE=OFF (llama/13135)
2025-05-07 21:00:32 +03:00
Akarshan Biswas
3c67195be9
SYCL: Disable reorder optimize by default and stop setting tensor extras when optimize is disabled (llama/13254)
...
* SYCL: Do not set tensor extras when reorder optimize is disabled
* SYCL: Disable reorder optimize by default
2025-05-07 21:00:32 +03:00
Johannes Gäßler
f9f78a773f
CUDA: fix bad asserts for partial offload (llama/13337)
2025-05-07 21:00:32 +03:00
Johannes Gäßler
be55e25cac
CUDA: fix --split-mode row for MMQ (llama/13323)
2025-05-07 21:00:32 +03:00
Johannes Gäßler
2ffdda99e8
CUDA: fix logic for clearing padding with -ngl 0 (llama/13320)
2025-05-07 21:00:32 +03:00
Akarshan Biswas
9bbedc51cc
SYCL: Disable mul_mat kernels for noncontiguous tensor b (llama/13308)
...
ggml-ci
2025-05-07 21:00:32 +03:00
Diego Devesa
1e1fa27add
rpc : use backend registry, support dl backends (llama/13304)
2025-05-07 21:00:32 +03:00
Aaron Teo
e1bdd148c5
ggml : activate s390x simd for Q3_K (llama/13301)
...
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-05-07 21:00:32 +03:00
Johannes Gäßler
7fa8bb303f
CUDA: fix race condition in MMQ stream-k fixup (llama/13299)
2025-05-07 21:00:32 +03:00