Commit Graph

4079 Commits

Author SHA1 Message Date
Taimur Ahmad 0c10a15447 ggml-cpu: add RVV vec dot kernels for quantization types (llama/18784)
* ggml-cpu: add rvv vec_dot for iq2_s

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

* ggml-cpu: add rvv vec_dot for iq3_s

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

* ggml-cpu: add rvv vec_dot for tq1_0, tq2_0

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

ggml-cpu: add rvv vec_dot for tq1_0, tq2_0

* ggml-cpu: add rvv vec_dot for iq1_s, iq1_m

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

* ggml-cpu: add vlen switch for rvv vec_dot

---------

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>
2026-02-27 20:57:58 +02:00
Masashi Yoshimura 0158795ebc ggml-webgpu: Add unary op (SQR, SQRT, SIN, COS) support. (llama/19700)
* ggml-webgpu: Add unary op (SQR, SQRT, SIN, COS) support.

* Fix to cast the src value to f32 before sin/cos computing.
2026-02-27 20:57:58 +02:00
Ruben Ortlam 3f68f30907 vulkan: fix MMQ shader push constants and multi-dispatch (llama/19732) 2026-02-27 20:57:58 +02:00
Johannes Gäßler ade724fced CUDA: fix kernel selection logic for tile FA (llama/19686)
* CUDA: fix kernel selection logic for tile FA

* add comment
2026-02-27 20:57:58 +02:00
shalinib-ibm cc9e5cf89d llamafile: powerpc: add FP16 MMA path for Q4/Q8 matmul (llama/19709)
Avoid xvi8ger4pp signed→unsigned bias correction by dequantizing Q4/Q8
inputs to FP16 and using FP16×FP16→FP32 MMA. This removes
post-processing overhead and improves performance.

Performance Impact:
1.5 ~ 2x improvement in PP_Speed for Q4 and Q8 Models,
measured with llama-bench and llama-batched-bench.
Q8 Model: granite-4.0-h-micro-Q8_0.gguf (from huggingface)
Q4 Model: Meta-Llama3-8b Q4 model (generated with llama-quantize from
f32 model)

llama-bench Q8 Model Results:
 model                          	       size 	     params 	 backend    	 threads 	            test 	Base t/s	Patch t/s
 granitehybrid 3B Q8_0          	   3.16 GiB 	     3.19 B 	 CPU        	      10 	             pp8 	         64.48 ± 4.72 	         73.99 ± 0.27
 granitehybrid 3B Q8_0          	   3.16 GiB 	     3.19 B 	 CPU        	      10 	            pp16 	         80.11 ± 0.32 	        112.53 ± 0.40
 granitehybrid 3B Q8_0          	   3.16 GiB 	     3.19 B 	 CPU        	      10 	            pp32 	         89.10 ± 0.27 	        152.95 ± 0.68
 granitehybrid 3B Q8_0          	   3.16 GiB 	     3.19 B 	 CPU        	      10 	            pp64 	         93.65 ± 0.25 	        187.83 ± 0.83
 granitehybrid 3B Q8_0          	   3.16 GiB 	     3.19 B 	 CPU        	      10 	           pp128 	         99.93 ± 0.02 	        201.32 ± 0.11
 granitehybrid 3B Q8_0          	   3.16 GiB 	     3.19 B 	 CPU        	      10 	           pp256 	        102.32 ± 0.40 	        208.32 ± 0.41
 granitehybrid 3B Q8_0          	   3.16 GiB 	     3.19 B 	 CPU        	      10 	           pp512 	        103.42 ± 0.40 	        209.98 ± 0.14
 granitehybrid 3B Q8_0          	   3.16 GiB 	     3.19 B 	 CPU        	      10 	           tg128 	         20.35 ± 0.01 	         19.57 ± 0.01

llama-bench Q4 Model Results:
 model                          	       size 	     params 	 backend    	 threads 	            test 	              Base    t/s 	               Patch   t/s
 llama 8B Q4_0                  	   4.33 GiB 	     8.03 B 	 CPU        	      10 	             pp8 	         34.77 ± 0.10 	         41.23 ± 0.08
 llama 8B Q4_0                  	   4.33 GiB 	     8.03 B 	 CPU        	      10 	            pp16 	         40.81 ± 0.04 	         64.55 ± 0.15
 llama 8B Q4_0                  	   4.33 GiB 	     8.03 B 	 CPU        	      10 	            pp32 	         44.65 ± 0.05 	         90.84 ± 0.22
 llama 8B Q4_0                  	   4.33 GiB 	     8.03 B 	 CPU        	      10 	            pp64 	         47.49 ± 0.03 	        114.39 ± 0.11
 llama 8B Q4_0                  	   4.33 GiB 	     8.03 B 	 CPU        	      10 	           pp128 	         49.29 ± 0.24 	        120.13 ± 0.19
 llama 8B Q4_0                  	   4.33 GiB 	     8.03 B 	 CPU        	      10 	           pp256 	         49.77 ± 0.23 	        121.51 ± 0.11
 llama 8B Q4_0                  	   4.33 GiB 	     8.03 B 	 CPU        	      10 	           pp512 	         49.89 ± 0.23 	        117.52 ± 0.10
 llama 8B Q4_0                  	   4.33 GiB 	     8.03 B 	 CPU        	      10 	           tg128 	         13.40 ± 0.01 	         13.37 ± 0.00

Llama perplexity Results:

Model	                    Base Final PPL Estimate	Patch Final PPL Estimate
granite-4.0-h-micro-Q8_0    1.3862 +/- 0.04424	        1.3868 +/- 0.04432
Meta-Llama3-8b Q4	    1.3801 +/- 0.04116	        1.3803 +/- 0.04116

Signed-off-by: Shalini.Salomi.Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2026-02-27 20:57:58 +02:00
Reese Levine 8b3a52ba87 ggml webgpu: Fix bug in dispatching large matrix-vector multiplication (llama/19535)
* Fix bug in dispatching large matrix-vector multiplication
2026-02-27 20:57:58 +02:00
Reese Levine fc7a78f4d8 ggml webgpu: shader library organization (llama/19530)
* Basic JIT compilation for mul_mat, get_rows, and scale (ggml/17)

* scale jit working

* preliminary working jit for getrows and mulmat, needs refining

* simplified mul_mat preprocessing switch statement

* get_rows fixes, mul_mat refinement

* formatted + last edits

* removed some extraneous prints

* fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish

* small fix

* some changes, working

* get_rows and mul_mat jit fixed and working

* Update formatting

* formatting

* Add header

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Start work on all-encompassing shader library

* refactor argmax, set_rows

* Refactor all but flashattention, mat mul

* flashattention and matrix multiplication moved to new format

* clean up preprocessing

* Formatting

* remove duplicate constants

* Split large shaders into multiple static strings

---------

Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
2026-02-27 20:57:58 +02:00
Jeff Bolz f1da0a26f5 vulkan: split mul_mat into multiple dispatches to avoid overflow (llama/19509)
* vulkan: split mul_mat into multiple dispatches to avoid overflow

The batch dimensions can be greater than the max workgroup count limit,
in which case we need to split into multiple dispatches and pass the base
index through a push constant.

Fall back for the less common p021 and nc variants.

* address feedback
2026-02-27 20:57:58 +02:00
shaofeiqi 51ce7de94c opencl: refactor expm1 and softplus (llama/19404)
* opencl: refactor expm1

* opencl: refactor softplus

* opencl: use h for half literals

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-02-27 20:57:58 +02:00
shaofeiqi 6fadc749a9 opencl: optimize mean and sum_row kernels (llama/19614)
* opencl: optimize mean and sum_row kernels

* opencl: add comment for max subgroups

* opencl: format

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-02-27 20:57:58 +02:00
Talha Can Havadar 58855d08c2 ggml: ggml-cpu: force-no-lto-for-cpu-feats (llama/19609)
When LTO enabled in build environments it forces all builds to have LTO
in place. But feature detection logic is fragile, and causing Illegal
instruction errors with lto. This disables LTO for the feature
detection code to prevent cross-module optimization from inlining
architecture-specific instructions into the score function. Without this,
LTO can cause SIGILL when loading backends on older CPUs (e.g., loading
power10 backend on power9 crashes before feature check runs).
2026-02-27 20:57:58 +02:00
Georgi Gerganov cf4bd07028 cuda : enable CUDA graphs for MMID 1 <= BS <= 4 (llama/19645)
* cuda : enable CUDA graphs for MMID BS <= 4

* cont : add stream capture check

Co-authored-by: Oliver Simons <osimons@nvidia.com>

* cont : add MMVQ_MMID_MAX_BATCH_SIZE

---------

Co-authored-by: Oliver Simons <osimons@nvidia.com>
2026-02-27 20:57:58 +02:00
Judd 5ee5748722 ggml : make `ggml_is_view` as API (llama/19539)
* make `ggml_is_view` as API

* introduce `ggml_aux_is_view` as inline version for internal use.

* change `ggml_aux_is_view` to  `ggml_impl_is_view`
2026-02-27 20:57:58 +02:00
Mario Limonciello 5d9d72ec12 Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (llama/19591)
Avoids issues with ROCm 6.4.4.

Closes: https://github.com/ggml-org/llama.cpp/issues/19580
Fixes: 6845f7f87 ("Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (#19461)")

Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
2026-02-27 20:57:58 +02:00
abhijain1204fujitsu f8f7c1d891 ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel (llama/19132)
* Updated repack.cpp

* Updated repack.cpp

* Updated repack.cpp

* Added if condition to support only vector length 256.

* Changed the format removed comments and duplicate variable

* If SVE 256 not present then was using generic function to compute, hence slowing the performance.

So added code if SVE 256 is not present then use NEON code.

* Code format change suggestion

---------

Co-authored-by: Vithule, Prashant <Prashant.Vithule@fujitsu.com>
2026-02-27 20:57:58 +02:00
David Friehs 02a9f660b8 cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (llama/19624)
* cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization

- load all 8 int8 for a grid position in one load
- calculate signs via popcnt instead of fetching from ksigns table
- broadcast signs to drop individual shift/mask

* cuda: iq2xxs: simplify sum scaling

express `(sum * scale + sum / 2) / 4` as `(sum * (scale * 2 + 1)) / 8`
express `((aux32 >> 28) * 2 + 1)` as `(aux32 >> 27 | 1)`

saves 3 registers for mul_mat_vec_q (152 -> 149) according to nsight
AFAICT no overflow can occur here as iq2xxs values are far too small

* uint -> uint32_t

error: identifier "uint" is undefined
2026-02-27 20:57:58 +02:00
Daniel Bevenius df2f8d3bc4 cmake : check if KleidiAI API has been fetched (llama/19640)
This commit addresses a build issue with the KleidiAI backend when
building multiple cpu backends. Commmit
3a00c98584e42a20675b6569d81beadb282b0952 ("cmake : fix KleidiAI install
target failure with EXCLUDE_FROM_ALL") introduced a change where
FetchContent_Populate is called instead of FetchContent_MakeAvailable,
where the latter does handle this case (it is idempotent but
FetchContent_Populate is not).

I missed this during my review and I should not have commited without
verifying the CI failure, sorry about that.
2026-02-27 20:57:58 +02:00
Georgi Gerganov 22f0861efc ggml : avoid UB in gemm ukernel (llama/19642) 2026-02-27 20:57:58 +02:00
Aaron Teo 7b5a1ebaa6 ggml-cpu: optimize ggml_vec_dot_bf16 for s390x (llama/19399) 2026-02-27 20:57:58 +02:00
Aman Gupta 76f769d06f ggml-cpu: FA add GEMM microkernel (llama/19422)
* ggml-cpu: FA add GEMM microkernel

* add guard for sizeless vector types

* fix case where DV % GGML_F32_EPR !=0

* move memset out of the loop

* move another memset out of the loop

* use RM=4 for arm

* simd_gemm: convert everything to int

* convert everything to size_t to avoid warnings

* fixup

* add pragma for ignoring aggressive loop optimizations
2026-02-27 20:57:58 +02:00
SamareshSingh 7ee772ab2b cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL (llama/19581)
* cmake: fix KleidiAI install target failure with EXCLUDE_FROM_ALL

Fix for the bug #19501 by adding EXCLUDE_FROM_ALL to FetchContent_Declare. This properly excludes KleidiAI from both build and install targets, preventing install failures when GGML_CPU_KLEIDIAI=ON is used.

The KleidiAI source files are still compiled into libggml-cpu.so, preserving all functionality.

* addressed code review comments
2026-02-27 20:57:58 +02:00
Georgi Gerganov 4bea3cd329 ggml : bump version to 0.9.7 (ggml/1425) 2026-02-27 20:57:58 +02:00
Dmitry Atamanov cec1dd9d12
examples : update miniaudio library to 0.11.24 (#3672) 2026-02-27 11:15:15 +01:00
Maxime Grenu 21411d81ea
docs : fix duplicate word typo in VAD section (#3670)
The VAD section contained a spurious 'the' at the end of a sentence,
creating the run-on 'Using this information the / only the speech
segments...'. Replace the orphaned 'the' with a comma so the sentence
reads correctly: 'Using this information, only the speech segments...'.
2026-02-19 16:18:42 +01:00
Georgi Gerganov 364c77f4ca talk-llama : sync llama.cpp 2026-02-15 21:44:37 +02:00
Georgi Gerganov 83f2ed19e1 sync : ggml 2026-02-15 21:44:37 +02:00
Georgi Gerganov 4ac70ce791 models : optimize qwen3next graph (llama/19375)
* models : optimizing qwen3next graph

* cont

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* cont : remove redundant q, g chunking

* minor

* minor

* avoid passing masks around

* avoid concats during chunking

* naming + shapes

* update names and use prefix to disable CUDA graphs
2026-02-15 21:44:37 +02:00
Adrien Gallouët 226e8c041c ggml : fix GGML_DEBUG with OpenMP (llama/19599)
last_graph is only available without OpenMP, but
ggml_graph_compute_thread() is called in both cases.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-15 21:44:37 +02:00
Georgi Gerganov fbdac5119c metal : fix ACC op (llama/19427) 2026-02-15 21:44:37 +02:00
Jeff Bolz cc448def01 vulkan: support L2_NORM with contiguous rows (llama/19604) 2026-02-15 21:44:37 +02:00
Jeff Bolz 197e9ab6eb vulkan: support GGML_OP_SET (llama/19584) 2026-02-15 21:44:37 +02:00
Sophon fc6bbab817 vulkan: Add vendor id for Qualcomm drivers (llama/19569)
This commit allows Qualcomm native vulkan driver to be used on Windows
instead of Mesa Dozen.
2026-02-15 21:44:37 +02:00
Max Krasnyansky e6476d4c12 hexagon: further optimizations and refactoring for flash attention (llama/19583)
* ggml-hexagon: fa improvements

ggml-hexagon: optimize flash attention calculations with improved variable handling

ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32

ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements

ggml-hexagon: optimize flash attention by changing slope vector type to F16

* hexfa: fixed test-backend-ops failurs due to leftover element handling

* hexagon: refactor and optimize fa to use local context struct

* ggml-hexagon: optimize flash-attention using hvx_vec_expf

Use HVX for online softmax.

---------

Co-authored-by: chraac <chraac@gmail.com>
2026-02-15 21:44:37 +02:00
Jeff Bolz ec57bf407c vulkan: restore -inf check in FA shaders (llama/19582) 2026-02-15 21:44:37 +02:00
Alberto Cabrera Pérez e8a25654b2 Fix wrong memcpy length for block_interleave == 4 (llama/19575) 2026-02-15 21:44:37 +02:00
ymcki 628b545b7e fix vulkan ggml_acc only works in 3d but not 4d (llama/19426)
* fix vulkan ggml_acc only works in 3d but not 4d

* removed clamp in test_acc_block

* use the correct stride and its test case

* cuda : fix "supports op" condition

* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check

* version without boundary check

* revert back to boundary check version

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-02-15 21:44:37 +02:00
Aman Gupta 58e3d5a42d CUDA: loop over ne2*ne3 in case it overflows (llama/19538)
* CUDA: loop over ne2*ne3 in case it overflows

* use fastdiv
2026-02-15 21:44:37 +02:00
Oliver Simons 3eb4905af1 CUDA: Do not mutate cgraph for fused ADDs (llama/19566)
* Do not mutate cgraph for fused ADDs

1. We should try to minimize in-place changes to the incoming
   ggml_cgraph where possible (those should happen in graph_optimize)
2. Modifying in-place leads to an additional, unnecessary graph capture
   step as we store the properties before modifying the graph in-place
   in the cuda-backend

* Assert ggml_tensor is trivially copyable

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-02-15 21:44:37 +02:00
Georgi Gerganov 0e94faa19c metal : improve concurrency (llama/19555) 2026-02-15 21:44:37 +02:00
Georgi Gerganov c5325e50fc metal : support GGML_OP_SET (llama/19548) 2026-02-15 21:44:37 +02:00
Shupei Fan 195af60a8b hexagon: fix typo in vtcm_needs_release (llama/19545) 2026-02-15 21:44:37 +02:00
lhez 9f87eeccdf opencl: add basic support for q4_1 (llama/19534)
* opencl: add q4_1 mv

* opencl: clean up

* opencl: add flattened q4_1 mv

* opencl: clean up

* opencl: add basic q4_1 mm

* opencl: fix whitespace

* opencl: add general q4_0 mm
2026-02-15 21:44:37 +02:00
Georgi Gerganov d8e3e2ef08 metal : update sum_rows kernel to support float4 (llama/19524) 2026-02-15 21:44:37 +02:00
Mario Limonciello 39b5f414a3 Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (llama/19461)
There is an upstream problem [1] with AMD's LLVM 22 fork and
rocWMMA 2.2.0 causing compilation issues on devices without
native fp16 support (CDNA devices).

The specialized types aren't resolved properly:
```
/opt/rocm/include/rocwmma/internal/mfma_impl.hpp:2549:37: error: ambiguous partial specializations of 'amdgcn_mfma<__half, __half, __half, 16, 16, 16>'
 2549 |             using ARegsT = typename Impl::ARegsT;
```

Add a workaround to explicitly declare the types and cast when
compiling with HIP and ROCWMMA_FATTN [2].  When this is actually
fixed upstream some guards can be used to detect and wrap the
version that has the fix to only apply when necessary.

Link: https://github.com/ROCm/rocm-libraries/issues/4398 [1]
Link: https://github.com/ggml-org/llama.cpp/issues/19269 [2]

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2026-02-15 21:44:37 +02:00
Max Krasnyansky 304205679c hexagon: further optimization and tuning of matmul and dot kernels (llama/19407)
* ggml-hexagon: implement 2x2 matmul kernel

* hexmm: implement vec_dot_rx2x2 for Q8_0 and MXFP4

* hexagon: fix editor config failures

* hexagon: refactor matmul ops to use context struct and remove wrappers

Also implement vec_dot_f16 2x2

* hexagon: refactor dyn quantizers to use mmctx

* hexagon: remove mm fastdiv from op_ctx

* hexagon: refactor matmul entry point to reduce code duplication

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
2026-02-15 21:44:37 +02:00
lhez 0326fd37dd opencl: add general Q6_K mm and Q4_K mv (llama/19347)
* opencl: add general q6_k mm

* opencl: refine condition for q6_K mm

* opencl: add general q4_K mv

* opencl: fix whitespace
2026-02-15 21:44:37 +02:00
Georgi Gerganov f3e78985be ggml : unary ops support non-cont src0 + metal F16 unary ops (llama/19511)
* ggml : unary ops support non-cont src0

* metal : support F16 unary ops + fix ELU
2026-02-15 21:44:37 +02:00
Georgi Gerganov 3ffa1fd84e metal : extend l2_norm support for non-cont src0 (llama/19502) 2026-02-15 21:44:37 +02:00
Max Krasnyansky 09587ceb12 hexagon: Add ARGSORT, DIV, SQR, SQRT, SUM_ROWS, GEGLU (llama/19406)
* hexagon: add ARGSORT op

Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com>

* hexagon: argsort reject tensors with huge rows for now

* Adding support for DIV,SQR,SQRT,SUM_ROWS ops in hexagon backend

* hexagon : Add GEGLU op

* hexagon: fix editor config check

* hexagon: rewrite and optimize binary ops ADD/SUB/MUL/DIV/ADD_ID to use DMA

---------

Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com>
Co-authored-by: Manohara Hosakoppa Krishnamurthy <mhosakop@qti.qualcomm.com>
2026-02-15 21:44:37 +02:00
Georgi Gerganov 3504358056 ggml : extend bin bcast for permuted src1 (llama/19484)
* tests : extend bin bcast for permuted src1

* cont : extend bin support

* cont : s0 is always 1

* tests : simplify
2026-02-15 21:44:37 +02:00