Commit Graph

2032 Commits

Author SHA1 Message Date
Daniel Bevenius df2f8d3bc4 cmake : check if KleidiAI API has been fetched (llama/19640)
This commit addresses a build issue with the KleidiAI backend when
building multiple cpu backends. Commmit
3a00c98584e42a20675b6569d81beadb282b0952 ("cmake : fix KleidiAI install
target failure with EXCLUDE_FROM_ALL") introduced a change where
FetchContent_Populate is called instead of FetchContent_MakeAvailable,
where the latter does handle this case (it is idempotent but
FetchContent_Populate is not).

I missed this during my review and I should not have commited without
verifying the CI failure, sorry about that.
2026-02-27 20:57:58 +02:00
Georgi Gerganov 22f0861efc ggml : avoid UB in gemm ukernel (llama/19642) 2026-02-27 20:57:58 +02:00
Aaron Teo 7b5a1ebaa6 ggml-cpu: optimize ggml_vec_dot_bf16 for s390x (llama/19399) 2026-02-27 20:57:58 +02:00
Aman Gupta 76f769d06f ggml-cpu: FA add GEMM microkernel (llama/19422)
* ggml-cpu: FA add GEMM microkernel

* add guard for sizeless vector types

* fix case where DV % GGML_F32_EPR !=0

* move memset out of the loop

* move another memset out of the loop

* use RM=4 for arm

* simd_gemm: convert everything to int

* convert everything to size_t to avoid warnings

* fixup

* add pragma for ignoring aggressive loop optimizations
2026-02-27 20:57:58 +02:00
SamareshSingh 7ee772ab2b cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL (llama/19581)
* cmake: fix KleidiAI install target failure with EXCLUDE_FROM_ALL

Fix for the bug #19501 by adding EXCLUDE_FROM_ALL to FetchContent_Declare. This properly excludes KleidiAI from both build and install targets, preventing install failures when GGML_CPU_KLEIDIAI=ON is used.

The KleidiAI source files are still compiled into libggml-cpu.so, preserving all functionality.

* addressed code review comments
2026-02-27 20:57:58 +02:00
Georgi Gerganov 4bea3cd329 ggml : bump version to 0.9.7 (ggml/1425) 2026-02-27 20:57:58 +02:00
Georgi Gerganov 4ac70ce791 models : optimize qwen3next graph (llama/19375)
* models : optimizing qwen3next graph

* cont

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* cont : remove redundant q, g chunking

* minor

* minor

* avoid passing masks around

* avoid concats during chunking

* naming + shapes

* update names and use prefix to disable CUDA graphs
2026-02-15 21:44:37 +02:00
Adrien Gallouët 226e8c041c ggml : fix GGML_DEBUG with OpenMP (llama/19599)
last_graph is only available without OpenMP, but
ggml_graph_compute_thread() is called in both cases.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-15 21:44:37 +02:00
Georgi Gerganov fbdac5119c metal : fix ACC op (llama/19427) 2026-02-15 21:44:37 +02:00
Jeff Bolz cc448def01 vulkan: support L2_NORM with contiguous rows (llama/19604) 2026-02-15 21:44:37 +02:00
Jeff Bolz 197e9ab6eb vulkan: support GGML_OP_SET (llama/19584) 2026-02-15 21:44:37 +02:00
Sophon fc6bbab817 vulkan: Add vendor id for Qualcomm drivers (llama/19569)
This commit allows Qualcomm native vulkan driver to be used on Windows
instead of Mesa Dozen.
2026-02-15 21:44:37 +02:00
Max Krasnyansky e6476d4c12 hexagon: further optimizations and refactoring for flash attention (llama/19583)
* ggml-hexagon: fa improvements

ggml-hexagon: optimize flash attention calculations with improved variable handling

ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32

ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements

ggml-hexagon: optimize flash attention by changing slope vector type to F16

* hexfa: fixed test-backend-ops failurs due to leftover element handling

* hexagon: refactor and optimize fa to use local context struct

* ggml-hexagon: optimize flash-attention using hvx_vec_expf

Use HVX for online softmax.

---------

Co-authored-by: chraac <chraac@gmail.com>
2026-02-15 21:44:37 +02:00
Jeff Bolz ec57bf407c vulkan: restore -inf check in FA shaders (llama/19582) 2026-02-15 21:44:37 +02:00
Alberto Cabrera Pérez e8a25654b2 Fix wrong memcpy length for block_interleave == 4 (llama/19575) 2026-02-15 21:44:37 +02:00
ymcki 628b545b7e fix vulkan ggml_acc only works in 3d but not 4d (llama/19426)
* fix vulkan ggml_acc only works in 3d but not 4d

* removed clamp in test_acc_block

* use the correct stride and its test case

* cuda : fix "supports op" condition

* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check

* version without boundary check

* revert back to boundary check version

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-02-15 21:44:37 +02:00
Aman Gupta 58e3d5a42d CUDA: loop over ne2*ne3 in case it overflows (llama/19538)
* CUDA: loop over ne2*ne3 in case it overflows

* use fastdiv
2026-02-15 21:44:37 +02:00
Oliver Simons 3eb4905af1 CUDA: Do not mutate cgraph for fused ADDs (llama/19566)
* Do not mutate cgraph for fused ADDs

1. We should try to minimize in-place changes to the incoming
   ggml_cgraph where possible (those should happen in graph_optimize)
2. Modifying in-place leads to an additional, unnecessary graph capture
   step as we store the properties before modifying the graph in-place
   in the cuda-backend

* Assert ggml_tensor is trivially copyable

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-02-15 21:44:37 +02:00
Georgi Gerganov 0e94faa19c metal : improve concurrency (llama/19555) 2026-02-15 21:44:37 +02:00
Georgi Gerganov c5325e50fc metal : support GGML_OP_SET (llama/19548) 2026-02-15 21:44:37 +02:00
Shupei Fan 195af60a8b hexagon: fix typo in vtcm_needs_release (llama/19545) 2026-02-15 21:44:37 +02:00
lhez 9f87eeccdf opencl: add basic support for q4_1 (llama/19534)
* opencl: add q4_1 mv

* opencl: clean up

* opencl: add flattened q4_1 mv

* opencl: clean up

* opencl: add basic q4_1 mm

* opencl: fix whitespace

* opencl: add general q4_0 mm
2026-02-15 21:44:37 +02:00
Georgi Gerganov d8e3e2ef08 metal : update sum_rows kernel to support float4 (llama/19524) 2026-02-15 21:44:37 +02:00
Mario Limonciello 39b5f414a3 Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (llama/19461)
There is an upstream problem [1] with AMD's LLVM 22 fork and
rocWMMA 2.2.0 causing compilation issues on devices without
native fp16 support (CDNA devices).

The specialized types aren't resolved properly:
```
/opt/rocm/include/rocwmma/internal/mfma_impl.hpp:2549:37: error: ambiguous partial specializations of 'amdgcn_mfma<__half, __half, __half, 16, 16, 16>'
 2549 |             using ARegsT = typename Impl::ARegsT;
```

Add a workaround to explicitly declare the types and cast when
compiling with HIP and ROCWMMA_FATTN [2].  When this is actually
fixed upstream some guards can be used to detect and wrap the
version that has the fix to only apply when necessary.

Link: https://github.com/ROCm/rocm-libraries/issues/4398 [1]
Link: https://github.com/ggml-org/llama.cpp/issues/19269 [2]

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2026-02-15 21:44:37 +02:00
Max Krasnyansky 304205679c hexagon: further optimization and tuning of matmul and dot kernels (llama/19407)
* ggml-hexagon: implement 2x2 matmul kernel

* hexmm: implement vec_dot_rx2x2 for Q8_0 and MXFP4

* hexagon: fix editor config failures

* hexagon: refactor matmul ops to use context struct and remove wrappers

Also implement vec_dot_f16 2x2

* hexagon: refactor dyn quantizers to use mmctx

* hexagon: remove mm fastdiv from op_ctx

* hexagon: refactor matmul entry point to reduce code duplication

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
2026-02-15 21:44:37 +02:00
lhez 0326fd37dd opencl: add general Q6_K mm and Q4_K mv (llama/19347)
* opencl: add general q6_k mm

* opencl: refine condition for q6_K mm

* opencl: add general q4_K mv

* opencl: fix whitespace
2026-02-15 21:44:37 +02:00
Georgi Gerganov f3e78985be ggml : unary ops support non-cont src0 + metal F16 unary ops (llama/19511)
* ggml : unary ops support non-cont src0

* metal : support F16 unary ops + fix ELU
2026-02-15 21:44:37 +02:00
Georgi Gerganov 3ffa1fd84e metal : extend l2_norm support for non-cont src0 (llama/19502) 2026-02-15 21:44:37 +02:00
Max Krasnyansky 09587ceb12 hexagon: Add ARGSORT, DIV, SQR, SQRT, SUM_ROWS, GEGLU (llama/19406)
* hexagon: add ARGSORT op

Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com>

* hexagon: argsort reject tensors with huge rows for now

* Adding support for DIV,SQR,SQRT,SUM_ROWS ops in hexagon backend

* hexagon : Add GEGLU op

* hexagon: fix editor config check

* hexagon: rewrite and optimize binary ops ADD/SUB/MUL/DIV/ADD_ID to use DMA

---------

Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com>
Co-authored-by: Manohara Hosakoppa Krishnamurthy <mhosakop@qti.qualcomm.com>
2026-02-15 21:44:37 +02:00
Georgi Gerganov 3504358056 ggml : extend bin bcast for permuted src1 (llama/19484)
* tests : extend bin bcast for permuted src1

* cont : extend bin support

* cont : s0 is always 1

* tests : simplify
2026-02-15 21:44:37 +02:00
Georgi Gerganov de949fb1db metal : consolidate unary ops (llama/19490) 2026-02-15 21:44:37 +02:00
Oliver Simons 57c620b4b1 CUDA : Update CCCL-tag for 3.2 to final release from RC (llama/19486)
CCCL 3.2 has been released since it was added to llama.cpp as part of
the backend-sampling PR, and it makes sense to update from RC to final
released version.

https://github.com/NVIDIA/cccl/releases/tag/v3.2.0
2026-02-15 21:44:37 +02:00
Nikhil Jain 562255fd77 Plug memory leaks and free resources on shutdown (llama/19315)
* Fix memory leaks in shader lib, backend, backend_context, buffer_context, and webgpu_buf_pool

* Free pools

* Cleanup

* More cleanup

* Run clang-format

* Fix arg-parser and tokenizer test errors that free an unallocated buffer

* Fix device lost callback to not print on device teardown

* Fix include and run clang-format

* remove unused unused

* Update binary ops

---------

Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-02-15 21:44:37 +02:00
Alberto Cabrera Pérez d77265c818 ggml-cpu: arm64: q6_K repack gemm and gemv (and generic) implementations (dotprod) (llama/19360)
* First working version of GEMM and GEMV

* interleave loads and compute

* Clang-format

* Added missing fallback. Removed tested TODO.

* Swap M and N to be consistent with the repack template convention
2026-02-15 21:44:37 +02:00
k4ss4n b0fe2e84fa ggml : use noexcept overload for is_regular_file in backend registration (llama/19452)
using noexcept std::filesystem::directory_entry::is_regular_file
overload prevents abnormal termination upon throwing an error
(as caused by symlinks to non-existent folders on linux)

Resolves: #18560
2026-02-15 21:44:37 +02:00
Raul Torres 2de2fc9270 CANN: Remove unnecessary wrapper for `gml_backend_buft_is_cann` (llama/18968) 2026-02-15 21:44:37 +02:00
hipudding 6a74f56212 CANN: implement quantized MUL_MAT_ID for MoE models (llama/19228)
Implement ggml_cann_mul_mat_id_quant function to support quantized matrix
multiplication for Mixture of Experts (MoE) architectures on CANN backend.

Key features:
- Support Q4_0 and Q8_0 quantized weight formats
- Use IndexSelect to dynamically route expert-specific weights based on indices
- Leverage WeightQuantBatchMatmulV2 for efficient quantized computation
- Handle automatic F16 type conversion for hardware compatibility
- Support both per-expert and broadcast input modes

Implementation details:
- Extract expert weights and scales using CANN IndexSelect operation
- Process each batch and expert combination independently
- Create proper tensor views with correct stride for matmul operations
- Automatic input/output type casting to/from F16 as needed

Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).
2026-02-15 21:44:37 +02:00
Georgi Gerganov a36210c836 cuda : extend GGML_OP_PAD to work with non-cont src0 (llama/19429)
* cuda : extend GGML_OP_PAD to work with non-cont src0

* tests : add permuted pad
2026-02-15 21:44:37 +02:00
Oliver Simons 808904277e CUDA: Fix non-contig rope (llama/19338)
* Rename variables + fix rope_neox

Seems memory layout is shared with Vulkan so we can port fix from
https://github.com/ggml-org/llama.cpp/pull/19299

* Fix rope_multi

* Fix rope_vision

* Fix rope_norm

* Rename ne* to ne0* for consistent variable naming

* cont : consistent stride names

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-02-15 21:44:37 +02:00
Georgi Gerganov 55d7cb2e93 metal : consolidate bin kernels (llama/19390)
* metal : refactor bin kernels

* cont

* cont : fix cv
2026-02-08 09:29:10 +02:00
Georgi Gerganov a9a0a51fba metal : fix event synchronization in cpy_tensor_async (llama/19402) 2026-02-08 09:29:10 +02:00
Abhijit Ramesh 1739af663a ggml-webgpu: JIT compile binary operators and handle binding overlaps (llama/19310)
* ggml webgpu: port binary operators to use pre-wgsl

* Add binary.wgsl: unified shader with conditionals for all 4 ops

* Add gen_binary_shaders.cpp: build tool for using pre_wgsl preprocessor

* Remove bin_op.tmpl.wgsl and binary.wgsl (Python template)

* Update CMake to generate binary operator shaders at build time

* ggml-webgpu: migrate binary ops to JIT compilation with overlap handling

* port binary operators from AOT to pre-wgsl JIT compilation

* add src1=dst overlap handling for binary ops

* use compile-time workgroup size defines instead of runtime overrides

* ggml-webgpu: complete overlap handling for binary ops

* add support for inplace & overlap case in binding setup

* restructure conditional logic to handle all overlap cases

* ensure all buffer bindings are correctly assigned for edge cases

* ggml-webgpu: remove unused binary overlap cases

Remove src0==src1 binary overlap case that never occurs in practice.

* keep INPLACE (src0==dst), OVERLAP (src1==dst), DEFAULT

* remove unused src0==src1 and all-same variant

* refactor wgsl to eliminate duplication
2026-02-08 09:29:10 +02:00
Nechama Krashinski f2f7320817 sycl: add F16 support for GGML_OP_CEIL (llama/19306)
* Fix SYCL CEIL operator

* sycl: implement GGML_OP_CEIL
2026-02-08 09:29:10 +02:00
Jeff Bolz cea22b3075 vulkan: For coopmat2 FA, use fp16 accumulators for the final result (llama/19376)
The cpu and cuda backends use fp16 for the VKQ accumulator type, this change
does the same for vulkan. This helps particularly with large head sizes which
are very register-limited.

I tried this for the coopmat1 path and it slowed down a bit. I didn't try for
scalar.

I applied the softmax bias that the cuda backend uses to avoid overflow,
although I was not able to reproduce the original bug without it.
2026-02-08 09:29:10 +02:00
Jeff Bolz c1b63354bb vulkan: make FA mask/softcap enables spec constants (llama/19309)
* vulkan: make FA mask/softcap enables spec constants

* don't specialize for sinks

* bump timeout a little bit
2026-02-08 09:29:10 +02:00
Georgi Gerganov 776cf61857 metal : skip loading all-zero mask (llama/19337)
* metal : skip loading all-zero mask

* cont : minor
2026-02-08 09:29:10 +02:00
Georgi Gerganov 2a7d5490f1 cuda : cuda graphs now compare all node params (llama/19383) 2026-02-08 09:29:10 +02:00
Georgi Gerganov 34d332aca5 metal : adaptive CPU/GPU interleave based on number of nodes (llama/19369) 2026-02-08 09:29:10 +02:00
Jeff Bolz a567c140a3 vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (llama/19281)
Write out a 2-bit code per block and avoid loading the mask when it
matches these two common cases.

Apply this optimization when the mask is relatively large (i.e. prompt
processing).
2026-02-08 09:29:10 +02:00
Georgi Gerganov 0781df2518 metal : add diag (llama/19330) 2026-02-08 09:29:10 +02:00