Commit Graph

4210 Commits

Author SHA1 Message Date
Aman Gupta 76f769d06f ggml-cpu: FA add GEMM microkernel (llama/19422)
* ggml-cpu: FA add GEMM microkernel

* add guard for sizeless vector types

* fix case where DV % GGML_F32_EPR !=0

* move memset out of the loop

* move another memset out of the loop

* use RM=4 for arm

* simd_gemm: convert everything to int

* convert everything to size_t to avoid warnings

* fixup

* add pragma for ignoring aggressive loop optimizations
2026-02-27 20:57:58 +02:00
SamareshSingh 7ee772ab2b cmake : fix KleidiAI install target failure with EXCLUDE_FROM_ALL (llama/19581)
* cmake: fix KleidiAI install target failure with EXCLUDE_FROM_ALL

Fix for the bug #19501 by adding EXCLUDE_FROM_ALL to FetchContent_Declare. This properly excludes KleidiAI from both build and install targets, preventing install failures when GGML_CPU_KLEIDIAI=ON is used.

The KleidiAI source files are still compiled into libggml-cpu.so, preserving all functionality.

* addressed code review comments
2026-02-27 20:57:58 +02:00
Georgi Gerganov 4bea3cd329 ggml : bump version to 0.9.7 (ggml/1425) 2026-02-27 20:57:58 +02:00
Dmitry Atamanov cec1dd9d12
examples : update miniaudio library to 0.11.24 (#3672) 2026-02-27 11:15:15 +01:00
Maxime Grenu 21411d81ea
docs : fix duplicate word typo in VAD section (#3670)
The VAD section contained a spurious 'the' at the end of a sentence,
creating the run-on 'Using this information the / only the speech
segments...'. Replace the orphaned 'the' with a comma so the sentence
reads correctly: 'Using this information, only the speech segments...'.
2026-02-19 16:18:42 +01:00
Georgi Gerganov 364c77f4ca talk-llama : sync llama.cpp 2026-02-15 21:44:37 +02:00
Georgi Gerganov 83f2ed19e1 sync : ggml 2026-02-15 21:44:37 +02:00
Georgi Gerganov 4ac70ce791 models : optimize qwen3next graph (llama/19375)
* models : optimizing qwen3next graph

* cont

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* cont : remove redundant q, g chunking

* minor

* minor

* avoid passing masks around

* avoid concats during chunking

* naming + shapes

* update names and use prefix to disable CUDA graphs
2026-02-15 21:44:37 +02:00
Adrien Gallouët 226e8c041c ggml : fix GGML_DEBUG with OpenMP (llama/19599)
last_graph is only available without OpenMP, but
ggml_graph_compute_thread() is called in both cases.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-02-15 21:44:37 +02:00
Georgi Gerganov fbdac5119c metal : fix ACC op (llama/19427) 2026-02-15 21:44:37 +02:00
Jeff Bolz cc448def01 vulkan: support L2_NORM with contiguous rows (llama/19604) 2026-02-15 21:44:37 +02:00
Jeff Bolz 197e9ab6eb vulkan: support GGML_OP_SET (llama/19584) 2026-02-15 21:44:37 +02:00
Sophon fc6bbab817 vulkan: Add vendor id for Qualcomm drivers (llama/19569)
This commit allows Qualcomm native vulkan driver to be used on Windows
instead of Mesa Dozen.
2026-02-15 21:44:37 +02:00
Max Krasnyansky e6476d4c12 hexagon: further optimizations and refactoring for flash attention (llama/19583)
* ggml-hexagon: fa improvements

ggml-hexagon: optimize flash attention calculations with improved variable handling

ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32

ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements

ggml-hexagon: optimize flash attention by changing slope vector type to F16

* hexfa: fixed test-backend-ops failurs due to leftover element handling

* hexagon: refactor and optimize fa to use local context struct

* ggml-hexagon: optimize flash-attention using hvx_vec_expf

Use HVX for online softmax.

---------

Co-authored-by: chraac <chraac@gmail.com>
2026-02-15 21:44:37 +02:00
Jeff Bolz ec57bf407c vulkan: restore -inf check in FA shaders (llama/19582) 2026-02-15 21:44:37 +02:00
Alberto Cabrera Pérez e8a25654b2 Fix wrong memcpy length for block_interleave == 4 (llama/19575) 2026-02-15 21:44:37 +02:00
ymcki 628b545b7e fix vulkan ggml_acc only works in 3d but not 4d (llama/19426)
* fix vulkan ggml_acc only works in 3d but not 4d

* removed clamp in test_acc_block

* use the correct stride and its test case

* cuda : fix "supports op" condition

* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check

* version without boundary check

* revert back to boundary check version

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-02-15 21:44:37 +02:00
Aman Gupta 58e3d5a42d CUDA: loop over ne2*ne3 in case it overflows (llama/19538)
* CUDA: loop over ne2*ne3 in case it overflows

* use fastdiv
2026-02-15 21:44:37 +02:00
Oliver Simons 3eb4905af1 CUDA: Do not mutate cgraph for fused ADDs (llama/19566)
* Do not mutate cgraph for fused ADDs

1. We should try to minimize in-place changes to the incoming
   ggml_cgraph where possible (those should happen in graph_optimize)
2. Modifying in-place leads to an additional, unnecessary graph capture
   step as we store the properties before modifying the graph in-place
   in the cuda-backend

* Assert ggml_tensor is trivially copyable

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-02-15 21:44:37 +02:00
Georgi Gerganov 0e94faa19c metal : improve concurrency (llama/19555) 2026-02-15 21:44:37 +02:00
Georgi Gerganov c5325e50fc metal : support GGML_OP_SET (llama/19548) 2026-02-15 21:44:37 +02:00
Shupei Fan 195af60a8b hexagon: fix typo in vtcm_needs_release (llama/19545) 2026-02-15 21:44:37 +02:00
lhez 9f87eeccdf opencl: add basic support for q4_1 (llama/19534)
* opencl: add q4_1 mv

* opencl: clean up

* opencl: add flattened q4_1 mv

* opencl: clean up

* opencl: add basic q4_1 mm

* opencl: fix whitespace

* opencl: add general q4_0 mm
2026-02-15 21:44:37 +02:00
Georgi Gerganov d8e3e2ef08 metal : update sum_rows kernel to support float4 (llama/19524) 2026-02-15 21:44:37 +02:00
Mario Limonciello 39b5f414a3 Add a workaround for compilation with ROCWMMA_FATTN and gfx9 (llama/19461)
There is an upstream problem [1] with AMD's LLVM 22 fork and
rocWMMA 2.2.0 causing compilation issues on devices without
native fp16 support (CDNA devices).

The specialized types aren't resolved properly:
```
/opt/rocm/include/rocwmma/internal/mfma_impl.hpp:2549:37: error: ambiguous partial specializations of 'amdgcn_mfma<__half, __half, __half, 16, 16, 16>'
 2549 |             using ARegsT = typename Impl::ARegsT;
```

Add a workaround to explicitly declare the types and cast when
compiling with HIP and ROCWMMA_FATTN [2].  When this is actually
fixed upstream some guards can be used to detect and wrap the
version that has the fix to only apply when necessary.

Link: https://github.com/ROCm/rocm-libraries/issues/4398 [1]
Link: https://github.com/ggml-org/llama.cpp/issues/19269 [2]

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2026-02-15 21:44:37 +02:00
Max Krasnyansky 304205679c hexagon: further optimization and tuning of matmul and dot kernels (llama/19407)
* ggml-hexagon: implement 2x2 matmul kernel

* hexmm: implement vec_dot_rx2x2 for Q8_0 and MXFP4

* hexagon: fix editor config failures

* hexagon: refactor matmul ops to use context struct and remove wrappers

Also implement vec_dot_f16 2x2

* hexagon: refactor dyn quantizers to use mmctx

* hexagon: remove mm fastdiv from op_ctx

* hexagon: refactor matmul entry point to reduce code duplication

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
2026-02-15 21:44:37 +02:00
lhez 0326fd37dd opencl: add general Q6_K mm and Q4_K mv (llama/19347)
* opencl: add general q6_k mm

* opencl: refine condition for q6_K mm

* opencl: add general q4_K mv

* opencl: fix whitespace
2026-02-15 21:44:37 +02:00
Georgi Gerganov f3e78985be ggml : unary ops support non-cont src0 + metal F16 unary ops (llama/19511)
* ggml : unary ops support non-cont src0

* metal : support F16 unary ops + fix ELU
2026-02-15 21:44:37 +02:00
Georgi Gerganov 3ffa1fd84e metal : extend l2_norm support for non-cont src0 (llama/19502) 2026-02-15 21:44:37 +02:00
Max Krasnyansky 09587ceb12 hexagon: Add ARGSORT, DIV, SQR, SQRT, SUM_ROWS, GEGLU (llama/19406)
* hexagon: add ARGSORT op

Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com>

* hexagon: argsort reject tensors with huge rows for now

* Adding support for DIV,SQR,SQRT,SUM_ROWS ops in hexagon backend

* hexagon : Add GEGLU op

* hexagon: fix editor config check

* hexagon: rewrite and optimize binary ops ADD/SUB/MUL/DIV/ADD_ID to use DMA

---------

Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com>
Co-authored-by: Manohara Hosakoppa Krishnamurthy <mhosakop@qti.qualcomm.com>
2026-02-15 21:44:37 +02:00
Georgi Gerganov 3504358056 ggml : extend bin bcast for permuted src1 (llama/19484)
* tests : extend bin bcast for permuted src1

* cont : extend bin support

* cont : s0 is always 1

* tests : simplify
2026-02-15 21:44:37 +02:00
Georgi Gerganov de949fb1db metal : consolidate unary ops (llama/19490) 2026-02-15 21:44:37 +02:00
Oliver Simons 57c620b4b1 CUDA : Update CCCL-tag for 3.2 to final release from RC (llama/19486)
CCCL 3.2 has been released since it was added to llama.cpp as part of
the backend-sampling PR, and it makes sense to update from RC to final
released version.

https://github.com/NVIDIA/cccl/releases/tag/v3.2.0
2026-02-15 21:44:37 +02:00
Nikhil Jain 562255fd77 Plug memory leaks and free resources on shutdown (llama/19315)
* Fix memory leaks in shader lib, backend, backend_context, buffer_context, and webgpu_buf_pool

* Free pools

* Cleanup

* More cleanup

* Run clang-format

* Fix arg-parser and tokenizer test errors that free an unallocated buffer

* Fix device lost callback to not print on device teardown

* Fix include and run clang-format

* remove unused unused

* Update binary ops

---------

Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-02-15 21:44:37 +02:00
Alberto Cabrera Pérez d77265c818 ggml-cpu: arm64: q6_K repack gemm and gemv (and generic) implementations (dotprod) (llama/19360)
* First working version of GEMM and GEMV

* interleave loads and compute

* Clang-format

* Added missing fallback. Removed tested TODO.

* Swap M and N to be consistent with the repack template convention
2026-02-15 21:44:37 +02:00
k4ss4n b0fe2e84fa ggml : use noexcept overload for is_regular_file in backend registration (llama/19452)
using noexcept std::filesystem::directory_entry::is_regular_file
overload prevents abnormal termination upon throwing an error
(as caused by symlinks to non-existent folders on linux)

Resolves: #18560
2026-02-15 21:44:37 +02:00
Raul Torres 2de2fc9270 CANN: Remove unnecessary wrapper for `gml_backend_buft_is_cann` (llama/18968) 2026-02-15 21:44:37 +02:00
hipudding 6a74f56212 CANN: implement quantized MUL_MAT_ID for MoE models (llama/19228)
Implement ggml_cann_mul_mat_id_quant function to support quantized matrix
multiplication for Mixture of Experts (MoE) architectures on CANN backend.

Key features:
- Support Q4_0 and Q8_0 quantized weight formats
- Use IndexSelect to dynamically route expert-specific weights based on indices
- Leverage WeightQuantBatchMatmulV2 for efficient quantized computation
- Handle automatic F16 type conversion for hardware compatibility
- Support both per-expert and broadcast input modes

Implementation details:
- Extract expert weights and scales using CANN IndexSelect operation
- Process each batch and expert combination independently
- Create proper tensor views with correct stride for matmul operations
- Automatic input/output type casting to/from F16 as needed

Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).
2026-02-15 21:44:37 +02:00
Georgi Gerganov a36210c836 cuda : extend GGML_OP_PAD to work with non-cont src0 (llama/19429)
* cuda : extend GGML_OP_PAD to work with non-cont src0

* tests : add permuted pad
2026-02-15 21:44:37 +02:00
Oliver Simons 808904277e CUDA: Fix non-contig rope (llama/19338)
* Rename variables + fix rope_neox

Seems memory layout is shared with Vulkan so we can port fix from
https://github.com/ggml-org/llama.cpp/pull/19299

* Fix rope_multi

* Fix rope_vision

* Fix rope_norm

* Rename ne* to ne0* for consistent variable naming

* cont : consistent stride names

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-02-15 21:44:37 +02:00
Nuno 764482c317
ci: add vulkan docker image (#3644)
Signed-off-by: rare-magma <rare-magma@posteo.eu>
2026-02-09 12:33:06 +02:00
Pádraic Slattery 052066c4f7
chore: Update outdated GitHub Actions versions (#3646) 2026-02-09 12:32:46 +02:00
Christian Kastner 525be69a66
cmake: Drop obsolete build-time configuration of backends (#3649)
The backend configuration now happens in ggml.

This updated configuration mirrors that of llama.cpp.
2026-02-09 12:32:18 +02:00
Sid Mohan eb27fa2252
server : fix hardcoded /inference path in default HTML page (#3639)
Closes #3596
2026-02-09 10:10:13 +02:00
Georgi Gerganov 193f7cdaaf
ci : try fix mirrors (#3655) 2026-02-09 09:59:22 +02:00
Georgi Gerganov 4b23ff249e talk-llama : sync llama.cpp 2026-02-08 09:29:10 +02:00
Georgi Gerganov b0e81c1a2e sync : ggml 2026-02-08 09:29:10 +02:00
Georgi Gerganov 55d7cb2e93 metal : consolidate bin kernels (llama/19390)
* metal : refactor bin kernels

* cont

* cont : fix cv
2026-02-08 09:29:10 +02:00
Georgi Gerganov a9a0a51fba metal : fix event synchronization in cpy_tensor_async (llama/19402) 2026-02-08 09:29:10 +02:00
Abhijit Ramesh 1739af663a ggml-webgpu: JIT compile binary operators and handle binding overlaps (llama/19310)
* ggml webgpu: port binary operators to use pre-wgsl

* Add binary.wgsl: unified shader with conditionals for all 4 ops

* Add gen_binary_shaders.cpp: build tool for using pre_wgsl preprocessor

* Remove bin_op.tmpl.wgsl and binary.wgsl (Python template)

* Update CMake to generate binary operator shaders at build time

* ggml-webgpu: migrate binary ops to JIT compilation with overlap handling

* port binary operators from AOT to pre-wgsl JIT compilation

* add src1=dst overlap handling for binary ops

* use compile-time workgroup size defines instead of runtime overrides

* ggml-webgpu: complete overlap handling for binary ops

* add support for inplace & overlap case in binding setup

* restructure conditional logic to handle all overlap cases

* ensure all buffer bindings are correctly assigned for edge cases

* ggml-webgpu: remove unused binary overlap cases

Remove src0==src1 binary overlap case that never occurs in practice.

* keep INPLACE (src0==dst), OVERLAP (src1==dst), DEFAULT

* remove unused src0==src1 and all-same variant

* refactor wgsl to eliminate duplication
2026-02-08 09:29:10 +02:00