Commit Graph

4004 Commits

Author SHA1 Message Date
Jeff Bolz a567c140a3 vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (llama/19281)
Write out a 2-bit code per block and avoid loading the mask when it
matches these two common cases.

Apply this optimization when the mask is relatively large (i.e. prompt
processing).
2026-02-08 09:29:10 +02:00
Georgi Gerganov 0781df2518 metal : add diag (llama/19330) 2026-02-08 09:29:10 +02:00
Oleksandr Kuvshynov 932def3198 vulkan: fix GPU deduplication logic. (llama/19222)
* vulkan: fix GPU deduplication logic.

As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the
(same uuid, same driver) logic is problematic for windows+intel igpu.

Let's just avoid filtering for MoltenVK which is apple-specific, and
keep the logic the  same as before 88d23ad5 - just dedup based on UUID.

Verified that MacOS + 4xVega still reports 4 GPUs with this version.

* vulkan: only skip dedup when both drivers are moltenVk
2026-02-08 09:29:10 +02:00
Jeff Bolz 5a786f7648 vulkan: Set k_load_shmem to false when K is too large (llama/19301) 2026-02-08 09:29:10 +02:00
Jeff Bolz e0a3f393ad vulkan: fix non-contig rope (llama/19299) 2026-02-08 09:29:10 +02:00
will-lms eecc9bfa69 metal : add missing includes (llama/19348) 2026-02-08 09:29:10 +02:00
Kevin Pouget 2763054f99 ggml-virtgpu: make the code thread safe (llama/19204)
* ggml-virtgpu: regenerate_remoting.py: add the ability to deprecate a function

* ggml-virtgpu: deprecate buffer_type is_host remoting

not necessary

* ggml-virtgpu: stop using static vars as cache

The static init isn't thread safe.

* ggml-virtgpu: protect the use of the shared memory to transfer data

* ggml-virtgpu: make the remote calls thread-safe

* ggml-virtgpu: backend: don't continue if couldn't allocate the tensor memory

* ggml-virtgpu: add a cleanup function for consistency

* ggml-virtgpu: backend: don't crash if buft->iface.get_max_size is missing

* fix style and ordering

* Remove the static variable in apir_device_get_count

* ggml-virtgpu: improve the logging

* fix review minor formatting changes
2026-02-08 09:29:10 +02:00
Aman Gupta 4685ec9555 ggml-cpu: use LUT for converting e8->f32 scales on x86 (llama/19288)
* ggml-cpu: use LUT for converting e8->f32 scales on x86

* add dispatch based on macro
2026-02-08 09:29:10 +02:00
Georgi Gerganov 5dda94dd2e metal : add solve_tri (llama/19302) 2026-02-08 09:29:10 +02:00
Ruben Ortlam aa34558b6f vulkan: disable coopmat1 fa on Nvidia Turing (llama/19290) 2026-02-08 09:29:10 +02:00
Aman Gupta 8eede801e3 CUDA: use mmvq for mul-mat-id for small batch sizes (llama/18958)
* CUDA: use mmvq for mul-mat-id for small batch sizes

* add mmvq too

* Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs

* templatize multi_token_path
2026-02-08 09:29:10 +02:00
Georgi Gerganov ce8a2da620 metal : minor cleanup (llama/19251) 2026-02-08 09:29:10 +02:00
Oliver Simons 698265d754 CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup (llama/19053)
By providing stride_* variables as size_t (i.e., 64-bit) the compiler can
correctly unroll the [two for-loops](557515be1e/ggml/src/ggml-cuda/mmq.cuh (L3789-L3816))
on BW. This gives some perf for prefill/pp phase on BW, while not affecting
other SMs:

| GPU                                                     | Model                 | Test   |   t/s master |   t/s osimons/fix_bw_mmq_fixup_kernel |   Speedup |
|:--------------------------------------------------------|:----------------------|:-------|-------------:|--------------------------------------:|----------:|
| NVIDIA RTX 6000 Ada Generation                          | gpt-oss 20B MXFP4 MoE | pp8096 |      8404.05 |                               8375.79 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | llama 3B Q4_K_M       | pp8096 |     16148.93 |                              16019.60 |      0.99 |
| NVIDIA RTX 6000 Ada Generation                          | llama 8B Q4_0         | pp8096 |      8008.29 |                               7978.80 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | nemotron_h 9B BF16    | pp8096 |      4263.16 |                               4248.53 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | nemotron_h 9B Q4_K_M  | pp8096 |      5165.11 |                               5157.43 |      1.00 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | gpt-oss 20B MXFP4 MoE | pp8096 |     12582.80 |                              12758.37 |      1.01 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 3B Q4_K_M       | pp8096 |     16879.10 |                              17619.47 |      1.04 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 8B Q4_0         | pp8096 |     10649.90 |                              10982.65 |      1.03 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B BF16    | pp8096 |      7717.73 |                               7716.22 |      1.00 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B Q4_K_M  | pp8096 |      7301.90 |                               7370.38 |      1.01 |
2026-02-08 09:29:10 +02:00
George 57107b2bf8 ggml: added cleanups in ggml_quantize_free (llama/19278)
Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.
2026-02-08 09:29:10 +02:00
Gaurav Garg 6ec362d2e0 cuda : revert CUDA_SCALE_LAUNCH_QUEUES override until investigated (llama/19227)
Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.
2026-02-08 09:29:10 +02:00
lhez 591072fcc8 opencl: refactor some ops, concat, repeat, tanh and scale (llama/19226)
* opencl: refactor concat

* opencl: refactor repeat

* opencl: refactor tanh

* opencl: enable fp16 for tanh

* opencl: refactor scale

* opencl: fix unused variables
2026-02-08 09:29:10 +02:00
Aman Gupta 871063016d ggml-cpu: FA split across kv for faster TG (llama/19209)
* ggml-cpu: split across kv for faster TG

* simplify sinks application

* add ref impl
2026-02-08 09:29:10 +02:00
Neo Zhang c4003da2b8 Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. (llama/19246)
User can't build up the software for Nvidia & AMD GPU.
rm the oneMath since it is only used in NV and AMD code path.
2026-02-08 09:29:10 +02:00
Tamar 74353e90a1 sycl: implement GGML_OP_TOP_K (llama/19242) 2026-02-08 09:29:10 +02:00
Georgi Gerganov 73e04555eb metal : support virtual devices (llama/18919)
* metal : support virtual devices

* cont : manage buffer type context memory

* metal : add events

* cont : implement cpy_tensor_async
2026-02-08 09:29:10 +02:00
Johannes Gäßler 625c8d863e ggml-backend: fix async set/get fallback sync (llama/19179) 2026-02-08 09:29:10 +02:00
Christian Kastner 0e219ebf89 docs : Minor cleanups (llama/19252)
* Update old URLs to github.com/ggml-org/

* Bump copyrights
2026-02-08 09:29:10 +02:00
Nikhil Jain a0256b8159 Remove pipeline cache mutexes (llama/19195)
* Remove mutex for pipeline caches, since they are now per-thread.

* Add comment

* Run clang-format

* Cleanup

* Run CI again

* Run CI once more

* Run clang-format
2026-02-08 09:29:10 +02:00
Max Krasnyansky aca5953d8d Bump cmake max version (needed for Windows on Snapdragon builds) (llama/19188)
* Bump max cmake version (needed for Windows on Snapdragon builds)

* cmake: move max version setting into ggml/CMakeLists
2026-02-08 09:29:10 +02:00
nullname 9b927dd849 ggml-hexagon: flash-attention and reduce-sum optimizations (llama/19141)
* wip

* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation

* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations

* wip

* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance

* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability

* optimize vector dot product functions to use unified reduction for improved performance

* wip

* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation

* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations

* wip

* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance

* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability

* optimize vector dot product functions to use unified reduction for improved performance

* hexagon: optimize reduce-sum for v75+

* hexagon: always keep row_sums in sf/fp32

* ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAGON_TOOLS_ROOT

* fix compiling error after rebase

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-02-08 09:29:10 +02:00
shaofeiqi db9c88744d opencl: add optimized q8_0 mm kernel for adreno (llama/18871)
* Add Q8_0 OpenCL kernel

Co-authored-by: yunjie <yunjie@qti.qualcomm.com>

* opencl: fix build for non-adreno

* opencl: refactor q8_0

* opencl: enforce subgroup size of 64 for adreno for q8_0

* For A750 and older generations, subgroup size can be 64 or 128.
  This kernel assumes subgroup size 64.

* opencl: suppress warning when adreno kernels are disabled

---------

Co-authored-by: yunjie <yunjie@qti.qualcomm.com>
Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-02-08 09:29:10 +02:00
Simon Redman efd6344939 Correctly fetch q8_1 quantize pipeline in test as needed by 8a3519b (llama/19194) 2026-02-08 09:29:10 +02:00
Georgi Gerganov 06e3750407 ggml : bump version to 0.9.6 (ggml/1423) 2026-02-08 09:29:10 +02:00
Georgi Gerganov fc1a3e579e cmake : remove unused file (ggml/1419) 2026-02-08 09:29:10 +02:00
KITAITI Makoto 941bdabbe4
ruby : add `Whisper::Context::Params`, fix token memory management (#3647)
* Don't convert to temporary VALUE

* Define Whisper::Context::Params

* Add test for Whisper::Context::Params

* Implement Whisper::Context::Params

* Add tests for Context::Params

* Fix Whisper::Token memory management

* Add test for token_timestamps

* Make Context accept Context::Params

* Make Context::Params.new accept keyword args

* Add test for Context::Params.new with keyword args

* Add signature of Context::Params

* Add example for Whisper::Token

* Fix typos

* Revert "Don't convert to temporary VALUE"

This reverts commit dee66e7384.

* Hold Token#text as Ruby objectd

* Don't use pointer for ruby_whisper_context_params.params

* Use RUBY_DEFAULT_FREE instead of custom function

* Update bindings/ruby/README.md

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* Add document for Whisper::Context::Params

---------

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2026-02-04 20:33:09 +09:00
KITAITI Makoto aa1bc0d1a6
ruby : add `VAD::Context#segments_from_samples`, allow Pathname, etc. (#3633)
* ruby : Bump version to 1.3.6

* Fix code in example

* Add sample code to transcribe from MemoryView

* Define GetVADContext macro

* Use GetVADContext

* Extract parse_full_args function

* Use parse_full_args in ruby_whisper_full_parallel

* Free samples after use

* Check return value of parse_full_args()

* Define GetVADParams macro

* Add VAD::Context#segments_from_samples

* Add tests for VAD::Context#segments_from_samples

* Add signature for VAD::Context#segments_from_samples

* Add sample code for VAD::Context#segments_from_samples

* Add test for Whisper::Context#transcribe with Pathname

* Make Whisper::Context#transcribe and Whisper::VAD::Context#detect accept Pathname

* Update signature of Whisper::Context#transcribe

* Fix variable name

* Don't free memory view

* Make parse_full_args return struct

* Fallback when failed to get MemoryView

* Add num of samples when too long

* Check members of MemoryView

* Fix a typo

* Remove unnecessary include

* Fix a typo

* Fix a typo

* Care the case of MemoryView doesn't fit spec

* Add TODO comment

* Add optimazation option to compiler flags

* Use ALLOC_N instead of malloc

* Add description to sample code

* Rename and change args: parse_full_args -> parse_samples

* Free samples when exception raised

* Assign type check result to a variable

* Define wrapper function of whisper_full

* Change signature of parse_samples for rb_ensure

* Ensure release MemoryView

* Extract fill_samples function

* Free samples memory when filling it failed

* Free samples memory when transcription failed

* Prepare transcription in wrapper funciton

* Change function name

* Simplify function boundary
2026-01-30 22:59:36 +09:00
Frieder Bluemle bf422cb704
scripts : Fix dSYMs path case for macOS xcframework build (#3630)
The script creates dSYMs/ but references dSYMS/ for macOS, causing
build failures on case-sensitive filesystems.
2026-01-30 15:57:26 +02:00
Georgi Gerganov acbace0571 cuda : fix compile warnings (#0) 2026-01-30 15:56:40 +02:00
Georgi Gerganov 953e503fd9 talk-llama : sync llama.cpp 2026-01-30 15:56:40 +02:00
Georgi Gerganov b529c0610f sync : ggml 2026-01-30 15:56:40 +02:00
bssrdf 5dca0db99c add tensor type checking as part of cuda graph properties (llama/19186) 2026-01-30 15:56:40 +02:00
s8322 2a16e7a67f sycl: implement GGML_UNARY_OP_SOFTPLUS (llama/19114)
* sycl: add softplus unary op implementation

* sycl: add softplus unary op implementation

* docs(ops): mark SYCL SOFTPLUS as supported

* docs: update SYCL status for SOFTPLUS
2026-01-30 15:56:40 +02:00
RachelMantel 1b3c27efae sycl: implement GGML_OP_TRI (llama/19089)
* sycl: implement GGML_OP_TRI

* docs: update ops.md for SYCL TRI

* docs: regenerate ops.md

* docs: update SYCL support for GGML_OP_TRI
2026-01-30 15:56:40 +02:00
Zheyuan Chen 829e70044b ggml-webgpu: improve flastAttention performance by software pipelining (llama/19151)
* webgpu : pipeline flash_attn Q/K loads in WGSL

* ggml-webgpu: unroll Q*K accumlation inner loop

* ggml-webgpu: vectorization

* ggml-webgpu: unrolling

* ggml-webgpu: remove redundant unrolling

* ggml-webgpu: restore the config

* ggml-webgpu: remove redundant comments

* ggml-webgpu: formatting

* ggml-webgpu: formatting and remove vectorization

* ggml-webgpu: remove unnecessary constants

* ggml-webgpu: change QKV buffer to read_write to pass validation

* ggml-webgpu: add explanation for the additional bracket around Q K accumulate

* Indentation and for -> if for tail

* Kick off CI on wgsl only commits

---------

Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-01-30 15:56:40 +02:00
Todor Boinovski 2a89a3f35c hexagon: enable offloading to Hexagon on Windows on Snapdragon (llama/19150)
* hexagon: updates to enable offloading to HTP on WoS

* Update windows.md

* Update windows.md

* hexagon: enable -O3 optimizations

* hexagon: move all _WINDOWS conditional compilation to _WIN32

* hexagon: updates to enable offloading to HTP on WoS

* hexagon: use run-time vs load-time dynamic linking for cdsp driver interface

* refactor htp-drv

* hexagon: add run-bench.ps1 script

* hexagon: htdrv refactor

* hexagon: unify Android and Windows build readmes

* hexagon: update README.md

* hexagon: refactor htpdrv

* hexagon: drv refactor

* hexagon: more drv refactor

* hexagon: fixes for android builds

* hexagon: factor out dl into ggml-backend-dl

* hexagon: add run-tool.ps1 script

* hexagon: merge htp-utils in htp-drv and remove unused code

* wos: no need for getopt_custom.h

* wos: add missing CR in htpdrv

* hexagon: ndev enforecement applies only to the Android devices

* hexagon: add support for generating and signing .cat file

* hexagon: add .inf file

* hexagon: working auto-signing and improved windows builds

* hexagon: futher improve skel build

* hexagon: add rough WoS guide

* hexagon: updated windows guide

* hexagon: improve cmake handling of certs and logging

* hexagon: improve windows setup/build doc

* hexagon: more windows readme updates

* hexagon: windows readme updates

* hexagon: windows readme updates

* hexagon: windows readme updates

* hexagon: windows readme updates

* Update windows.md

* Update windows.md

* snapdragon: rename docs/backend/hexagon to docs/backends/snapdragon

Also added a power shell script to simplify build env setup.

* hexagon: remove trailing whitespace and move cmake requirement to user-presets

* hexagon: fix CMakeUserPresets path in workflow yaml

* hexagon: introduce local version of libdl.h

* hexagon: fix src1 reuse logic

gpt-oss needs a bigger lookahead window.
The check for src[1] itself being quantized was wrong.

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-01-30 15:56:40 +02:00
Georgi Gerganov b997e690ef cuda : fix nkvo, offload and cuda graph node properties matching (llama/19165)
* cuda : fix nkvo

* cont : more robust cuda graph node property matching

* cont : restore pre-leafs implementation

* cont : comments + static_assert
2026-01-30 15:56:40 +02:00
yulo 34a3e28a08 HIP: add mmf for CDNA (llama/18896)
* refactor mmf rows_per_block

* speed up compile

* pass cdna compile

* fix cuda error

* clean up mmf

* f32 mmf

* clean float mma

* fix mmf error

* faster mmf

* extend tile k

* fix compile error

* Revert "extend tile k"

This reverts commit 4d2ef3d483932659801a59a5af0b6b48f6ffd5c7.

* fix smem overflow

* speed up compiling mmf

* speed up compile for hip

* 512 block for cdna

* config pad size

* fix as comment

* update select logic

* move some code to cuh

* fix as comment

* correct cdna3 config

---------

Co-authored-by: zhang hui <you@example.com>
2026-01-30 15:56:40 +02:00
Vishal Singh e0a2182970 ggml-zendnn : resolve ZenDNN backend cross-module symbol dependency (llama/19159) 2026-01-30 15:56:40 +02:00
Aman Gupta 62ba8b537f CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) (llama/19126) 2026-01-30 15:56:40 +02:00
Neo Zhang f0e85bb142 sycl: fix norm kernels: l2_norm, group_norm, rms_norm by remove assert to support more cases (llama/19154)
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2026-01-30 15:56:40 +02:00
Ruben Ortlam 33148bb523 Vulkan Flash Attention Coopmat1 Refactor (llama/19075)
* vulkan: use coopmat for flash attention p*v matrix multiplication

* fix P loading issue

* fix barrier position

* remove reduction that is no longer needed

* move max thread reduction into loop

* remove osh padding

* add bounds checks and padding

* remove unused code

* fix shmem sizes, loop duration and accesses

* don't overwrite Qf, add new shared psh buffer instead

* add missing bounds checks

* use subgroup reductions

* optimize

* move bounds check, reduce barriers

* support other Bc values and other subgroup sizes

* remove D_split

* replace Of register array with shared memory Ofsh array

* parallelize HSV across the rowgroups

* go back to Of in registers, not shmem

* vectorize sfsh

* don't store entire K tile in shmem

* fixes

* load large k tiles to shmem on Nvidia

* adapt shared memory host check function to shader changes

* remove Bc 32 case

* remove unused variable

* fix missing mask reduction tmspsh barrier

* fix mask bounds check

* fix rowmax f16 under/overflow to inf

* fix flash_attn_cm2 BLOCK_SIZE preprocessor directives
2026-01-30 15:56:40 +02:00
Patryk Kaminski cc0c103b5d ggml-sycl: remove unused syclcompat header (llama/19140)
The syclcompat/math.hpp is not used anymore. The change that intrduced it was successfuly reverted (https://github.com/ggml-org/llama.cpp/pull/17826).
This include path will become obsolete and dropped in oneAPI 2026.0 effectively breaking ggml-sycl builds.
2026-01-30 15:56:40 +02:00
Oleksandr Kuvshynov dda7d9cd1c vulkan: handle device dedup on MacOS + Vega II Duo cards (llama/19058)
Deduplication here relied on the fact that vulkan would return unique
UUID for different physical GPUs. It is at the moment not always the case.
On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total),
MotlenVK would assign same UUID to pairs of GPUs, unless they
are connected with Infinity Fabric.

See more details here: KhronosGroup/MoltenVK#2683.

The right way is to fix that in MoltenVK, but until it is fixed,
llama.cpp would only recognize 2 of 4 GPUs in such configuration.

The deduplication logic here is changed to only filter GPUs if UUID is
same but driver is different.
2026-01-30 15:56:40 +02:00
Kevin Pouget 531d7b6781 ggml: new backend for Virglrenderer API Remoting acceleration (v2) (llama/18718) 2026-01-30 15:56:40 +02:00
Alberto Cabrera Pérez 3701413a71 ggml-cpu: arm64: Q4_K scale unroll and vectorization (llama/19108) 2026-01-30 15:56:40 +02:00