Commit Graph

324 Commits

Author SHA1 Message Date
ymcki 628b545b7e fix vulkan ggml_acc only works in 3d but not 4d (llama/19426)
* fix vulkan ggml_acc only works in 3d but not 4d

* removed clamp in test_acc_block

* use the correct stride and its test case

* cuda : fix "supports op" condition

* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check

* version without boundary check

* revert back to boundary check version

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-02-15 21:44:37 +02:00
Jeff Bolz c1b63354bb vulkan: make FA mask/softcap enables spec constants (llama/19309)
* vulkan: make FA mask/softcap enables spec constants

* don't specialize for sinks

* bump timeout a little bit
2026-02-08 09:29:10 +02:00
Jeff Bolz a567c140a3 vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (llama/19281)
Write out a 2-bit code per block and avoid loading the mask when it
matches these two common cases.

Apply this optimization when the mask is relatively large (i.e. prompt
processing).
2026-02-08 09:29:10 +02:00
Oleksandr Kuvshynov 932def3198 vulkan: fix GPU deduplication logic. (llama/19222)
* vulkan: fix GPU deduplication logic.

As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the
(same uuid, same driver) logic is problematic for windows+intel igpu.

Let's just avoid filtering for MoltenVK which is apple-specific, and
keep the logic the  same as before 88d23ad5 - just dedup based on UUID.

Verified that MacOS + 4xVega still reports 4 GPUs with this version.

* vulkan: only skip dedup when both drivers are moltenVk
2026-02-08 09:29:10 +02:00
Jeff Bolz 5a786f7648 vulkan: Set k_load_shmem to false when K is too large (llama/19301) 2026-02-08 09:29:10 +02:00
Jeff Bolz e0a3f393ad vulkan: fix non-contig rope (llama/19299) 2026-02-08 09:29:10 +02:00
Ruben Ortlam aa34558b6f vulkan: disable coopmat1 fa on Nvidia Turing (llama/19290) 2026-02-08 09:29:10 +02:00
Simon Redman efd6344939 Correctly fetch q8_1 quantize pipeline in test as needed by 8a3519b (llama/19194) 2026-02-08 09:29:10 +02:00
Ruben Ortlam 33148bb523 Vulkan Flash Attention Coopmat1 Refactor (llama/19075)
* vulkan: use coopmat for flash attention p*v matrix multiplication

* fix P loading issue

* fix barrier position

* remove reduction that is no longer needed

* move max thread reduction into loop

* remove osh padding

* add bounds checks and padding

* remove unused code

* fix shmem sizes, loop duration and accesses

* don't overwrite Qf, add new shared psh buffer instead

* add missing bounds checks

* use subgroup reductions

* optimize

* move bounds check, reduce barriers

* support other Bc values and other subgroup sizes

* remove D_split

* replace Of register array with shared memory Ofsh array

* parallelize HSV across the rowgroups

* go back to Of in registers, not shmem

* vectorize sfsh

* don't store entire K tile in shmem

* fixes

* load large k tiles to shmem on Nvidia

* adapt shared memory host check function to shader changes

* remove Bc 32 case

* remove unused variable

* fix missing mask reduction tmspsh barrier

* fix mask bounds check

* fix rowmax f16 under/overflow to inf

* fix flash_attn_cm2 BLOCK_SIZE preprocessor directives
2026-01-30 15:56:40 +02:00
Oleksandr Kuvshynov dda7d9cd1c vulkan: handle device dedup on MacOS + Vega II Duo cards (llama/19058)
Deduplication here relied on the fact that vulkan would return unique
UUID for different physical GPUs. It is at the moment not always the case.
On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total),
MotlenVK would assign same UUID to pairs of GPUs, unless they
are connected with Infinity Fabric.

See more details here: KhronosGroup/MoltenVK#2683.

The right way is to fix that in MoltenVK, but until it is fixed,
llama.cpp would only recognize 2 of 4 GPUs in such configuration.

The deduplication logic here is changed to only filter GPUs if UUID is
same but driver is different.
2026-01-30 15:56:40 +02:00
Jeff Bolz b7e323f40b vulkan: Remove transfer_ctx, do everything in compute_ctx. (llama/18945)
* vulkan: Remove transfer_ctx, do everything in compute_ctx.

We had a bug where a set_tensor_async (using transfer_ctx) didn't get
submitted before the graph_compute (using compute_ctx) that came after
it. To avoid this sort of issue, just do everything in compute_ctx.

Remove transfer_cmd_pool, which was already unused.

* fix crash with perf logger
2026-01-30 15:56:40 +02:00
Jeff Bolz b2bc4d810b vulkan: support flash attention GQA/split_k with small batches (llama/18938) 2026-01-30 15:56:40 +02:00
Masato Nakasaka 3bbf4ced47 Revert "vulkan: force full subgroups for flash attention to fix intel subgroup crash (#17356)" (llama/18831)
This reverts commit 980b7cd17e055c8c587f79ffda7eb4fddf405566.
2026-01-30 15:56:40 +02:00
Jeff Bolz 660d943ff8 vulkan: Use mul_mat_vec_id for small values of n (llama/18918)
Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.

Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.
2026-01-30 15:56:40 +02:00
Georgi Gerganov 47f3e3b927 ggml : add ggml_build_forward_select (llama/18550)
* ggml : add ggml_build_forward_select

* cuda : adapt CUDA graph compat to new feature

* vulkan : update logic to handle command buffer closing

* ggml : check compute for fusion

* ggml : add comment
2026-01-30 15:56:40 +02:00
Jeff Bolz 4b155e9bfb vulkan: Check maxStorageBufferRange in supports_op (llama/18709)
* vulkan: Check maxStorageBufferRange in supports_op

* skip maxStorageBufferRange check when shader64BitIndexing is enabled
2026-01-30 15:56:40 +02:00
Jeff Bolz ab1828dc1c vulkan: change memory_logger to be controlled by an env var (llama/18769) 2026-01-14 09:11:59 +02:00
Jeff Bolz aedf332ec5 vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (llama/18678)
This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which
has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128.

This should work when the number of blocks in the A matrix is less than 2^32
(for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like
2^32*LOAD_VEC_A elements.

- Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b.
- Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle
variants. So far this change just adds a single use case for this, compiling with the
e64BitIndexingEXT flag.
- Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange.

64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort
to avoid enabling it unconditionally.
2026-01-14 09:11:59 +02:00
Ruben Ortlam 716d68aca9 vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (llama/18763)
* vulkan: Disable large coopmat matmul configuration on proprietary AMD driver

* Also disable the large tile size
2026-01-14 09:11:59 +02:00
Ruben Ortlam c0433783c3 Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (llama/18749)
* vulkan: Enable and optimize large matmul parameter combination for AMD

* limit tuning to AMD GPUs with coopmat support

* use tx_m values instead of _l
2026-01-14 09:11:59 +02:00
Jeff Bolz 0bc0e5616e vulkan: fix push constant size for quantize_q8_1 (llama/18687)
I added an assert to catch further mismatches, and it found several.
Fix those, too.
2026-01-14 09:11:59 +02:00
Jeff Bolz 678c660e62 vulkan: optimize ssm_scan (llama/18630)
* vulkan: optimize ssm_scan

* fix warp vs subgroup naming
2026-01-14 09:11:59 +02:00
Doctor Shotgun b9965c89a1 ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH (llama/18535)
* ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH
* makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32

* ggml: read GGML_OP_OFFLOAD_MIN_BATCH once and store to dev ctx

* cann: forward declaration of device context struct

* cann: move offload op check after device context declaration

* cuda: fix whitespace

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-01-14 09:11:59 +02:00
Jeff Bolz a91ab72bd9 vulkan: reject ops when a tensor is too large to allocate (llama/18646) 2026-01-14 09:11:59 +02:00
virajwad 096e7e911a vulkan: Warptile tuning for Intel Xe2/Xe3 (llama/18178)
* modify warptile tuning for xe3

* intel vendor check w/ coopmat support

* fix back formatting

* fix formatting change 2

* move intel check to chip specific tuning part

* Change to support both windows and linux

* modify m_warptile to l_warptile for intel

* modify warptile tuning for bf16 matmuls to fix regression (m_warptile to l_warptile)

* Code style changes

* Code style changes (2)

* Code style changes (3)
2026-01-14 09:11:59 +02:00
Jeff Bolz dbec71f6cf vulkan: support buffer_from_host_ptr (llama/18467)
* vulkan: support buffer_from_host_ptr

* hacky use of buffer_from_host_ptr for directio

* disable buffer_from_host_ptr cap

* use external memory for ggml_vk_host_malloc, revert model loader changes

* disable external_memory_host for MoltenVK

* take buffer memory types into account

* don't use external_memory_host for ggml_vk_host_malloc
2026-01-14 09:11:59 +02:00
Jeff Bolz 0a99b4c377 vulkan: handle quantize_q8_1 overflowing the max workgroup count (llama/18515)
* vulkan: handle quantize_q8_1 overflowing the max workgroup count

* vulkan: Fix small tile size matmul on lavapipe

* fix mul_mat_id failures
2026-01-14 09:11:59 +02:00
Jeff Bolz 9d83865607 vulkan: Optimize GGML_OP_CUMSUM (llama/18417)
* vulkan: Optimize GGML_OP_CUMSUM

There are two paths: The preexisting one that does a whole row per workgroup
in a single shader, and one that splits each row into multiple blocks and does
two passes. The first pass computes partials within a block, the second adds
the block partials to compute the final result. The multipass shader is used
when there are a small number of large rows.

In the whole-row shader, handle multiple elements per invocation.

* use 2 ELEM_PER_THREAD for AMD/Intel

* address feedback
2026-01-14 09:11:59 +02:00
Jeff Bolz b7ff521e71 vulkan: Implement mmvq for iq1_s/iq1_m (llama/18450) 2026-01-14 09:11:59 +02:00
Jeff Bolz b1f65a4a7e vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (llama/18295)
* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.

* change test_topk_moe to allow results in arbitrary order

* disable sigmoid fusion for moltenvk
2026-01-14 09:11:59 +02:00
Jeff Bolz 015b618d96 vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (llama/18352)
Run a preprocess to count how many times each expert is used, and use this to
quickly discard workgroups that aren't needed.
2025-12-31 17:52:09 +02:00
Jeff Bolz e37c8ed94e vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (llama/18349)
* vulkan: Use BK=32 for coopmat2 mul_mat_id

* vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader

Disable robustness, remove the OOB check in decodeFuncB, and initialize the
row_ids to zero to avoid OOB access.

Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
zero and remove the '& (BN - 1)'. This allows the compiler to common some of
the shared memory loads.
2025-12-31 17:52:09 +02:00
Jeff Bolz 331c6ccd31 vulkan: Use BK=32 for coopmat2 mul_mat_id (llama/18332) 2025-12-31 17:52:09 +02:00
Jeff Bolz 181e36f194 vulkan: Support UPSCALE w/antialias (llama/18327) 2025-12-31 17:52:09 +02:00
Jeff Bolz 67473fef57 vulkan: handle rope with large number of rows (llama/18306) 2025-12-31 17:52:09 +02:00
Jeff Bolz f863735caa vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (llama/18302) 2025-12-31 17:52:09 +02:00
Ruben Ortlam 1356600679 vulkan: use fewer FA rows for small cache runs (llama/18280) 2025-12-31 17:52:09 +02:00
Jeff Bolz dbbe6c11b5 vulkan: Extend rope fusions to allow mrope (llama/18264)
Extend the test-backend-ops tests as well.
2025-12-31 17:52:09 +02:00
Jeff Bolz 98e59a43d1 vulkan: Implement set_tensor_async and the event interfaces (llama/18047)
The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from #7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
2025-12-31 17:52:09 +02:00
Jeff Bolz b893e0813a vulkan: fix im2col overflowing maxworkgroupcount (llama/18180) 2025-12-31 17:52:09 +02:00
Jeff Bolz f407c5e562 vulkan/cuda: fix topk_moe with exp_probs_b (llama/18071)
I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn
and added coverage for exp_probs_b and some other missing combinations. This
exposed a bug in both CUDA and Vulkan backends where they were assuming the
input to argsort and the input to get_rows are the same. I'd like to optimize
this graph in another change, but for now just get it functional.

CUDA also had a bug where it got n_experts from the wrong place, leading to
GGML_ASSERT failures in some of the new tests.
2025-12-31 17:52:09 +02:00
Jeff Bolz ad6ee3865d vulkan: support GGML_UNARY_OP_XIELU (llama/18062) 2025-12-31 17:52:09 +02:00
Jeff Bolz 3cd141f1a9 vulkan: in graph_optimize, try to group ADD operations (llama/18060)
I saw the adds not staying together in the new nemotron 3 nano model.
2025-12-31 17:52:09 +02:00
Jeff Bolz 195d8d0c65 vulkan: Add perf logger mode with concurrency (llama/17944)
This implements a variation of the perf logger where rather than timing each
operation individually with effectively a barrier in between, we put the
timing boundaries where we already synchronize and time the groups of work
that normally overlap. This can be useful to help understand whether
individual operations need to be optimized, or if the group is already running
efficiently.

GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when
GGML_VK_PERF_LOGGER is also set).

GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch.
2025-12-31 17:52:09 +02:00
Jeff Bolz b901ebe4a3 vulkan: support get_rows for i32 (llama/17941) 2025-12-18 08:20:56 +02:00
Jeff Bolz f33446643e vulkan: support GGML_OP_DIAG (llama/17893) 2025-12-18 08:20:56 +02:00
Jeff Bolz 939d3085e9 vulkan: Multi-pass softmax for large number of cols (llama/17892)
When the number of cols is large, split each row across multiple workgroups.
There are three phases that communicate partial results through temp buffers:
(1) compute max partials
(2) take max of partials, compute sum(exp(x-max)) partials
(3) sum partials, compute scaled result
2025-12-18 08:20:56 +02:00
Jeff Bolz 13bb296dbf vulkan: Allow non-pow2 n_experts in topk_moe (llama/17872) 2025-12-18 08:20:56 +02:00
Jeff Bolz 898f876fe2
vulkan: perf_logger improvements (llama/17672)
* vulkan: perf_logger improvements

- Move perf_logger from device to ctx.
- Add an env var to control the frequency we dump the stats. If you set a very
large value, it just dumps when the ctx is destroyed.
- Add a fusion info string to the tracking, only log one item per fused op.
- Fix MUL_MAT_ID flops calculation.

* fix vector sizes
2025-12-12 17:53:21 +02:00
Phylliida Dev c5e1807071
ggml : add circular tiling support to pad, for Vulkan, CUDA, and CPU (used for making seamless textures) (llama/16985)
* Feat: Added vulkan circular tiling support

* Feat: Added cpu circular

* Feat: Added cuda kernels

* Added tests

* Added tests

* Removed non-pad operations

* Removed unneded changes

* removed backend non pad tests

* Update test-backend-ops.cpp

* Fixed comment on pad test

* removed trailing whitespace

* Removed unneded test in test-backend-ops

* Removed removed test from calls

* Update ggml/src/ggml-vulkan/vulkan-shaders/pad.comp

Co-authored-by: Ruben Ortlam <picard12@live.de>

* Fixed alignment

* Formatting

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Format pad

* Format

* Clang format

* format

* format

* don't change so much stuff

* clang format and update to bool

* fix duplicates

* don't need to fix the padding

* make circular bool

* duplicate again

* rename vulkan to wrap around

* Don't need indent

* moved to const expr

* removed unneded extra line break

* More readable method calls

* Minor wording changes

* Added final newline

* Update ggml/include/ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/include/ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Added circular pad ext tests

* Gate non circular pad devices

* Cleaned gating of non-circular pad devices

---------

Co-authored-by: Phylliida <phylliidadev@gmail.com>
Co-authored-by: Ruben Ortlam <picard12@live.de>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-12 17:53:20 +02:00