Commit Graph

463 Commits

Author SHA1 Message Date
Jeff Bolz 85bbc82209
vulkan: Support F16 OP_FILL (llama/22177) 2026-04-30 11:29:14 +03:00
Ruben Ortlam 7fe6b8e171
vulkan: optimize im2col (llama/21713)
* vulkan: improve im2col memory write layout

* cap workgroups

* minimal device tuning

* use vendor_id instead of subgroup size
2026-04-30 11:29:10 +03:00
Jeff Bolz 45365fa111
vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it (llama/21572)
* vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it

* use FetchContent to get SPIRV-Headers

* Fetch spirv-headers unconditionally

* remove fetchcontent, rely on installed headers

* fix ubuntu job

* Update docs/build.md
2026-04-30 11:29:08 +03:00
Jeff Bolz cdeaa34174
vulkan: Support GGML_TYPE_NVFP4 (llama/21455)
This adds nvfp4 support for get_rows, dequant, and mul_mat(_id). For
mul_mat, it does not add support for the dp4/q8_1 path, it's all via
fp16/fp32.
2026-04-30 11:29:07 +03:00
Ruben Ortlam 0f99a47177
vulkan: Flash Attention DP4A shader for quantized KV cache (llama/20797)
* use integer dot product for quantized KV flash attention

* small improvements

* fix SHMEM_STAGING indexing

* add missing KV type quants

* fixes

* add supported quants to FA tests

* readd fast paths for <8bit quants

* fix mmq gate and shmem checks
2026-04-30 11:29:07 +03:00
Jeff Bolz 458ad1d93e
vulkan: Support Q1_0 (llama/21539)
* vulkan: Support Q1_0

* use get_dm
2026-04-30 11:29:05 +03:00
Johannes Gäßler bb895c843d
ggml: backend-agnostic tensor parallelism (experimental) (llama/19378)
* ggml: backend-agnostic tensor parallelism

* support for GPT-OSS, Qwen 3 MoE

* partial Vulkan fix

* add support for 4/8 GPUs

* unconditional peer access

* re-use buffers + ggml contexts

* fix output pattern

* NCCL support

* GGML: HIP: add RCCL support

* Remove shfl and AllReduce from backend interface

* move allocation workaround out of ggml-alloc.c

* 2d tensor set/get support

* Fix the seg fault without NCCL

* Apply suggestion from JohannesGaessler

* support for tensor dims % n_devs != 0

* fix view_offs scaling

* arbitrary num. of GPUs/tensor split

* fix compilation

* better granularity estimate

* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.

Fix compilation errors.

* partial Qwen 3 Next support

* Fix qwen3 30b (llama/8)

* Fix crash with Qwen-30B-A3B Q4_0

Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.

* Decide block size based on tensor quantization type

* Fix crashes due to KV cache serialization (llama/9)

KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.

* metal : fix build (llama/7)

* static memory allocations, fix usage count

* fix tensor granularity

* more even memory distribution

* use BF16 for allreduce

* rebase fixup

* better error message for unsupported architectures

* Fix device mismatch during scatter of allReduce. (llama/11)

There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies

* Enable the previous allreduce implementation. It is better in both perf and stability (llama/12)

* delay AllReduce for Moe for less I/O

* build : clean-up compile warnings

* backend : move most of the meta backend API to ggml-backend-impl.h

* cont : hide unused public API in the implementation

* llama : use llama_device + remove ggml_backend_dev_is_meta()

* ggml-backend : remove unused alloc include

* minor : remove regex include

* ggml : introduce ggml-ext.h for staging new APIs

* rebase fixup

* fix tests

* llama : more robust logic for determining Meta devices (llama/16)

* llama : more robust logic for determining Meta devices

* cont : fix devs size check

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cont : fix log type

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* disable roundtrip for meta backend

* fix arch selection

* Qwen 3.5 support

* fix Gemma 4 MoE

* fix OpenVino, SYCL

* fix test-llama-archs for CPU-only builds

* Fix Qwen 3.5 MoE

* disable meta backend tests for WebGPU

* tests : filter CPU-based devices from the Meta backend tests (llama/17)

* meta : formatting, naming, indentation (llama/18)

* formatting : llama-model.cpp

* formatting : ggml-ext.h

* formatting : ggml-backend-meta.cpp

* meta : add TODO

* add documentation

* better error messages

* fix GPT-OSS

---------

Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-30 11:29:05 +03:00
Ruben Ortlam 1d555510de
vulkan: unify type macros to use Vx instead of _VECx (llama/21605) 2026-04-30 11:29:04 +03:00
Tom Overlund 78b4fd85e1
ggml: Vulkan build, Linux -- output error string for errno on fork failure (#20868) (llama/20904) 2026-04-30 11:29:02 +03:00
mkoker 18c98ffaf7
vulkan: add FA dequant for q4_1, q5_0, q5_1, iq4_nl (llama/21029)
Add dequantize4() implementations for Q4_1, Q5_0, Q5_1, and IQ4_NL
in the flash attention base shader. Register them in the shader
generator, pipeline creation, and enable in the scalar/coopmat1 FA
support check.
2026-04-30 11:29:02 +03:00
Ruben Ortlam 759f0084b4 vulkan: add noncontiguous GLU support (llama/21081)
* vulkan: add noncontiguous GLU support

* fix compile issue
2026-03-29 15:04:36 +03:00
Matt Corallo 22710fdb82 Add shader count for Intel Arc Pro B60 (llama/20818) 2026-03-29 15:04:36 +03:00
Jeff Bolz 49b505bcc5 vulkan: change gated_delta_net to shard a column across a subgroup (llama/20662)
* vulkan: change gated_delta_net to shard a column across a subgroup

This is based on https://github.com/ggml-org/llama.cpp/pull/20391, I used an
LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to
work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of
subgroup to invocation id, using subgroupAdd optionally, etc.).

This fixes a perf regression from the transposing of the values in memory
(!20443).

* vulkan: Spread columns across fewer lanes to reduce the number of workgroups
2026-03-29 15:04:36 +03:00
Eve 43c7c0f86c vulkan: dequantize iq4_xs 4 at a time (llama/20657) 2026-03-29 15:04:36 +03:00
Ruben Ortlam 16ca5e6fb1 vulkan: disable mmvq on Intel Windows driver (llama/20672)
* vulkan: disable mmvq on Intel Windows driver

* improve comment
2026-03-29 15:04:36 +03:00
Ruben Ortlam 0ad6ceef59 vulkan: async and event fixes (llama/20518)
* vulkan: fix event wait submission, event command buffer reset

* fix event command buffer reset validation error

* also reset command buffers before reuse

* use timeline semaphores instead of fences for event_synchronize

* don't use initializer list for semaphore wait info

* use multiple events to avoid reset issues

* fix event reuse issue with multiple vectors

* add semaphore wait condition also if compute_ctx already exists

* remove event pending stage
2026-03-29 15:04:36 +03:00
Ruben Ortlam 49adc8b470 vulkan: allow graphics queue only through env var (llama/20599)
* vulkan: avoid graphics queue on non-RADV AMD drivers

* avoid graphics queues on small GPUs

* change to only use graphics queue if overridden with env var GGML_VK_ALLOW_GRAPHICS_QUEUE

* reenable transfer queue if graphics queue is not used
2026-03-29 15:04:36 +03:00
Ruben Ortlam 724ea71cf9 vulkan: fix flash attention dot product precision (llama/20589) 2026-03-29 15:04:36 +03:00
Ruben Ortlam cd02195b8f vulkan: use graphics queue on AMD (llama/20551)
* vulkan: use graphics queue on AMD for slightly better performance

* disable async transfer queue on AMD
2026-03-16 13:10:15 +02:00
Georgi Gerganov c7abcd577b graph : remove redundant GDN state transposes (llama/20443)
* ggml : transpose fused GDN state access for coalesced memory reads (llama/20436)

The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix
column-wise on row-major storage, causing strided reads (stride S_v =
128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a
39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused
path.

Transpose the state indexing so threads read contiguously:
- Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v)
- CUDA:  curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced)
- CPU:   restructured loops for row-wise transposed access

Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so
users can control fused GDN independently of auto-detection.

All GATED_DELTA_NET backend-ops tests pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* ggml : use SIMD dot products in CPU GDN kernel, couple AR/chunked fused flags

- Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized
  dot products in the CPU fused GDN kernel (delta and attention output)
- Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one
  path lacks device support, disable both to prevent state layout mismatch
  between transposed (fused) and non-transposed (unfused) formats

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* llama : rever fgdn argument changes

* graph : remove GDN state transposes

* vulkan : adapt

* cuda : remove obsolete smem code

---------

Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
2026-03-16 13:10:15 +02:00
ProgenyAlpha 2450919665 vulkan: add GATED_DELTA_NET op support (llama/20334)
* vulkan: add GATED_DELTA_NET op support

Implements the fused gated delta net recurrence as a Vulkan compute
shader with full support for scalar gate, KDA vector gate, GQA
broadcast, multi-token sequences, and permuted (non-contiguous) q/k
inputs. Specialization constants select head size (32/64/128) and
KDA mode at pipeline creation time.

Passes all 13 test-backend-ops cases on AMD Radeon 890M (RADV GFX1150).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: optimize GATED_DELTA_NET shader (Phase 1)

- vec4 dot products on all inner loops (dp4 hardware intrinsic)
- Cache exp(g) in shared memory for KDA path, eliminating ~32K
  redundant global reads and ~16K redundant exp() calls per token
- vec4 fused decay + rank-1 update (3 vec4 ops vs 12 scalar ops)
- Add perf benchmark cases for GATED_DELTA_NET to test-backend-ops

KDA TG: +5.4% throughput. Non-KDA: no regressions.
13/13 test-backend-ops passing on AMD Radeon 890M (RADV GFX1150).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: address review feedback for GATED_DELTA_NET

Pipeline array refactor [3][2], A_TYPE/D_TYPE/FLOAT_TYPE shader macros,
scale in push constants, supports_op fix, dispatch restructuring.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: use FLOAT_TYPE for buffer/shared declarations, align formatting

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: add explicit FLOAT_TYPE casts for buffer loads

Wrap data_q, data_k, and data_g buffer reads with FLOAT_TYPE() casts
to ensure correct behavior across all Vulkan configurations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: fix Q/K broadcast for interleaved head layout

Adapt to the interleaved broadcast convention from #20340:
head_id / rq1 → head_id % neq1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 13:10:15 +02:00
ProgenyAlpha 44c12c642e vulkan: fix SSM_CONV PP scaling with large ubatch sizes (llama/20379)
* vulkan: optimize SSM_CONV workgroup dispatch for large ubatch

Tile tokens into 2D workgroups (32x16) to reduce workgroup launch
overhead at large ubatch sizes. Add vec4 fast path for nc=4 (common
d_conv size). Fixes PP performance degradation with ubatch > 512.

Ref: ggml-org/llama.cpp#18725

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* vulkan: remove unused shared memory declaration in SSM_CONV

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 13:10:15 +02:00
Jeff Bolz 86e312d61d vulkan: fix l2_norm epsilon handling (llama/20350) 2026-03-16 13:10:15 +02:00
Jeff Bolz 6c5e3aac3e vulkan: fix OOB check in flash_attn_mask_opt (llama/20296) 2026-03-16 13:10:15 +02:00
Masato Nakasaka 26ee4f7362 vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (llama/20059)
* Changed to reuse command buffers to fix crashing on Intel GPU

* Removed unused parameter

* Fixed compile error and minor mistake

* Fix logging

* Changing to use usage flag per command buffer

* fixed style

* added buffer reset

* Removed cmd_buffer_idx for reuse consistency

* Fixed style
2026-03-16 13:10:15 +02:00
Bertay Eren 65dbf3c31a ggml-vulkan: add SGN operator, auto-generate Vulkan.csv and ops.md (llama/20219) 2026-03-16 13:10:15 +02:00
Ruben Ortlam 890c047e30 vulkan: skip zero size tensors in backend copies (llama/20233) 2026-03-16 13:10:15 +02:00
GiantPrince 8d97f59639 ggml-vulkan: Add ELU op support (llama/20183)
* ggml-Vulkan: add ELU support

* ggml-Vulkan: remove extra spaces and variables

* ggml-Vulkan: fix format issue

* ggml-Vulkan: fix format issue

* fix whitespace issue

* Update Vulkan.csv and ops.md
2026-03-16 13:10:15 +02:00
Jeff Bolz 4b0653a792 vulkan: Fix data races in coopmat1 mul_mat(_id) (llama/20084)
* vulkan: Fix data races in coopmat1 mul_mat(_id)

Add barriers between coopmat store and regular loads. We sort of got away with
this because it was the same subgroup accessing the values, but it's still a
race and may not work.

* switch to subgroup control barriers
2026-03-16 13:10:15 +02:00
Marcel Petrick 67abc63e9d chore : correct typos [no ci] (llama/20041)
* fix(docs): correct typos found during code review

Non-functional changes only:
- Fixed minor spelling mistakes in comments
- Corrected typos in user-facing strings
- No variables, logic, or functional code was modified.

Signed-off-by: Marcel Petrick <mail@marcelpetrick.it>

* Update docs/backend/CANN.md

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* Revert "Auxiliary commit to revert individual files from 846d1c301281178efbc6ce6060ad34c1ebe45af8"

This reverts commit 02fcf0c7db661d5ff3eff96b2b2db9fdb7213256.

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Signed-off-by: Marcel Petrick <mail@marcelpetrick.it>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-16 13:10:15 +02:00
Ruben Ortlam 923a292429 vulkan: tune MMVQ for Intel Windows (llama/19988) 2026-03-16 13:10:15 +02:00
Ruben Ortlam 2a9649c420 vulkan: improve partial offloading performance on AMD (llama/19976)
* vulkan: fix and enable cpy_tensor_async function

* use transfer_queue for async transfers on AMD, synchronize with timeline semaphore

* update offload_op logic

* fix missing transfer submission

* disable async transfer queue on AMD GCN

* revert op batch size change

* fix cpy_tensor_async checks
2026-03-16 13:10:15 +02:00
Ruben Ortlam e722ee1bf5 vulkan: fix fp16 Flash Attention on Windows AMD RDNA2 and below (llama/19921) 2026-02-27 20:57:58 +02:00
Jeff Bolz fb55b2654b vulkan: check for memory overlap before doing fusion (llama/19768)
* vulkan: check for memory overlap before doing fusion

* Update ggml/src/ggml-vulkan/ggml-vulkan.cpp

* address feedback
2026-02-27 20:57:58 +02:00
Ruben Ortlam 90800b5aa5 Vulkan Scalar Flash Attention Refactor (llama/19625)
* vulkan: allow using fp16 in scalar flash attention shader

* split rows inside of subgroups for faster synchronization

* use row_split when Br >= 4, change reductions to use shared memory if row_split == 1

* use f32 scalar FA if f16 is not supported by device

* fix amd workgroup size issue

* optimize masksh use

* add medium rows FA shader Br size

* fixes

* add padding to mask shmem buffer

* cache q values into registers for KQ

* fuse lf accumulation, pf and v accumulation into a loop

* stage K loads through shmem

* stage V loads through shmem

* only stage through shmem on Nvidia

* default to Bc 32

* also stage V through shmem when this is done for K

* dynamic subgroups for intel

* use vectorized stores

* use float_type for dequantize4 functions

* use smaller scalar rows size for smaller rows count

* relax flash attention split_k condition to allow non-gqa use

* use minimal subgroup size on Intel

* fix shmem support function

* fix rebase issues

* fixes

* Bc 4 for scalar FA is not a valid configuration

* Use wave32 on AMD RDNA for scalar FA

* add Intel shader core count lookup-table

* fix regressions

* device tuning

* tmpsh size fix

* fix editorconfig

* refactor fa tuning logic into a single place

* fix gqa opt logic

* fix block_rows with small n_rows

* amd tuning

* fix hsk=72/80 issue

* tuning

* allow condition skipping for column check

* use float16 for Of if available

* address feedback

* fix bad RDNA performance on head size <= 128 by limiting occupancy

* allow printing pipeline stats

* cleanup and fixes

* limit occupancy for GCN for small batch FA with large HSK

* disable f16 FA for GCN AMD GPUs on the proprietary driver
2026-02-27 20:57:58 +02:00
Jeff Bolz dcc877688d vulkan: fix coopmat1 without bf16 support (llama/19793) 2026-02-27 20:57:58 +02:00
Jeff Bolz 344eae3d22 vulkan: fix data race in mul_mat_id shader (llama/19790) 2026-02-27 20:57:58 +02:00
Ruben Ortlam 3f68f30907 vulkan: fix MMQ shader push constants and multi-dispatch (llama/19732) 2026-02-27 20:57:58 +02:00
Jeff Bolz f1da0a26f5 vulkan: split mul_mat into multiple dispatches to avoid overflow (llama/19509)
* vulkan: split mul_mat into multiple dispatches to avoid overflow

The batch dimensions can be greater than the max workgroup count limit,
in which case we need to split into multiple dispatches and pass the base
index through a push constant.

Fall back for the less common p021 and nc variants.

* address feedback
2026-02-27 20:57:58 +02:00
Jeff Bolz cc448def01 vulkan: support L2_NORM with contiguous rows (llama/19604) 2026-02-15 21:44:37 +02:00
Jeff Bolz 197e9ab6eb vulkan: support GGML_OP_SET (llama/19584) 2026-02-15 21:44:37 +02:00
Sophon fc6bbab817 vulkan: Add vendor id for Qualcomm drivers (llama/19569)
This commit allows Qualcomm native vulkan driver to be used on Windows
instead of Mesa Dozen.
2026-02-15 21:44:37 +02:00
Jeff Bolz ec57bf407c vulkan: restore -inf check in FA shaders (llama/19582) 2026-02-15 21:44:37 +02:00
ymcki 628b545b7e fix vulkan ggml_acc only works in 3d but not 4d (llama/19426)
* fix vulkan ggml_acc only works in 3d but not 4d

* removed clamp in test_acc_block

* use the correct stride and its test case

* cuda : fix "supports op" condition

* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check

* version without boundary check

* revert back to boundary check version

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-02-15 21:44:37 +02:00
Jeff Bolz cea22b3075 vulkan: For coopmat2 FA, use fp16 accumulators for the final result (llama/19376)
The cpu and cuda backends use fp16 for the VKQ accumulator type, this change
does the same for vulkan. This helps particularly with large head sizes which
are very register-limited.

I tried this for the coopmat1 path and it slowed down a bit. I didn't try for
scalar.

I applied the softmax bias that the cuda backend uses to avoid overflow,
although I was not able to reproduce the original bug without it.
2026-02-08 09:29:10 +02:00
Jeff Bolz c1b63354bb vulkan: make FA mask/softcap enables spec constants (llama/19309)
* vulkan: make FA mask/softcap enables spec constants

* don't specialize for sinks

* bump timeout a little bit
2026-02-08 09:29:10 +02:00
Jeff Bolz a567c140a3 vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (llama/19281)
Write out a 2-bit code per block and avoid loading the mask when it
matches these two common cases.

Apply this optimization when the mask is relatively large (i.e. prompt
processing).
2026-02-08 09:29:10 +02:00
Oleksandr Kuvshynov 932def3198 vulkan: fix GPU deduplication logic. (llama/19222)
* vulkan: fix GPU deduplication logic.

As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the
(same uuid, same driver) logic is problematic for windows+intel igpu.

Let's just avoid filtering for MoltenVK which is apple-specific, and
keep the logic the  same as before 88d23ad5 - just dedup based on UUID.

Verified that MacOS + 4xVega still reports 4 GPUs with this version.

* vulkan: only skip dedup when both drivers are moltenVk
2026-02-08 09:29:10 +02:00
Jeff Bolz 5a786f7648 vulkan: Set k_load_shmem to false when K is too large (llama/19301) 2026-02-08 09:29:10 +02:00
Jeff Bolz e0a3f393ad vulkan: fix non-contig rope (llama/19299) 2026-02-08 09:29:10 +02:00