Commit Graph

380 Commits

Author SHA1 Message Date
Jeff Bolz 67473fef57 vulkan: handle rope with large number of rows (llama/18306) 2025-12-31 17:52:09 +02:00
Jeff Bolz f863735caa vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (llama/18302) 2025-12-31 17:52:09 +02:00
Ruben Ortlam 1356600679 vulkan: use fewer FA rows for small cache runs (llama/18280) 2025-12-31 17:52:09 +02:00
Jeff Bolz dbbe6c11b5 vulkan: Extend rope fusions to allow mrope (llama/18264)
Extend the test-backend-ops tests as well.
2025-12-31 17:52:09 +02:00
Jeff Bolz 98e59a43d1 vulkan: Implement set_tensor_async and the event interfaces (llama/18047)
The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from #7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
2025-12-31 17:52:09 +02:00
Jeff Bolz b893e0813a vulkan: fix im2col overflowing maxworkgroupcount (llama/18180) 2025-12-31 17:52:09 +02:00
Jeff Bolz f407c5e562 vulkan/cuda: fix topk_moe with exp_probs_b (llama/18071)
I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn
and added coverage for exp_probs_b and some other missing combinations. This
exposed a bug in both CUDA and Vulkan backends where they were assuming the
input to argsort and the input to get_rows are the same. I'd like to optimize
this graph in another change, but for now just get it functional.

CUDA also had a bug where it got n_experts from the wrong place, leading to
GGML_ASSERT failures in some of the new tests.
2025-12-31 17:52:09 +02:00
Jeff Bolz ad6ee3865d vulkan: support GGML_UNARY_OP_XIELU (llama/18062) 2025-12-31 17:52:09 +02:00
Jeff Bolz 3cd141f1a9 vulkan: in graph_optimize, try to group ADD operations (llama/18060)
I saw the adds not staying together in the new nemotron 3 nano model.
2025-12-31 17:52:09 +02:00
lovedheart 449fc7c024 Vulkan: some improvement on mul_mat_iq2_xs (llama/18031)
* Some improvement on mul_mat_iq2_xs

Refactor calculations for db values and grid data to optimize performance and reduce redundancy.

* Fix trailing whitespace
2025-12-31 17:52:09 +02:00
Jeff Bolz 195d8d0c65 vulkan: Add perf logger mode with concurrency (llama/17944)
This implements a variation of the perf logger where rather than timing each
operation individually with effectively a barrier in between, we put the
timing boundaries where we already synchronize and time the groups of work
that normally overlap. This can be useful to help understand whether
individual operations need to be optimized, or if the group is already running
efficiently.

GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when
GGML_VK_PERF_LOGGER is also set).

GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch.
2025-12-31 17:52:09 +02:00
Ruben Ortlam 3bb4e1e0ac vulkan: fix mul_mat_vec_iq1_s formatting (llama/18026) 2025-12-18 08:20:56 +02:00
Jeff Bolz af2c8cba6f vulkan: Fix data race/hang in scalar/cm1 flash attention (llama/17887) 2025-12-18 08:20:56 +02:00
lovedheart 7e5df2975e vulkan: improve mul_mat_vec_iq1_s speed (llama/17874) 2025-12-18 08:20:56 +02:00
Eve cdadfc3b72 vulkan: faster q6_k matmul (llama/17813)
* q6_k faster mul mat

* 8 values

* fix comment

* switch to two at a time

* start ci for .glsl files
2025-12-18 08:20:56 +02:00
Jeff Bolz b901ebe4a3 vulkan: support get_rows for i32 (llama/17941) 2025-12-18 08:20:56 +02:00
Jeff Bolz f33446643e vulkan: support GGML_OP_DIAG (llama/17893) 2025-12-18 08:20:56 +02:00
Jeff Bolz 939d3085e9 vulkan: Multi-pass softmax for large number of cols (llama/17892)
When the number of cols is large, split each row across multiple workgroups.
There are three phases that communicate partial results through temp buffers:
(1) compute max partials
(2) take max of partials, compute sum(exp(x-max)) partials
(3) sum partials, compute scaled result
2025-12-18 08:20:56 +02:00
Jeff Bolz 13bb296dbf vulkan: Allow non-pow2 n_experts in topk_moe (llama/17872) 2025-12-18 08:20:56 +02:00
lovedheart d6d44fac69
Vulkan: improve mul_mat_vec_iq1_m (llama/16907)
* Optimize Vulkan shader for matrix-vector multiplication

* Revert changes on compute_outputs and main

Refactor compute_outputs to handle remaining rows correctly.

* Fix trailing whitespace
2025-12-12 17:53:21 +02:00
Jeff Bolz 898f876fe2
vulkan: perf_logger improvements (llama/17672)
* vulkan: perf_logger improvements

- Move perf_logger from device to ctx.
- Add an env var to control the frequency we dump the stats. If you set a very
large value, it just dumps when the ctx is destroyed.
- Add a fusion info string to the tracking, only log one item per fused op.
- Fix MUL_MAT_ID flops calculation.

* fix vector sizes
2025-12-12 17:53:21 +02:00
Phylliida Dev c5e1807071
ggml : add circular tiling support to pad, for Vulkan, CUDA, and CPU (used for making seamless textures) (llama/16985)
* Feat: Added vulkan circular tiling support

* Feat: Added cpu circular

* Feat: Added cuda kernels

* Added tests

* Added tests

* Removed non-pad operations

* Removed unneded changes

* removed backend non pad tests

* Update test-backend-ops.cpp

* Fixed comment on pad test

* removed trailing whitespace

* Removed unneded test in test-backend-ops

* Removed removed test from calls

* Update ggml/src/ggml-vulkan/vulkan-shaders/pad.comp

Co-authored-by: Ruben Ortlam <picard12@live.de>

* Fixed alignment

* Formatting

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Format pad

* Format

* Clang format

* format

* format

* don't change so much stuff

* clang format and update to bool

* fix duplicates

* don't need to fix the padding

* make circular bool

* duplicate again

* rename vulkan to wrap around

* Don't need indent

* moved to const expr

* removed unneded extra line break

* More readable method calls

* Minor wording changes

* Added final newline

* Update ggml/include/ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/include/ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Added circular pad ext tests

* Gate non circular pad devices

* Cleaned gating of non-circular pad devices

---------

Co-authored-by: Phylliida <phylliidadev@gmail.com>
Co-authored-by: Ruben Ortlam <picard12@live.de>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-12 17:53:20 +02:00
Jeff Bolz c66c71e9f4
vulkan: Use one row per workgroup for f32 mmv (llama/17711)
The MoE models have a mul_mat_vec with very small m (32, 64, 128) right before
the topk_moe selection. Running multiple rows per wg doesn't utilize the SMs
well. I think even for larger m, f32 is so bandwidth-limited that running
multiple rows doesn't help.
2025-12-12 17:53:20 +02:00
Jeff Bolz 875d861473
vulkan: support solve_tri with larger N/K values (llama/17781)
Split N into chunks to fit into shared memory.
If K > 128, use a larger workgroup with enough invocations.
Add perf tests matching qwen3next.
2025-12-12 17:53:20 +02:00
Masato Nakasaka a8d02735f7
vulkan: Replace deprecated VK_EXT_validation_features (llama/17637)
* replaced deprecated VK_EXT_validation_features

* forgot to remove old code
2025-12-12 17:53:19 +02:00
Masato Nakasaka 191e5f46a2
vulkan: Fix mismatch in TOPK_MOE unit test (llama/17541)
* Fix shader to support 2D workgroup mapping to a single subgroup

* Set required_subgroup_size

topk_moe shader requires static WARP_SIZE and actual subgroup size to match
2025-12-12 17:53:19 +02:00
Jeff Bolz 64a3f573e0
vulkan: add more num_blocks instantiations in rms_norm (llama/17701) 2025-12-12 17:53:19 +02:00
Jeff Bolz 0484147ab2
vulkan: fix top_k bug when there are ties in the input (llama/17659)
* vulkan: Reduce temporary memory usage for TOP_K

- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"

For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.

* vulkan: fix top_k bug when there are ties in the input

I noticed by inspection a bug in the vulkan top_k shader where if the least
value in the top_k appears multiple times we could end up writing those extra
copies out rather than some larger values (if the larger values are on higher
numbered threads).

I rewrote the test verification to handle this case, where the final index set
is not necessarily the same.

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-12 17:53:19 +02:00
Acly 0b53759b29
vulkan : support conv-2d with large output size (llama/17685) 2025-12-12 17:53:19 +02:00
Jeff Bolz 7e97d3b069
vulkan: enable mmvq for q2_k on NVIDIA (llama/17675) 2025-12-12 17:53:18 +02:00
Jeff Bolz 32ba1ec8e0
vulkan: set all memory allocations to high priority (llama/17624)
* vulkan: set all memory allocations to high priority

* gate by env var
2025-12-12 17:53:18 +02:00
Jeff Bolz 86cb5ab93f
vulkan: Reduce temporary memory usage for TOP_K (llama/17623)
- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"

For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.
2025-12-12 17:53:15 +02:00
Tarek Dakhran 0defeee679
model: LFM2-VL fixes (llama/17577)
* Adjust to pytorch

* Add antialiasing upscale

* Increase number of patches to 1024

* Handle default marker insertion for LFM2

* Switch to flag

* Reformat

* Cuda implementation of antialias kernel

* Change placement in ops.cpp

* consistent float literals

* Pad only for LFM2

* Address PR feedback

* Rollback default marker placement changes

* Fallback to CPU implementation for antialias implementation of upscale
2025-12-12 17:53:14 +02:00
Acly 2258930c2e
vulkan : fix FA mask load with bounds check (coopmat2) (llama/17606) 2025-12-12 17:53:13 +02:00
Ruben Ortlam 2fcc0a3a9f
Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support (llama/16900)
* vulkan: split mul_mmq_funcs for mul_mat_vecq use

* add mxfp4 mmvq

* add q2_k mmvq

* add q3_k mmvq

* add q4_k and q5_k mmvq

* add q6_k mmvq

* handle 4x4 quants per mmvq thread

* enable MUL_MAT_ID mmvq support

* enable subgroup optimizations for mul_mat_vec_id shaders

* device tuning

* request prealloc_y sync after quantization

* fix indentation

* fix llvmpipe test failures

* fix mul_mat_id mmvq condition

* fix unused variable warning
2025-12-12 17:53:12 +02:00
Jeff Bolz dbf8766ffa
vulkan: improve topk perf for large k, fix overflow in unit tests (llama/17582) 2025-12-12 17:53:12 +02:00
Jeff Bolz 7a20963140
vulkan: Implement GGML_OP_TRI (llama/17503)
* vulkan: Implement GGML_OP_TRI

* check types match
2025-12-12 17:53:11 +02:00
Jeff Bolz 3727a36c48
vulkan: Implement SOLVE_TRI (llama/17486)
* vulkan: Implement SOLVE_TRI

* load B matrix through shared memory

* use FLOAT_TYPE
2025-12-12 17:53:10 +02:00
Acly ac92424b59
vulkan : move contiguous checks to device_supports_op (llama/17490)
* vulkan : remove op_supports_incontiguous and add missing constraints in device_supports_op

* im2col: remove contraints on src0 (kernel input)
2025-12-12 17:53:10 +02:00
Jeff Bolz 310db24fca
vulkan: use a fixed 1KB buffer for the add_rms_fusion opt (llama/17514) 2025-12-12 17:53:10 +02:00
Jeff Bolz c8050e5fdc
vulkan: allow graph_optimize for prompt processing workloads (llama/17475) 2025-12-12 17:53:09 +02:00
Jeff Bolz d8b61e05f8
vulkan: Implement top-k (llama/17418)
* vulkan: Implement top-k

Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10)
and discards all but the top K. Repeat until only K are left. And there's a fast
path when K==1 to just find the max value rather than sorting.

* fix pipeline selection

* vulkan: Add N-ary search algorithm for topk

* microoptimizations
2025-12-12 17:53:09 +02:00
Jeff Bolz 208450048c
vulkan: Implement GGML_OP_CUMSUM (llama/17479) 2025-12-12 17:53:08 +02:00
Jeff Bolz 273e4fe7ae
vulkan: Use fewer rows for scalar FA when HS is not a multiple of 16 (llama/17455) 2025-12-12 17:53:07 +02:00
Jeff Bolz 553d57a4e7
vulkan: more FA details in vk_perf_logger (llama/17443) 2025-12-12 17:53:07 +02:00
Jeff Bolz deb4958add
vulkan: remove a couple unnecessary switches (llama/17419) 2025-12-12 17:53:06 +02:00
Jeff Bolz cdc1a776be
vulkan: disable async for older Intel devices (llama/17369)
* vulkan: disable async for older Intel devices

* update detection logic

* use name string for detection
2025-12-12 17:53:05 +02:00
Giuseppe Scrivano 24b14cad87
vulkan: implement ADD1, ARANGE, FILL, SOFTPLUS, STEP, ROUND, CEIL, FLOOR, TRUNC (llama/17319)
* vulkan: initialize array

* vulkan: implement ADD1

* vulkan: implement ARANGE

* vulkan: implement FILL

* vulkan: implement SOFTPLUS

* vulkan: implement STEP

* vulkan: implement ROUND

* vulkan: implement CEIL

* vulkan: implement FLOOR

* vulkan: implement TRUNC

* docs: update Vulkan ops

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-12-12 17:53:04 +02:00
Jeff Bolz 95d0b0b0cf
vulkan: support larger argsort (llama/17313)
* vulkan: support larger argsort

This is an extension of the original bitonic sorting shader that puts the
temporary values in global memory and when more than 1024 threads are needed
it runs multiple workgroups and synchronizes through a pipelinebarrier.

To improve the memory access pattern, a copy of the float value is kept with
the index value. I've applied this same change to the original shared memory
version of the shader, which is still used when ncols <= 1024.

* Reduce the number of shader variants. Use smaller workgroups when doing a single pass, for a modest perf boost

* reduce loop overhead

* run multiple cols per invocation, to reduce barrier overhead
2025-12-12 17:53:04 +02:00
Jeff Bolz ae8865c6e6
vulkan: Add copy_transpose shader (llama/17371) 2025-12-12 17:53:04 +02:00