* Updated vec.h/vec.cpp code to accumulate to F32 rather than F16
Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>
* OP_GATED_DELTA_NET impl
* add back lanes_per_column declaration
* removed has_subgroup_arithmetic and has_subgroup_clustered_reduce
* removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot
* support for K>1 state snapshot
* removed picky indent multiple of 4 fixes
* removed return that won\'t be executed
* hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now
* hmx-mm: add support for Q4_1
* hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot
* hexagon: fix repack scratch buffer overflow
* hex-mm: fix Q4_1 repack buffer sizing
* hexagon: flip the build order for mm and fa (seems to help LTO)
* hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1
* hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output
* hexagon: resurrect early-wake and add support for polling for op-batch completions
With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax.
This is a good thing! But it does add extra latency for the pure benchmark runs.
Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking.
---------
Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
* vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32
Against mesa git, this shows a 4.8% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.
Note that this breaks some tests until the last commit which fixes
OOB A reads.
* vulkan: Use aligned loads in mul_mat_vec when available
Against mesa git, this shows a 3.3% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.
* Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec
Mesa's UUB logic can't see through conditionals, limiting its
ability to understand the bounds on the `num_rows` field in the
cleanup run. Making it explicit that `num_rows` is, indeed, always
<= `NUM_ROWS` helps mesa make slightly better codegen.
Against mesa git, this currently shows a 1% performance improvement
in tg128 on Qwen3.5-9B:BF16 on Intel BMG.
* vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes
There was a TODO to fix the OOB reads from the A matrix which we do
here.
It is within performance noise (+<0.1%) in tg128 for
Qwen3.5-9B:BF16 on Intel BMG.
* feat: extend repeat op for vulkan
* feat: add repeat_f16 vulkan pipeline
* fix: ensure same dst and src types
* fix: use type_size instead of data types
* fix: use int16 and int32 for repeat shader op
* chore: rename repeat_f* to repeat_i*
* chore: rename repeat vulkan pipelines
* ggml-zendnn: fixed naming of matmul function
* ggml-zendnn: fixed naming of mul_mat_id function
* ggml-zendnn: fixed print in mul_mat_id
---------
Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>
* vulkan: add CONV_SHAPE_64x128 for medium-K conv2d
* vulkan: skip conv2d bounds checks when shapes align with tile sizes
* vulkan: use WG_SIZE=128 for CONV_SHAPE_64x32 conv2d
* vulkan: stage cm2 conv2d accumulator through shmem before global store
* vulkan: add coopmat1 conv2d path
* fallback when using too much shared memory. clean up comments
* Require 16x16x16 and subgroup size 32 or 64
* check whether shared memory is sufficient before overwriting conv2d params with coopmat1 values
* hexagon: add support for CONCAT with optimized concat_2d_transposed
qwen3.5 models are quite heavy on the CONCAT with large and transposed src1.
* hex-concat: use fastdiv in generic version
* hex-concat: make checks for transposed a bit more readable
* hex-concat: reoder dma ops for better pipelining
* hex-cont/cpy: optimize CPY and CONT ops
The primary change is to avoid scalar divs in the inner loops.
We were calling hvx_copy_uu(... type_size) where type_size is non a constexpr.
This causes runtime divs by that value which is normally just 4 or 2 (f32/f16).
* hex-get-rows: optimize GET_ROWS for large rows
We now use DMA for larger rows and also split them into chunks to improve perf for Qwen3.5 and other models
that do lots of GET_ROWS with huge (2MB+ rows).
Also bump the DMA queue depth now that we can take advantage of it.
* hex-concat: unroll the inner loops of concat_2d
* hex-concat: more updates to concat_2d to improve perf a bit further
* hex-cpy: fixed n_rows per thread checks in the copy ops
* hmx-fa: fix alignment issues while computing dma sizes
* hex-set-rows: add early returns for idle threads
* hvx-rope: minor optimization to replace loops with fastdiv logic
* hex-rope: replace scalar tail processing with HVX
* hex-rope: optimize rope cache init with HVX
Add hvx-utils sin/cos helpers that use an aprox method (similar to rsqrt, inverse, etc)
Use the helpers to optimize ROPE.
* ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K
* Fix to editorconfig checking pass
* Remove mul-mat-legacy pipeline
* Fix to use vendor name as is and add dot_product/vendor to shader_lib_ctx
* Only run webgpu CI on my fork
* Add webgpu only workflow
* refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled
* restore build.yml
* TP: fix ggml context size calculation, memory leak
* move split state cache back into the context
* revert to constant ggml context size for cgraphs
* increase headroom for statically allocated tensors
* remove obsolete include
* ggml: implement `gguf_init_from_buffer`
* test: `gguf_init_from_buffer`
* fix: memory breakdown for a model loaded with `no_alloc` from a file is consistent with being loaded from a buffer
* fix: use `GGML_UNUSED`
Co-authored-by: Copilot <copilot@github.com>
* fix: remove `total_size` from `gguf_reader`
* fix: file offset calculation, rename `offset` to `data_offset`
Co-authored-by: Copilot <copilot@github.com>
* refactor: extract model loader bug fixes to another PR
* feat: add `gguf_init_from_callback`
* fix: always require a max expected size
* fix: change `gguf_reader_callback_t`'s `output` type to `void *`, change `max_expected_size` and offsets to `uint64_t`
* fix: harden against offset overflow in buffer read
* fix: remove seek behavior from the callback
* feat: `max_chunk_read == 0` means `SIZE_MAX`
* fix: seeking in a gguf file with no tensors
---------
Co-authored-by: Copilot <copilot@github.com>
- Use OpenMP to parallelize iq2xs_init_impl and iq3xs_init_impl.
- Move the OpenMP detection from ggml-cpu to ggml-base.
- Update OpenMP dependencies in ggml-config.cmake.in.
- change `k_copy_src1_to_contiguous` so that uses a precomputed contiguous mapping where all rows "owned" by an expert are in one slice with a know starts and ends
- switch the `O(n_as * n_routed_rows)` contraption to a counting sort-based procedure with `O(n_as + n_routed_rows)` complexity
* SYCL: add BF16 to DMMV kernel path for ~4x token generation speedup
BF16 models had no dedicated token generation kernel — they fell through
to the generic full-GEMM path, resulting in ~14% memory bandwidth
utilization on Intel Arc GPUs. This adds BF16 support to the DMMV
(dequantize mul-mat-vec) path, matching the existing F16 implementation.
Fixes#20478
* SYCL: fix BF16 DMMV out-of-bounds when ncols % 64 != 0
The qk=1 kernel (used for F16 and BF16) iterates with stride
2*GGML_SYCL_DMMV_X (= 64 on Intel targets where WARP_SIZE=16). When
ncols is a multiple of DMMV_X (32) but not of 2*DMMV_X (64), the last
warp iteration accesses elements at col >= ncols, producing NaN for the
final row and wrong values for interior rows.
Fix: tighten can_use_dequantize_mul_mat_vec to require ne[0] %
(2*DMMV_X) == 0 for F16/BF16 types, and update the ASSERT in the BF16
launcher to match. Quantized types use block-structured kernels with
different access patterns and keep the existing DMMV_X check.
Verified: test-backend-ops MUL_MAT passes 913/913 on Intel Arc Pro B70.
Previously failing: m=128/129 n=1 k=1056 cases (NaN and ERR > 0.0005).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* vulkan: fuse snake activation (mul, sin, sqr, mul, add)
Add snake.comp shader with F32 / F16 / BF16 pipelines and
ggml_vk_snake_dispatch_fused. The matcher recognizes the naive 5 op
decomposition emitted by audio decoders (BigVGAN, Vocos) for snake
activation y = x + sin(a*x)^2 * inv_b and rewrites it to a single
elementwise kernel.
test_snake_fuse from the CUDA PR now also compares CPU naive vs
Vulkan fused across F32 / F16 / BF16.
* vulkan: address jeffbolznv review for fused snake activation
Rename T / C to ne0 / ne1 in the shader and push constants to match
the standard naming convention used across the Vulkan backend.
Tighten ggml_vk_can_fuse_snake: require x and dst to be contiguous
(the shader uses idx = i0 + i1 * ne0) and require a / inv_b to be
tightly packed on the broadcast dim (the shader reads data_a[i1]).
* vulkan: tighten snake fusion type checks for all operands (address jeffbolznv review)
* vulkan: reject snake fusion when ne[2] or ne[3] > 1 (address jeffbolznv review)
* vulkan: address 0cc4m review for fused snake activation
snake.comp is renamed to follow the ggml DATA_A_* / A_TYPE convention.
A_TYPE now applies to the activation tensor data_a instead of the
broadcast multiplier, and the bindings become data_a (A_TYPE), data_b
(float), data_c (float) and data_d (D_TYPE). A header at the top of
the shader maps each buffer to its role in y = x + sin(b * x)^2 * c.
On the C++ side, ggml_vk_can_fuse_snake reuses the existing snake_pattern
constant instead of duplicating the op list, sin_node is extracted as a
named local alongside the other chain nodes, and the broadcast operands
a and inv_b are now required to be GGML_TYPE_F32 to match the hardcoded
float bindings on data_b and data_c (the previous a->type == x->type
would silently reject any future BF16 or F16 chain once the supports_op
gate for SIN / SQR is lifted). ggml_vk_snake_dispatch_fused gets an
explicit GGML_TYPE_F32 case and GGML_ABORT on default in place of the
silent f32 fallback, and a stale comment about data_a[i1] / data_inv_b[i1]
is refreshed to match the new binding names.
* metal : fix GGML_OP_SET kernel threads
* tests : extend test_cpy to support different src/dst shapes
Extend test_cpy to support different source and destination tensor shapes
for CPY operations (reshaping), where the total number of elements must match.
- Renamed ne -> ne_src, added ne_dst parameter (default: use src shape)
- Added 50 new reshaping test cases covering 1D<->2D<->3D<->4D conversions
- Tests exercise 1024 boundary, small shapes, and large dimensionality changes
- Fixed dangling reference bug (storing & to temporary std::array)
- Updated all existing test calls with permute/transpose args for compatibility
Assisted-by: llama.cpp:local pi
* metal : optimize concat kernel with row batching for small widths
When ne0 < 256, batch multiple rows into a single threadgroup to improve
occupancy. This avoids underutilizing the GPU when processing narrow tensors.
- Dispatch nth = min(256, ne0) threads per group
- Calculate nrptg (rows per threadgroup) to fill up to 256 threads
- Update kernel index calculation to handle the row batching
- Add boundary check for i1 >= ne1
Assisted-by: llama.cpp:local pi
* tests : clean-up
* tests : refactor CPY shape tests to use dimension permutations
Replace 75 hardcoded test cases with a loop over permutations of
{3, 5, 7, 32} (total elements: 3360). Each src permutation is tested
against canonical sorted and reverse dst, skipping identical shapes.
Covers F32, F16, and Q4_0 (when both src and dst ne0 == 32).
Assisted-by: llama.cpp:local pi
* hexagon: remove gathers and better handling of vtcm in ssm-conv
* hexagon: relax ssm-conv gating requirements
* hexagon: add new prefill ssm-conv backend test
* hexagon: remove trailing white space
* hex-rope: uninline rope_cache_init, otherwise it breaks after rebaseing with SSM_CONV changes
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* opencl: refactor initialization
* opencl: refactor GPU identification
* opencl: rename for consistency
* opencl: cache global mem size in dev_ctx
* opencl: adjust log level
* opencl: load argsort and flash_attn kernels in supports_op
* argsort kernel must be built for supports_op for querying the max
workgroups
* flash_attn kernel has many variants, only load them when needed
* hmx-mm: update debug logging in hmx-mm
* hmx-mm: update dequant logic to use HVX_vector_x2/4
* hmx-mm: remove non-pipelined version of the quantize matmul
It seems that we don't reall need non-pipelined version
* hmx-mm: use activation depth mode and update naming
Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
* hex-mm: minor hmx matmul naming updates
* hmx-mm: remove unused vars
* snapdragon: scripts bump default ubatch-size to 1K
* hexagon: combine HMX and power and clock settings into a single set_power call
* hmx-mm: remove leftover of the scale repl helper
* hexagon: fix editconf error
---------
Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
* Adds initial PDL setup.
* Adds PDL barriers based on simple heuristic: place "sync" before first input pointer access, and "launch" after last write, e.g. to tensors like dst.
* Further optimization pass of the first half of kernels
* Optimized PDL barriers for the second batch of kernels
* Further refinements after rebase.
* Moves pdl logic to separate function, removes some whitespace
* Strips post-hoc PDL logic
* Adds stream capture PDL setup. Enrolls quantize_q8_1 to leverage pdl to
overlap execution with previous kernels
* Enrolls mul_mat_vec_q, rms_norm_f32 and k_bin_bcast (partly) into PDL
* Enrolls mmvf, rope, set-rows and topk kernels for gpt-oss into PDL
* Introduce ggml_cuda_kernel_launch, to abstract away cudaLaunchKernelEx,
to enable hip/musa compatibility
* Enrolls cpy_scalar_contiguous, k_get_rows_float and rms_norm_f32
* Enrolls flash_attn_combine_results
* Fix: Drops needless and broken check of CUDA arch for PDL. PDL either
works or is without effect.
* Enrolls flash-attention kernels to pdl
* Fix: inlines ggml_cuda_kernel_launch, and uses perfect forwarding for
kernels args. This fixes PDL.
* Perf: Enrolls k_bin_bcast variadic template invocation into PDL, via
and template alias and template expansion
* Enrolls all remaining kernels for qwen3-coder-next into PDL
* Remove all PDL LC calls to create a baseline
* Added LC according to internal guidance and tested kernel performance.
* Enrols missing qwen3-5 kernels passively into PDL.
* Kernel optimizations (LC signals) for qwen3.5
* Enrolls ssm-scan kernels into PDL
* Adds GGML_CUDA_PDL command line option to toggle PDL.
* Fix: Ada and lower compilation by guarding PDL calls correctly
* Cleanup: Removes commented out GGML_CUDA_PDL_LC
* Cleanup: Removes experimental comments
* Adds 90-virtual to build script so that Hopper GPUs can leverage PDL.
* Adds stricter checks to enable PDL, adds env-check to disable it, and removes now superfluous compile option to enable PDL.
* Fix: Correct PDL en/disablement based on device-side arch check. Host
side check is UB. Required moving from macros to inlined functions
* Fix: default-disable PDL. Enable by setting GGML_CUDA_ENABLE_PDL=1
* Enable PDL by default for Hopper+ devices
* Enrolls softcap_f32 and two flash_attn kernels into PDL.
* Improves flash attn PDL barrier placement
* Fix: Perf regression on ada; excludes ada and below from PDL launches
* Improves some sync barrier placements
* Drops superfluous constructor
* Adds #endif guard comments
* Reverts experimental change to top-k-moe.cu, which moved expensive allocations
in front of the PDL barrier. It did not have a meaningful impact.
* Exchanges GGML_CUDA_DISABLE_PDL with GGML_CUDA_PDL. IFF GGML_CUDA_PDL=0
PDL is disabled
* Revert "Drops superfluous constructor". Adds const to remaining
arguments
This reverts commit 12b1d250da0089ae02a9bb71bbb3fd6d70f6f2f1.
* Cleanup: Removes and fixes some comments and whitespace
* Clarifies comment of sync-barrier position
* Relocates and refactors PDL launch functions and accessories
* Adds error checking to the regular kernel launch path
* Drops "auto" in favor of "ggml_cuda_kernel_params"
* Adds "const" to ggml_cuda_kernel_launch_params
* [Whitespace] Adds final newline to common.cuh to make editorconfig CI job happy