Commit Graph

4429 Commits

Author SHA1 Message Date
Ruben Ortlam 0c7c3ba570
vulkan: add get/set tensor 2d functions (llama/22514)
* vulkan: add get/set_tensor_2d functions

* fix backend interface comments

* Update ggml/src/ggml-metal/ggml-metal.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-01 13:07:35 +03:00
Johannes Gäßler 582d2562a4
CUDA: fix tile FA kernel on Pascal (llama/22541) 2026-05-01 13:07:35 +03:00
Rithik Sharma d74c56862b
add fast matmul iquants (llama/22504) 2026-05-01 13:07:35 +03:00
Max Krasnyansky 66392cf1a2
hexagon: make vmem and buffer-size configurable (llama/22487)
* hexagon: allow host to set max vmem size

We use a sane default but it's helpful to allow for an override if needed.

* hexagon: add support for measuring vmem space and move pinned mmaping management to host

* hexagon: update vmem checks to use uint64

* hexagon: bump op buffers to 16 (matches max mmaps)

* hexagon: bump default vmem to 3.2GB

* hexagon: add support for autodetecting vmem space and some logging cleanup in that area

* hexagon: fix whitespace warnings

* Update scripts/snapdragon/adb/run-cli.sh

Co-authored-by: Pascal <admin@serveurperso.com>

* hex-adb: fix run-completion script

---------

Co-authored-by: Pascal <admin@serveurperso.com>
2026-05-01 13:07:35 +03:00
Anav Prasad aec8e69c2f
CUDA: fuse SSM_CONV + ADD(bias) + SILU (llama/22478) 2026-05-01 13:07:35 +03:00
shalinib-ibm 9f2cec1840
ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (llama/22293)
* ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault

vec_xst operations in the tiled path crash on AIX when writing
near 4KB page boundaries due to strict memory protection. Fall
back to mnpack implementation on AIX for stable execution.

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>

* Update ggml/src/ggml-cpu/llamafile/sgemm.cpp

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* Update sgemm.cpp

* Update sgemm.cpp

---------

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
2026-05-01 13:07:34 +03:00
Georgi Gerganov c59a773605
examples : update to Q1_0 2026-05-01 13:07:33 +03:00
Georgi Gerganov 320c048724
sync : ggml 2026-04-30 21:44:28 +03:00
Georgi Gerganov ad670182d9
ggml : bump version to 0.10.1 (ggml/1469) 2026-04-30 11:29:23 +03:00
Aman Gupta 44e7803661
ggml-cuda: refactor fusion code (llama/22468)
* ggml-cuda: refactor fusion code

* apply formatting + make env variable truthy
2026-04-30 11:29:23 +03:00
qiurui144 6119537e9a
ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME (llama/22317)
* ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME

When GGML_CPU_RISCV64_SPACEMIT=ON is set, ime1_kernels.cpp contains
inline asm for the vmadot family which requires the xsmtvdotii custom
extension.(problem can see in some blogs and make sure in K3 platform)
The current CMakeLists does not include xsmtvdotii, so any toolchain
that honours the explicit -march (tested with SpacemiT GCC 15.2) fails
at the assembler stage:

  Error: unrecognized opcode `vmadot v16,v14,v0',
         extension `xsmtvdotii' required

Append _xsmtvdotii to MARCH_STR when GGML_CPU_RISCV64_SPACEMIT is
enabled so the IME path can actually build with a capable toolchain.
No effect on builds that leave GGML_CPU_RISCV64_SPACEMIT off.

toolchain from https://www.spacemit.com/community/resources-download/Tools

* Update ggml/src/ggml-cpu/CMakeLists.txt

Co-authored-by: alex-spacemit <jinghui.huang@spacemit.com>

---------

Co-authored-by: alex-spacemit <jinghui.huang@spacemit.com>
2026-04-30 11:29:23 +03:00
Reese Levine fa20229eeb
ggml-webgpu: Fix bug in FlashAttention support check (llama/22492)
* Fix flashattention support check for devices that don't support subgroups

* set path to none if kv_tile doesn't fit
2026-04-30 11:29:23 +03:00
hrushitfujitsu 3076725eb0
ggml : add sve tuned code for gemm_q8_0_4x8_q8_0() kernel (llama/21916)
* Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel

* Change arrays to static const in repack.cpp

---------

Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
2026-04-30 11:29:23 +03:00
Johannes Gäßler 5301139374
TP: fix delayed AllReduce + zero-sized slices (llama/22489) 2026-04-30 11:29:23 +03:00
Michael Wand c200b588f8
ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (llama/22196) 2026-04-30 11:29:22 +03:00
lnigam b553e17071
ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… (#22286)
* ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (GQA=32)

Adds MMA-f16 and tile kernel configs, dispatch logic, template instances,
and tile .cu file for Mistral Small 4 (head sizes 320/256), restricting to
ncols2=32 to support GQA ratio 32 only.

* Adding check to return BEST_FATTN_KERNEL_NONE in case GQA!=32

* Apply suggestions from code review

Address review comments

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments and making kernel config default to DQK=512, DV=512 instead of DQK=256,DV=256

* Fixed bug with sinks=1, with ncols=32, there are two warp-groups created but sinks index is same(0,...,15) for both the groups hence with sinks=1, output is not matching with CPU output. Added sink_base which will be base index for each warp_group (threadIdx.y / np)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/template-instances/generate_cu_files.py

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-04-30 11:29:22 +03:00
Matt Corallo e69c109aac
vulkan: Coalesce Q4_K/Q5_K scale loads (llama/21751)
Some SPIR-V compilers (notably mesa) don't handle the current
vulkan Q4_K/Q5_K scale load pattern in mul_mat particularly well.
While reading three `u8`s from the 12-byte scale array should (at
least on some hardware) result in loading the full 12 bytes in a
single LOAD followed by whatever extraction is needed, at least
the ANV Intel driver really can't practically perform this
optimization.

`mesa`'s unsigned upper bound logic doesn't handle tracking bounds
through ternary, resulting in the `(is < 4) ? ... : is - 4` having
an infinite upper bound (as it cannot prove `is - 4` doesn't
underflow). While this could still be rectified if mesa looked at
the array bounds, it currently doesn't and `glslc` currently emits
SPIR-V that doesn't allow for this optimization anyway (though
maybe it will at some point, see
https://github.com/KhronosGroup/glslang/issues/4206).

In mul_mat_vecq we took a different approach to loading the same
fields. We read the first two bytes we needed from `scale` then
took a branch before deciding whether we needed to read a third
byte. In mesa this did, indeed, lead to a top-level branch with
conditional loads. As such these loads ended up not being
coalesced either (at least in the ANV driver) resulting in
additional instructions in our hot loop.

Instead, here, we go ahead and force loading the full 12 bytes and
extract the bits we need from the packed-u32s instead. In mul_mat
there's a few less ternaries and only one extra shift, so even on
drivers that did optimize the previous loads properly the only
material change should be pulling a few extra bytes into registers
(which on most hardware won't cost anything anyway, though
ironically on Intel it theoretically could). In mul_mat_vecq this
requires a bit of extra math and may read bytes from the u32 that
weren't needed, but it seems likely avoiding the branch is a win
on most platforms.

On Intel Xe2/mesa 26.0.4 with the optimizations from
https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15162,

for shader matmul_id_subgroup_q4_k_f32_f16acc_aligned_l:
 * Instruction Count: 2753 -> 2688
 * SEND Count: 269 -> 261
 * Cycle Count: 273976 -> 266138
 * Max live registers: 248 -> 246
 * Non SSA regs after NIR: 381 -> 382

for shader matmul_id_subgroup_q5_k_f32_f16acc_aligned_l:
 * Instruction Count: 2767 -> 2702
 * SEND Count: 271 -> 263
 * Cycle Count: 274140 -> 268144
 * Max live registers: 248 -> 246
 * Non SSA regs after NIR: 381 -> 382

for shader mul_mat_vec_id_q4_k_q8_1_f32:
 * Instruction Count: 1930 -> 1646
 * SEND Count: 116 -> 71
 * Cycle Count: 1348306 -> 843350
 * Max live registers: 78 -> 84
 * Non SSA regs after NIR: 300 -> 135

for shader mul_mat_vec_id_q5_k_q8_1_f32:
 * Instruction Count: 2207 -> 1922
 * SEND Count: 131 -> 86
 * Cycle Count: 1392012 -> 1037836
 * Max live registers: 90 -> 90
 * Non SSA regs after NIR: 300 -> 135

for shader mul_mat_vec_q4_k_q8_1_f32:
 * Instruction Count: 2029 -> 1749
 * SEND Count: 111 -> 66
 * Cycle Count: 1347278 -> 840118
 * Max live registers: 74 -> 80
 * Non SSA regs after NIR: 299 -> 134

for shader mul_mat_vec_q5_k_q8_1_f32:
 * Instruction Count: 2307 -> 2022
 * SEND Count: 126 -> 81
 * Cycle Count: 1379820 -> 954042
 * Max live registers: 86 -> 86
 * Non SSA regs after NIR: 299 -> 134

On one Arc Pro B60, unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL:
 * pp512: 907.34 ± 9.28 -> 941.94 ± 10.53 (+4%)
 * pp2048: 897.95 ± 1.82 -> 931.55 ± 1.79 (+4%)
 * tg128: 49.49 ± 0.02 -> 49.86 ± 0.05 (+ <1%)

On one Arc Pro B60, unsloth/Qwen3.5-27B-GGUF:Q4_K_S:
 * pp512: 324.13 ± 10.52 -> 354.33 ± 6.81 (+9%)
 * pp2048: 329.80 ± 0.25 -> 357.10 ± 0.06 (+8%)
 * tg128: 17.11 ± 0.01 -> 18.11 ± 0.01 (+6%)

On four Arc Pro B60s, unsloth/Qwen3.5-122B-A10B-GGUF:Q5_K_S with
-sm layer (note that -sm tensor improvements will naturally be
less):
 * pp512: 264.55 ± 2.81 -> 280.45 ± 3.94 (+6%)
 * pp2048: 319.32 ± 2.72 -> 335.70 ± 3.48 (+5%)
 * tg128: 26.39 ± 0.01 -> 26.67 ± 0.01 (+1%)
2026-04-30 11:29:22 +03:00
Reese Levine 4ea5b6febc
ggml-webgpu: fix buffer aliasing for ssm_scan and refactor aliasing logic (llama/22456)
* Refactor buffer aliasing to be part of shader lib decisions

* cleanup

* formatting
2026-04-30 11:29:22 +03:00
Jeff Bolz 35fa508360
vulkan: add barrier after writetimestamp (llama/21865) 2026-04-30 11:29:22 +03:00
Emil Askerov 0fa31f9bb6
ggml: improve SPIR-V headers detection with __has_include (llama/21918)
* ggml: improve SPIR-V headers detection with __has_include while preserving original _WIN32 logic

* Address review comments: fix fallback logic and add FreeBSD support

* Remove spirv_cross fallback as per review

* Remove redundant __has_include check
2026-04-30 11:29:22 +03:00
Adrien Gallouët 6fceff2eb4
ggml : skip already registered backends and devices (llama/22296)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-04-30 11:29:21 +03:00
Adrien Gallouët ca624d86ab
ggml : revert to -lm linking instead of find_library (llama/22355)
* ggml : revert to -lm linking instead of find_library

`find_library(MATH_LIBRARY m)` was introduced recently, but it breaks
CUDA compilation with GGML_STATIC. I could not find any valid use case
where we would prefer `find_library` over the standard `-lm` approach.

This commit is also meant to start a discussion if there is a valid
reason to keep `find_library(MATH_LIBRARY m)`, we should clarify what
problem it was solving and find an alternative fix that does not break
CUDA with GGML_STATIC.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* ggml : use MATH_LIBRARY only if defined

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* ggml : fix initial broken condition

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* ggml : always respect MATH_LIBRARY when defined

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-04-30 11:29:21 +03:00
hipudding 70e4c0aec0
CANN: add new ops, optimize existing ops (llama/21204)
New operators:
- GGML_OP_SET: implement via aclnnInplaceCopy on target region
- GGML_OP_CUMSUM: implement via aclnnCumsum
- GGML_OP_FILL: implement via aclnnInplaceFillScalar
- GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides
- GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via
  aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets
- GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve
- GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus

Optimizations:
- GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu /
  aclnnGeGluV3 when applicable; fallback conditions now checked inside
  each function rather than at the call site
- CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→
  ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call
- L2_NORM: fix in-place ClampMin on norm result (was clamping wrong
  tensor); add eps clamping before division to avoid divide-by-zero
- PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call
  ReflectionPad1d once on the full 4-D view; remove redundant nb copies
- GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor
  helper into gather_batched lambda with batch loop inlined
- SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice;
  refactor helper into scatter_batched lambda with batch loop inlined
- OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with
  per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast
  batch dims where ne02/ne03 may differ from ne2/ne3
- backend memset_tensor: implement via aclrtMemset (was NULL)

Bug fixes:
- COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary
  buffer instead of InplaceEqTensor, avoiding corruption of src0
- ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[]
  fields in ggml_graph_node_properties; has_matching_properties() was
  missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to
  incorrectly share cached graphs and produce wrong results (ERR≈679)
- graph cache op_params matching: compare full GGML_MAX_OP_PARAMS
  bytes so that ops differing only in parameters are not incorrectly
  replayed from cache
2026-04-30 11:29:21 +03:00
Rithik Sharma 9c233f11f0
ggml-webgpu: add Q1_0 support (llama/22374)
* add fast matmul matvec q1_0 kernel

* ggml-webgpu: drop redundant zero-fills in Q1_0 shmem init
2026-04-30 11:29:21 +03:00
Rithik Sharma f675a8c926
add fast mat-vec kernels for i-quants (llama/22344) 2026-04-30 11:29:21 +03:00
unraido c9ba41397c
fix: rpc-server cache may not work in Windows environments (llama/22394)
* fix: create directory and log cache file name.

* Remove GGML_LOG_INFO conditional compilation.

---------

Co-authored-by: kotaro <kotaro.kusunoki@gmail.com>
2026-04-30 11:29:21 +03:00
Adrien Gallouët f5c3ce17d5
ggml : use 64 bytes aligned tile buffers (llama/21058)
| Model                            | Test   |   t/s OLD |   t/s NEW |   Speedup |
|:---------------------------------|:-------|----------:|----------:|----------:|
| qwen35 0.8B BF16                 | pp512  |    584.59 |    595.41 |      1.02 |
| qwen35 0.8B BF16                 | tg128  |     52.23 |     52.82 |      1.01 |
| qwen35 0.8B IQ2_M - 2.7 bpw      | pp512  |    260.64 |    261.70 |      1.00 |
| qwen35 0.8B IQ2_M - 2.7 bpw      | tg128  |     81.17 |     80.89 |      1.00 |
| qwen35 0.8B IQ2_XXS - 2.0625 bpw | pp512  |    302.36 |    302.56 |      1.00 |
| qwen35 0.8B IQ2_XXS - 2.0625 bpw | tg128  |     84.93 |     85.12 |      1.00 |
| qwen35 0.8B IQ3_XXS - 3.0625 bpw | pp512  |    263.22 |    260.01 |      0.99 |
| qwen35 0.8B IQ3_XXS - 3.0625 bpw | tg128  |     80.29 |     78.94 |      0.98 |
| qwen35 0.8B IQ4_NL - 4.5 bpw     | pp512  |    728.65 |    742.09 |      1.02 |
| qwen35 0.8B IQ4_NL - 4.5 bpw     | tg128  |     82.39 |     84.46 |      1.03 |
| qwen35 0.8B IQ4_XS - 4.25 bpw    | pp512  |    681.33 |    677.06 |      0.99 |
| qwen35 0.8B IQ4_XS - 4.25 bpw    | tg128  |     80.18 |     79.28 |      0.99 |
| qwen35 0.8B Q2_K_M               | pp512  |    413.28 |    415.94 |      1.01 |
| qwen35 0.8B Q2_K_M               | tg128  |     81.90 |     82.78 |      1.01 |
| qwen35 0.8B Q3_K_M               | pp512  |    493.17 |    495.08 |      1.00 |
| qwen35 0.8B Q3_K_M               | tg128  |     82.75 |     83.23 |      1.01 |
| qwen35 0.8B Q3_K_S               | pp512  |    429.35 |    427.64 |      1.00 |
| qwen35 0.8B Q3_K_S               | tg128  |     86.69 |     87.02 |      1.00 |
| qwen35 0.8B Q4_0                 | pp512  |    783.46 |    782.32 |      1.00 |
| qwen35 0.8B Q4_0                 | tg128  |     88.23 |     87.90 |      1.00 |
| qwen35 0.8B Q4_1                 | pp512  |    741.71 |    729.76 |      0.98 |
| qwen35 0.8B Q4_1                 | tg128  |     85.44 |     86.01 |      1.01 |
| qwen35 0.8B Q4_K_M               | pp512  |    676.24 |    681.31 |      1.01 |
| qwen35 0.8B Q4_K_M               | tg128  |     76.59 |     77.06 |      1.01 |
| qwen35 0.8B Q4_K_S               | pp512  |    683.12 |    688.81 |      1.01 |
| qwen35 0.8B Q4_K_S               | tg128  |     80.50 |     81.19 |      1.01 |
| qwen35 0.8B Q5_K_M               | pp512  |    635.33 |    642.11 |      1.01 |
| qwen35 0.8B Q5_K_M               | tg128  |     72.07 |     72.49 |      1.01 |
| qwen35 0.8B Q5_K_S               | pp512  |    660.95 |    658.18 |      1.00 |
| qwen35 0.8B Q5_K_S               | tg128  |     72.19 |     72.95 |      1.01 |
| qwen35 0.8B Q6_K                 | pp512  |    647.97 |    638.84 |      0.99 |
| qwen35 0.8B Q6_K                 | tg128  |     72.83 |     72.49 |      1.00 |
| qwen35 0.8B Q8_0                 | pp512  |    805.01 |    785.49 |      0.98 |
| qwen35 0.8B Q8_0                 | tg128  |     70.10 |     70.13 |      1.00 |

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-04-30 11:29:20 +03:00
Rithik Sharma 1478450e61
add performance-portable tuning for register-tile and subgroup matmul (llama/22241) 2026-04-30 11:29:20 +03:00
Gaurav Garg 7296b9c7fa
Fix recurrent state serialization for partial reads and writes (llama/22362)
The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server.
2026-04-30 11:29:20 +03:00
Oliver Simons 9bf6c3c860
CUDA: better coalesce data-access for contiguous concat (llama/22330)
Also, distribute all elements across CTAs evenly instead of launching
one CTA per dim
2026-04-30 11:29:20 +03:00
Sigbjørn Skjæret 2f3df42cdd
ggml-cpu : re-enable fast gelu_quick_f16 (llama/22339) 2026-04-30 11:29:20 +03:00
Eve 4e11277a19
ggml-cpu: optimize avx2 q6_k (llama/22345) 2026-04-30 11:29:20 +03:00
lhez 93a3f37642
opencl: add iq4_nl support (llama/22272)
* opencl: add general support for iq4_nl

* opencl: add iq4_nl gemm/gemv for adreno

* opencl: pack 2 lut entries into a uint
2026-04-30 11:29:19 +03:00
Trivikram Reddy 1be2adf7b3
hexagon: guard HMX clock request for v75+ platforms (llama/22377) 2026-04-30 11:29:19 +03:00
Johannes Gäßler da738a74f5
CUDA: reduce MMQ stream-k overhead (llama/22298)
* CUDA: reduce MMQ stream-k overhead

* use 32 bit integers for kbc
2026-04-30 11:29:19 +03:00
Developer-Ecosystem-Engineering 21da84303e
metal : optimize Metal Tensor API usage for GGML_OP_MUL_MAT (llama/20962)
* Optimize Metal Tensor API usage for matmul2d

Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR.

The legacy simdgroup_matrix kernel is preserved under #else.

Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.

* cont : cleanup

* cont : cleanup

* cont : cleanup

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-30 11:29:19 +03:00
Neo Zhang 6296fd5a90
Optimize Q4_0 mul_mat for Arc770, add scripts (llama/22291)
* opt arc770 for Q4_0

* add for Q4_0

* update the script

* add help script for windows

* update guide

* fix format issue

* convert from dos to unix for format issue

* fix missed -sm parameter
2026-04-30 11:29:19 +03:00
Reese Levine c235b05d8a
ggml-webgpu: support for SSM_SCAN and disable set_rows error checking (llama/22327)
* Implement ssm_scan

* Remove blocking in graph_compute and check for set rows

* Fix bindings

* Update op support
2026-04-30 11:29:19 +03:00
Trivikram Reddy c546b0b1bc
Hexagon: Bump HMX Frequency to Max Corner (llama/22334)
* hexagon: bump HMX freq to max corner

* hex-mm: fix error in log msg
2026-04-30 11:29:18 +03:00
Zheyuan Chen 35d679a4f8
ggml-webgpu: enable FLASH_ATTN_EXT on browser without subgroup matrix (llama/22199)
* ggml-webgpu: add tile flash attention fallback

* ggml-webgpu: add new fields and discard usage of mnk for tile version

* ggml-webgpu: modify the vec path to discard the mnk parameter

* ggml-webgpu: enable flash attention vec and tile version for broswer

* ggml-webgpu: stagging KV for flash attention tile version

* formatting

* turn on subgroup uniformity check

* remove Q_TILE as it is always 1 for vec path

* make row_max and exp_sum to local register

* make different bindings with same underlying buffer to have the same usage flags

* move path selection into the shader library and have the host consume a single flash-attn decision object.

* turn off skip_validation and address buffer overlapping when nwg==1

* formatting

* merge binding when kv overlap
2026-04-30 11:29:18 +03:00
Mengsheng Wu 6576c4da90
hexagon: use DIRID 13 in libggml-htp.inf for modern InfVerif (llama/22306) 2026-04-30 11:29:18 +03:00
Georgi Gerganov 07d6db39e5
metal : print GPU description (llama/22318) 2026-04-30 11:29:18 +03:00
Georgi Gerganov dfb8b68799
ggml : minor coding style (llama/22308) 2026-04-30 11:29:18 +03:00
Mengsheng Wu 23921d5a69
hexagon: add SOLVE_TRI op (llama/21974)
* hexagon: add SOLVE_TRI op

* ggml: fix TODO description for solve_tri

* hexagon: rm unused variable/function warnings

* hexagon: chunk vs batch processingfor better thread utilization

* hexagon: vectorize partial f32 loads

* hexagon: move HVX f32 add/sub/mul wrappers to hvx-base.h

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
2026-04-30 11:29:18 +03:00
Chen Yuan 641998f558
fix(shader): handle the buffer aliasing for rms fuse (llama/22266) 2026-04-30 11:29:17 +03:00
Max Krasnyansky 71b1ab3784
hexagon: add support for basic and extended Op profiling (llama/22269)
* hexagon: restore HTP_OPMASK_QUEUE

* hexagon: honor OPMASK_SKIP_COMPUTE in hmx-matmul

* hex-prof: restore op profiling

* hex-prof: enable PMU

* hexagon: simplify and improve op-queuing with full profiling support

Add separate profile descriptors.

* hexagon: remove opsync and rename opmask into opstage

opsync is no longer needed since the profiler is fully async now.
opmask name was confusing and opstage is more accurate.

* hexagon: refactor opbatch queue handling

* hexagon: add iface hooks for enabling profiler from the host

Also move all the PMU setup stuff out of the hex-utils since it's not inteded for normal use.

* hexagon: make profiler mode configurable

On older devices getting PMU counters is expensive so it's now optional.

* hexagon: add support for setting profiler pmu events from env

* hexagon: simplify profiler output (no need to print buffs, etc)

* hexagon: simplify pmu counter formating

* hexagon: add a simple profile post-proc tool

* hex-prof: add support for reading logs from stdin

* hexagon: document GGML_HEXAGON_PROFILE

* hex-prof: update default width for dims field

* hex-prof: fix linter warnings and errors

* Update ggml/src/ggml-hexagon/htp/htp-ops.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/snapdragon/ggml-hexagon-profile.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-30 11:29:17 +03:00
Georgi Gerganov 682ee99305
metal : fix event synchronization (llama/22260) 2026-04-30 11:29:17 +03:00
Georgi Gerganov 1aba061737
ggml-base: use MATH_LIBRARY variable instead of hardcoded 'm' (llama/22239)
Fixes #22237 — the find_library(MATH_LIBRARY m) result was being
discarded and the target linked against the literal 'm' string.

This prevents users from overriding the math library (e.g. for AMD AOCL)
via CMake variables. Now the discovered MATH_LIBRARY is used directly.
2026-04-30 11:29:17 +03:00
abotsis b938c5026c
sycl : fused MoE mul_mat_vec_q for TG (llama/21920)
* sycl : fused MoE mul_mat_vec_q for TG

Create an MMVQ kernel so ggml_sycl_mul_mat_id can consolidate
n_experts_used matmuls in a single kernel launch. The kernel
also reads expert IDs directly, removing a per-call host sync.

This is similar to the CUDA backend's ggml_cuda_mul_mat_vec_q*
paths.

All types supported in the current MMVQ are supported here as well:
Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0

It will fall back to the existing per-expert path when src0 has been rewritten
by opt_for_reorder(), and for any shape the fused path doesn't handle.

test-backend-ops passes for supported type/shape combos.

Benchmark: Qwen3-Next-35B-A3B Q4_K_M on Intel Arc B70 (SYCL0),
baseline 707c0b7a6, 16k context, -fa 0.

  build/bin/llama-bench -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M \
    -p 1024 -n 128 -d 16384 -ngl 99 -fa 0 -ub 2048 -r 2 -dev SYCL0

Before (3 runs on 707c0b7a6):

  | test            |            run 1 |            run 2 |            run 3 |
  | --------------- | ----------------:| ----------------:| ----------------:|
  | pp1024 @ d16384 |   533.26 ±  4.87 |   535.20 ±  2.78 |   524.27 ±  3.10 |
  | tg128  @ d16384 |    33.47 ±  0.02 |    33.31 ±  0.02 |    33.17 ±  0.05 |

After (3 runs on 707c0b7a6 + this patch):

  | test            |            run 1 |            run 2 |            run 3 |
  | --------------- | ----------------:| ----------------:| ----------------:|
  | pp1024 @ d16384 |   534.06 ±  0.97 |   531.95 ±  0.02 |   520.94 ± 20.10 |
  | tg128  @ d16384 |    45.85 ±  0.21 |    45.95 ±  0.45 |    46.22 ±  0.12 |

disclosure: Claude wrote it, but I reviewed and understand the implementation
(albeit my C is a little rusty).

* sycl: also support nvfp4 and mxfp4 expert types

* sycl: terser comments/nested dispatch in response to review

* sycl: more comment cleanup in mmvq.cpp/hpp

---------

Co-authored-by: Debian <aaron@openllmi.net.bots.is>
2026-04-30 11:29:17 +03:00
Chen Yuan df528c4f71
ggml-webgpu: add support for im2col (llama/22259)
* shader(im2col): implement the im2col shader

* shader(im2col): clean the formatting issues

* shader(im2col): clean the editorconfig checker warning

* fix(shader): address the workgroup issues of im2col and conv2d
2026-04-30 11:29:17 +03:00