whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Ruben Ortlam	0c7c3ba570	vulkan: add get/set tensor 2d functions (llama/22514) * vulkan: add get/set_tensor_2d functions * fix backend interface comments * Update ggml/src/ggml-metal/ggml-metal.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-01 13:07:35 +03:00
Johannes Gäßler	582d2562a4	CUDA: fix tile FA kernel on Pascal (llama/22541)	2026-05-01 13:07:35 +03:00
Rithik Sharma	d74c56862b	add fast matmul iquants (llama/22504)	2026-05-01 13:07:35 +03:00
Max Krasnyansky	66392cf1a2	hexagon: make vmem and buffer-size configurable (llama/22487) * hexagon: allow host to set max vmem size We use a sane default but it's helpful to allow for an override if needed. * hexagon: add support for measuring vmem space and move pinned mmaping management to host * hexagon: update vmem checks to use uint64 * hexagon: bump op buffers to 16 (matches max mmaps) * hexagon: bump default vmem to 3.2GB * hexagon: add support for autodetecting vmem space and some logging cleanup in that area * hexagon: fix whitespace warnings * Update scripts/snapdragon/adb/run-cli.sh Co-authored-by: Pascal <admin@serveurperso.com> * hex-adb: fix run-completion script --------- Co-authored-by: Pascal <admin@serveurperso.com>	2026-05-01 13:07:35 +03:00
Anav Prasad	aec8e69c2f	CUDA: fuse SSM_CONV + ADD(bias) + SILU (llama/22478)	2026-05-01 13:07:35 +03:00
shalinib-ibm	9f2cec1840	ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (llama/22293) * ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault vec_xst operations in the tiled path crash on AIX when writing near 4KB page boundaries due to strict memory protection. Fall back to mnpack implementation on AIX for stable execution. Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com> * Update ggml/src/ggml-cpu/llamafile/sgemm.cpp Co-authored-by: Aaron Teo <taronaeo@gmail.com> * Update sgemm.cpp * Update sgemm.cpp --------- Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com> Co-authored-by: Aaron Teo <taronaeo@gmail.com>	2026-05-01 13:07:34 +03:00
Georgi Gerganov	c59a773605	examples : update to Q1_0	2026-05-01 13:07:33 +03:00
Georgi Gerganov	320c048724	sync : ggml	2026-04-30 21:44:28 +03:00
Georgi Gerganov	ad670182d9	ggml : bump version to 0.10.1 (ggml/1469)	2026-04-30 11:29:23 +03:00
Aman Gupta	44e7803661	ggml-cuda: refactor fusion code (llama/22468) * ggml-cuda: refactor fusion code * apply formatting + make env variable truthy	2026-04-30 11:29:23 +03:00
qiurui144	6119537e9a	ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME (llama/22317) * ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME When GGML_CPU_RISCV64_SPACEMIT=ON is set, ime1_kernels.cpp contains inline asm for the vmadot family which requires the xsmtvdotii custom extension.(problem can see in some blogs and make sure in K3 platform) The current CMakeLists does not include xsmtvdotii, so any toolchain that honours the explicit -march (tested with SpacemiT GCC 15.2) fails at the assembler stage: Error: unrecognized opcode `vmadot v16,v14,v0', extension `xsmtvdotii' required Append _xsmtvdotii to MARCH_STR when GGML_CPU_RISCV64_SPACEMIT is enabled so the IME path can actually build with a capable toolchain. No effect on builds that leave GGML_CPU_RISCV64_SPACEMIT off. toolchain from https://www.spacemit.com/community/resources-download/Tools * Update ggml/src/ggml-cpu/CMakeLists.txt Co-authored-by: alex-spacemit <jinghui.huang@spacemit.com> --------- Co-authored-by: alex-spacemit <jinghui.huang@spacemit.com>	2026-04-30 11:29:23 +03:00
Reese Levine	fa20229eeb	ggml-webgpu: Fix bug in FlashAttention support check (llama/22492) * Fix flashattention support check for devices that don't support subgroups * set path to none if kv_tile doesn't fit	2026-04-30 11:29:23 +03:00
hrushitfujitsu	3076725eb0	ggml : add sve tuned code for gemm_q8_0_4x8_q8_0() kernel (llama/21916) * Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel * Change arrays to static const in repack.cpp --------- Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>	2026-04-30 11:29:23 +03:00
Johannes Gäßler	5301139374	TP: fix delayed AllReduce + zero-sized slices (llama/22489)	2026-04-30 11:29:23 +03:00
Michael Wand	c200b588f8	ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (llama/22196)	2026-04-30 11:29:22 +03:00
lnigam	b553e17071	ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… (#22286 ) * ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (GQA=32) Adds MMA-f16 and tile kernel configs, dispatch logic, template instances, and tile .cu file for Mistral Small 4 (head sizes 320/256), restricting to ncols2=32 to support GQA ratio 32 only. * Adding check to return BEST_FATTN_KERNEL_NONE in case GQA!=32 * Apply suggestions from code review Address review comments Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments and making kernel config default to DQK=512, DV=512 instead of DQK=256,DV=256 * Fixed bug with sinks=1, with ncols=32, there are two warp-groups created but sinks index is same(0,...,15) for both the groups hence with sinks=1, output is not matching with CPU output. Added sink_base which will be base index for each warp_group (threadIdx.y / np) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/template-instances/generate_cu_files.py Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-04-30 11:29:22 +03:00
Matt Corallo	e69c109aac	vulkan: Coalesce Q4_K/Q5_K scale loads (llama/21751) Some SPIR-V compilers (notably mesa) don't handle the current vulkan Q4_K/Q5_K scale load pattern in mul_mat particularly well. While reading three `u8`s from the 12-byte scale array should (at least on some hardware) result in loading the full 12 bytes in a single LOAD followed by whatever extraction is needed, at least the ANV Intel driver really can't practically perform this optimization. `mesa`'s unsigned upper bound logic doesn't handle tracking bounds through ternary, resulting in the `(is < 4) ? ... : is - 4` having an infinite upper bound (as it cannot prove `is - 4` doesn't underflow). While this could still be rectified if mesa looked at the array bounds, it currently doesn't and `glslc` currently emits SPIR-V that doesn't allow for this optimization anyway (though maybe it will at some point, see https://github.com/KhronosGroup/glslang/issues/4206). In mul_mat_vecq we took a different approach to loading the same fields. We read the first two bytes we needed from `scale` then took a branch before deciding whether we needed to read a third byte. In mesa this did, indeed, lead to a top-level branch with conditional loads. As such these loads ended up not being coalesced either (at least in the ANV driver) resulting in additional instructions in our hot loop. Instead, here, we go ahead and force loading the full 12 bytes and extract the bits we need from the packed-u32s instead. In mul_mat there's a few less ternaries and only one extra shift, so even on drivers that did optimize the previous loads properly the only material change should be pulling a few extra bytes into registers (which on most hardware won't cost anything anyway, though ironically on Intel it theoretically could). In mul_mat_vecq this requires a bit of extra math and may read bytes from the u32 that weren't needed, but it seems likely avoiding the branch is a win on most platforms. On Intel Xe2/mesa 26.0.4 with the optimizations from https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15162, for shader matmul_id_subgroup_q4_k_f32_f16acc_aligned_l: * Instruction Count: 2753 -> 2688 * SEND Count: 269 -> 261 * Cycle Count: 273976 -> 266138 * Max live registers: 248 -> 246 * Non SSA regs after NIR: 381 -> 382 for shader matmul_id_subgroup_q5_k_f32_f16acc_aligned_l: * Instruction Count: 2767 -> 2702 * SEND Count: 271 -> 263 * Cycle Count: 274140 -> 268144 * Max live registers: 248 -> 246 * Non SSA regs after NIR: 381 -> 382 for shader mul_mat_vec_id_q4_k_q8_1_f32: * Instruction Count: 1930 -> 1646 * SEND Count: 116 -> 71 * Cycle Count: 1348306 -> 843350 * Max live registers: 78 -> 84 * Non SSA regs after NIR: 300 -> 135 for shader mul_mat_vec_id_q5_k_q8_1_f32: * Instruction Count: 2207 -> 1922 * SEND Count: 131 -> 86 * Cycle Count: 1392012 -> 1037836 * Max live registers: 90 -> 90 * Non SSA regs after NIR: 300 -> 135 for shader mul_mat_vec_q4_k_q8_1_f32: * Instruction Count: 2029 -> 1749 * SEND Count: 111 -> 66 * Cycle Count: 1347278 -> 840118 * Max live registers: 74 -> 80 * Non SSA regs after NIR: 299 -> 134 for shader mul_mat_vec_q5_k_q8_1_f32: * Instruction Count: 2307 -> 2022 * SEND Count: 126 -> 81 * Cycle Count: 1379820 -> 954042 * Max live registers: 86 -> 86 * Non SSA regs after NIR: 299 -> 134 On one Arc Pro B60, unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL: * pp512: 907.34 ± 9.28 -> 941.94 ± 10.53 (+4%) * pp2048: 897.95 ± 1.82 -> 931.55 ± 1.79 (+4%) * tg128: 49.49 ± 0.02 -> 49.86 ± 0.05 (+ <1%) On one Arc Pro B60, unsloth/Qwen3.5-27B-GGUF:Q4_K_S: * pp512: 324.13 ± 10.52 -> 354.33 ± 6.81 (+9%) * pp2048: 329.80 ± 0.25 -> 357.10 ± 0.06 (+8%) * tg128: 17.11 ± 0.01 -> 18.11 ± 0.01 (+6%) On four Arc Pro B60s, unsloth/Qwen3.5-122B-A10B-GGUF:Q5_K_S with -sm layer (note that -sm tensor improvements will naturally be less): * pp512: 264.55 ± 2.81 -> 280.45 ± 3.94 (+6%) * pp2048: 319.32 ± 2.72 -> 335.70 ± 3.48 (+5%) * tg128: 26.39 ± 0.01 -> 26.67 ± 0.01 (+1%)	2026-04-30 11:29:22 +03:00
Reese Levine	4ea5b6febc	ggml-webgpu: fix buffer aliasing for ssm_scan and refactor aliasing logic (llama/22456) * Refactor buffer aliasing to be part of shader lib decisions * cleanup * formatting	2026-04-30 11:29:22 +03:00
Jeff Bolz	35fa508360	vulkan: add barrier after writetimestamp (llama/21865)	2026-04-30 11:29:22 +03:00
Emil Askerov	0fa31f9bb6	ggml: improve SPIR-V headers detection with __has_include (llama/21918) * ggml: improve SPIR-V headers detection with __has_include while preserving original _WIN32 logic * Address review comments: fix fallback logic and add FreeBSD support * Remove spirv_cross fallback as per review * Remove redundant __has_include check	2026-04-30 11:29:22 +03:00
Adrien Gallouët	6fceff2eb4	ggml : skip already registered backends and devices (llama/22296) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-04-30 11:29:21 +03:00
Adrien Gallouët	ca624d86ab	ggml : revert to -lm linking instead of find_library (llama/22355) * ggml : revert to -lm linking instead of find_library `find_library(MATH_LIBRARY m)` was introduced recently, but it breaks CUDA compilation with GGML_STATIC. I could not find any valid use case where we would prefer `find_library` over the standard `-lm` approach. This commit is also meant to start a discussion if there is a valid reason to keep `find_library(MATH_LIBRARY m)`, we should clarify what problem it was solving and find an alternative fix that does not break CUDA with GGML_STATIC. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * ggml : use MATH_LIBRARY only if defined Signed-off-by: Adrien Gallouët <angt@huggingface.co> * ggml : fix initial broken condition Signed-off-by: Adrien Gallouët <angt@huggingface.co> * ggml : always respect MATH_LIBRARY when defined Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-04-30 11:29:21 +03:00
hipudding	70e4c0aec0	CANN: add new ops, optimize existing ops (llama/21204) New operators: - GGML_OP_SET: implement via aclnnInplaceCopy on target region - GGML_OP_CUMSUM: implement via aclnnCumsum - GGML_OP_FILL: implement via aclnnInplaceFillScalar - GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides - GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets - GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve - GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus Optimizations: - GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu / aclnnGeGluV3 when applicable; fallback conditions now checked inside each function rather than at the call site - CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→ ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call - L2_NORM: fix in-place ClampMin on norm result (was clamping wrong tensor); add eps clamping before division to avoid divide-by-zero - PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call ReflectionPad1d once on the full 4-D view; remove redundant nb copies - GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor helper into gather_batched lambda with batch loop inlined - SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice; refactor helper into scatter_batched lambda with batch loop inlined - OUT_PROD: replace O(ne[3]ne[2]ne[1]) Ger+InplaceAdd loop with per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast batch dims where ne02/ne03 may differ from ne2/ne3 - backend memset_tensor: implement via aclrtMemset (was NULL) Bug fixes: - COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary buffer instead of InplaceEqTensor, avoiding corruption of src0 - ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[] fields in ggml_graph_node_properties; has_matching_properties() was missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to incorrectly share cached graphs and produce wrong results (ERR≈679) - graph cache op_params matching: compare full GGML_MAX_OP_PARAMS bytes so that ops differing only in parameters are not incorrectly replayed from cache	2026-04-30 11:29:21 +03:00
Rithik Sharma	9c233f11f0	ggml-webgpu: add Q1_0 support (llama/22374) * add fast matmul matvec q1_0 kernel * ggml-webgpu: drop redundant zero-fills in Q1_0 shmem init	2026-04-30 11:29:21 +03:00
Rithik Sharma	f675a8c926	add fast mat-vec kernels for i-quants (llama/22344)	2026-04-30 11:29:21 +03:00
unraido	c9ba41397c	fix: rpc-server cache may not work in Windows environments (llama/22394) * fix: create directory and log cache file name. * Remove GGML_LOG_INFO conditional compilation. --------- Co-authored-by: kotaro <kotaro.kusunoki@gmail.com>	2026-04-30 11:29:21 +03:00
Adrien Gallouët	f5c3ce17d5	ggml : use 64 bytes aligned tile buffers (llama/21058) \| Model \| Test \| t/s OLD \| t/s NEW \| Speedup \| \|:---------------------------------\|:-------\|----------:\|----------:\|----------:\| \| qwen35 0.8B BF16 \| pp512 \| 584.59 \| 595.41 \| 1.02 \| \| qwen35 0.8B BF16 \| tg128 \| 52.23 \| 52.82 \| 1.01 \| \| qwen35 0.8B IQ2_M - 2.7 bpw \| pp512 \| 260.64 \| 261.70 \| 1.00 \| \| qwen35 0.8B IQ2_M - 2.7 bpw \| tg128 \| 81.17 \| 80.89 \| 1.00 \| \| qwen35 0.8B IQ2_XXS - 2.0625 bpw \| pp512 \| 302.36 \| 302.56 \| 1.00 \| \| qwen35 0.8B IQ2_XXS - 2.0625 bpw \| tg128 \| 84.93 \| 85.12 \| 1.00 \| \| qwen35 0.8B IQ3_XXS - 3.0625 bpw \| pp512 \| 263.22 \| 260.01 \| 0.99 \| \| qwen35 0.8B IQ3_XXS - 3.0625 bpw \| tg128 \| 80.29 \| 78.94 \| 0.98 \| \| qwen35 0.8B IQ4_NL - 4.5 bpw \| pp512 \| 728.65 \| 742.09 \| 1.02 \| \| qwen35 0.8B IQ4_NL - 4.5 bpw \| tg128 \| 82.39 \| 84.46 \| 1.03 \| \| qwen35 0.8B IQ4_XS - 4.25 bpw \| pp512 \| 681.33 \| 677.06 \| 0.99 \| \| qwen35 0.8B IQ4_XS - 4.25 bpw \| tg128 \| 80.18 \| 79.28 \| 0.99 \| \| qwen35 0.8B Q2_K_M \| pp512 \| 413.28 \| 415.94 \| 1.01 \| \| qwen35 0.8B Q2_K_M \| tg128 \| 81.90 \| 82.78 \| 1.01 \| \| qwen35 0.8B Q3_K_M \| pp512 \| 493.17 \| 495.08 \| 1.00 \| \| qwen35 0.8B Q3_K_M \| tg128 \| 82.75 \| 83.23 \| 1.01 \| \| qwen35 0.8B Q3_K_S \| pp512 \| 429.35 \| 427.64 \| 1.00 \| \| qwen35 0.8B Q3_K_S \| tg128 \| 86.69 \| 87.02 \| 1.00 \| \| qwen35 0.8B Q4_0 \| pp512 \| 783.46 \| 782.32 \| 1.00 \| \| qwen35 0.8B Q4_0 \| tg128 \| 88.23 \| 87.90 \| 1.00 \| \| qwen35 0.8B Q4_1 \| pp512 \| 741.71 \| 729.76 \| 0.98 \| \| qwen35 0.8B Q4_1 \| tg128 \| 85.44 \| 86.01 \| 1.01 \| \| qwen35 0.8B Q4_K_M \| pp512 \| 676.24 \| 681.31 \| 1.01 \| \| qwen35 0.8B Q4_K_M \| tg128 \| 76.59 \| 77.06 \| 1.01 \| \| qwen35 0.8B Q4_K_S \| pp512 \| 683.12 \| 688.81 \| 1.01 \| \| qwen35 0.8B Q4_K_S \| tg128 \| 80.50 \| 81.19 \| 1.01 \| \| qwen35 0.8B Q5_K_M \| pp512 \| 635.33 \| 642.11 \| 1.01 \| \| qwen35 0.8B Q5_K_M \| tg128 \| 72.07 \| 72.49 \| 1.01 \| \| qwen35 0.8B Q5_K_S \| pp512 \| 660.95 \| 658.18 \| 1.00 \| \| qwen35 0.8B Q5_K_S \| tg128 \| 72.19 \| 72.95 \| 1.01 \| \| qwen35 0.8B Q6_K \| pp512 \| 647.97 \| 638.84 \| 0.99 \| \| qwen35 0.8B Q6_K \| tg128 \| 72.83 \| 72.49 \| 1.00 \| \| qwen35 0.8B Q8_0 \| pp512 \| 805.01 \| 785.49 \| 0.98 \| \| qwen35 0.8B Q8_0 \| tg128 \| 70.10 \| 70.13 \| 1.00 \| Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-04-30 11:29:20 +03:00
Rithik Sharma	1478450e61	add performance-portable tuning for register-tile and subgroup matmul (llama/22241)	2026-04-30 11:29:20 +03:00
Gaurav Garg	7296b9c7fa	Fix recurrent state serialization for partial reads and writes (llama/22362) The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server.	2026-04-30 11:29:20 +03:00
Oliver Simons	9bf6c3c860	CUDA: better coalesce data-access for contiguous concat (llama/22330) Also, distribute all elements across CTAs evenly instead of launching one CTA per dim	2026-04-30 11:29:20 +03:00
Sigbjørn Skjæret	2f3df42cdd	ggml-cpu : re-enable fast gelu_quick_f16 (llama/22339)	2026-04-30 11:29:20 +03:00
Eve	4e11277a19	ggml-cpu: optimize avx2 q6_k (llama/22345)	2026-04-30 11:29:20 +03:00
lhez	93a3f37642	opencl: add iq4_nl support (llama/22272) * opencl: add general support for iq4_nl * opencl: add iq4_nl gemm/gemv for adreno * opencl: pack 2 lut entries into a uint	2026-04-30 11:29:19 +03:00
Trivikram Reddy	1be2adf7b3	hexagon: guard HMX clock request for v75+ platforms (llama/22377)	2026-04-30 11:29:19 +03:00
Johannes Gäßler	da738a74f5	CUDA: reduce MMQ stream-k overhead (llama/22298) * CUDA: reduce MMQ stream-k overhead * use 32 bit integers for kbc	2026-04-30 11:29:19 +03:00
Developer-Ecosystem-Engineering	21da84303e	metal : optimize Metal Tensor API usage for GGML_OP_MUL_MAT (llama/20962) * Optimize Metal Tensor API usage for matmul2d Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else. Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure. * cont : cleanup * cont : cleanup * cont : cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-30 11:29:19 +03:00
Neo Zhang	6296fd5a90	Optimize Q4_0 mul_mat for Arc770, add scripts (llama/22291) * opt arc770 for Q4_0 * add for Q4_0 * update the script * add help script for windows * update guide * fix format issue * convert from dos to unix for format issue * fix missed -sm parameter	2026-04-30 11:29:19 +03:00
Reese Levine	c235b05d8a	ggml-webgpu: support for SSM_SCAN and disable set_rows error checking (llama/22327) * Implement ssm_scan * Remove blocking in graph_compute and check for set rows * Fix bindings * Update op support	2026-04-30 11:29:19 +03:00
Trivikram Reddy	c546b0b1bc	Hexagon: Bump HMX Frequency to Max Corner (llama/22334) * hexagon: bump HMX freq to max corner * hex-mm: fix error in log msg	2026-04-30 11:29:18 +03:00
Zheyuan Chen	35d679a4f8	ggml-webgpu: enable FLASH_ATTN_EXT on browser without subgroup matrix (llama/22199) * ggml-webgpu: add tile flash attention fallback * ggml-webgpu: add new fields and discard usage of mnk for tile version * ggml-webgpu: modify the vec path to discard the mnk parameter * ggml-webgpu: enable flash attention vec and tile version for broswer * ggml-webgpu: stagging KV for flash attention tile version * formatting * turn on subgroup uniformity check * remove Q_TILE as it is always 1 for vec path * make row_max and exp_sum to local register * make different bindings with same underlying buffer to have the same usage flags * move path selection into the shader library and have the host consume a single flash-attn decision object. * turn off skip_validation and address buffer overlapping when nwg==1 * formatting * merge binding when kv overlap	2026-04-30 11:29:18 +03:00
Mengsheng Wu	6576c4da90	hexagon: use DIRID 13 in libggml-htp.inf for modern InfVerif (llama/22306)	2026-04-30 11:29:18 +03:00
Georgi Gerganov	07d6db39e5	metal : print GPU description (llama/22318)	2026-04-30 11:29:18 +03:00
Georgi Gerganov	dfb8b68799	ggml : minor coding style (llama/22308)	2026-04-30 11:29:18 +03:00
Mengsheng Wu	23921d5a69	hexagon: add SOLVE_TRI op (llama/21974) * hexagon: add SOLVE_TRI op * ggml: fix TODO description for solve_tri * hexagon: rm unused variable/function warnings * hexagon: chunk vs batch processingfor better thread utilization * hexagon: vectorize partial f32 loads * hexagon: move HVX f32 add/sub/mul wrappers to hvx-base.h --------- Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>	2026-04-30 11:29:18 +03:00
Chen Yuan	641998f558	fix(shader): handle the buffer aliasing for rms fuse (llama/22266)	2026-04-30 11:29:17 +03:00
Max Krasnyansky	71b1ab3784	hexagon: add support for basic and extended Op profiling (llama/22269) * hexagon: restore HTP_OPMASK_QUEUE * hexagon: honor OPMASK_SKIP_COMPUTE in hmx-matmul * hex-prof: restore op profiling * hex-prof: enable PMU * hexagon: simplify and improve op-queuing with full profiling support Add separate profile descriptors. * hexagon: remove opsync and rename opmask into opstage opsync is no longer needed since the profiler is fully async now. opmask name was confusing and opstage is more accurate. * hexagon: refactor opbatch queue handling * hexagon: add iface hooks for enabling profiler from the host Also move all the PMU setup stuff out of the hex-utils since it's not inteded for normal use. * hexagon: make profiler mode configurable On older devices getting PMU counters is expensive so it's now optional. * hexagon: add support for setting profiler pmu events from env * hexagon: simplify profiler output (no need to print buffs, etc) * hexagon: simplify pmu counter formating * hexagon: add a simple profile post-proc tool * hex-prof: add support for reading logs from stdin * hexagon: document GGML_HEXAGON_PROFILE * hex-prof: update default width for dims field * hex-prof: fix linter warnings and errors * Update ggml/src/ggml-hexagon/htp/htp-ops.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/snapdragon/ggml-hexagon-profile.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-30 11:29:17 +03:00
Georgi Gerganov	682ee99305	metal : fix event synchronization (llama/22260)	2026-04-30 11:29:17 +03:00
Georgi Gerganov	1aba061737	ggml-base: use MATH_LIBRARY variable instead of hardcoded 'm' (llama/22239) Fixes #22237 — the find_library(MATH_LIBRARY m) result was being discarded and the target linked against the literal 'm' string. This prevents users from overriding the math library (e.g. for AMD AOCL) via CMake variables. Now the discovered MATH_LIBRARY is used directly.	2026-04-30 11:29:17 +03:00
abotsis	b938c5026c	sycl : fused MoE mul_mat_vec_q for TG (llama/21920) * sycl : fused MoE mul_mat_vec_q for TG Create an MMVQ kernel so ggml_sycl_mul_mat_id can consolidate n_experts_used matmuls in a single kernel launch. The kernel also reads expert IDs directly, removing a per-call host sync. This is similar to the CUDA backend's ggml_cuda_mul_mat_vec_q* paths. All types supported in the current MMVQ are supported here as well: Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0 It will fall back to the existing per-expert path when src0 has been rewritten by opt_for_reorder(), and for any shape the fused path doesn't handle. test-backend-ops passes for supported type/shape combos. Benchmark: Qwen3-Next-35B-A3B Q4_K_M on Intel Arc B70 (SYCL0), baseline 707c0b7a6, 16k context, -fa 0. build/bin/llama-bench -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M \ -p 1024 -n 128 -d 16384 -ngl 99 -fa 0 -ub 2048 -r 2 -dev SYCL0 Before (3 runs on 707c0b7a6): \| test \| run 1 \| run 2 \| run 3 \| \| --------------- \| ----------------:\| ----------------:\| ----------------:\| \| pp1024 @ d16384 \| 533.26 ± 4.87 \| 535.20 ± 2.78 \| 524.27 ± 3.10 \| \| tg128 @ d16384 \| 33.47 ± 0.02 \| 33.31 ± 0.02 \| 33.17 ± 0.05 \| After (3 runs on 707c0b7a6 + this patch): \| test \| run 1 \| run 2 \| run 3 \| \| --------------- \| ----------------:\| ----------------:\| ----------------:\| \| pp1024 @ d16384 \| 534.06 ± 0.97 \| 531.95 ± 0.02 \| 520.94 ± 20.10 \| \| tg128 @ d16384 \| 45.85 ± 0.21 \| 45.95 ± 0.45 \| 46.22 ± 0.12 \| disclosure: Claude wrote it, but I reviewed and understand the implementation (albeit my C is a little rusty). * sycl: also support nvfp4 and mxfp4 expert types * sycl: terser comments/nested dispatch in response to review * sycl: more comment cleanup in mmvq.cpp/hpp --------- Co-authored-by: Debian <aaron@openllmi.net.bots.is>	2026-04-30 11:29:17 +03:00
Chen Yuan	df528c4f71	ggml-webgpu: add support for im2col (llama/22259) * shader(im2col): implement the im2col shader * shader(im2col): clean the formatting issues * shader(im2col): clean the editorconfig checker warning * fix(shader): address the workgroup issues of im2col and conv2d	2026-04-30 11:29:17 +03:00

1 2 3 4 5 ...

4429 Commits All Branches Search

4429 Commits

All Branches