* Make ggml_gated_delta_net take only the initial recurrent state (D, 1, n_seqs) and passes the snapshot count K as an op parameter instead of inferring it from state->ne[1].
Remove the padding hack and copy all emitted snapshots into the recurrent cache with a single strided ggml_cpy
* Make GDN changes in all backends. Address review comments.
* Fix CI build errors
* vulkan: add support for valve fp16 dot2 extension
* use macro for dot2 path choice
* properly check for the feature
* add dot_product abstraction to reduce preprocessor branching
* cpu: add GGML_OP_COL2IM_1D
Add the overlap-add (scatter-add) step of a 1D transposed convolution.
A ConvTranspose1d factorizes as a GEMM followed by col2im: a weight
pre-permuted to [IC, K*OC] is contracted against the [IC, T_in] input
with mul_mat to produce a column matrix [K*OC, T_in], and col2im_1d
scatters those columns back into the [T_out, OC] signal, with
T_out = (T_in - 1)*s0 + K - 2*p0.
Keeping the contraction as a plain mul_mat leaves the heavy work on the
optimized (and quantizable) matmul kernels, so col2im_1d only does the
cheap overlap-add.
CPU uses a gather formulation parallelized over output channels,
supporting F32, F16 and BF16 with an F32 accumulator.
* tests: add backend coverage for GGML_OP_COL2IM_1D
Add test_col2im_1d next to the conv_transpose_1d cases, covering F32,
F16 and BF16 across eight geometries: the canonical kernel = 2*stride
DAC upsampling shape, overlap, no overlap, cropping (p0 = 1 and
p0 = stride/2), kernel < stride with zeroed gaps, kernel not a
multiple of stride, and a single column unfold.
Perf mode gets three real vocoder stage shapes reporting memory
bandwidth. max_nmse_err relaxes to 5e-4 for F16 and BF16.
* cpu: harden GGML_OP_COL2IM_1D
ggml_col2im_1d validates s0, oc, p0 and input contiguity at graph
build time, before the oc division, protecting every backend at once.
The kernel asserts the contiguity its flat indexing assumes and its
doc states the full output length including the crop term.
The kernel parallelizes over the time axis: the split stays balanced
down to OC = 1, where the previous channel split was single threaded.
Values are bit identical on the three real vocoder chains, two out of
three improve.
* tests: extend the GGML_OP_COL2IM_1D grid
The eval grid grows to eleven geometries: OC = 1 (mono output stage),
K = 1 with stride > 1 (sparse scatter, every gap position zeroed) and
a crop down to T_out = 2 where all the gather bounds act at once.
* tests: add col2im_1d equivalence test
tests/test-col2im-1d.cpp proves mul_mat + col2im_1d matches the
native ggml_conv_transpose_1d on the CPU backend, F32 bit exact, F16
and BF16 through casts of the column matrix. test-backend-ops cannot
cover this for a CPU only op since the CPU backend is its own
reference there.
* rpc: bump protocol patch version for GGML_OP_COL2IM_1D
GGML_OP_COUNT goes from 96 to 97 with the new op, which trips the
static_assert in ggml-rpc.h. Bump RPC_PROTO_PATCH_VERSION since the
op is appended and no existing op code shifts.
* Only run webgpu CI on my fork
* Add webgpu only workflow
* handle buffer overlap case for concat operator
* restore build-webgpu.yml
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Run clang-format
* Update ggml/src/ggml-webgpu/wgsl-shaders/concat.wgsl
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Only run webgpu CI on my fork
* Add webgpu only workflow
* Implement 2d workgroups for more operations
* fix
* Fix type
* Move back to global_invocation_id
This allows vec4 loads of the B elements. Also increase BK to 64 when this is
enabled. Neither of these alone is consistently faster, but together these give
a nice speedup.
In ggml-vulkan.cpp, we need to make sure the B matrix alignment and stride are
multiples of 4.
* cuda: reset device in get_memory function if no backend is active
* also count device and host buffers
* exclude hip and musa from counting and device reset
* use device mutex instead of atomic
* undo backend_free function move
* vulkan: add fwht support for Intel with shmem reduction
* don't use N as workgroup size
* disable subgroup shuffle on MoltenVK AMD
* disable fwht shader on Intel Windows due to driver bug
mmvq:
Port the ncols_dst optimization from ggml-cuda/mmvq.cu to SYCL.
Read weights once per dispatch instead of once per column.
Covers all standard quant types + reorder paths for Q4_0, Q8_0,
Q3_K, Q4_K, Q5_K, Q6_K. IQ types (except IQ4_XS) excluded due to
incompatible vec_dot signatures.
ggml-sycl:
The weight reorder was only bootstrapped on single-token mat-vec
(ne[1] == 1). Speculative / MTP verify issues only multi-column mat-vec,
so it never triggered the reorder and ran on the slower non-reorder
kernel. Bootstrap it on small multi-column batches (ne[1] <= 8) too.
* ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128
Optimize the inner loop of ggml_vec_dot_q4_1_q8_1_generic using
WASM SIMD128 intrinsics, gated behind #ifdef __wasm_simd128__ so
non-wasm builds are completely unaffected.
Approach:
- single wasm_v128_load covers all 32 packed 4-bit weights
- nibbles unpacked via AND/SHR into two u8x16 registers
- widened to i16 before multiply (WASM SIMD has no i8*i8 instruction)
- 4x wasm_i32x4_dot_i16x8 calls accumulate all 32 element pairs
- horizontal reduce via 4x wasm_i32x4_extract_lane
Benchmark (node v25, emcc -O3 -msimd128, 64 blocks x QK8_1=32,
200k iterations):
| impl | ns/call | speedup |
|--------|---------|---------|
| scalar | 880.7 | 1.00x |
| simd | 257.8 | 3.42x |
Correctness verified against scalar reference across 10 random seeds
with exact output match.
* ggml: move q4_1_q8_1 WASM SIMD implementation to wasm backend
Relocate the SIMD128 implementation of ggml_vec_dot_q4_1_q8_1 to ggml/src/ggml-cpu/arch/wasm/quants.c to follow architecture-specific layout. Restore the generic implementation in ggml/src/ggml-cpu/quants.c.
Move for loop in the else block.
* ggml: use generic q4_1_q8_1 fallback in wasm backend
* Start work on flash_attn refactor
* Refactor
* Split k/v quantization
* Refactor and abstract quantization logic for flash_attn and mul_mat
* Add quantization support to tile path
* formatting
* Move to functions, add a check
* Removes __restrict__ from PDL kernel headers due to incompatibility with
PDL. Adds preprocessor directives based on arch in kernel body to add
__restrict__ to retain performance on older architectures.
* Simplifies new __restrict__ usage via macro
* Add hopper to PDL __restrict__ fix.
Co-authored-by: Oliver Simons <osimons@nvidia.com>
---------
Co-authored-by: Oliver Simons <osimons@nvidia.com>
* cuda: reserve space for quantize kv-cache at startup
* address review comments
* remove forward decl
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* remove assert in ggml-cuda.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* hex-mm: initial support for F32 * F32 -> F32 matmuls
* hex-rms-norm: fix src1 stride use in fused rms_norm_mul
* hex-ops: clear spad pointers in the ops that clober it
This fixes an odd case where fused rms-norm-mul was failing but only in qwen3.5-2B and only at searth op-bath sizes.
* hmx-mm: add support for F32 * F32 -> F32 matmul_2d on HMX
Decided to use Q4_0 * F32 -> F32 matmul for this.
Q4_0 gets dequantized and tiled into F16, and here we quantize and tile F32 into F16.
Super simple and pretty efficient.
* hmx-mm: route f16 2D matmuls through the same kernel used for all other types
* hmx-mm: re-introduce pipelined vs non-pipelined mode that we used to have but is much more generic way
This update futher improves matmul performance and at the same time removes most of the redudant logic
we had in different paths.
* hmx-fa: slighlty improved pipeline simimar to matmul updates
* hmx-mm: initial version of MAT_MUL_ID support for HMX
* hmx-mm: fixed mxfp4 handling for MUL_MAT_ID
* hex-gdn: optimize GATED_DELTA_NET
DMA prefetch/double-buff, vectorize everything with HVX, in other words -- the usual :)
* hmx-mm: missed one more case where we can use fastmod
* hexagon: update DCVS settings for a slight perf bump
* hmx-fa: use fastdiv in hmx-flash-attn
* hmx-fa: precompute slope values to avoid disrupting the inner loop
* hvx-utils/fa: new HVX helpers for powf and logf and using those to speed up FA alibi
* hex-ops: fixed a bug in fusion logic that was messing up the order of the src tensors when some srcs are empty
* hex-fa: correctly fallback to HVX if we have sinks or the dims are not quite right
* opencl: add general q5_0 support
* opencl: add general q5_1 support
* opencl: support non-uniform workgrp size
---------
Co-authored-by: Li He <lih@qti.qualcomm.com>
Drops the hardcoded f32 GLU kernels in favor of a single template. We now load/store in the native tensor type (half or float) to save memory bandwidth, but keep the actual ALU compute in float to avoid exploding math in geglu/swiglu. Also opened up the dispatch gate to allow f16 inputs.
* vulkan: don't hold the device mutex while compiling pipelines
We need to hold a lock while we traverse all pipelines and lazily initialize
them, but we don't need to hold it while the pipeline is being compiled. And
it doesn't need to be the same lock as the device mutex. We call load_shaders
each time a pipeline is needed, so we only need to compile that one pipeline
(and, for example, don't want to end up compiling a pipeline that another
thread should be compiling).
* remove 'needed'
Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even
though they're only 2-byte aligned, and Q3_K still wins on
NVIDIA as well.
mesa isn't all that great at coalescing back-to-back loads from
alternating arrays, so we force it instead. Further, we can do
subtraction directly on a full int32_t rather than an i8vec4
with bit twiddling because the high bit is always free to start.
On Intel BMG on mesa, the switch to MMVQ provides an immediate
~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and
~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.
The futher switch to block loads leads to a ~24% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.
Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA
override for K quants on Xe2 as well.
* add to support Q1_0, NVFP4, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ1_S, IQ1_M, IQ3_S, IQ4_NL, IQ4_XS, I32, MXFP4, Q2_K, Q3_K, Q5_K, and Q6_K in GET_ROWS OP
* correct the link