* cuda: reserve space for quantize kv-cache at startup
* address review comments
* remove forward decl
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* remove assert in ggml-cuda.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* hex-mm: initial support for F32 * F32 -> F32 matmuls
* hex-rms-norm: fix src1 stride use in fused rms_norm_mul
* hex-ops: clear spad pointers in the ops that clober it
This fixes an odd case where fused rms-norm-mul was failing but only in qwen3.5-2B and only at searth op-bath sizes.
* hmx-mm: add support for F32 * F32 -> F32 matmul_2d on HMX
Decided to use Q4_0 * F32 -> F32 matmul for this.
Q4_0 gets dequantized and tiled into F16, and here we quantize and tile F32 into F16.
Super simple and pretty efficient.
* hmx-mm: route f16 2D matmuls through the same kernel used for all other types
* hmx-mm: re-introduce pipelined vs non-pipelined mode that we used to have but is much more generic way
This update futher improves matmul performance and at the same time removes most of the redudant logic
we had in different paths.
* hmx-fa: slighlty improved pipeline simimar to matmul updates
* hmx-mm: initial version of MAT_MUL_ID support for HMX
* hmx-mm: fixed mxfp4 handling for MUL_MAT_ID
* hex-gdn: optimize GATED_DELTA_NET
DMA prefetch/double-buff, vectorize everything with HVX, in other words -- the usual :)
* hmx-mm: missed one more case where we can use fastmod
* hexagon: update DCVS settings for a slight perf bump
* hmx-fa: use fastdiv in hmx-flash-attn
* hmx-fa: precompute slope values to avoid disrupting the inner loop
* hvx-utils/fa: new HVX helpers for powf and logf and using those to speed up FA alibi
* hex-ops: fixed a bug in fusion logic that was messing up the order of the src tensors when some srcs are empty
* hex-fa: correctly fallback to HVX if we have sinks or the dims are not quite right
* opencl: add general q5_0 support
* opencl: add general q5_1 support
* opencl: support non-uniform workgrp size
---------
Co-authored-by: Li He <lih@qti.qualcomm.com>
Drops the hardcoded f32 GLU kernels in favor of a single template. We now load/store in the native tensor type (half or float) to save memory bandwidth, but keep the actual ALU compute in float to avoid exploding math in geglu/swiglu. Also opened up the dispatch gate to allow f16 inputs.
* vulkan: don't hold the device mutex while compiling pipelines
We need to hold a lock while we traverse all pipelines and lazily initialize
them, but we don't need to hold it while the pipeline is being compiled. And
it doesn't need to be the same lock as the device mutex. We call load_shaders
each time a pipeline is needed, so we only need to compile that one pipeline
(and, for example, don't want to end up compiling a pipeline that another
thread should be compiling).
* remove 'needed'
Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even
though they're only 2-byte aligned, and Q3_K still wins on
NVIDIA as well.
mesa isn't all that great at coalescing back-to-back loads from
alternating arrays, so we force it instead. Further, we can do
subtraction directly on a full int32_t rather than an i8vec4
with bit twiddling because the high bit is always free to start.
On Intel BMG on mesa, the switch to MMVQ provides an immediate
~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and
~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.
The futher switch to block loads leads to a ~24% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.
Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA
override for K quants on Xe2 as well.
* add to support Q1_0, NVFP4, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ1_S, IQ1_M, IQ3_S, IQ4_NL, IQ4_XS, I32, MXFP4, Q2_K, Q3_K, Q5_K, and Q6_K in GET_ROWS OP
* correct the link
* vulkan: add flash attention bf16 kv support
* vulkan: bf16 FA coopmat1 support
* vulkan: bf16 FA coopmat2 support
* fix FA bf16 f32 fallback
* fix FA bf16 coopmat1 shader
* fix FA bf16 coopmat2 shader
* code cleanup
* cleanup comment change
* address feedback
* add O_TYPE for cm2 FA
* use O_TYPE for gqaStore function
* reduce BFLOAT16 ifdefs
* CUDA: Check PTX version on host side to guard PDL dispatch
Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this
variable doesn't differentiate between compiling for say sm_90, sm_90a
or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX).
Thus, one can have a bug when compiling with
`DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly
dispatch to PDL on sm_90/sm_120 in forward-JIT mode.
This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of
the incoming kernel at runtime. A check on ptxVersion alone is
sufficient, as device-codes will always be >= ptxVersion (and any
violation of this would be a severe bug in CUDA/nvcc), see:
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code
* Implement MurmurHash3 mixer for better hash distribution
Magic constants were taken from boost:
2698b43803/include/boost/container_hash/detail/hash_mix.hpp (L19-L65)
* Update ggml/src/ggml-cuda/common.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Address review comments, make seed non-zero
* Apply code-formatting
* Replace std::size_t -> size_t for consistency
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING
* avoid a mismatch for JIT compilation of Turing device code for Ampere or newer
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* hex-fa: clean up qf32/fp32 handling and stride handling
* hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79
* hex-fa: vectorize leftover handling
* hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity
* hmx-mm: remove dead code
* hmx-mm: use fastdiv in x4x2 dequant
* hmx-mm: sandwich dequant and scatter to improve perf
* hmx-mm: fixed rebase conflicts
* hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv
* hmx-mm: an even earlier dispatch for per-type dequant
* hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs
This is a bit faster than LUT.
* hex-cmake: one more tweak for lto
---------
Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
* Updated vec.h/vec.cpp code to accumulate to F32 rather than F16
Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>
* OP_GATED_DELTA_NET impl
* add back lanes_per_column declaration
* removed has_subgroup_arithmetic and has_subgroup_clustered_reduce
* removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot
* support for K>1 state snapshot
* removed picky indent multiple of 4 fixes
* removed return that won\'t be executed
* hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now
* hmx-mm: add support for Q4_1
* hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot
* hexagon: fix repack scratch buffer overflow
* hex-mm: fix Q4_1 repack buffer sizing
* hexagon: flip the build order for mm and fa (seems to help LTO)
* hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1
* hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output
* hexagon: resurrect early-wake and add support for polling for op-batch completions
With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax.
This is a good thing! But it does add extra latency for the pure benchmark runs.
Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking.
---------
Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
* vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32
Against mesa git, this shows a 4.8% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.
Note that this breaks some tests until the last commit which fixes
OOB A reads.
* vulkan: Use aligned loads in mul_mat_vec when available
Against mesa git, this shows a 3.3% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.
* Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec
Mesa's UUB logic can't see through conditionals, limiting its
ability to understand the bounds on the `num_rows` field in the
cleanup run. Making it explicit that `num_rows` is, indeed, always
<= `NUM_ROWS` helps mesa make slightly better codegen.
Against mesa git, this currently shows a 1% performance improvement
in tg128 on Qwen3.5-9B:BF16 on Intel BMG.
* vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes
There was a TODO to fix the OOB reads from the A matrix which we do
here.
It is within performance noise (+<0.1%) in tg128 for
Qwen3.5-9B:BF16 on Intel BMG.
* feat: extend repeat op for vulkan
* feat: add repeat_f16 vulkan pipeline
* fix: ensure same dst and src types
* fix: use type_size instead of data types
* fix: use int16 and int32 for repeat shader op
* chore: rename repeat_f* to repeat_i*
* chore: rename repeat vulkan pipelines