Commit Graph

2563 Commits

Author SHA1 Message Date
Charles Xu 750fa4ca35 ggml-cpu: use runtime SVE width in FWHT (llama/24059) 2026-06-08 14:36:36 +03:00
Aman Gupta d5a49ebec8 cuda: reserve space for quantize kv-cache at startup (llama/23907)
* cuda: reserve space for quantize kv-cache at startup

* address review comments

* remove forward decl

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* remove assert in ggml-cuda.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-06-08 14:36:36 +03:00
lhez f110ff540c opencl: use flat variants of q4_K and q6_K gemv for very large M (llama/24006) 2026-06-08 14:36:36 +03:00
Max Krasnyansky d31cb20b25 hexagon: profiler output fix and script updates (llama/24042)
* hex-ops: fix profiler output (ie remove the redundant NONEs)

* hex-prof: update profiling script to support tot.usec column
2026-06-08 14:36:36 +03:00
Max Krasnyansky 8d61a9edf0 hexagon: MUL_MAT, MUL_MAT_ID, FLASH_ATTN and GDN cleanup and optimizations for latest models (llama/23989)
* hex-mm: initial support for F32 * F32 -> F32 matmuls

* hex-rms-norm: fix src1 stride use in fused rms_norm_mul

* hex-ops: clear spad pointers in the ops that clober it

This fixes an odd case where fused rms-norm-mul was failing but only in qwen3.5-2B and only at searth op-bath sizes.

* hmx-mm: add support for F32 * F32 -> F32 matmul_2d on HMX

Decided to use Q4_0 * F32 -> F32 matmul for this.
Q4_0 gets dequantized and tiled into F16, and here we quantize and tile F32 into F16.
Super simple and pretty efficient.

* hmx-mm: route f16 2D matmuls through the same kernel used for all other types

* hmx-mm: re-introduce pipelined vs non-pipelined mode that we used to have but is much more generic way

This update futher improves matmul performance and at the same time removes most of the redudant logic
we had in different paths.

* hmx-fa: slighlty improved pipeline simimar to matmul updates

* hmx-mm: initial version of MAT_MUL_ID support for HMX

* hmx-mm: fixed mxfp4 handling for MUL_MAT_ID

* hex-gdn: optimize GATED_DELTA_NET

DMA prefetch/double-buff, vectorize everything with HVX, in other words -- the usual :)

* hmx-mm: missed one more case where we can use fastmod

* hexagon: update DCVS settings for a slight perf bump

* hmx-fa: use fastdiv in hmx-flash-attn

* hmx-fa: precompute slope values to avoid disrupting the inner loop

* hvx-utils/fa: new HVX helpers for powf and logf and using those to speed up FA alibi

* hex-ops: fixed a bug in fusion logic that was messing up the order of the src tensors when some srcs are empty

* hex-fa: correctly fallback to HVX if we have sinks or the dims are not quite right
2026-06-08 14:36:36 +03:00
Todor Boinovski 754247f28b hexagon: add gelu_quick (llama/24007) 2026-06-08 14:36:36 +03:00
Anav Prasad 79223704a1 clean up unused variables warnings (llama/23975) 2026-06-08 14:36:36 +03:00
lhez 9a0265d13b opencl: fix compiler warnings for non-adreno path (llama/23922)
* opencl: fix compiler warnings for non-adreno path

* opencl: fix const cast warning
2026-06-08 14:36:36 +03:00
Masashi Yoshimura db2a39507c revert to using global_invocation_id for cpy shader (llama/23955) 2026-06-08 14:36:36 +03:00
shaofeiqi e728bae159 opencl: add basic support for q5_0 and q5_1 (llama/23548)
* opencl: add general q5_0 support

* opencl: add general q5_1 support

* opencl: support non-uniform workgrp size

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-06-08 14:36:36 +03:00
Shrivas Shankar 050b8567a0 metal: template GLU kernels to support f16/f32 (llama/23882)
Drops the hardcoded f32 GLU kernels in favor of a single template. We now load/store in the native tensor type (half or float) to save memory bandwidth, but keep the actual ALU compute in float to avoid exploding math in geglu/swiglu. Also opened up the dispatch gate to allow f16 inputs.
2026-06-08 14:36:36 +03:00
Jeff Bolz 71d80aa49e vulkan: don't hold the device mutex while compiling pipelines (llama/23641)
* vulkan: don't hold the device mutex while compiling pipelines

We need to hold a lock while we traverse all pipelines and lazily initialize
them, but we don't need to hold it while the pipeline is being compiled. And
it doesn't need to be the same lock as the device mutex. We call load_shaders
each time a pipeline is needed, so we only need to compile that one pipeline
(and, for example, don't want to end up compiling a pipeline that another
thread should be compiling).

* remove 'needed'
2026-06-08 14:36:36 +03:00
Winston Ma c471bcce1b vulkan: reduce host memory lock contention (llama/23376)
* vulkan: reduces lock contention

* replace unique_lock with lock_guard
2026-06-08 14:36:36 +03:00
Johannes Gäßler e815b264eb TP: quantized KV cache support (llama/23792)
* TP: quantized KV cache support

* fix partial view

* remove overly strict assert
2026-06-08 14:36:36 +03:00
Matt Corallo 982533fc0c vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (llama/23056)
Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even
though they're only 2-byte aligned, and Q3_K still wins on
NVIDIA as well.

mesa isn't all that great at coalescing back-to-back loads from
alternating arrays, so we force it instead. Further, we can do
subtraction directly on a full int32_t rather than an i8vec4
with bit twiddling because the high bit is always free to start.

On Intel BMG on mesa, the switch to MMVQ provides an immediate
~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and
~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

The futher switch to block loads leads to a ~24% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA
override for K quants on Xe2 as well.
2026-06-08 14:36:36 +03:00
Winston Ma aea93ada61 vulkan: Removed unused functions (llama/23175) 2026-06-08 14:36:36 +03:00
Neo Zhang ec0c661950 Support Q4_1, Q5_0, Q5_1 in Flash-attention (llama/23812)
* support Q4_1, Q5_0, Q5_1

* update ut case
2026-06-08 14:36:36 +03:00
Neo Zhang 20323e48c4 Add more types in GET_ROWS OP (llama/23710)
* add to support Q1_0, NVFP4, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ1_S, IQ1_M, IQ3_S, IQ4_NL, IQ4_XS, I32, MXFP4, Q2_K, Q3_K, Q5_K, and Q6_K in GET_ROWS OP

* correct the link
2026-06-08 14:36:36 +03:00
Neo Zhang 687fbcb149 sycl : Optimize Q3_K mul_mat by reorder (llama/23725) 2026-06-08 14:36:36 +03:00
lhez 1c0d1f0f7c opencl: support bf16 by converting to f16 (llama/23839) 2026-06-08 14:36:36 +03:00
Georgi Gerganov bf74b557d2 metal : restore im2col implementation for large kernels (llama/23901) 2026-06-08 14:36:36 +03:00
Jinyang He 64b0d6b7fc ggml : add some lsx support (llama/23798)
* loongarch : optimize LSX fp16 load/store with native intrinsics

Use __lsx_vfcvtl_s_h and __lsx_vfcvt_h_s instead of scalar loops in
__lsx_f16x4_load and __lsx_f16x4_store.

* loongarch : add LSX implementation for q8_0 dot product

* loongarch : add LSX implementation for q6_K dot product

* loongarch : add LSX implementation for iq4_xs dot product

* Improve reduce ops when sun int16 pairs to int32
2026-06-08 14:36:36 +03:00
Ruben Ortlam 4317ddbe2b vulkan: add Flash Attention support for BFloat16 KV cache (llama/23420)
* vulkan: add flash attention bf16 kv support

* vulkan: bf16 FA coopmat1 support

* vulkan: bf16 FA coopmat2 support

* fix FA bf16 f32 fallback

* fix FA bf16 coopmat1 shader

* fix FA bf16 coopmat2 shader

* code cleanup

* cleanup comment change

* address feedback

* add O_TYPE for cm2 FA

* use O_TYPE for gqaStore function

* reduce BFLOAT16 ifdefs
2026-06-08 14:36:36 +03:00
Reese Levine 9147a9676b ggml-webgpu: Check earlier for WebGPU required features (llama/23879) 2026-06-08 14:36:36 +03:00
Reese Levine acd91d2c38 ggml-webgpu: add q4_0/q8_0 SET_ROWS (llama/23760)
* Add q8_0 and q4_0 set_rows

* Add fast(er) quantization set_rows path

* formatting/naming

* a little more naming

* Remove unused constant

* Don't override other override

* Avoid bitcast

* Narrow relaxation
2026-06-08 14:36:36 +03:00
Oliver Simons f7aad4ed7e CUDA: Check PTX version on host side to guard PDL dispatch (llama/23530)
* CUDA: Check PTX version on host side to guard PDL dispatch

Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this
variable doesn't differentiate between compiling for say sm_90, sm_90a
or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX).

Thus, one can have a bug when compiling with
`DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly
dispatch to PDL on sm_90/sm_120 in forward-JIT mode.

This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of
the incoming kernel at runtime. A check on ptxVersion alone is
sufficient, as device-codes will always be >= ptxVersion (and any
violation of this would be a severe bug in CUDA/nvcc), see:
 https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code

* Implement MurmurHash3 mixer for better hash distribution

Magic constants were taken from boost:
2698b43803/include/boost/container_hash/detail/hash_mix.hpp (L19-L65)

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments, make seed non-zero

* Apply code-formatting

* Replace std::size_t -> size_t for consistency

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-06-08 14:36:36 +03:00
fairydreaming c50e951afd model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (llama/23346)
* llama : support DeepSeek V3.2 model family (with DSA lightning indexer)

* convert : handle DeepseekV32ForCausalLM architecture

* ggml : support for f16 GGML_OP_FILL

* memory : separate hparams argument in llama_kv_cache constructor

* memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache)

* llama : support for LLM_ARCH_DEEPSEEK32

* model : llama_model_deepseek32 implementation

* model : merge two scale operations into one in DSA lightning indexer implementation

* chore : remove unused code

* model : support NVFP4 in DeepSeek V3.2

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* memory : refactoring TODO

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
2026-06-08 14:36:36 +03:00
Georgi Gerganov 92fc3f2a58 ggml : bump version to 0.13.1 (ggml/1523) 2026-05-29 09:47:30 +03:00
Andreas Kieslinger e90501e179 cuda : disables launch_fattn PDL enrollment due to compiler bug (llama/23825) 2026-05-29 09:47:30 +03:00
Matt Corallo f1b687da28 meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (llama/23480)
Without this at least the vulkan backend will skip the `* 0` for
!COMPUTE tensors, causing corrupt output.
2026-05-29 09:47:30 +03:00
Max Krasnyansky 442be1789d hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (llama/23835)
Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.
2026-05-29 09:47:30 +03:00
lhez 94922ce12c opencl: move backend info printing into its own function (llama/23702)
* opencl: move backend info print into its own function

* opencl: move new log line

* opencl: fix for non adreno path
2026-05-29 09:47:30 +03:00
fl0rianr e1faa7cb4d ggml: auto apply iGPU flag CUDA/HIP if integrated device (llama/23007) 2026-05-29 09:47:30 +03:00
redfox 4e8af441e5 mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … (#23729)
* mmvq Optim:  add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING

* avoid a mismatch for JIT compilation of Turing device code for Ampere or newer

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-29 09:47:30 +03:00
Jaden_Mach 04795e6272 CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (llama/23227)
* CUDA: per-quant MMVQ/MMQ batch threshold on AMD MFMA hardware

The dispatcher uses a single global threshold (MMVQ_MAX_BATCH_SIZE = 8)
to choose between mul_mat_vec_q (per-row GEMV) and mul_mat_q (MFMA-tiled
GEMM) for quantized matmul. On AMD CDNA, the optimal crossover differs
substantially by quant family because the per-row GEMV cost is dominated
by dequantisation, not the dot-product itself: K-quants pay a heavier
super-block decode and so MMQ wins sooner; legacy and IQ quants have
lean decode and stay ahead until the batch fully populates an MFMA tile.

This patch introduces ggml_cuda_should_use_mmvq(type, cc, ne11) -> bool,
mirroring the existing ggml_cuda_should_use_mmq, and gates per-quant
thresholds on amd_mfma_available(cc):

  Q3_K, Q4_K, Q5_K  : MMVQ <= 3   (MMQ wins from batch=4: +5% .. +76%)
  Q2_K, Q6_K        : MMVQ <= 5   (MMQ wins from batch=6: +8% .. +35%)
  others            : MMVQ <= 8   (legacy & IQ regress under MMQ; unchanged)

Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical
to master. GGML_CUDA_FORCE_MMVQ=1 restores the original global threshold
for A/B testing.

Measured on MI250X (gfx90a, ROCm 7.2.1) with Llama-3.2-3B-Instruct,
llama-bench pp512 across all 20 supported quants, ubatch 1..8, 10 reps.
Full table in PR description.

  Selected pp512 throughput (tok/s, ub=8):
    Q4_K_S:  559 -> 940  (+68%)
    Q5_K_S:  503 -> 884  (+76%)
    Q3_K_S:  629 -> 879  (+40%)
    Q2_K  :  615 -> 809  (+32%)
    Q6_K  :  582 -> 776  (+33%)

  Selected pp512 throughput (tok/s, ub=4):
    Q4_K_S:  444 -> 480  (+ 8%)
    Q4_0  :  682 -> 685  (+ 0%)   (no regression - retains MMVQ)
    IQ4_XS:  706 -> 698  (- 1%)   (no regression - retains MMVQ)

* CUDA: address review — inline MMVQ batch table, drop env hatch & doc block

* tune kernel selection logic for CDNA1

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-29 09:47:30 +03:00
Max Krasnyansky 1b241b879c hexagon: minor refresh for HMX FA and MM (llama/23796)
* hex-fa: clean up qf32/fp32 handling and stride handling

* hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79

* hex-fa: vectorize leftover handling

* hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity

* hmx-mm: remove dead code

* hmx-mm: use fastdiv in x4x2 dequant

* hmx-mm: sandwich dequant and scatter to improve perf

* hmx-mm: fixed rebase conflicts

* hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv

* hmx-mm: an even earlier dispatch for per-type dequant

* hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs

This is a bit faster than LUT.

* hex-cmake: one more tweak for lto

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
2026-05-29 09:47:30 +03:00
Jeff Bolz b896e91f18 vulkan: fast path for walsh-hadamard transform (llama/23687)
* vulkan: fast path for walsh-hadamard transform

* disable for intel due to segfault
2026-05-29 09:47:30 +03:00
Winston Ma 816c3029bc vulkan: fix wrong index variable in inner loop (llama/23665) 2026-05-29 09:47:30 +03:00
Winston Ma 5db94bac04 vulkan: Fix memory logger unsafe iterator access (llama/23667) 2026-05-29 09:47:30 +03:00
fairydreaming 60e420ff6a cuda : fix KQ mask offset integer overflow in fattn MMA kernel (llama/23610)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2026-05-29 09:47:30 +03:00
Martin Klacer 8e40325876 ggml: fixed Arm SVE usage bug in vec.h, vec.cpp (llama/22841)
* Updated vec.h/vec.cpp code to accumulate to F32 rather than F16

Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>
2026-05-29 09:47:30 +03:00
ymcki d284e1c3aa Hexagon: OP_GATED_DELTA_NET K>1 support (llama/23531)
* K>1 state snapshot support

* removed picky indent multiple of 4 fixes
2026-05-29 09:47:30 +03:00
ymcki 7e843a80e1 opencl: OP_GATED_DELTA_NET (llama/23312)
* OP_GATED_DELTA_NET impl

* add back lanes_per_column declaration

* removed has_subgroup_arithmetic and has_subgroup_clustered_reduce

* removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot

* support for K>1 state snapshot

* removed picky indent multiple of 4 fixes

* removed return that won\'t be executed
2026-05-29 09:47:30 +03:00
Reese Levine 8c8f213dac ggml-webgpu: remove legacy constants (llama/23672) 2026-05-29 09:47:30 +03:00
Max Krasnyansky 3bbe93378c hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (llama/23647)
* hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now

* hmx-mm: add support for Q4_1

* hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot

* hexagon: fix repack scratch buffer overflow

* hex-mm: fix Q4_1 repack buffer sizing

* hexagon: flip the build order for mm and fa (seems to help LTO)

* hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1

* hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output

* hexagon: resurrect early-wake and add support for polling for op-batch completions

With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax.
This is a good thing! But it does add extra latency for the pure benchmark runs.
Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking.

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
2026-05-29 09:47:30 +03:00
Masashi Yoshimura a52bd385d6 ggml-webgpu: Fix how to dispatch WG to some ops (llama/23750) 2026-05-29 09:47:30 +03:00
Matt Corallo 8bce478ee8 vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 (llama/22887)
* vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32

Against mesa git, this shows a 4.8% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.

Note that this breaks some tests until the last commit which fixes
OOB A reads.

* vulkan: Use aligned loads in mul_mat_vec when available

Against mesa git, this shows a 3.3% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.

* Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec

Mesa's UUB logic can't see through conditionals, limiting its
ability to understand the bounds on the `num_rows` field in the
cleanup run. Making it explicit that `num_rows` is, indeed, always
<= `NUM_ROWS` helps mesa make slightly better codegen.

Against mesa git, this currently shows a 1% performance improvement
in tg128 on Qwen3.5-9B:BF16 on Intel BMG.

* vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes

There was a TODO to fix the OOB reads from the A matrix which we do
here.

It is within performance noise (+<0.1%) in tg128 for
Qwen3.5-9B:BF16 on Intel BMG.
2026-05-29 09:47:30 +03:00
Jeff Bolz 1b590bbb9a vulkan: use GL_NV_cooperative_matrix_decode_vector for faster matmul (llama/23541) 2026-05-29 09:47:30 +03:00
l8bloom c5cde8c717 vulkan: add REPEAT op support for f16 to f16. (llama/23298)
* feat: extend repeat op for vulkan

* feat: add repeat_f16 vulkan pipeline

* fix: ensure same dst and src types

* fix: use type_size instead of data types

* fix: use int16 and int32 for repeat shader op

* chore: rename repeat_f* to repeat_i*

* chore: rename repeat vulkan pipelines
2026-05-29 09:47:30 +03:00
Oliver Simons 98c6722fec CUDA: restrict PDL to CTK >= 12.3 due to MSVC issues (llama/23742) 2026-05-29 09:47:30 +03:00