Commit Graph

2542 Commits

Author SHA1 Message Date
Jinyang He 64b0d6b7fc ggml : add some lsx support (llama/23798)
* loongarch : optimize LSX fp16 load/store with native intrinsics

Use __lsx_vfcvtl_s_h and __lsx_vfcvt_h_s instead of scalar loops in
__lsx_f16x4_load and __lsx_f16x4_store.

* loongarch : add LSX implementation for q8_0 dot product

* loongarch : add LSX implementation for q6_K dot product

* loongarch : add LSX implementation for iq4_xs dot product

* Improve reduce ops when sun int16 pairs to int32
2026-06-08 14:36:36 +03:00
Ruben Ortlam 4317ddbe2b vulkan: add Flash Attention support for BFloat16 KV cache (llama/23420)
* vulkan: add flash attention bf16 kv support

* vulkan: bf16 FA coopmat1 support

* vulkan: bf16 FA coopmat2 support

* fix FA bf16 f32 fallback

* fix FA bf16 coopmat1 shader

* fix FA bf16 coopmat2 shader

* code cleanup

* cleanup comment change

* address feedback

* add O_TYPE for cm2 FA

* use O_TYPE for gqaStore function

* reduce BFLOAT16 ifdefs
2026-06-08 14:36:36 +03:00
Reese Levine 9147a9676b ggml-webgpu: Check earlier for WebGPU required features (llama/23879) 2026-06-08 14:36:36 +03:00
Reese Levine acd91d2c38 ggml-webgpu: add q4_0/q8_0 SET_ROWS (llama/23760)
* Add q8_0 and q4_0 set_rows

* Add fast(er) quantization set_rows path

* formatting/naming

* a little more naming

* Remove unused constant

* Don't override other override

* Avoid bitcast

* Narrow relaxation
2026-06-08 14:36:36 +03:00
Oliver Simons f7aad4ed7e CUDA: Check PTX version on host side to guard PDL dispatch (llama/23530)
* CUDA: Check PTX version on host side to guard PDL dispatch

Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this
variable doesn't differentiate between compiling for say sm_90, sm_90a
or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX).

Thus, one can have a bug when compiling with
`DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly
dispatch to PDL on sm_90/sm_120 in forward-JIT mode.

This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of
the incoming kernel at runtime. A check on ptxVersion alone is
sufficient, as device-codes will always be >= ptxVersion (and any
violation of this would be a severe bug in CUDA/nvcc), see:
 https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code

* Implement MurmurHash3 mixer for better hash distribution

Magic constants were taken from boost:
2698b43803/include/boost/container_hash/detail/hash_mix.hpp (L19-L65)

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments, make seed non-zero

* Apply code-formatting

* Replace std::size_t -> size_t for consistency

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-06-08 14:36:36 +03:00
fairydreaming c50e951afd model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (llama/23346)
* llama : support DeepSeek V3.2 model family (with DSA lightning indexer)

* convert : handle DeepseekV32ForCausalLM architecture

* ggml : support for f16 GGML_OP_FILL

* memory : separate hparams argument in llama_kv_cache constructor

* memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache)

* llama : support for LLM_ARCH_DEEPSEEK32

* model : llama_model_deepseek32 implementation

* model : merge two scale operations into one in DSA lightning indexer implementation

* chore : remove unused code

* model : support NVFP4 in DeepSeek V3.2

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* memory : refactoring TODO

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
2026-06-08 14:36:36 +03:00
Georgi Gerganov 92fc3f2a58 ggml : bump version to 0.13.1 (ggml/1523) 2026-05-29 09:47:30 +03:00
Andreas Kieslinger e90501e179 cuda : disables launch_fattn PDL enrollment due to compiler bug (llama/23825) 2026-05-29 09:47:30 +03:00
Matt Corallo f1b687da28 meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (llama/23480)
Without this at least the vulkan backend will skip the `* 0` for
!COMPUTE tensors, causing corrupt output.
2026-05-29 09:47:30 +03:00
Max Krasnyansky 442be1789d hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (llama/23835)
Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.
2026-05-29 09:47:30 +03:00
lhez 94922ce12c opencl: move backend info printing into its own function (llama/23702)
* opencl: move backend info print into its own function

* opencl: move new log line

* opencl: fix for non adreno path
2026-05-29 09:47:30 +03:00
fl0rianr e1faa7cb4d ggml: auto apply iGPU flag CUDA/HIP if integrated device (llama/23007) 2026-05-29 09:47:30 +03:00
redfox 4e8af441e5 mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … (#23729)
* mmvq Optim:  add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING

* avoid a mismatch for JIT compilation of Turing device code for Ampere or newer

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-29 09:47:30 +03:00
Jaden_Mach 04795e6272 CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (llama/23227)
* CUDA: per-quant MMVQ/MMQ batch threshold on AMD MFMA hardware

The dispatcher uses a single global threshold (MMVQ_MAX_BATCH_SIZE = 8)
to choose between mul_mat_vec_q (per-row GEMV) and mul_mat_q (MFMA-tiled
GEMM) for quantized matmul. On AMD CDNA, the optimal crossover differs
substantially by quant family because the per-row GEMV cost is dominated
by dequantisation, not the dot-product itself: K-quants pay a heavier
super-block decode and so MMQ wins sooner; legacy and IQ quants have
lean decode and stay ahead until the batch fully populates an MFMA tile.

This patch introduces ggml_cuda_should_use_mmvq(type, cc, ne11) -> bool,
mirroring the existing ggml_cuda_should_use_mmq, and gates per-quant
thresholds on amd_mfma_available(cc):

  Q3_K, Q4_K, Q5_K  : MMVQ <= 3   (MMQ wins from batch=4: +5% .. +76%)
  Q2_K, Q6_K        : MMVQ <= 5   (MMQ wins from batch=6: +8% .. +35%)
  others            : MMVQ <= 8   (legacy & IQ regress under MMQ; unchanged)

Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical
to master. GGML_CUDA_FORCE_MMVQ=1 restores the original global threshold
for A/B testing.

Measured on MI250X (gfx90a, ROCm 7.2.1) with Llama-3.2-3B-Instruct,
llama-bench pp512 across all 20 supported quants, ubatch 1..8, 10 reps.
Full table in PR description.

  Selected pp512 throughput (tok/s, ub=8):
    Q4_K_S:  559 -> 940  (+68%)
    Q5_K_S:  503 -> 884  (+76%)
    Q3_K_S:  629 -> 879  (+40%)
    Q2_K  :  615 -> 809  (+32%)
    Q6_K  :  582 -> 776  (+33%)

  Selected pp512 throughput (tok/s, ub=4):
    Q4_K_S:  444 -> 480  (+ 8%)
    Q4_0  :  682 -> 685  (+ 0%)   (no regression - retains MMVQ)
    IQ4_XS:  706 -> 698  (- 1%)   (no regression - retains MMVQ)

* CUDA: address review — inline MMVQ batch table, drop env hatch & doc block

* tune kernel selection logic for CDNA1

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-29 09:47:30 +03:00
Max Krasnyansky 1b241b879c hexagon: minor refresh for HMX FA and MM (llama/23796)
* hex-fa: clean up qf32/fp32 handling and stride handling

* hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79

* hex-fa: vectorize leftover handling

* hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity

* hmx-mm: remove dead code

* hmx-mm: use fastdiv in x4x2 dequant

* hmx-mm: sandwich dequant and scatter to improve perf

* hmx-mm: fixed rebase conflicts

* hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv

* hmx-mm: an even earlier dispatch for per-type dequant

* hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs

This is a bit faster than LUT.

* hex-cmake: one more tweak for lto

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
2026-05-29 09:47:30 +03:00
Jeff Bolz b896e91f18 vulkan: fast path for walsh-hadamard transform (llama/23687)
* vulkan: fast path for walsh-hadamard transform

* disable for intel due to segfault
2026-05-29 09:47:30 +03:00
Winston Ma 816c3029bc vulkan: fix wrong index variable in inner loop (llama/23665) 2026-05-29 09:47:30 +03:00
Winston Ma 5db94bac04 vulkan: Fix memory logger unsafe iterator access (llama/23667) 2026-05-29 09:47:30 +03:00
fairydreaming 60e420ff6a cuda : fix KQ mask offset integer overflow in fattn MMA kernel (llama/23610)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2026-05-29 09:47:30 +03:00
Martin Klacer 8e40325876 ggml: fixed Arm SVE usage bug in vec.h, vec.cpp (llama/22841)
* Updated vec.h/vec.cpp code to accumulate to F32 rather than F16

Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>
2026-05-29 09:47:30 +03:00
ymcki d284e1c3aa Hexagon: OP_GATED_DELTA_NET K>1 support (llama/23531)
* K>1 state snapshot support

* removed picky indent multiple of 4 fixes
2026-05-29 09:47:30 +03:00
ymcki 7e843a80e1 opencl: OP_GATED_DELTA_NET (llama/23312)
* OP_GATED_DELTA_NET impl

* add back lanes_per_column declaration

* removed has_subgroup_arithmetic and has_subgroup_clustered_reduce

* removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot

* support for K>1 state snapshot

* removed picky indent multiple of 4 fixes

* removed return that won\'t be executed
2026-05-29 09:47:30 +03:00
Reese Levine 8c8f213dac ggml-webgpu: remove legacy constants (llama/23672) 2026-05-29 09:47:30 +03:00
Max Krasnyansky 3bbe93378c hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (llama/23647)
* hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now

* hmx-mm: add support for Q4_1

* hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot

* hexagon: fix repack scratch buffer overflow

* hex-mm: fix Q4_1 repack buffer sizing

* hexagon: flip the build order for mm and fa (seems to help LTO)

* hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1

* hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output

* hexagon: resurrect early-wake and add support for polling for op-batch completions

With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax.
This is a good thing! But it does add extra latency for the pure benchmark runs.
Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking.

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
2026-05-29 09:47:30 +03:00
Masashi Yoshimura a52bd385d6 ggml-webgpu: Fix how to dispatch WG to some ops (llama/23750) 2026-05-29 09:47:30 +03:00
Matt Corallo 8bce478ee8 vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 (llama/22887)
* vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32

Against mesa git, this shows a 4.8% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.

Note that this breaks some tests until the last commit which fixes
OOB A reads.

* vulkan: Use aligned loads in mul_mat_vec when available

Against mesa git, this shows a 3.3% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.

* Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec

Mesa's UUB logic can't see through conditionals, limiting its
ability to understand the bounds on the `num_rows` field in the
cleanup run. Making it explicit that `num_rows` is, indeed, always
<= `NUM_ROWS` helps mesa make slightly better codegen.

Against mesa git, this currently shows a 1% performance improvement
in tg128 on Qwen3.5-9B:BF16 on Intel BMG.

* vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes

There was a TODO to fix the OOB reads from the A matrix which we do
here.

It is within performance noise (+<0.1%) in tg128 for
Qwen3.5-9B:BF16 on Intel BMG.
2026-05-29 09:47:30 +03:00
Jeff Bolz 1b590bbb9a vulkan: use GL_NV_cooperative_matrix_decode_vector for faster matmul (llama/23541) 2026-05-29 09:47:30 +03:00
l8bloom c5cde8c717 vulkan: add REPEAT op support for f16 to f16. (llama/23298)
* feat: extend repeat op for vulkan

* feat: add repeat_f16 vulkan pipeline

* fix: ensure same dst and src types

* fix: use type_size instead of data types

* fix: use int16 and int32 for repeat shader op

* chore: rename repeat_f* to repeat_i*

* chore: rename repeat vulkan pipelines
2026-05-29 09:47:30 +03:00
Oliver Simons 98c6722fec CUDA: restrict PDL to CTK >= 12.3 due to MSVC issues (llama/23742) 2026-05-29 09:47:30 +03:00
Winston Ma 80e87ec453 vulkan: avoid preferring transfer queue on AMD UMA devices (llama/22455) 2026-05-29 09:47:30 +03:00
Vladislav 6a249cd640 ggml-zendnn : fixed naming of matmul function (llama/20964)
* ggml-zendnn: fixed naming of matmul function

* ggml-zendnn: fixed naming of mul_mat_id function

* ggml-zendnn: fixed print in  mul_mat_id

---------

Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>
2026-05-29 09:47:30 +03:00
Jeff Bolz a0efd13f0f vulkan: optimize conv2d and implement coopmat1 support (llama/22620)
* vulkan: add CONV_SHAPE_64x128 for medium-K conv2d

* vulkan: skip conv2d bounds checks when shapes align with tile sizes

* vulkan: use WG_SIZE=128 for CONV_SHAPE_64x32 conv2d

* vulkan: stage cm2 conv2d accumulator through shmem before global store

* vulkan: add coopmat1 conv2d path

* fallback when using too much shared memory. clean up comments

* Require 16x16x16 and subgroup size 32 or 64

* check whether shared memory is sufficient before overwriting conv2d params with coopmat1 values
2026-05-29 09:47:30 +03:00
Max Krasnyansky f8df28d331 hexagon: add support for CONCAT op (llama/23648)
* hexagon: add support for CONCAT with optimized concat_2d_transposed

qwen3.5 models are quite heavy on the CONCAT with large and transposed src1.

* hex-concat: use fastdiv in generic version

* hex-concat: make checks for transposed a bit more readable

* hex-concat: reoder dma ops for better pipelining

* hex-cont/cpy: optimize CPY and CONT ops

The primary change is to avoid scalar divs in the inner loops.
We were calling hvx_copy_uu(... type_size) where type_size is non a constexpr.
This causes runtime divs by that value which is normally just 4 or 2 (f32/f16).

* hex-get-rows: optimize GET_ROWS for large rows

We now use DMA for larger rows and also split them into chunks to improve perf for Qwen3.5 and other models
that do lots of GET_ROWS with huge (2MB+ rows).

Also bump the DMA queue depth now that we can take advantage of it.

* hex-concat: unroll the inner loops of concat_2d

* hex-concat: more updates to concat_2d to improve perf a bit further

* hex-cpy: fixed n_rows per thread checks in the copy ops

* hmx-fa: fix alignment issues while computing dma sizes

* hex-set-rows: add early returns for idle threads

* hvx-rope: minor optimization to replace loops with fastdiv logic

* hex-rope: replace scalar tail processing with HVX

* hex-rope: optimize rope cache init with HVX

Add hvx-utils sin/cos helpers that use an aprox method (similar to rsqrt, inverse, etc)
Use the helpers to optimize ROPE.
2026-05-29 09:47:30 +03:00
Alexey Kopytko 049f0af339 SYCL: implement ggml_sycl_pool_vmm (llama/22862)
* SYCL: implement ggml_sycl_pool_vmm

* Add an option to bypass VMM with GGML_SYCL_DISABLE_VMM

* Clean up debugging logging

* document GGML_SYCL_DISABLE_VMM

* Multi-stream MoE optimization

* Revert "Multi-stream MoE optimization"

This reverts commit 938929c3f13a562ec67c59e87cc5d38595444cce.

* Update common.hpp

Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>

* Flip GGML_SYCL_DISABLE_VMM to GGML_SYCL_ENABLE_VMM

* add logging for GGML_SYCL_ENABLE_VMM when extension is not available (SYCL_EXT_ONEAPI_VIRTUAL_MEM macro)

* Apply suggestions from code review

Co-authored-by: Alexey Kopytko <alexey@kopytko.com>

* Apply suggestion from @sanmai

* Apply suggestion from @sanmai

---------

Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
2026-05-29 09:47:30 +03:00
Masashi Yoshimura 00a5110b19 ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K and clean up legacy MUL_MAT pipeline (llama/23594)
* ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K

* Fix to editorconfig checking pass

* Remove mul-mat-legacy pipeline

* Fix to use vendor name as is and add dot_product/vendor to shader_lib_ctx
2026-05-29 09:47:30 +03:00
Nikhil Jain bc77933c2d Check batch_compute_passes before sending passes when not doing GPU profiling (llama/23457)
* Only run webgpu CI on my fork

* Add webgpu only workflow

* refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled

* restore build.yml
2026-05-29 09:47:30 +03:00
Johannes Gäßler 2307712d32 CUDA: missing PDL sync for FWHT, better fallback (llama/23690) 2026-05-29 09:47:30 +03:00
forforever73 1c477d4056 metal : add apple device id (llama/23566)
Co-authored-by: lvyichen <lvyichen@stepfun.com>
2026-05-29 09:47:30 +03:00
Aman Gupta 205ee5a189 CUDA: add fast walsh-hadamard transform (llama/23615)
* CUDA: add fast walsh-hadamard transform

* review: add unrolls + change size_t -> int

* warp size 64

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-29 09:47:30 +03:00
Georgi Gerganov 1cf8e3a903
ggml : bump version to 0.13.0 (ggml/1510) 2026-05-25 12:44:04 +03:00
Johannes Gäßler bcff515150
TP: fix ggml context size calculation (llama/22616)
* TP: fix ggml context size calculation, memory leak

* move split state cache back into the context

* revert to constant ggml context size for cgraphs

* increase headroom for statically allocated tensors

* remove obsolete include
2026-05-25 12:44:04 +03:00
Gilad S 2979e5f95f
ggml: `gguf_init_from_callback` and `gguf_init_from_buffer` (llama/22341)
* ggml: implement `gguf_init_from_buffer`

* test: `gguf_init_from_buffer`

* fix: memory breakdown for a model loaded with `no_alloc` from a file is consistent with being loaded from a buffer

* fix: use `GGML_UNUSED`

Co-authored-by: Copilot <copilot@github.com>

* fix: remove `total_size` from `gguf_reader`

* fix: file offset calculation, rename `offset` to `data_offset`

Co-authored-by: Copilot <copilot@github.com>

* refactor: extract model loader bug fixes to another PR

* feat: add `gguf_init_from_callback`

* fix: always require a max expected size

* fix: change `gguf_reader_callback_t`'s `output` type to `void *`, change `max_expected_size` and offsets to `uint64_t`

* fix: harden against offset overflow in buffer read

* fix: remove seek behavior from the callback

* feat: `max_chunk_read == 0` means `SIZE_MAX`

* fix: seeking in a gguf file with no tensors

---------

Co-authored-by: Copilot <copilot@github.com>
2026-05-25 12:44:04 +03:00
Georgi Gerganov 946d6813b9 ggml : bump version to 0.12.1 (ggml/1508) 2026-05-25 12:26:07 +03:00
Jeff Bolz a369b3949c ggml : Parallelize quant LUT init (llama/23595)
- Use OpenMP to parallelize iq2xs_init_impl and iq3xs_init_impl.
- Move the OpenMP detection from ggml-cpu to ggml-base.
- Update OpenMP dependencies in ggml-config.cmake.in.
2026-05-25 12:26:07 +03:00
Johannes Gäßler 3306af62b1 TP: fix entirely zero-sized slices per device (llama/23525) 2026-05-25 12:26:07 +03:00
shaofeiqi 1435988ab3 opencl: batch profiling to improve speed and prevent memory leaks (llama/23495) 2026-05-25 12:26:07 +03:00
Yiwei Shao b84d03487c hexagon: apply repl optimization in flash attn softmax as #22993 (llama/23455) 2026-05-25 12:26:07 +03:00
dskwe 511f8602b1 ggml : Check the right iface method before using the fallback 2d get (llama/23514) 2026-05-25 12:26:07 +03:00
Jeff Bolz 6b85d73b33 vulkan: fix windows find_package of SPIRV-Headers (llama/23215)
* vulkan: fix windows find_package of SPIRV-Headers

* not windows-only
2026-05-25 12:26:07 +03:00
Shawn Gu aefffa1fa5 opencl: generalize Adreno MoE kernels on M (llama/23449) 2026-05-25 12:26:07 +03:00