Commit Graph

4210 Commits

Author SHA1 Message Date
shalinib-ibm 78a23d4830 ggml-cpu: optimize ggml_vec_dot_bf16 for Power9 (llama/18837) 2026-01-30 15:56:40 +02:00
Max Krasnyansky 50b7ab3d46 hexagon: support for OP_CPY, host buffers now optional (llama/18822) 2026-01-30 15:56:40 +02:00
Oliver Simons bc09047405 CUDA: Factor out and re-use `block_reduce` function (llama/18785)
* CUDA: Refactor and expose two_stage_warp_reduce_* function

* Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it

Moving smem out of `__device__` function to `__global__` function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (`cudaFuncSetAttribute` fails when not accounting for
it once for each call to two_stage_warp_reduce)

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Use two_stage_warp_reduce in group_norm_f32

* Use two_stage_warp_reduce in rms_norm_f32

* Fix smem calculation which expects bytes

* Make `two_stage_warp_reduce` accept all values warp_reduce accepts

Also integrate it into norm_f32 function

* Use two_stage_warp_reduce in l2_norm_f32

* Use type traits for block reduction for better legibility

Also adresss other requests by @am17an such as variable renaming

* Make norm tests cover all cuda paths

* Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK

Unit-tests passed locally, let's see if they pass in the CI as well

* Use `enum class` for `block_reduce_method`

This is more type-safe than plain enum

* Rename variables as suggested in code review by @am17an

* Rename two_stage_warp_reduce -> block_reduce

* Fix trailing whitespace in common.cuh

* Make condition of static_assert type-dependent

This delays evaluation until the template is actually instantiated.
Otherwise, some compilers may evaluate the assert when parsing the
template, resulting in build errors as observed here:

https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785

* Inline definitions

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-01-30 15:56:40 +02:00
Jeff Bolz 4b155e9bfb vulkan: Check maxStorageBufferRange in supports_op (llama/18709)
* vulkan: Check maxStorageBufferRange in supports_op

* skip maxStorageBufferRange check when shader64BitIndexing is enabled
2026-01-30 15:56:40 +02:00
Daniel Bevenius 25aeb66a4a CUDA : fix typo in clang pragma comment [no ci] (llama/18830) 2026-01-30 15:56:40 +02:00
Ruben Ortlam 49762e8fb3 vulkan: work around Intel fp16 bug in mmq (llama/18814) 2026-01-30 15:56:40 +02:00
Perry Naseck 17656e56dc ggml-metal: do not copy headers for embedded, use current binary dir for embedded (llama/18705) 2026-01-30 15:56:40 +02:00
yulo c6a495ae5d HIP: add fattn-mma-f16 for RDNA4 (llama/18481)
* finish VQ mma

* flash_attn_ext_f16_iter

* KQ_rowsum

* correct exp

* fix scale error

* fix softmax scale

* fix softmax scale

* enable fattn on cpu side

* fix random error

* disable fattn-mma-f16 on rdna3

* fix wrong col for rdna

* use identity mat to transpose

* resolve conflicts

* basic tuning for DeepSeek-R1-Distill-Qwen-1.5B

* fix volta compile error

* align rdna4 policy for fattn

* adjust fattn policy

* adjust kernel selection logic

* update as the review comments

* keep fattn-wmma logic

* adjust kernel selection logic

---------

Co-authored-by: zhang hui <you@example.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-01-30 15:56:40 +02:00
Bráulio Oliveira 7aa8818647
examples : use -dev/--device and WHISPER_ARG_DEVICE (#3557)
Align device selection naming with llama.cpp.
2026-01-21 08:40:30 +01:00
Yshtola f53dc74843
whisper : Fix UTF-8 character boundary issue in segment wrapping (max_len) (#3592)
The current implementation in `whisper_wrap_segment()` uses `strlen()` to count bytes, not UTF-8 characters. When splitting segments at `max_len`, this can break multi-byte UTF-8 characters, resulting in invalid sequences displayed as `�` (U+FFFD replacement character).
2026-01-16 14:16:05 +02:00
Georgi Gerganov 2eeeba56e9
release : v1.8.3 2026-01-15 11:54:31 +02:00
Georgi Gerganov 21c1765fcb
benches : update 2026-01-15 11:53:09 +02:00
Georgi Gerganov 47af2fb70f sync : ggml 2026-01-14 09:11:59 +02:00
Georgi Gerganov 6ee0eaf531 CUDA : fix unused argument when USE_CUDA_GRAPH=OFF (llama/18800) 2026-01-14 09:11:59 +02:00
Jeff Bolz ab1828dc1c vulkan: change memory_logger to be controlled by an env var (llama/18769) 2026-01-14 09:11:59 +02:00
Jeff Bolz aedf332ec5 vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (llama/18678)
This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which
has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128.

This should work when the number of blocks in the A matrix is less than 2^32
(for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like
2^32*LOAD_VEC_A elements.

- Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b.
- Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle
variants. So far this change just adds a single use case for this, compiling with the
e64BitIndexingEXT flag.
- Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange.

64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort
to avoid enabling it unconditionally.
2026-01-14 09:11:59 +02:00
Ruben Ortlam 716d68aca9 vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (llama/18763)
* vulkan: Disable large coopmat matmul configuration on proprietary AMD driver

* Also disable the large tile size
2026-01-14 09:11:59 +02:00
Ruben Ortlam c0433783c3 Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (llama/18749)
* vulkan: Enable and optimize large matmul parameter combination for AMD

* limit tuning to AMD GPUs with coopmat support

* use tx_m values instead of _l
2026-01-14 09:11:59 +02:00
Georgi Gerganov ecfcc65fbf talk-llama : sync llama.cpp 2026-01-14 09:11:59 +02:00
Georgi Gerganov 13dc9a912b sync : ggml 2026-01-14 09:11:59 +02:00
shaofeiqi d4ce2e554f opencl: add SOFTPLUS op support (llama/18726) 2026-01-14 09:11:59 +02:00
Johannes Gäßler 3a1ea96373 HIP: adjust RDNA3.5 MMQ kernel selction logic (llama/18666) 2026-01-14 09:11:59 +02:00
Perry Naseck 484b17053a cmake : update blas logic (llama/18205) 2026-01-14 09:11:59 +02:00
Michael Wand 45be2cd27a Corrected: changed s13 = src1->nb[3] instead of nb[2] (llama/18724) 2026-01-14 09:11:59 +02:00
shaofeiqi 4af27bf2da opencl: add EXPM1 op (llama/18704) 2026-01-14 09:11:59 +02:00
Reese Levine 4ac8c3b478 Updates to webgpu get_memory (llama/18707) 2026-01-14 09:11:59 +02:00
Aaron Teo fff3ebd93d llama: use host memory if device reports 0 memory (llama/18587) 2026-01-14 09:11:59 +02:00
Masashi Yoshimura a71127dfd8 ggml-webgpu: Fix GGML_MEM_ALIGN to 8 for emscripten. (llama/18628)
* Fix GGML_MEM_ALIGN to 8 for emscripten.

* Add a comment explaining the need for GGML_MEM_ALIGN == 8 in 64-bit wasm with emscripten
2026-01-14 09:11:59 +02:00
Reese Levine 1bb903f599 ggml webgpu: initial flashattention implementation (llama/18610)
* FlashAttention (llama/13)

* Add inplace softmax

* Move rms_norm to split row approach

* Update debug for supports_op

* clean up debug statements

* neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though

* neg passes backend test

* unary operators pass ggml tests

* rms_norm double declaration bug atoned

* abides by editor-config

* removed vestigial files

* fixed autoconfig

* All operators (inlcluding xielu) working

* removed unnecesarry checking if node->src[1] exists for unary operators

* responded and dealt with PR comments

* implemented REPL_Template support and removed bug in unary operators kernel

* formatted embed wgsl and ggml-webgpu.cpp

* Faster tensors (llama/8)

Add fast matrix and matrix/vector multiplication.

* Use map for shader replacements instead of pair of strings

* Wasm (llama/9)

* webgpu : fix build on emscripten

* more debugging stuff

* test-backend-ops: force single thread on wasm

* fix single-thread case for init_tensor_uniform

* use jspi

* add pthread

* test: remember to set n_thread for cpu backend

* Add buffer label and enable dawn-specific toggles to turn off some checks

* Intermediate state

* Fast working f16/f32 vec4

* Working float fast mul mat

* Clean up naming of mul_mat to match logical model, start work on q mul_mat

* Setup for subgroup matrix mat mul

* Basic working subgroup matrix

* Working subgroup matrix tiling

* Handle weirder sg matrix sizes (but still % sg matrix size)

* Working start to gemv

* working f16 accumulation with shared memory staging

* Print out available subgroup matrix configurations

* Vectorize dst stores for sg matrix shader

* Gemv working scalar

* Minor set_rows optimization (llama/4)

* updated optimization, fixed errors

* non vectorized version now dispatches one thread per element

* Simplify

* Change logic for set_rows pipelines

---------

Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Comment on dawn toggles

* Working subgroup matrix code for (semi)generic sizes

* Remove some comments

* Cleanup code

* Update dawn version and move to portable subgroup size

* Try to fix new dawn release

* Update subgroup size comment

* Only check for subgroup matrix configs if they are supported

* Add toggles for subgroup matrix/f16 support on nvidia+vulkan

* Make row/col naming consistent

* Refactor shared memory loading

* Move sg matrix stores to correct file

* Working q4_0

* Formatting

* Work with emscripten builds

* Fix test-backend-ops emscripten for f16/quantized types

* Use emscripten memory64 to support get_memory

* Add build flags and try ci

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* Remove extra whitespace

* Move wasm single-thread logic out of test-backend-ops for cpu backend

* Disable multiple threads for emscripten single-thread builds in ggml_graph_plan

* Refactored pipelines and workgroup calculations (llama/10)

* refactored pipelines

* refactored workgroup calculation

* removed commented out block of prior maps

* Clean up ceiling division pattern

---------

Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Start work on flash attention

* Shader structure set up (many bugs still)

* debugging

* Working first test

* Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32

* Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling

* Start work on integrating pre-wgsl

* Separate structs/initial shader compilation library into separate files

* Work on compilation choices for flashattention

* Work on subgroup matrix/tile size portability

* subgroup size agnostic online softmax

* Cleanups, quantization types

* more cleanup

* fix wasm build

* Refactor flashattention to increase parallelism, use direct loads for KV in somce cases

* Checkpoint

* formatting

* Update to account for default kv cache padding

* formatting shader

* Add workflow for ggml-ci webgpu

* Try passing absolute path to dawn in ggml-ci

* Avoid error on device destruction, add todos for proper cleanup

* Fix unused warning

* Forgot one parameter unused

* Move some flashattn computation to f32 for correctness
2026-01-14 09:11:59 +02:00
Jeff Bolz 0bc0e5616e vulkan: fix push constant size for quantize_q8_1 (llama/18687)
I added an assert to catch further mismatches, and it found several.
Fix those, too.
2026-01-14 09:11:59 +02:00
Jeff Bolz 678c660e62 vulkan: optimize ssm_scan (llama/18630)
* vulkan: optimize ssm_scan

* fix warp vs subgroup naming
2026-01-14 09:11:59 +02:00
도로로도로또 f2d8588229 metal : add MoE kernel specialization for ne20=5 (llama/18667)
Add template specialization for kernel_mul_mm_id_map0 with ne20=5
to support models using 5 active experts (e.g., VAETKI).
2026-01-14 09:11:59 +02:00
Doctor Shotgun b9965c89a1 ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH (llama/18535)
* ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH
* makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32

* ggml: read GGML_OP_OFFLOAD_MIN_BATCH once and store to dev ctx

* cann: forward declaration of device context struct

* cann: move offload op check after device context declaration

* cuda: fix whitespace

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-01-14 09:11:59 +02:00
shaofeiqi 85a329cb08 opencl: add FILL op support (llama/18682) 2026-01-14 09:11:59 +02:00
Oliver Walsh 4f2ca7c163 cuda : fix build on cuda 12.8 (llama/18672)
compute121 requires 12.9

Signed-off-by: Oliver Walsh <owalsh@redhat.com>
2026-01-14 09:11:59 +02:00
Jeff Bolz a91ab72bd9 vulkan: reject ops when a tensor is too large to allocate (llama/18646) 2026-01-14 09:11:59 +02:00
virajwad 096e7e911a vulkan: Warptile tuning for Intel Xe2/Xe3 (llama/18178)
* modify warptile tuning for xe3

* intel vendor check w/ coopmat support

* fix back formatting

* fix formatting change 2

* move intel check to chip specific tuning part

* Change to support both windows and linux

* modify m_warptile to l_warptile for intel

* modify warptile tuning for bf16 matmuls to fix regression (m_warptile to l_warptile)

* Code style changes

* Code style changes (2)

* Code style changes (3)
2026-01-14 09:11:59 +02:00
Eve a576ed944a vulkan: more mul mat optimizations (llama/18533)
* q4_k

* q5_k

* q2_k

* q4_1

* q5_1

* better buf index
2026-01-14 09:11:59 +02:00
hipudding 5c583f3c02 CANN: Fix rename for get_env (llama/18652)
In #18624, get_env in ggml-cann was renamed to get_env_as_lowercase
to accurately reflect the function’s behavior and reduce the chance
of misuse. However, the update missed renaming call sites in other
files. This commit fixes that oversight.
2026-01-14 09:11:59 +02:00
Raul Torres 47671c81db CANN: Rename `get_env` to `get_env_as_lowercase` (llama/18624) 2026-01-14 09:11:59 +02:00
Max Krasnyansky a5f51ac75b Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (llama/18611)
* hexagon: improve fp16 matmul and add fp32/fp16 flash-attention

* hexagon: add support for set-rows fp32 -> fp16 with i32/i64 row-idx

* hexagon: add support for SCALE fp32

* hexagon: replace scalar fp32 -> fp16 copy with HVX

* hexagon: optimize flash_atten_ext with aligned VTCM buffers and DMA

- Implements double-buffered DMA prefetching for K, V, and Mask tensors.
- Ensures K and V rows in VTCM are padded to 128 bytes to support aligned HVX operations.
- Correctly synchronizes DMA transfers to prevent race conditions.
- Uses `FLASH_ATTN_BLOCK_SIZE` of 128 for efficient chunking.

* hexagon: use aligned mad_f16

* hexagon: flash_atten more aligned ops

* hexagon: optimize scale_f32 hvx helpers

* hexagon: unroll fa loops

* hexagon: remove unused set-rows log

* hexagon: flash_attn_ext add support for DMAing Q

- Update `op_flash_attn_ext` to include Q row size in scratchpad allocation.
- Pad Q row size to 128 bytes for alignment.
- Implement DMA transfer for Q tensor in `flash_attn_ext_f16_thread`.
- Update dot product computations to use VTCM-buffered Q data.

* hexagon: fix handling of NANs hvx dotproducts

* hexagon: cleanup spad allocation in flash-atten

* hexagon: improve fp16/fp32 matmul

- Introduced `vec_dot_f16_f16` and `vec_dot_f16_f16_rx2` kernels using efficient HVX dot product intrinsics.
- Added `quantize_fp32_f16` to copy/convert weights from DDR to VTCM
- Updated `op_matmul` to use the optimized path when VTCM capacity allows and broadcasting requirements are compatible.
- Implemented fallback logic to the original implementation for complex broadcasting scenarios.

* hexagon: fix HVX_ARCH check

* hexagon: matmul cleanup and fp16 fixes

Use aligned vec_dot_f16 for 2d matmuls and unaligned version for 4d.

* hexagon: fix fp16 x fp16 matmuls and some minor refactoring

* hexagon: add support for GET_ROWS f32 -> f32

Also optimize SET_ROWS threading a bit when we have just a few rows to process.

* hexagon: optimize set-rows threading

* hexagon: update adb/run-bench.sh to properly support experimental and verbose options

* hexagon: flash_atten use aligned vectors for dot products
2026-01-14 09:11:59 +02:00
Aadeshveer Singh 436f30d05f ggml : optimize cuda ssm_scan using warp-level reduction (llama/18505)
* ggml : optimize cuda ssm_scan using warp-level reduction

* ggml : apply code review suggestions (style, const, constexpr)

* ggml : add TODO regarding stride consistency
2026-01-14 09:11:59 +02:00
Jeff Bolz dbec71f6cf vulkan: support buffer_from_host_ptr (llama/18467)
* vulkan: support buffer_from_host_ptr

* hacky use of buffer_from_host_ptr for directio

* disable buffer_from_host_ptr cap

* use external memory for ggml_vk_host_malloc, revert model loader changes

* disable external_memory_host for MoltenVK

* take buffer memory types into account

* don't use external_memory_host for ggml_vk_host_malloc
2026-01-14 09:11:59 +02:00
Aman Gupta 575d894603 ggml-cuda: refactor cuda graph usage (llama/18637)
* ggml-cuda: refactor cuda graph usage

* use is_enabled() instead of enabled
2026-01-14 09:11:59 +02:00
Beinsezii ed674cfc10 mmq.cu: tune mmq/rocblas switching for RDNA (llama/18537)
* Patch perf regression for mmq kernels in ROCm

recover performance regression for https://github.com/ggml-org/llama.cpp/issues/17917

* add n_experts branch like the cdna path

* mmq.cu: tune mmq/wmma switching for RDNA

* mmq.cu: move amd wmma mmq/wmma switching behind IS_RDNA3

* Update ggml/src/ggml-cuda/mmq.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Jiacheng (Jason) Chen <76919340+jiachengjason@users.noreply.github.com>
Co-authored-by: jiachengjason <jasonchen.jiacheng@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-01-14 09:11:59 +02:00
Adrien Gallouët 5520f27363 ggml : fix avx512bf16 build (llama/18623)
- include `immintrin.h` when required
- remove unused m512bh

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-14 09:11:59 +02:00
Raul Torres 9a1a6685ba CANN: Make `valid_values` variable `static const` (llama/18627) 2026-01-14 09:11:59 +02:00
nwyin e563e239a7 ggml webgpu: add CEIL operation support (llama/18605)
* ggml-webgpu: add CEIL operation support

      Add support for the CEIL unary operation in the WebGPU backend:
      - Add CEIL_FUNC shader template in unary_op.wgsl
      - Add 4 shader variants (f32, f16, inplace versions)
      - Initialize CEIL pipelines in ggml-webgpu.cpp
      - Register CEIL in supports_op function

* docs: update WebGPU ops support for CEIL
2026-01-14 09:11:59 +02:00
Johannes Gäßler 9956333361 CUDA: fix FA FP16 accumulator overflow for Granite (llama/18614) 2026-01-14 09:11:59 +02:00
Aman Gupta 804f545454 ggml-cuda: check for srcs outside the cgraph (llama/18583)
* ggml-cuda: check for srcs outside the cgraph

* review: use leafs instead
2026-01-14 09:11:59 +02:00