Commit Graph

3922 Commits

Author SHA1 Message Date
lhez 46615d74d3 opencl: add fastdiv and use it in set_rows, ported from cuda (llama/17090)
* opencl: add fastdiv for mm q8_0

* opencl: use uint4 for fastdiv vals

* opencl: use fastdiv for set_rows

* opencl: do not use fastdiv for q8_0 mm
2025-11-17 21:05:46 +02:00
Max Krasnyansky ccf525baf0 cpu: skip NOPs to avoid barriers (llama/17133)
* cpu: skip NOPs to avoid barriers

* cpu: use ggml_op_is_empty
2025-11-17 21:05:46 +02:00
Georgi Gerganov 40aebfe8bf metal : cap threadgroups size of set_rows (llama/17146) 2025-11-17 21:05:46 +02:00
Adrien Gallouët 86be60093e ggml-cpu : inspect -march and -mcpu to found the CPU (llama/16333)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-11-17 21:05:46 +02:00
Ruben Ortlam ef71d83b76 vulkan: check glslc executable string (llama/17144) 2025-11-17 21:05:46 +02:00
Ruben Ortlam 43f2c1ff54 vulkan: fix validation issue introduced by #16868 (llama/17145) 2025-11-17 21:05:46 +02:00
Georgi Gerganov bb92c79f56 metal : enable tensor API for A19 (llama/17087) 2025-11-17 21:05:46 +02:00
fj-y-saito 4fea91f06e arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… (#15277)
* add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_q8_K

* Surround SVE function with compiler directive

* fix compile switch

* fix coding style

* ggml : fix indent

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-17 21:05:46 +02:00
Acly 58a97d988f cuda/vulkan : bicubic interpolation (llama/17022)
* vulkan : implement upscale with bicubic interpolation

* cuda : implement upscale with bicubic interpolation

* tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests

* adapt OpenCL backend to not support the OP in that case so tests don't fail

* print scale mode & flags in test-backend-ops
2025-11-17 21:05:46 +02:00
Ruben Ortlam 2e04e7a906 vulkan: fix memory allocations (llama/17122) 2025-11-17 21:05:46 +02:00
KITAITI Makoto 27f485a14c
vad : Silero VAD v6.2.0 (#3524)
* Add ggml-silero-v6.2.0 to download candidates

* Make default VAD model ggml-silero-v6.2.0

* Make VAD model in documentations ggml-silero-v6.2.0
2025-11-17 22:26:17 +09:00
KITAITI Makoto d9b7613b34
ruby : VAD separately from ASR (#3518)
* Add Whisper::VAD::Context

* Add test for Whisper::VAD::Context

* Add Whisper::VAD::Segment

* Add Whisper::VAD::Segments

* Add Whisper::VAD::Context#detect

* Define Whisper::VAD::Segments#each

* Define Whisper::VAD::Segment#start_time and #end_time

* Define Whisper::VAD::Segment#deconstruct_keys

* Add tests for Whisper::VAD family

* Add signatures for VAD family

* Add document on VAD in README

* Define Whisper::VAD::Segments#length

* Add test for Whisper::VAD::Segments#length

* Add signature of Segments#length

* Make vad_segments responsible to initialize VAD::Segments

* Remove meaningless argument check

* Check NULL of segments member

* Add tests for Whisper::VAD::Segments

* Initialize Whisper::VAD::Segment on .allocate

* Add tests for Whisper::VAD::Segment

* Check NULL of context member

* Add test for Whisper::VAD::Context.allocate
2025-11-13 10:15:26 +09:00
Georgi Gerganov a1867e0dad sync : llama.cpp 2025-11-09 23:38:03 +02:00
Georgi Gerganov e67dfbc51b sync : ggml 2025-11-09 23:38:03 +02:00
Ruben Ortlam 1993e397bb vulkan: iGPU memory reporting fix (llama/17110)
* vulkan: use all device-local heaps for memory availability reporting

Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>

* use all available heaps for iGPU memory reporting

* Allow multiple memory types per buffer request for devices with split heaps

---------

Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-11-09 23:38:03 +02:00
Ruben Ortlam ee8349cf10 vulkan: fix mmq out of bounds reads (llama/17108)
* vulkan: fix mmq out of bounds reads, streamline outdated matmul host code

* fix mul_mat_id quantization call

* Fix compiler warnings
2025-11-09 23:38:03 +02:00
Jeff Bolz db98e8c5b4 vulkan: fuse mul_mat_id + mul (llama/17095)
* vulkan: fuse mul_mat_id + mul

This comes up in qwen3 moe.

* split mul_mat_id fusion tests into a separate class
2025-11-09 23:38:03 +02:00
Georgi Gerganov a4339e2ea7 metal : retain src and dst buffers during async ops (llama/17101) 2025-11-09 23:38:03 +02:00
Jeff Bolz 6de3404773 vulkan: Use spec constants for conv2d s/d/p and kernel W/H (llama/16978)
* vulkan: Use spec constants for conv2d s/d/p and kernel W/H

Also add some additional unroll hints, which seems to help.

* lock around map lookup
2025-11-09 23:38:03 +02:00
Aman Gupta 8967c9ad9b Revert "CUDA: add expert reduce kernel (ggml/16857)" (llama/17100) 2025-11-09 23:38:03 +02:00
Aman Gupta 522b9bce33 CUDA: skip fusion for repeating adds in bias (llama/17080) 2025-11-09 23:38:03 +02:00
SavicStefan 0caa32c772 vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp (llama/16636)
Signed-off-by: Stefan Savic <stefan.savic@huawei.com>
Co-authored-by: Stefan Savic <stefan.savic@huawei.com>
2025-11-09 23:38:03 +02:00
Aleksei Nikiforov 3c975ad523 ggml: disable vxe for cross-compilation by default (llama/16966)
Otherwise compilation will fail due to enabling -mvx -mzvector
and not setting corresponding -march options.
2025-11-09 23:38:03 +02:00
Jeff Bolz 257ce2f5c0 vulkan: fuse rms_norm + mul + rope (+ view + set_rows) (llama/16977)
This change combines the rms_norm+mul and rope+view+set_rows fusions to
allow fusing the whole sequence together. This comes up in Qwen3, Bailing,
and some other models.
2025-11-09 23:38:03 +02:00
Jeff Bolz 4eef518167 vulkan: Fix test-thread-safety crashes (llama/17024)
The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the
same time, which needs to hold the lock. To be safe, hold the lock for all of
ggml_vk_load_shaders.
2025-11-09 23:38:03 +02:00
Johannes Gäßler 358f77aca7 CUDA: fix MMQ stream-k fixup ne1 indices (llama/17089) 2025-11-09 23:38:03 +02:00
Reese Levine 78ea6c5b67 ggml webgpu: faster matrix multiplication/matrix-vector multiplication (llama/17031)
* Faster tensors (llama/8)

Add fast matrix and matrix/vector multiplication.

* Use map for shader replacements instead of pair of strings
2025-11-09 23:38:03 +02:00
bssrdf 547724b0a5 CUDA: properly handle nb00=nb02 case for cpy (llama/17081) 2025-11-09 23:38:03 +02:00
Acly 11543bf446 vulkan : refactor buffer handling in vk_op_f32 (llama/16840)
* vulkan : refactor/simplify buffer handling in vk_op_* functions

* Combine UMA handling into ggml_vk_tensor_subbuffer
2025-11-09 23:38:03 +02:00
Johannes Gäßler af8a88792f CUDA: fix should_use_mmvf for ne11 == 1 (llama/17085)
* CUDA: fix should_use_mmvf for ne11 == 1

* Apply suggestion from @am17an

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2025-11-09 23:38:03 +02:00
Adrien Gallouët a1746097bc Revert "ggml-cpu: detect correct cpu flags for arm64 (llama/16229) (#16239)" (llama/17084)
This reverts commit 7c23f3f0d4b9f5d6ea140756eb694b562d5acebb.
2025-11-09 23:38:03 +02:00
iron 512592513c ggml-cpu: detect correct cpu flags for arm64 (ggml/16229) (llama/16239)
When using GCC 9 and GCC 12 on the arm64 platform of ubuntu 2004,
the command "gcc -mcpu=native -E -v -" fails to detect the correct CPU flags,
which results in compilation failures for certain extended instructions,
but the correct CPU flags can be obtained by using gcc -march.

Signed-off-by: lizhenneng <lizhenneng@kylinos.cn>
Co-authored-by: lizhenneng <lizhenneng@kylinos.cn>
2025-11-09 23:38:03 +02:00
xctan 5bce732795 ggml-cpu : optimize RVV q2_k and q3_k kernels (llama/16887) 2025-11-09 23:38:03 +02:00
Johannes Gäßler b5d6fa438f CUDA: fix crash on uneven context without FA (llama/16988) 2025-11-09 23:38:03 +02:00
Georgi Gerganov 32ed574370 metal : initial Metal4 tensor API support (llama/16634)
* metal : rework mat-mat multiplication

* metal : initial Metal4 support

* cont

* metal : detect tensor support

* cont : better ifdefs

* metal : support tensors in mul_mm_id

* metal : add env for disabling tensor API

* tests : restore

* metal : remove unused constants

* metal : fix check for bfloat tensor support

* cont : handle API incompatibilities

* cont : handle even more incompatibilities

* metal : use tensor API only on M5 and later
2025-11-09 23:38:03 +02:00
YehuditE 45588b272e sycl: add CONCAT operator support (llama/16047)
* sycl: add CONCAT operator support

* cleanup: remove stray lines added by mistake

* fix: code format issues in concat.cpp and tests/test-backend-ops.cpp

* chore: fix editorconfig violations

* cleanup: drop unnecessary i16 type support

* docs: update sycl-csv and regenerate ops.md

* update docs/ops.md

* fix: adapt to upstream master changes after rebase

* fix: remove empty files

* fix: drop whitespace

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-09 23:38:03 +02:00
l3utterfly b3324ae7d1 ggml-hexagon: graceful fallback for older socs where rpcmem_alloc2 and FASTRPC_GET_URI is unsupported (llama/16987)
* support older socs where FASTRPC_GET_URI is unsupported

* added graceful fallback when FASTRPC_GET_URI call fails

* use weak symbols instead of loading libcdsprpc.so dynamically

* Add weak pragma for rpcmem_alloc2

* Remove weak declaration for rpcmem_alloc2 in ggml-hexagon.cpp

Removed weak declaration for rpcmem_alloc2.

* Enforce ndev to 1 for archs below v75

Force ndev to 1 for SoCs architectures lower than v75.
2025-11-09 23:38:03 +02:00
bssrdf 13cd906501 improve CUDA cpy memory bandwidth when copying transposed tensor (llama/16841)
* WIP

* added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth

* added BF16 support

* more strict check to make sure src0 is a transpose

* reformulated to handle more complicated transpose cases

* bring back 2D transpose for higher performance

* allow build on windows

* tranpose copy more shapes

* minor tweak

* final clean up

* restore some test cases

* keep only the kernel for true tranposed case; updated with review suggestions

* make CI happy

* remove headers not needed

* reduced bank conflicts for fp16 and bf16

* add missing const*

* now bank conflicts free

* use padding instead of swizzling

---------

Co-authored-by: bssrdf <bssrdf@gmail.com>
2025-11-09 23:38:03 +02:00
Jeff Bolz 558a04c9c7 vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (llama/16919) 2025-11-09 23:38:03 +02:00
Reese Levine e734b5d6ef ggml webgpu: minor set rows optimization (llama/16810)
* Add buffer label and enable dawn-specific toggles to turn off some checks

* Minor set_rows optimization (ggml/4)

* updated optimization, fixed errors

* non vectorized version now dispatches one thread per element

* Simplify

* Change logic for set_rows pipelines

---------

Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Comment on dawn toggles

* Remove some comments

* Implement overlap binary operators

* Revert "Implement overlap binary operators"

This reverts commit ed710b36f51ab3f53fa13db15c1685dc8678a32a.

* Disable support for non-contiguous binary_op tensors and leave note for future support

---------

Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
2025-11-09 23:38:03 +02:00
nullname 44e77ccee6 refactor: replace sprintf with snprintf for safer string handling in dump functions (llama/16913) 2025-11-09 23:38:03 +02:00
Jeff Bolz 1672d41ab0 vulkan: remove the need for the dryrun (llama/16826)
* vulkan: remove the need for the dryrun

Allocate pipelines and descriptor sets when requested.

Reallocate the prealloc buffers when needed, and flush any pending work
before reallocating.

For rms_partials and total_mul_mat_bytes, use the sizes computed the last time
the graph was executed.

* remove dryrun parameters
2025-11-09 23:38:03 +02:00
Acly 997fdde0c4 ggml-cpu : bicubic interpolation (llama/16891) 2025-11-09 23:38:03 +02:00
Noah 52e43a2fa5 Fix garbled output with REPACK at high thread counts (llama/16956)
* Fix garbled output with REPACK at high thread counts

Fixed a race condition in the REPACK matrix multiplication code that caused garbled output when using 26+ threads (model-dependent threshold). The issue occurred because with high thread counts, the code forced chunk count to equal thread count, creating many small chunks. After aligning these chunks to NB_COLS boundaries, adjacent chunks could overlap, causing data corruption and race conditions. The fix enforces minimum chunk sizes based on NB_COLS and caps maximum chunk count to prevent creating too many tiny chunks, ensuring proper alignment without overlaps.

* Update ggml/src/ggml-cpu/repack.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cpu/repack.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-09 23:38:03 +02:00
Aman Gupta e51a2f90fe CUDA: avoid mul + bias fusion when doing fusion (llama/16935) 2025-11-09 23:38:03 +02:00
lhez f856023f46 opencl: support imrope (llama/16914)
* opencl: support imrope

* opencl: fix whitespace
2025-11-09 23:38:03 +02:00
theo77186 82ede64cd0 ggml: CUDA: add head size 72 for flash-attn (llama/16962) 2025-11-09 23:38:03 +02:00
Jinyang He 79801188f7 ggml : LoongArch fixes (llama/16958)
* Fix test-quantize-fns f16 and q4_0 failed when use LSX

* Fix LoongArch set float intrinsic when use LSX/LASX
2025-11-09 23:38:03 +02:00
shani-f f1da026bb8 SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt (#16869)
* SYCL repeat_back v1 — add core op + switch case

* Implement repeat_back SYCL operation and minor fixes

* SYCL: optimize repeat_back kernel

* Remove Hebrew comment from repeat_back.cpp

* Remove comments for code clarity

Removed comments to clean up the code.

* Fix formatting in ggml-sycl.cpp

* Formatted lambda according to legacy style. No logic changes

* Remove blank line in repeat_back.cpp

Remove unnecessary blank line before assigning acc to dst_dd.
2025-11-09 23:38:03 +02:00
Georgi Gerganov 39834fde1b clip : use FA (llama/16837)
* clip : use FA

* cont : add warning about unsupported ops

* implement "auto" mode for clip flash attn

* clip : print more detailed op support info during warmup

* cont : remove obsolete comment [no ci]

* improve debugging message

* trailing space

* metal : remove stray return

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-11-09 23:38:03 +02:00