Commit Graph

3856 Commits

Author SHA1 Message Date
Ruben Ortlam f571655e8e vulkan: fix MMQ quantize_y condition (llama/17301) 2025-11-17 21:05:46 +02:00
Georgi Gerganov 9549cc1051 metal : remove obosolete asserts (llama/17295) 2025-11-17 21:05:46 +02:00
lhez a75525cad0 opencl: fix rms_norm_mul (llama/17250)
* opencl: use subgrroup reduce for reduction in rms_norm_mul

* opencl: add comment about workgroup size
2025-11-17 21:05:46 +02:00
shaofeiqi c78845bfa9 opencl: add kernel to handle mat mul in attention to improve encoding speed (llama/17181)
* Add mul_mm_f16_f32_kq_kqv kernel

* Add ggml_cl_mul_mat_kq_kqv_adreno func

* fix whitespace

* remove unused variable

* remove redundant

* refactor and clean up

* remove trailing whitespace
2025-11-17 21:05:46 +02:00
shani-f 1fd63da9f2 sycl : unify unary kernels with a generic implementation and enable wide operator support (llama/17213)
* SYCL: add generic unary op implementation for multiple ops (ABS/SGN/…); unify non-contiguous access

* SYCL: update documentation and sycl.csv to reflect new unary op support

* update ops.md after syncing SYCL.csv changes

* Fix SYCL.csv merge conflict

* Update ops.md after fixing SYCL.csv conflicts

* Fix SYCL.csv tail after merge conflict and regenerate ops.md

* Fix line endings and final newline in SYCL.csv

* Remove TOPK_MOE entries from SYCL.csv as requested

* Update ops.md after removing TOPK_MOE from SYCL.csv

* Regenerated SYCL.csv and synced ops.md with upstream

* Update ops.md using create_ops_docs.py
2025-11-17 21:05:46 +02:00
Jeff Bolz ea3ebd8b0d vulkan: Fuse mul_mat_id+add_id+mul and mul_mat+add+add. (llama/17287)
These both show up in gpt-oss. Also, cleanup the mul_mat_vec fusion code a bit.
2025-11-17 21:05:46 +02:00
Ruben Ortlam 7caea54450 vulkan: Replace 16-bit unpack8 calls to work around legacy Windows AMD driver bug (llama/17285) 2025-11-17 21:05:46 +02:00
Giuseppe Scrivano 4c4e663da0 vulkan: implement ABS and NEG (llama/17245)
* docs: update Vulkan ops

* vulkan: add NEG op

* vulkan: add ABS op

---------

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-11-17 21:05:46 +02:00
Jeff Bolz e1846fc599 vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths (llama/17244)
* vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths

* set allow_misalign
2025-11-17 21:05:46 +02:00
Jeff Bolz 9614a56314 vulkan: skip all-negative-inf blocks in FA (llama/17186) 2025-11-17 21:05:46 +02:00
Jeff Bolz 37d4bba152 vulkan: change graph_compute to be async and enable get_tensor_async (llama/17158)
* vulkan: change graph_compute to be async and enable get_tensor_async

This allows some additional CPU/GPU overlap for large pp workloads. Also seems
to help a bit for token gen, maybe getting rid of a small bubble between
graph_compute and get_tensor.

Async set and copy functions seem to be very rarely used, so I didn't enable
them because I didn't have a good way to test them.

The async commands need to be ordered against each other, so put them all on
the compute queue. The non-async commands still use the transfer queue.

The fence for graph_compute/get_tensor_async is submitted and waited on in
ggml_vk_synchronize.

* fix thread safety errors

* teardown context cleanly

* Handle async read to non-pinned dst
2025-11-17 21:05:46 +02:00
Georgi Gerganov 523a6c27ea metal : support argsort for ne00 > 1024 (llama/17247)
* metal : refactor argsort

* cont : sort chunks

* cont : merge sorted buckets

* cont : cleanup
2025-11-17 21:05:46 +02:00
Georgi Gerganov b4d7df3ba2 metal : make the FA extra sizes consistent (llama/17143) 2025-11-17 21:05:46 +02:00
Alberto Cabrera Pérez a81fbfc78e ggml-cpu: handle 3d tensors in repack mat_mul (llama/17241)
* ggml-cpu: handle 3d tensors in repack mul_mat

* Removed unnecessary branch, removed need for <algorithm>

* Fixed dst_ptr pointer in chunk + clang_format

* GGML_ASSERT to check wdata within bounds

* Accidental ggml.h inclusion

* Improved GGML_ASSERT on wdata boundaries

* Address performance regression in Qwen and llama.cpp due to chunking
2025-11-17 21:05:46 +02:00
Piotr Wilkin (ilintar) 3e684f26c1 ggml : add ops SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM (llama/17063)
* Add ops needed for new hybrid models: SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM

* Update ggml/include/ggml.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Code review

* Whitespace

* Update tests/test-backend-ops.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* This is actually sigmoid, duh.

* Add CONST, remove TRI_KEEP, other changes from review

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cuda/unary.cu

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Remove extra script

* Update ggml/src/ggml.c

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* moving changes from laptop [no ci]

* pre-rebase

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Refactor tests

* ggml : cleanup

* cont : fix ggml_fill srcs

* tests : add note

* ggml : add ggml_fill_inplace

* ggml : add asserts

* ggml : fix ggml_fill constant cast

* cont : ggml_tri minor

* Use TENSOR_LOCALS

* Fix regression from #14596, regenerate

* Don't make commits at night...

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-17 21:05:46 +02:00
Ruben Ortlam e8e0004fe5 vulkan: remove shell call from vulkan-shaders-gen tool, revert file check (llama/17219)
* vulkan: remove shell call from vulkan-shaders-gen tool

* use string vector for command execution

* Fix condition

* use string, remove const_cast

* Fix dependency file quotation on Windows

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-11-17 21:05:46 +02:00
Diego Devesa 210f0f860b sched : fix reserve ignoring user tensor assignments (llama/17232) 2025-11-17 21:05:46 +02:00
ixgbe 91fa5b5cac ggml-cpu : add RISC-V vector intrinsic support for silu and cvar operations (llama/17227)
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-11-17 21:05:46 +02:00
bagheera 265d326fa8 metal: accelerated conv2d (llama/17175)
* metal: accelerated conv2d

* cont : cleanup

---------

Co-authored-by: bghira <bghira@users.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-17 21:05:46 +02:00
Georgi Gerganov 6a1d830dfd Revert "ggml-cpu: handle 3d tensors in repack mat_mul (llama/17030)" (llama/17233)
This reverts commit 1c398dc9eca9c366ce98deb0e6f3538e444ebc8a.
2025-11-17 21:05:46 +02:00
Diego Devesa 6a91780c3b ggml-cpu : use template for argsort (llama/17222) 2025-11-17 21:05:46 +02:00
TecJesh 726912d1cb CANN: Add cross_entropy_loss op support (llama/16886)
* update L2_NORM op support

* update L2_NORM op support

* remove extra whitespace

* cann: update cross_entropy_loss op support

* remove trailing whitespaces

* rebase the latest code in the main repository and remove the l2_norm operator that already exists in another pull request.

* undo the l2_norm operator deletion
2025-11-17 21:05:46 +02:00
Aman Gupta 84275fc493 CUDA: fuse rope + set_rows (llama/16884)
* CUDA: add fused rope

* move k forward_expand up

* create helper function instead of re-using params

* make assert statement more in line with comment

* rope_norm: coalesced writes to global mem
2025-11-17 21:05:46 +02:00
Johannes Gäßler 566c4c4469 CUDA: static assert to prevent misuse of memcpy_1 (llama/17198) 2025-11-17 21:05:46 +02:00
Georgi Gerganov 3810a6180b ggml : use std::sort in ggml_argsort CPU implementation (llama/17211)
* ggml : use std::sort in ggml_argsort CPU implementation

* cont : add missing header
2025-11-17 21:05:46 +02:00
Alberto Cabrera Pérez 7df8515824 ggml-cpu: handle 3d tensors in repack mat_mul (llama/17030)
* ggml-cpu: handle 3d tensors in repack mul_mat

* Removed unnecessary branch, removed need for <algorithm>

* Fixed dst_ptr pointer in chunk + clang_format

* GGML_ASSERT to check wdata within bounds

* Accidental ggml.h inclusion

* Improved GGML_ASSERT on wdata boundaries
2025-11-17 21:05:46 +02:00
TecJesh e8b66d9f94 CANN: Add L2_NORM op support (llama/16856)
* update L2_NORM op support

* update L2_NORM op support

* remove extra whitespace
2025-11-17 21:05:46 +02:00
Neo Zhang Jianyu 8388350c66 fix ci crash about SSM_CONV (llama/17169)
* fix ci crash

* Update ggml-sycl.cpp

* Update ggml/src/ggml-sycl/ggml-sycl.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-11-17 21:05:46 +02:00
Max Krasnyansky 6748d27f55 hexagon: various Op fixes (llama/17135)
* hexagon: explicitly check for ops with zero nrows

llm_graph_context::build_inp_out_ids() can generate tensors with zero nrows.
Somehow other backends seems to handle this without obvious explicit checks.
In the hexagon case we need to check explicitly and skip them.

* hexagon: introduce fastdiv, fix test-backend-ops for ADD/SUB/MUL

Co-authored-by: chraac <chraac@gmail.com>

* hexagon: use fastdiv in ADD_ID

* hexagon: use ggml_op_is_empty and ggml_is_empty to check for NOPs

---------

Co-authored-by: chraac <chraac@gmail.com>
2025-11-17 21:05:46 +02:00
Eve 559091005a disable rms norm mul rope for chips with no fp16 rte (llama/17134) 2025-11-17 21:05:46 +02:00
ixgbe cd8f64d1b5 ggml-cpu : add RISC-V RVV (Zvfh) optimization for FP16 to FP32 conversion (llama/17161)
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-11-17 21:05:46 +02:00
duduta 1cefb03571 ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16 (llama/16805)
* extract rotate_pairs logic from ggml_compute_forward_rope_f32

* templateify ggml_compute_forward_rope_f32 and _f16

* abort when rope type not supported, remove GLM from test-rope

* add imrope branch to switch

* add rope tests for perf

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-17 21:05:46 +02:00
Charles Xu 3920ecce3a kleidiai: add optimized per-channel kernels for Q8_0 (llama/16993) 2025-11-17 21:05:46 +02:00
Mike Abbott c01bf73dd1 cmake : add version to all shared object files (llama/17091)
When compiling llama.cpp in Yocto, it fails QA checks because the generated so files aren't versioned.  This applies a version to all generated so files, allowing the package to build without errors.
2025-11-17 21:05:46 +02:00
lhez 46615d74d3 opencl: add fastdiv and use it in set_rows, ported from cuda (llama/17090)
* opencl: add fastdiv for mm q8_0

* opencl: use uint4 for fastdiv vals

* opencl: use fastdiv for set_rows

* opencl: do not use fastdiv for q8_0 mm
2025-11-17 21:05:46 +02:00
Max Krasnyansky ccf525baf0 cpu: skip NOPs to avoid barriers (llama/17133)
* cpu: skip NOPs to avoid barriers

* cpu: use ggml_op_is_empty
2025-11-17 21:05:46 +02:00
Georgi Gerganov 40aebfe8bf metal : cap threadgroups size of set_rows (llama/17146) 2025-11-17 21:05:46 +02:00
Adrien Gallouët 86be60093e ggml-cpu : inspect -march and -mcpu to found the CPU (llama/16333)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-11-17 21:05:46 +02:00
Ruben Ortlam ef71d83b76 vulkan: check glslc executable string (llama/17144) 2025-11-17 21:05:46 +02:00
Ruben Ortlam 43f2c1ff54 vulkan: fix validation issue introduced by #16868 (llama/17145) 2025-11-17 21:05:46 +02:00
Georgi Gerganov bb92c79f56 metal : enable tensor API for A19 (llama/17087) 2025-11-17 21:05:46 +02:00
fj-y-saito 4fea91f06e arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… (#15277)
* add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_q8_K

* Surround SVE function with compiler directive

* fix compile switch

* fix coding style

* ggml : fix indent

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-17 21:05:46 +02:00
Acly 58a97d988f cuda/vulkan : bicubic interpolation (llama/17022)
* vulkan : implement upscale with bicubic interpolation

* cuda : implement upscale with bicubic interpolation

* tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests

* adapt OpenCL backend to not support the OP in that case so tests don't fail

* print scale mode & flags in test-backend-ops
2025-11-17 21:05:46 +02:00
Ruben Ortlam 2e04e7a906 vulkan: fix memory allocations (llama/17122) 2025-11-17 21:05:46 +02:00
KITAITI Makoto 27f485a14c
vad : Silero VAD v6.2.0 (#3524)
* Add ggml-silero-v6.2.0 to download candidates

* Make default VAD model ggml-silero-v6.2.0

* Make VAD model in documentations ggml-silero-v6.2.0
2025-11-17 22:26:17 +09:00
KITAITI Makoto d9b7613b34
ruby : VAD separately from ASR (#3518)
* Add Whisper::VAD::Context

* Add test for Whisper::VAD::Context

* Add Whisper::VAD::Segment

* Add Whisper::VAD::Segments

* Add Whisper::VAD::Context#detect

* Define Whisper::VAD::Segments#each

* Define Whisper::VAD::Segment#start_time and #end_time

* Define Whisper::VAD::Segment#deconstruct_keys

* Add tests for Whisper::VAD family

* Add signatures for VAD family

* Add document on VAD in README

* Define Whisper::VAD::Segments#length

* Add test for Whisper::VAD::Segments#length

* Add signature of Segments#length

* Make vad_segments responsible to initialize VAD::Segments

* Remove meaningless argument check

* Check NULL of segments member

* Add tests for Whisper::VAD::Segments

* Initialize Whisper::VAD::Segment on .allocate

* Add tests for Whisper::VAD::Segment

* Check NULL of context member

* Add test for Whisper::VAD::Context.allocate
2025-11-13 10:15:26 +09:00
Georgi Gerganov a1867e0dad sync : llama.cpp 2025-11-09 23:38:03 +02:00
Georgi Gerganov e67dfbc51b sync : ggml 2025-11-09 23:38:03 +02:00
Ruben Ortlam 1993e397bb vulkan: iGPU memory reporting fix (llama/17110)
* vulkan: use all device-local heaps for memory availability reporting

Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>

* use all available heaps for iGPU memory reporting

* Allow multiple memory types per buffer request for devices with split heaps

---------

Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-11-09 23:38:03 +02:00
Ruben Ortlam ee8349cf10 vulkan: fix mmq out of bounds reads (llama/17108)
* vulkan: fix mmq out of bounds reads, streamline outdated matmul host code

* fix mul_mat_id quantization call

* Fix compiler warnings
2025-11-09 23:38:03 +02:00