Commit Graph

4210 Commits

Author SHA1 Message Date
Jiacheng (Jason) Chen e3f3c6ead1
HIP: enable WMMA-MMQ INT kernels for RDNA 3 (llama/17576)
* enabled wmma instructions for most quantizations other than q2k

* fixed the last q2_k test case failure

* address comments: fix out of bound write for RDNA4, add comments after #endif

* clean up rebase: fix ne error in half2

* fix the EditorConfig CI
2025-12-12 17:53:17 +02:00
Piotr Wilkin (ilintar) 8d44d6181a
Add support for CUMSUM and TRI for CUDA. (llama/17584)
* Add support for CUMSUM and TRI for CUDA.

* Minor optimizations.

* Correct warp_prefix_inclusive_sum in float2 variant to return float2

* Optimize TRI

* Whitespace

* Fix strides.

* Implement double loop

* Whitespace

* Fix HIP compilation bugs

* Optimizations + big case performance tests

* Implement using CUB with fallback to custom kernel

* Remove error message.

* Fixes from code review

* Comment out CPU-unsupported F16/BF16 cases to fix CI

* Fine, you win :P

* Fix last cast, use NO_DEVICE_CODE and GGML_UNUSED_VARS

* Vary warp-size based on physical warp size

* Add GGML_UNUSED_VARS in tri as well

* Use constexpr and call prefix_inclusive with warp_size template param

* Update ggml/src/ggml-cuda/cumsum.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Change to tid % warp_size

* Fix strides; hardcode mask; add ggml_lane_mask_t

* Missing renames, remove unused get_warp_mask(), explicit calls to ggml_cuda_info()

* Too hasty...

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-12-12 17:53:17 +02:00
Gabe Goodhart 8902c9d976
metal: TRI, FILL, EXPM1, SOFTPLUS (llama/16623)
* feat(wip): Port initial TRI impl from pervious work

The kernel does not work and is not optimized, but the
code compiles and runs, so this will be the starting point
now that the core op has been merged.

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove argument for constant val override

This was added in the original draft, but later removed. With this, the
kernel now passes tests.

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Move the ttype conditional to templating to avoid conditional in kernel

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Type fixes

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* feat: Add softplus for metal

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add EXPM1 for metal

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add FILL for metal

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Branchless version of tri using _ggml_vec_tri_cmp as a mask

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unused arguments

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use select instead of branch for softplus non-vec

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-12 17:53:17 +02:00
Alberto Cabrera Pérez f96ebc92d2
ggml-cpu : remove asserts always evaluating to false (llama/17728) 2025-12-12 17:53:17 +02:00
Georgi Gerganov 194d016456
metal : use params per pipeline instance (llama/17739) 2025-12-12 17:53:16 +02:00
Adrien Gallouët 92e50155c9
build : move _WIN32_WINNT definition to headers (llama/17736)
Previously, cmake was forcing `_WIN32_WINNT=0x0A00` for MinGW builds,
This caused "macro redefined" warnings with toolchains that define the version.

This also removes the `GGML_WIN_VER` variable as it is no longer needed.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-12 17:53:16 +02:00
Herman Semenoff 3794a0d3b6
ggml-cpu: remove duplicate conditional check 'iid' (llama/17650) 2025-12-12 17:53:16 +02:00
Johannes Gäßler 7adbcafb6c
CUDA: generalized (mma) FA, add Volta support (llama/17505)
* CUDA: generalized (mma) FA, add Volta support

* use struct for MMA FA kernel config

---------

Co-authored-by: Aman Gupta <aman>
2025-12-12 17:53:16 +02:00
Georgi Gerganov 4a00f2e3a4
metal : fix data race in pipeline library (llama/17731) 2025-12-12 17:53:16 +02:00
Reese Levine d263bdbfb6
ggml webgpu: add support for emscripten builds (llama/17184)
* Faster tensors (llama/8)

Add fast matrix and matrix/vector multiplication.

* Use map for shader replacements instead of pair of strings

* Wasm (llama/9)

* webgpu : fix build on emscripten

* more debugging stuff

* test-backend-ops: force single thread on wasm

* fix single-thread case for init_tensor_uniform

* use jspi

* add pthread

* test: remember to set n_thread for cpu backend

* Add buffer label and enable dawn-specific toggles to turn off some checks

* Intermediate state

* Fast working f16/f32 vec4

* Working float fast mul mat

* Clean up naming of mul_mat to match logical model, start work on q mul_mat

* Setup for subgroup matrix mat mul

* Basic working subgroup matrix

* Working subgroup matrix tiling

* Handle weirder sg matrix sizes (but still % sg matrix size)

* Working start to gemv

* working f16 accumulation with shared memory staging

* Print out available subgroup matrix configurations

* Vectorize dst stores for sg matrix shader

* Gemv working scalar

* Minor set_rows optimization (llama/4)

* updated optimization, fixed errors

* non vectorized version now dispatches one thread per element

* Simplify

* Change logic for set_rows pipelines

---------

Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Comment on dawn toggles

* Working subgroup matrix code for (semi)generic sizes

* Remove some comments

* Cleanup code

* Update dawn version and move to portable subgroup size

* Try to fix new dawn release

* Update subgroup size comment

* Only check for subgroup matrix configs if they are supported

* Add toggles for subgroup matrix/f16 support on nvidia+vulkan

* Make row/col naming consistent

* Refactor shared memory loading

* Move sg matrix stores to correct file

* Working q4_0

* Formatting

* Work with emscripten builds

* Fix test-backend-ops emscripten for f16/quantized types

* Use emscripten memory64 to support get_memory

* Add build flags and try ci

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* Remove extra whitespace

* Move wasm single-thread logic out of test-backend-ops for cpu backend

* Disable multiple threads for emscripten single-thread builds in ggml_graph_plan

* Fix .gitignore

* Add memory64 option and remove unneeded macros for setting threads to 1

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-12-12 17:53:16 +02:00
Jeff Bolz 86cb5ab93f
vulkan: Reduce temporary memory usage for TOP_K (llama/17623)
- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"

For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.
2025-12-12 17:53:15 +02:00
xiaobing318 fffdf679d4
cmake : add utf8 compilation options for msvc (llama/17682) 2025-12-12 17:53:15 +02:00
Adrien Gallouët 16688c6d2c
ggml : use svcntb() for SVE vector length detection (llama/17474)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-12 17:53:15 +02:00
TianHao324 a64d46a529
CANN: Disable Ger operator of OUT_PROD on 310p device (llama/17563) 2025-12-12 17:53:15 +02:00
Daniel Bevenius 201b910743
ggml : remove redundant n_copies check when setting input/output (llama/17612)
This commit removes a redundant check for sched->n_copies > 1 when
setting input and output flags on tensor copies in
ggml_backend_sched_split_graph.

The motivation for this change is to clarify the code as the outer if
statement already performs this check.
2025-12-12 17:53:15 +02:00
Adrien Gallouët e2537b4af3
ggml : add fallback definition for HWCAP2_SVE2 (llama/17683)
This align with other HWCAP2 feature flags

See #17528

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-12 17:53:15 +02:00
Aman Gupta 4c89232b5c
ggml-cuda: reorder only relevant nodes (llama/17639) 2025-12-12 17:53:14 +02:00
Neo Zhang Jianyu 26732d28c4
enhance argsort for UT (llama/17573)
Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
2025-12-12 17:53:14 +02:00
Georgi Gerganov 32090930f7
metal : add FA head size 48 (llama/17619) 2025-12-12 17:53:14 +02:00
Georgi Gerganov 7cd3de89bf
ggml : extend the GGML_SCHED_NO_REALLOC debug logic of the scheduler (llama/17617) 2025-12-12 17:53:14 +02:00
Aman Gupta 6cc2d0534f
llama-graph: avoid expand_forward for fusion (llama/17633) 2025-12-12 17:53:14 +02:00
Tarek Dakhran 0defeee679
model: LFM2-VL fixes (llama/17577)
* Adjust to pytorch

* Add antialiasing upscale

* Increase number of patches to 1024

* Handle default marker insertion for LFM2

* Switch to flag

* Reformat

* Cuda implementation of antialias kernel

* Change placement in ops.cpp

* consistent float literals

* Pad only for LFM2

* Address PR feedback

* Rollback default marker placement changes

* Fallback to CPU implementation for antialias implementation of upscale
2025-12-12 17:53:14 +02:00
Gilad S. 706647202e
ggml: fix: macOS build with `-DGGML_BACKEND_DL=ON` (llama/17581) 2025-12-12 17:53:13 +02:00
Aman Gupta e68ee6e281
CUDA: add stream-based concurrency (llama/16991)
* CUDA: add stream-based concurrency

* HIP: fix hipStreamWaitEvent define and nodiscard warnings

* ggml-cuda: fix fusion inside stream

* ggml-cuda: fix bug w.r.t first stream launch

* ggml-cuda: format

* ggml-cuda: improve assert message

* ggml-cuda: use lambda instead of duplicating code

* ggml-cuda: add some more comments

* ggml-cuda: add more detailed comments about concurrency

* ggml-cuda: rename + remove unused var

* ggml-cuda: fix condition for stream launch

* ggml-cuda: address review comments, add destructor

* common.cuh: add is_valid for concurrent events

* common.cuh: make comment better

* update comment

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* update comment

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* common.cuh: fix lower_bound condition + remove join_node data from write_ranges

* ggml-cuda: fix overlap condition + shadowing parameter

---------

Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-12-12 17:53:13 +02:00
Mahekk Shaikh 2e4a7a21fa
cuda : add error checking for cudaMemcpyAsync in argsort (llama/17599)
* cuda : add error checking for cudaMemcpyAsync in argsort (llama/12836)

* fix indentation
2025-12-12 17:53:13 +02:00
Acly 2258930c2e
vulkan : fix FA mask load with bounds check (coopmat2) (llama/17606) 2025-12-12 17:53:13 +02:00
Neo Zhang a3459484bf
sycl : support to malloc memory on device more than 4GB, update the doc and script (llama/17566)
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2025-12-12 17:53:13 +02:00
ixgbe 28dff06555
ggml: replace hwcap with riscv_hwprobe for RVV detection (llama/17567)
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-12-12 17:53:12 +02:00
Ruben Ortlam 2fcc0a3a9f
Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support (llama/16900)
* vulkan: split mul_mmq_funcs for mul_mat_vecq use

* add mxfp4 mmvq

* add q2_k mmvq

* add q3_k mmvq

* add q4_k and q5_k mmvq

* add q6_k mmvq

* handle 4x4 quants per mmvq thread

* enable MUL_MAT_ID mmvq support

* enable subgroup optimizations for mul_mat_vec_id shaders

* device tuning

* request prealloc_y sync after quantization

* fix indentation

* fix llvmpipe test failures

* fix mul_mat_id mmvq condition

* fix unused variable warning
2025-12-12 17:53:12 +02:00
Jeff Bolz dbf8766ffa
vulkan: improve topk perf for large k, fix overflow in unit tests (llama/17582) 2025-12-12 17:53:12 +02:00
Diego Devesa 463003e76c
ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched (llama/17276)
* ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched
Enabled in ggml-ci for testing.

* llama : update worst-case graph for unified cache

* ci : disable op offload in some tests

* fix spelling

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-12 17:53:12 +02:00
R0CKSTAR c372bdbb3c
enable fp16/fast_fp16/bf16_mma on PH1 (llama/17551)
* [MUSA] enable fp16/fast_fp16/bf16_mma on PH1

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Update ggml/src/ggml-cuda/fattn-vec.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/fattn-vec.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/fattn-tile.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-12-12 17:53:12 +02:00
Aman Gupta 90ca4e0a07
ggml-cuda: add stricter checking for fusion (llama/17568)
* ggml-cuda: make conditions for fusion more explicit

* ggml-cuda: remove size check as std::equal already does it
2025-12-12 17:53:12 +02:00
Piotr Wilkin (ilintar) 43441ff58a
model : Qwen3 Next (llama/16095)
* Qwen3 Next - cleaned up version

* Whitespaces and stuff

* Correct minor errors

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Misc. fixes.

* Clean up code, add missing hybrid qualifier

* Did someone transpose the SOLVE_TRI result matrix? Perhaps...

* Whitespace

* Proper tensors for cb calls

* Use llama-graph.h vertical alignment

* BROKEN: chunking

* Set new tensors as inputs.

* Proper chunk logic

* It's the circle of life...

* More shenanigans for n_seq > 1

* Nail in the coffin?

* Fix Windows build

* Eh, one fails on Windows, the other fails on Mac... just use general capture.

* quant : cleanup

* model : cleanup

* qwen3 : cleanup

* cont : cleanup

* cont : cleanup

* ggml : revert change

* qwen3 : cleanup

* cont : cleanup

* Readd cmath

* qwen3 : fix typo

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Usual suspects

* fix my bad suggestion

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-12 17:53:11 +02:00
Johannes Gäßler 37e4c2ed3a
CUDA: no FP16 arithmetic for vector FA kernel (llama/17558) 2025-12-12 17:53:11 +02:00
Jeff Bolz 7a20963140
vulkan: Implement GGML_OP_TRI (llama/17503)
* vulkan: Implement GGML_OP_TRI

* check types match
2025-12-12 17:53:11 +02:00
Radoslav Gerganov d26d1c8b85
rpc : cache and reuse compute graphs (llama/15405)
Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.
2025-12-12 17:53:11 +02:00
yulo f92d542d4d
HIP: enable mul_mat_f for RDNA4 (llama/17437)
* enable mmf for rdna4

* move some mmvf to mmf

* revert lds128 for wmma loading

* Revert "revert lds128 for wmma loading"

This reverts commit db9ae8b6b4738a5def5b393caa1611d52133e9b5.

* Revert "enable mmf for rdna4"

This reverts commit 698c9f24187b990e35c3b73a8067e5387e6ddbd4.

* Revert "move some mmvf to mmf"

This reverts commit 99b92bd6653cc8593607f641e44606391691792f.

* enable mul_mat for rdna4

---------

Co-authored-by: zhang hui <you@example.com>
2025-12-12 17:53:11 +02:00
Piotr Wilkin (ilintar) 51e842d106
SOLVE_TRI CUDA kernel for small matrices (llama/17457) 2025-12-12 17:53:11 +02:00
Neo Zhang Jianyu 93bc8dc5a8
refactor pad_reflect_1d to make the UT case pass (llama/17204)
Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>
2025-12-12 17:53:10 +02:00
Jeff Bolz 3727a36c48
vulkan: Implement SOLVE_TRI (llama/17486)
* vulkan: Implement SOLVE_TRI

* load B matrix through shared memory

* use FLOAT_TYPE
2025-12-12 17:53:10 +02:00
matt23654 e682af7886
cuda : fix UMA detection on discrete GPUs. (llama/17537) 2025-12-12 17:53:10 +02:00
Alberto Cabrera Pérez 93f6cdb9c0
ggml-cpu: aarm64: q4_K repack gemm and gemv implementations (dotprod only) (llama/17494)
* Enabled q4_K_4x8 path

* Fixed generic Q4_K 8x4 implementation

* wip: dotprod gemm

* Working arm q4_K dotprod gemm

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Undo acc rename

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Q4_K arm dotprod gemm

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fix: q4_qs reinterpret from uint to int

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Removed comments

* Fixed macro guards

* Fixed unused vars in generic implementation

* Fixed unused vars in 8x4 repack

* Fixed unused vars in generic implementation, unneeded comment

* Missing arch fallback for x86

* minor : style

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-12 17:53:10 +02:00
Acly ac92424b59
vulkan : move contiguous checks to device_supports_op (llama/17490)
* vulkan : remove op_supports_incontiguous and add missing constraints in device_supports_op

* im2col: remove contraints on src0 (kernel input)
2025-12-12 17:53:10 +02:00
Jeff Bolz 310db24fca
vulkan: use a fixed 1KB buffer for the add_rms_fusion opt (llama/17514) 2025-12-12 17:53:10 +02:00
lhez 74ef5dd1a9
opencl: add sqr, sqrt, mean and ssm_conv (llama/17476)
* opencl: add sqr

* opencl: add sqrt

* opencl: add mean

* opencl: add ssm_conv

* opencl: add missing cl_khr_fp16

* opencl: do sqrt in f32 then convert to f16 for better precision
2025-12-12 17:53:09 +02:00
Alberto Cabrera Pérez 3de4372465
Fix chunks being too small with small matrix sizes (llama/17526) 2025-12-12 17:53:09 +02:00
Jeff Bolz c8050e5fdc
vulkan: allow graph_optimize for prompt processing workloads (llama/17475) 2025-12-12 17:53:09 +02:00
Jeff Bolz d8b61e05f8
vulkan: Implement top-k (llama/17418)
* vulkan: Implement top-k

Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10)
and discards all but the top K. Repeat until only K are left. And there's a fast
path when K==1 to just find the max value rather than sorting.

* fix pipeline selection

* vulkan: Add N-ary search algorithm for topk

* microoptimizations
2025-12-12 17:53:09 +02:00
xctan fb31a19797
ggml-cpu : add RISC-V Zvfh impl for ggml_vec_mad_f16 (llama/17448)
* ggml-cpu : add RISC-V Zvfh impl for ggml_vec_mad_f16

* ggml-cpu : dedup scalar impl

* Update ggml/src/ggml-cpu/vec.h

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-12 17:53:09 +02:00