Commit Graph

3856 Commits

Author SHA1 Message Date
Georgi Gerganov 194d016456
metal : use params per pipeline instance (llama/17739) 2025-12-12 17:53:16 +02:00
Adrien Gallouët 92e50155c9
build : move _WIN32_WINNT definition to headers (llama/17736)
Previously, cmake was forcing `_WIN32_WINNT=0x0A00` for MinGW builds,
This caused "macro redefined" warnings with toolchains that define the version.

This also removes the `GGML_WIN_VER` variable as it is no longer needed.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-12 17:53:16 +02:00
Herman Semenoff 3794a0d3b6
ggml-cpu: remove duplicate conditional check 'iid' (llama/17650) 2025-12-12 17:53:16 +02:00
Johannes Gäßler 7adbcafb6c
CUDA: generalized (mma) FA, add Volta support (llama/17505)
* CUDA: generalized (mma) FA, add Volta support

* use struct for MMA FA kernel config

---------

Co-authored-by: Aman Gupta <aman>
2025-12-12 17:53:16 +02:00
Georgi Gerganov 4a00f2e3a4
metal : fix data race in pipeline library (llama/17731) 2025-12-12 17:53:16 +02:00
Reese Levine d263bdbfb6
ggml webgpu: add support for emscripten builds (llama/17184)
* Faster tensors (llama/8)

Add fast matrix and matrix/vector multiplication.

* Use map for shader replacements instead of pair of strings

* Wasm (llama/9)

* webgpu : fix build on emscripten

* more debugging stuff

* test-backend-ops: force single thread on wasm

* fix single-thread case for init_tensor_uniform

* use jspi

* add pthread

* test: remember to set n_thread for cpu backend

* Add buffer label and enable dawn-specific toggles to turn off some checks

* Intermediate state

* Fast working f16/f32 vec4

* Working float fast mul mat

* Clean up naming of mul_mat to match logical model, start work on q mul_mat

* Setup for subgroup matrix mat mul

* Basic working subgroup matrix

* Working subgroup matrix tiling

* Handle weirder sg matrix sizes (but still % sg matrix size)

* Working start to gemv

* working f16 accumulation with shared memory staging

* Print out available subgroup matrix configurations

* Vectorize dst stores for sg matrix shader

* Gemv working scalar

* Minor set_rows optimization (llama/4)

* updated optimization, fixed errors

* non vectorized version now dispatches one thread per element

* Simplify

* Change logic for set_rows pipelines

---------

Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Comment on dawn toggles

* Working subgroup matrix code for (semi)generic sizes

* Remove some comments

* Cleanup code

* Update dawn version and move to portable subgroup size

* Try to fix new dawn release

* Update subgroup size comment

* Only check for subgroup matrix configs if they are supported

* Add toggles for subgroup matrix/f16 support on nvidia+vulkan

* Make row/col naming consistent

* Refactor shared memory loading

* Move sg matrix stores to correct file

* Working q4_0

* Formatting

* Work with emscripten builds

* Fix test-backend-ops emscripten for f16/quantized types

* Use emscripten memory64 to support get_memory

* Add build flags and try ci

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* Remove extra whitespace

* Move wasm single-thread logic out of test-backend-ops for cpu backend

* Disable multiple threads for emscripten single-thread builds in ggml_graph_plan

* Fix .gitignore

* Add memory64 option and remove unneeded macros for setting threads to 1

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-12-12 17:53:16 +02:00
Jeff Bolz 86cb5ab93f
vulkan: Reduce temporary memory usage for TOP_K (llama/17623)
- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"

For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.
2025-12-12 17:53:15 +02:00
xiaobing318 fffdf679d4
cmake : add utf8 compilation options for msvc (llama/17682) 2025-12-12 17:53:15 +02:00
Adrien Gallouët 16688c6d2c
ggml : use svcntb() for SVE vector length detection (llama/17474)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-12 17:53:15 +02:00
TianHao324 a64d46a529
CANN: Disable Ger operator of OUT_PROD on 310p device (llama/17563) 2025-12-12 17:53:15 +02:00
Daniel Bevenius 201b910743
ggml : remove redundant n_copies check when setting input/output (llama/17612)
This commit removes a redundant check for sched->n_copies > 1 when
setting input and output flags on tensor copies in
ggml_backend_sched_split_graph.

The motivation for this change is to clarify the code as the outer if
statement already performs this check.
2025-12-12 17:53:15 +02:00
Adrien Gallouët e2537b4af3
ggml : add fallback definition for HWCAP2_SVE2 (llama/17683)
This align with other HWCAP2 feature flags

See #17528

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-12 17:53:15 +02:00
Aman Gupta 4c89232b5c
ggml-cuda: reorder only relevant nodes (llama/17639) 2025-12-12 17:53:14 +02:00
Neo Zhang Jianyu 26732d28c4
enhance argsort for UT (llama/17573)
Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
2025-12-12 17:53:14 +02:00
Georgi Gerganov 32090930f7
metal : add FA head size 48 (llama/17619) 2025-12-12 17:53:14 +02:00
Georgi Gerganov 7cd3de89bf
ggml : extend the GGML_SCHED_NO_REALLOC debug logic of the scheduler (llama/17617) 2025-12-12 17:53:14 +02:00
Aman Gupta 6cc2d0534f
llama-graph: avoid expand_forward for fusion (llama/17633) 2025-12-12 17:53:14 +02:00
Tarek Dakhran 0defeee679
model: LFM2-VL fixes (llama/17577)
* Adjust to pytorch

* Add antialiasing upscale

* Increase number of patches to 1024

* Handle default marker insertion for LFM2

* Switch to flag

* Reformat

* Cuda implementation of antialias kernel

* Change placement in ops.cpp

* consistent float literals

* Pad only for LFM2

* Address PR feedback

* Rollback default marker placement changes

* Fallback to CPU implementation for antialias implementation of upscale
2025-12-12 17:53:14 +02:00
Gilad S. 706647202e
ggml: fix: macOS build with `-DGGML_BACKEND_DL=ON` (llama/17581) 2025-12-12 17:53:13 +02:00
Aman Gupta e68ee6e281
CUDA: add stream-based concurrency (llama/16991)
* CUDA: add stream-based concurrency

* HIP: fix hipStreamWaitEvent define and nodiscard warnings

* ggml-cuda: fix fusion inside stream

* ggml-cuda: fix bug w.r.t first stream launch

* ggml-cuda: format

* ggml-cuda: improve assert message

* ggml-cuda: use lambda instead of duplicating code

* ggml-cuda: add some more comments

* ggml-cuda: add more detailed comments about concurrency

* ggml-cuda: rename + remove unused var

* ggml-cuda: fix condition for stream launch

* ggml-cuda: address review comments, add destructor

* common.cuh: add is_valid for concurrent events

* common.cuh: make comment better

* update comment

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* update comment

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* common.cuh: fix lower_bound condition + remove join_node data from write_ranges

* ggml-cuda: fix overlap condition + shadowing parameter

---------

Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-12-12 17:53:13 +02:00
Mahekk Shaikh 2e4a7a21fa
cuda : add error checking for cudaMemcpyAsync in argsort (llama/17599)
* cuda : add error checking for cudaMemcpyAsync in argsort (llama/12836)

* fix indentation
2025-12-12 17:53:13 +02:00
Acly 2258930c2e
vulkan : fix FA mask load with bounds check (coopmat2) (llama/17606) 2025-12-12 17:53:13 +02:00
Neo Zhang a3459484bf
sycl : support to malloc memory on device more than 4GB, update the doc and script (llama/17566)
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2025-12-12 17:53:13 +02:00
ixgbe 28dff06555
ggml: replace hwcap with riscv_hwprobe for RVV detection (llama/17567)
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-12-12 17:53:12 +02:00
Ruben Ortlam 2fcc0a3a9f
Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support (llama/16900)
* vulkan: split mul_mmq_funcs for mul_mat_vecq use

* add mxfp4 mmvq

* add q2_k mmvq

* add q3_k mmvq

* add q4_k and q5_k mmvq

* add q6_k mmvq

* handle 4x4 quants per mmvq thread

* enable MUL_MAT_ID mmvq support

* enable subgroup optimizations for mul_mat_vec_id shaders

* device tuning

* request prealloc_y sync after quantization

* fix indentation

* fix llvmpipe test failures

* fix mul_mat_id mmvq condition

* fix unused variable warning
2025-12-12 17:53:12 +02:00
Jeff Bolz dbf8766ffa
vulkan: improve topk perf for large k, fix overflow in unit tests (llama/17582) 2025-12-12 17:53:12 +02:00
Diego Devesa 463003e76c
ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched (llama/17276)
* ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched
Enabled in ggml-ci for testing.

* llama : update worst-case graph for unified cache

* ci : disable op offload in some tests

* fix spelling

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-12 17:53:12 +02:00
R0CKSTAR c372bdbb3c
enable fp16/fast_fp16/bf16_mma on PH1 (llama/17551)
* [MUSA] enable fp16/fast_fp16/bf16_mma on PH1

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Update ggml/src/ggml-cuda/fattn-vec.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/fattn-vec.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/fattn-tile.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-12-12 17:53:12 +02:00
Aman Gupta 90ca4e0a07
ggml-cuda: add stricter checking for fusion (llama/17568)
* ggml-cuda: make conditions for fusion more explicit

* ggml-cuda: remove size check as std::equal already does it
2025-12-12 17:53:12 +02:00
Piotr Wilkin (ilintar) 43441ff58a
model : Qwen3 Next (llama/16095)
* Qwen3 Next - cleaned up version

* Whitespaces and stuff

* Correct minor errors

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Misc. fixes.

* Clean up code, add missing hybrid qualifier

* Did someone transpose the SOLVE_TRI result matrix? Perhaps...

* Whitespace

* Proper tensors for cb calls

* Use llama-graph.h vertical alignment

* BROKEN: chunking

* Set new tensors as inputs.

* Proper chunk logic

* It's the circle of life...

* More shenanigans for n_seq > 1

* Nail in the coffin?

* Fix Windows build

* Eh, one fails on Windows, the other fails on Mac... just use general capture.

* quant : cleanup

* model : cleanup

* qwen3 : cleanup

* cont : cleanup

* cont : cleanup

* ggml : revert change

* qwen3 : cleanup

* cont : cleanup

* Readd cmath

* qwen3 : fix typo

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Usual suspects

* fix my bad suggestion

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-12 17:53:11 +02:00
Johannes Gäßler 37e4c2ed3a
CUDA: no FP16 arithmetic for vector FA kernel (llama/17558) 2025-12-12 17:53:11 +02:00
Jeff Bolz 7a20963140
vulkan: Implement GGML_OP_TRI (llama/17503)
* vulkan: Implement GGML_OP_TRI

* check types match
2025-12-12 17:53:11 +02:00
Radoslav Gerganov d26d1c8b85
rpc : cache and reuse compute graphs (llama/15405)
Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.
2025-12-12 17:53:11 +02:00
yulo f92d542d4d
HIP: enable mul_mat_f for RDNA4 (llama/17437)
* enable mmf for rdna4

* move some mmvf to mmf

* revert lds128 for wmma loading

* Revert "revert lds128 for wmma loading"

This reverts commit db9ae8b6b4738a5def5b393caa1611d52133e9b5.

* Revert "enable mmf for rdna4"

This reverts commit 698c9f24187b990e35c3b73a8067e5387e6ddbd4.

* Revert "move some mmvf to mmf"

This reverts commit 99b92bd6653cc8593607f641e44606391691792f.

* enable mul_mat for rdna4

---------

Co-authored-by: zhang hui <you@example.com>
2025-12-12 17:53:11 +02:00
Piotr Wilkin (ilintar) 51e842d106
SOLVE_TRI CUDA kernel for small matrices (llama/17457) 2025-12-12 17:53:11 +02:00
Neo Zhang Jianyu 93bc8dc5a8
refactor pad_reflect_1d to make the UT case pass (llama/17204)
Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>
2025-12-12 17:53:10 +02:00
Jeff Bolz 3727a36c48
vulkan: Implement SOLVE_TRI (llama/17486)
* vulkan: Implement SOLVE_TRI

* load B matrix through shared memory

* use FLOAT_TYPE
2025-12-12 17:53:10 +02:00
matt23654 e682af7886
cuda : fix UMA detection on discrete GPUs. (llama/17537) 2025-12-12 17:53:10 +02:00
Alberto Cabrera Pérez 93f6cdb9c0
ggml-cpu: aarm64: q4_K repack gemm and gemv implementations (dotprod only) (llama/17494)
* Enabled q4_K_4x8 path

* Fixed generic Q4_K 8x4 implementation

* wip: dotprod gemm

* Working arm q4_K dotprod gemm

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Undo acc rename

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Q4_K arm dotprod gemm

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fix: q4_qs reinterpret from uint to int

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Removed comments

* Fixed macro guards

* Fixed unused vars in generic implementation

* Fixed unused vars in 8x4 repack

* Fixed unused vars in generic implementation, unneeded comment

* Missing arch fallback for x86

* minor : style

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-12 17:53:10 +02:00
Acly ac92424b59
vulkan : move contiguous checks to device_supports_op (llama/17490)
* vulkan : remove op_supports_incontiguous and add missing constraints in device_supports_op

* im2col: remove contraints on src0 (kernel input)
2025-12-12 17:53:10 +02:00
Jeff Bolz 310db24fca
vulkan: use a fixed 1KB buffer for the add_rms_fusion opt (llama/17514) 2025-12-12 17:53:10 +02:00
lhez 74ef5dd1a9
opencl: add sqr, sqrt, mean and ssm_conv (llama/17476)
* opencl: add sqr

* opencl: add sqrt

* opencl: add mean

* opencl: add ssm_conv

* opencl: add missing cl_khr_fp16

* opencl: do sqrt in f32 then convert to f16 for better precision
2025-12-12 17:53:09 +02:00
Alberto Cabrera Pérez 3de4372465
Fix chunks being too small with small matrix sizes (llama/17526) 2025-12-12 17:53:09 +02:00
Jeff Bolz c8050e5fdc
vulkan: allow graph_optimize for prompt processing workloads (llama/17475) 2025-12-12 17:53:09 +02:00
Jeff Bolz d8b61e05f8
vulkan: Implement top-k (llama/17418)
* vulkan: Implement top-k

Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10)
and discards all but the top K. Repeat until only K are left. And there's a fast
path when K==1 to just find the max value rather than sorting.

* fix pipeline selection

* vulkan: Add N-ary search algorithm for topk

* microoptimizations
2025-12-12 17:53:09 +02:00
xctan fb31a19797
ggml-cpu : add RISC-V Zvfh impl for ggml_vec_mad_f16 (llama/17448)
* ggml-cpu : add RISC-V Zvfh impl for ggml_vec_mad_f16

* ggml-cpu : dedup scalar impl

* Update ggml/src/ggml-cpu/vec.h

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-12 17:53:09 +02:00
Adrien Gallouët 8e3560c7ce
ggml : fix ARM feature verification (llama/17519)
On arm64 with `cmake` version 3.31.6, the final feature verification fails:

    -- ARM detected flags: -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs
    -- Performing Test GGML_MACHINE_SUPPORTS_dotprod
    -- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Success
    -- Performing Test GGML_MACHINE_SUPPORTS_i8mm
    -- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Success
    -- Performing Test GGML_MACHINE_SUPPORTS_sve
    -- Performing Test GGML_MACHINE_SUPPORTS_sve - Success
    -- Performing Test GGML_MACHINE_SUPPORTS_sme
    -- Performing Test GGML_MACHINE_SUPPORTS_sme - Failed
    -- Performing Test GGML_MACHINE_SUPPORTS_nosme
    -- Performing Test GGML_MACHINE_SUPPORTS_nosme - Success
    -- Checking for ARM features using flags:
    --   -U__ARM_FEATURE_SME
    --   -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme
    -- Performing Test HAVE_DOTPROD
    -- Performing Test HAVE_DOTPROD - Failed
    -- Performing Test HAVE_SVE
    -- Performing Test HAVE_SVE - Failed
    -- Performing Test HAVE_MATMUL_INT8
    -- Performing Test HAVE_MATMUL_INT8 - Failed
    -- Performing Test HAVE_FMA
    -- Performing Test HAVE_FMA - Success
    -- Performing Test HAVE_FP16_VECTOR_ARITHMETIC
    -- Performing Test HAVE_FP16_VECTOR_ARITHMETIC - Failed
    -- Performing Test HAVE_SME
    -- Performing Test HAVE_SME - Failed
    -- Adding CPU backend variant ggml-cpu: -U__ARM_FEATURE_SME;-mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme

We need to explicitly replace `;` with spaces from the list to make
`CMAKE_REQUIRED_FLAGS` work correctly...

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-12 17:53:08 +02:00
Jiacheng (Jason) Chen bb7223da8a
HIP: Patch failed testcase in WMMA-MMQ kernels for RDNA 4 (llama/17502)
* patch failed test case MUL_MAT(type_a=q4_0,type_b=f32,m=576,n=512,k=576,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1) for enabling WMMA on RDNA4

* Quick clean up on mma.cuh to add ggml_cuda_memcpy_1 back in for half2 and bfloat162
2025-12-12 17:53:08 +02:00
hipudding f0c54d47e1
CANN: Add MROPE and IMROPE support (llama/17401)
* CANN: ROPE supports both MROPE and IMROPE.

1. Optimize the caching logic of rope_cache_init.
2. Add support for mRoPE and i-mRoPE.

Note that on Ascend 910B devices, it is necessary to disable FA
in CLIP and disable NZ-format conversion. These two issues are
still under investigation.

* Resolve review comments
2025-12-12 17:53:08 +02:00
Jeff Bolz 208450048c
vulkan: Implement GGML_OP_CUMSUM (llama/17479) 2025-12-12 17:53:08 +02:00