Commit Graph

3874 Commits

Author SHA1 Message Date
Aaron Teo 89a7b4d22c
ggml-cpu: implement MXFP4 SIMD for s390x (llama/16193)
* ggml-cpu: impl mxfp4 s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: missing s = sumf

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix incorrect kval_mxfp4 type

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: rework mxfp4

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: missing delta calc

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix typo

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix typo for vec_splats

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: expand to 2 blocks per loop

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add unroll to boost perf

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: back to 1 block per loop to test perf

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "ggml-cpu: back to 1 block per loop to test perf"

This reverts commit 1fe55724e2dc295701101bf838bdd4a512237492.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: rm unroll from single block

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-29 15:18:11 +03:00
R0CKSTAR 98ac209ae1
musa: fix build warnings (llama/15611)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-09-29 15:18:10 +03:00
Aman Gupta d9bf63cfb8
CUDA: add a fused top-K MoE kernel (llama/16130)
* CUDA: add a fused top-K MoE kernel

This kernel does the following:
1. softmax over the logits per token [n_experts, n_tokens]
2. argmax reduce over the top-k (n_experts_used) logits
3. write weights + ids to global memory

It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models

* Refactor into ggml_cuda_should_use_topk_moe

* Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before

* Review: format + micro-optimizations

* Fix bug: fix tie breakers

* Add optional norm + clean-up code

* Use smem for final write

* Add bounds check

* Use better memory pattern for writeback
2025-09-29 15:18:10 +03:00
junchao-zhao 24ea5476de
ggml : fix loongarch lsx compilation error (llama/15864) 2025-09-29 15:18:10 +03:00
Daniel Bevenius 611ff19f20
ggml : remove -dev suffix from release version (ggml/1355)
This commit removes the `-dev` suffix from the version string in
CMakeLists.txt and the release script. The version will now be
just be formatted as `MAJOR.MINOR.PATCH`.
2025-09-29 15:18:10 +03:00
Daniel Bevenius 06d7b3d124
ggml : bump version to 0.9.3 (ggml/1353) 2025-09-29 15:18:10 +03:00
Georgi Gerganov ac678efb35
metal : fuse NORM + MUL + ADD, support non-multiples of 4 (llama/16220)
* metal : fuse NORM + MUL + ADD

* metal : support norms of non-multiple of 4

* cont : fix comment [no ci]
2025-09-29 15:18:10 +03:00
Georgi Gerganov 268f1c961b
metal : relax reorder conditions (llama/16216) 2025-09-29 15:18:10 +03:00
Georgi Gerganov 0a5b811f2e
metal : restore im2col perf (llama/16219) 2025-09-29 15:18:10 +03:00
Radoslav Gerganov 0946619662
rpc : use ggml logging facilities
Use RPC_DEBUG environment variable to enable debug messages.
Add helper macro LOG_DBG() which does an early
check of the env var before calling GGML_LOG_DEBUG().
Make sure we log a debug message for every server function.
2025-09-29 15:18:10 +03:00
Johannes Gäßler cd431223e0
llama: print memory breakdown on exit (llama/15860)
* llama: print memory breakdown on exit
2025-09-29 15:18:10 +03:00
Acly 5069c08034
ggml : split graph allocations according to backend max buffer size (llama/15815)
* ggml : make gallocr respect the backend's max buffer size

* if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers
* vulkan: report the actual max  allocation size in buffer type  interface

* fix missing newline, apple-clang warning

* track size of individual chunks in ggml_dyn_tallocr and raise max chunks.
revert to use suballocation_block_size as max chunk size for vulkan.

* track (chunk, offset) pairs instead of "global" offsets through gallocr.

* simpler, don't need loops to map between local/global offsets
* touches more code

* fix dyn_tallocr_max_size and initialization

* fix memory leak when buffers are reused due to same buffer type appearing multiple times

* make vbuffer allocation follow the same logic as backend_buffer did before

* continue to use leftover unallocated space of previous chunks after a new one has been created

* treat free blocks of each chunk as separate list
* they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges
* exhaust freed blocks of all chunks before considering their last blocks with unallocated space
* start with 0 chunks/blocks and create chunks as needed
* allow the last chunk to grow beyond max size

* refactor: move adding new free block and new chunk into separate functions

* allocate chunks individually with a separate free-blocks list for each one

* needs a bit more memory/allocations/indirections, but code is simpler

* fix warnings (missing static) & debug checks
2025-09-29 15:18:09 +03:00
Xiangyan Sun 41245891c1
ggml-cpu: Respect cpumask settings (llama/16164) 2025-09-29 15:18:09 +03:00
Sigbjørn Skjæret 73e8f3acb8
ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (llama/15928)
* fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl

* change initialization to true
2025-09-29 15:18:09 +03:00
Aaron Teo c706a50746
zdnn: refactor codebase + add docs (llama/16178)
* zdnn: initial matmul refactor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rm static from funcs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: update ggml-zdnn.h

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: change header files to hpp

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: switch to common.hpp

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: move mulmat forward around

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rm inline from utils

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: code cleanup

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: add zDNN docs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-29 15:18:09 +03:00
Daniel Bevenius d8d31e3638
ggml-cpu : fix typo in gemm comments [no ci] (llama/16189) 2025-09-29 15:18:09 +03:00
Sigbjørn Skjæret 4e32ee733b
ggml : implement set_rows with i32 index (llama/16159)
* implement set_rows with i32 index

* template fix

* test quantized path

warnings--

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* forgotten name change

* deduplicate cuda/sycl and test-fix

* indent++

* vulkan: support set_rows with i32 index type (llama/16162)

* disable i32 index for webgpu for now

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-09-29 15:18:09 +03:00
Georgi Gerganov df672c6372
ggml : extend ggml_can_fuse to work with non-sequential nodes (llama/16123)
* ggml : extend ggml_can_fuse to work with non-sequential nodes in the graph

* cont : fix wrong bounds check condition

* cont : remove unnecessary overload
2025-09-29 15:18:09 +03:00
Georgi Gerganov 973054a8cd
ggml : add ggml_op_is_empty (llama/16122)
* ggml : add ggml_op_is_empty

* ggml : move to ggml-impl.h
2025-09-29 15:18:09 +03:00
Shin-myoung-serp 9f673df08d
Vulkan: add conv_transpose_2d operation (llama/16022)
* Vulkan: add conv_transpose_2d operation

* Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L)

* Vulkan: fix incorrect indentation in conv_transpose_2d shader

* Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation

* Vulkan: revert the order of the index calculation and bound check in conv_2d shader

* Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation.

* Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.
2025-09-29 15:18:09 +03:00
Jeff Bolz 14723f25a1
vulkan: add RTE variants of exp shader (llama/16165)
This fixes some failures on Turing where "round to zero" rounds to the max f16
value but the CPU reference value is infinite.
2025-09-29 15:18:08 +03:00
Ruben Ortlam 95b29fab78
vulkan: vec dot matrix multiplication fix (llama/16151)
* vulkan: fix matrix multiplication index calculation for odd m/n and odd k in combination with batching

* add odd m/n + odd k test with batching
2025-09-29 15:18:08 +03:00
lhez 4b7f09ac0b
opencl: fix concat crash on win arm64 with Adreno (llama/15944) 2025-09-29 15:18:08 +03:00
lhez 0a7096f4f3
opencl: initial `q8_0` mv support (llama/15732) 2025-09-29 15:18:08 +03:00
Giuseppe Scrivano eae2be0ca2
vulkan: optimize UMA buffer operations and fix driver hangs (llama/16059)
* vulkan: optimize UMA buffer operations and fix driver hangs

The previous implementation was blocking the GPU for extended periods,
causing the i915 driver to reset the context due to the hangcheck
protection.

[32628.443070] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:85dffffb, in llama-server [194114]
[32628.443091] i915 0000:00:02.0: [drm] llama-server[194114] context reset due to GPU hang

* vulkan: implement deferred_memset on UMA

---------

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2025-09-29 15:18:08 +03:00
Jeff Bolz 9a6c2036a9
vulkan: fix validation error about VK_PIPELINE_CREATE_CAPTURE_STATISTICS_BIT_KHR (llama/16086) 2025-09-29 15:18:08 +03:00
Georgi Gerganov 8d10ded025
ggml : prepare for development of 0.9.2-dev 2025-09-29 15:18:08 +03:00
Georgi Gerganov d89164a08d
ggml : bump version to 0.9.1 2025-09-29 15:18:05 +03:00
Georgi Gerganov 36778bd8b8
talk-llama : sync llama.cpp 2025-09-20 13:58:28 +03:00
Georgi Gerganov 66ad624d5b
sync : ggml 2025-09-20 13:46:41 +03:00
Ruben Ortlam 76d0934287
vulkan: use vec dot for matrix matrix multiplications (llama/16056)
* vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions

* use fma instead of dot to fix Nvidia and Apple performance issues
2025-09-20 13:46:39 +03:00
Xuan-Son Nguyen 2ad00d5586
ggml : refactor forward_dup for cpu backend (llama/16062)
* ggml : refactor forward_dup for cpu backend

* clean up a bit

* add quant/dequant perf test
2025-09-20 13:46:39 +03:00
Adrien Gallouët 4d8cd07825
ggml-amx : fix ggml_amx_init() on generic Linux (llama/16049)
Generalize Linux check to `__linux__` to support non-glibc systems (like musl).
Also, return `false` on unknown/untested OS.

Without this commit, the code compiles (with warnings) but fails:

    register_backend: registered backend CPU (1 devices)
    register_device: registered device CPU (Intel(R) Xeon(R) Platinum 8488C)
    build: 6487 (51c4cac6) with x86_64-linux-musl-gcc (GCC) 15.1.0 for x86_64-linux-musl (debug)
    system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
    ....
    print_info: n_ctx_orig_yarn  = 262144
    print_info: rope_finetuned   = unknown
    print_info: model type       = 4B
    Illegal instruction (core dumped)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-09-20 13:46:39 +03:00
Adrien Gallouët 4575f96873
cmake : fix static linking for OpenMP on Unix-like systems (llama/16031)
When compiling with GGML_STATIC=ON, the build process would produce a
binary that was still dynamically linked to OpenMP. This defeats the
purpose of a static build:

    $ cmake -B build \
            -DBUILD_SHARED_LIBS=OFF \
            -DLLAMA_CURL=OFF \
            -DGGML_CCACHE=OFF \
            -DGGML_NATIVE=OFF \
            -DGGML_STATIC=ON

    $ ldd llama-server
            linux-vdso.so.1 (0x0000e1a434e3b000)
            libgomp.so.1 => /lib/aarch64-linux-gnu/libgomp.so.1 (0x0000e1a4345a0000)
            libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000e1a434300000)
            libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000e1a434240000)
            libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000e1a434200000)
            libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000e1a434030000)
            /lib/ld-linux-aarch64.so.1 (0x0000e1a434df0000)

This commit resolves the issue by modifying `CMAKE_FIND_LIBRARY_SUFFIXES`
to prioritize `.a` files, forcing CMake to link the static version of
the library.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-09-20 13:46:39 +03:00
Shawn Gu f4a225cea6
opencl: optimize mxfp4 kernels (llama/16037)
- flatten mxfp4 and packed fp4->fp16 bit-wise convert function (replace lut)
- MoE kernel optimizations

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2025-09-20 13:46:39 +03:00
Jeff Bolz 7fcb7e83ec
rename optimize_graph to graph_optimize (llama/16082) 2025-09-20 13:46:39 +03:00
Bowen Han fce6354e0f
CUDA: Optimize PAD_REFLECT_1D (llama/15957)
* CUDA: Optimize PAD_REFLECT_1D
feat: add more test cases for PAD_REFLECT_1D

* use fast_div to improve performance

* Apply suggestion from JohannesGaessler

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Apply suggestion from JohannesGaessler

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* optimize

* use a concise expression to further speedup the cuda kernel

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-09-20 13:46:38 +03:00
Johannes Gäßler 05bdfd4380
CUDA: fix compilation on CC 6.0 (llama/16091) 2025-09-20 13:46:38 +03:00
Georgi Gerganov 960aaa9904
metal : use function constants for mul_mv_ext kernels (llama/16074)
* metal : use function constants for mul_mv_ext kernels

ggml-ci

* metal : remove NW template argument

ggml-ci

* metal : adjust constants

ggml-ci
2025-09-20 13:46:38 +03:00
Sigbjørn Skjæret 225d7c1d5a
cuda : add missing F32<->I32 entries in ggml_cuda_cpy_fn (llama/16060) 2025-09-20 13:46:38 +03:00
Georgi Gerganov d37f590a77
metal : improve F32, F16 and BF16 mat-vec multiplication (llama/16057)
* metal : improve F32, F16 and BF16 mat-vec multiplication

ggml-ci

* metal : make the NSG a function constant in mul_mv kernels

ggml-ci
2025-09-20 13:46:38 +03:00
Jhen-Jie Hong 32b6d9c134
metal : avoid call free for non-owned buffer (llama/16067) 2025-09-20 13:46:38 +03:00
Georgi Gerganov 1f24b1df4d
metal : handle nil cv during pipeline creation (llama/16065)
ggml-ci
2025-09-20 13:46:38 +03:00
Chenguang Li c46adc0817
CANN: Remove print (llama/16044)
Signed-off-by: noemotiovon <757486878@qq.com>
2025-09-20 13:46:38 +03:00
Reese Levine 1361f679cc
GGML WebGPU: Support for ADD, MUL, RMS_NORM, GET_ROWS operators (llama/16018)
* Add paramater buffer pool, batching of submissions, refactor command building/submission

* Add header for linux builds

* Free staged parameter buffers at once

* Format with clang-format

* Fix thread-safe implementation

* Use device implicit synchronization

* Update workflow to use custom release

* Remove testing branch workflow

* some f32 tests passing

* Disable set_rows until it's implemented

* f32 add all tests passing

* Begin work on set_rows

* Work on set rows

* Add error buffers for reporting unsupported SET_ROWS indices

* Remove extra comments

* Add templated addition, clean up code

* Get addition and multiplication working

* Implement rms_norm

* Add get_rows implementation

* Add new get_rows files

* Refactor use of wg size entry

* Fix compilation

* Try manually unrolled q4_0 quant

* Revert "Try manually unrolled q4_0 quant"

This reverts commit 77f8b96515f7e640ae4b0e44f066321fbc4a6166.

* Move to constant max wg size

* Check for tensor size in supports_op

* Vectorize f32 and change default workgroup size

* Move f32 get_rows from < 4 to % 4 != 0

* fix linter errors

* Add in-place tests

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
2025-09-20 13:46:37 +03:00
Georgi Gerganov eb2c01f92e
metal : refactor + optimize v2 (llama/15995) 2025-09-20 13:46:10 +03:00
Georgi Gerganov 6458bac4c1
sync : ggml 2025-09-20 13:45:32 +03:00
Johannes Gäßler d452f0cf8c
CUDA: fix FA occupancy, optimize tile kernel (llama/15982) 2025-09-20 13:45:30 +03:00
Eve e96b285011
vulkan: automatically remove unsupported devices (llama/15976)
* remove unsupported vulkan devices

* make this happen during selection instead

* pass by reference
2025-09-20 13:45:30 +03:00
Chenguang Li e32c3b0fd3
CANN: Optimize ggml_cann_set_device (llama/15935)
* CANN: Fix ggml_cann_set_device to avoid redundant device switches

- Added a check to skip aclrtSetDevice if the current device is already set.
- Prevents unnecessary context switches while keeping thread/device consistency.

* CANN: add device default id
2025-09-20 13:45:30 +03:00