Commit Graph

4394 Commits

Author SHA1 Message Date
Jeff Bolz b1f65a4a7e vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (llama/18295)
* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.

* change test_topk_moe to allow results in arbitrary order

* disable sigmoid fusion for moltenvk
2026-01-14 09:11:59 +02:00
Peter A. a96310871a
examples : fix executable example targets (#3600)
* cmake:
    - added `whisper-` prefix to unprefixed targets: `quantize`, `lsp`,
      `vad-speech-segments`
    - added `install(TARGETS ${TARGET} RUNTIME)` where it was missing

Signed-off-by: Peter A. <ink.splatters@pm.me>

* .github/workflows/build.yml: quantize -> whisper-quantize

Signed-off-by: Peter A. <ink.splatters@pm.me>

---------

Signed-off-by: Peter A. <ink.splatters@pm.me>
2026-01-13 08:08:18 +01:00
KITAITI Makoto 679bdb53db
ruby : fix segmentation fault (#3591)
* Mark long live variable

* Fix test for Whisper::Token#deconstruct_keys(nil)

* Don't use long live variable

* Fix indentation
2026-01-05 17:41:22 +09:00
Georgi Gerganov e9898ddfb9 sync : ggml 2025-12-31 18:27:20 +02:00
Georgi Gerganov ce03f8e759 ggml : bump version to 0.9.5 (ggml/1410) 2025-12-31 18:27:20 +02:00
Georgi Gerganov 7359ac94d5 talk-llama : sync llama.cpp 2025-12-31 17:52:09 +02:00
Georgi Gerganov 54fa8216ea sync : ggml 2025-12-31 17:52:09 +02:00
gatbontonpc 8189f2cb65 metal : add count_equal op (llama/18314)
* add count equal for metal

* remove trailing whitespace

* updated doc ops table

* changed shmem to i32

* added multi tg and templating

* removed BLAS support from Metal docs

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add memset to set dst to 0

* metal : cleanup

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-31 17:52:09 +02:00
Johannes Gäßler 2d250f8049 CUDA: fix KQ max calculation (llama/18487) 2025-12-31 17:52:09 +02:00
Georgi Gerganov 5deaf8f2a3 metal : remove BF16 x F16 kernels (llama/18456) 2025-12-31 17:52:09 +02:00
Aman Gupta 467933199a sycl: add newline at the end of CMakeLists.txt (llama/18503) 2025-12-31 17:52:09 +02:00
Rahul Sathe a3635494da Work around broken IntelSYCLConfig.cmake in Intel oneAPI 2025.x (llama/18345)
* cmake: work around broken IntelSYCLConfig.cmake in oneAPI 2025.x

* [AI] sycl: auto-detect and skip incompatible IntelSYCL package

Automatically detect compiler versions with incompatible IntelSYCL
CMake configuration files and fall back to manual SYCL flags instead
of requiring users to set options manually.

Fixes build failures with oneAPI 2025.x where IntelSYCLConfig.cmake
has SYCL_FEATURE_TEST_EXTRACT invocation errors.

* refactor: improve SYCL provider handling and error messages in CMake configuration

* refactor: enhance SYCL provider validation and error handling in CMake configuration

* ggml-sycl: wrap find_package(IntelSYCL) to prevent build crashes
2025-12-31 17:52:09 +02:00
Charles Xu c9955367d4 kleidiai: add and integrate SVE 256-bit vector-length kernel (llama/18458)
* kleidiai: add and integrate SVE 256-bit vector-length kernel

* updated for review comments
2025-12-31 17:52:09 +02:00
Aman Gupta 6d4aa96bfa CUDA: add log line when mxfp4 acceleration is used (llama/18483)
* CUDA: add log line when mxfp4 acceleration is used

* add in backend_get_features
2025-12-31 17:52:09 +02:00
Johannes Gäßler 5765c5b04e CUDA: fix replacment of bad archs in CMake (llama/18457) 2025-12-31 17:52:09 +02:00
Johannes Gäßler d6cb2407b7 CUDA: Blackwell features for non-native builds (llama/18436) 2025-12-31 17:52:09 +02:00
Aman Gupta e49e88b2d8 cuda: fix race condition in cumsum (llama/18448)
* ggml-cuda: fix race condition in cumsum

* remove unneccesary sync_threads
2025-12-31 17:52:09 +02:00
uvos 20f5729921 HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated (llama/18202) 2025-12-31 17:52:09 +02:00
Aman Gupta b8d209f55c Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413)" (llama/18426) 2025-12-31 17:52:09 +02:00
o7si 54fe9a645d rpc: fix segfault on invalid endpoint format (llama/18387)
* rpc: fix segfault on invalid endpoint format

* rpc: add error log for failed endpoint connection
2025-12-31 17:52:09 +02:00
Boian Berberov b3788ef729 cmake: Added more x86_64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` (llama/18186)
* minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h`

* cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On`

- `ivybridge`
- `piledriver`
- `cannonlake`
- `cascadelake`
- `cooperlake`
- `zen4`

Resolves: #17966
2025-12-31 17:52:09 +02:00
QDelta 31fc2c37c8 ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (llama/18413) 2025-12-31 17:52:09 +02:00
lhez a800a3acd1 opencl: allow resizing transpose buffers (llama/18384)
* opencl: allow resizing transpose buffers instead of using fixed sizes

* opencl: remove commented code
2025-12-31 17:52:09 +02:00
Aman Gupta 29f8155445 ggml-cuda: Use same regex for GGML_NATIVE=OFF (llama/18407) 2025-12-31 17:52:09 +02:00
Jeff Bolz 015b618d96 vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (llama/18352)
Run a preprocess to count how many times each expert is used, and use this to
quickly discard workgroups that aren't needed.
2025-12-31 17:52:09 +02:00
Jeff Bolz e37c8ed94e vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (llama/18349)
* vulkan: Use BK=32 for coopmat2 mul_mat_id

* vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader

Disable robustness, remove the OOB check in decodeFuncB, and initialize the
row_ids to zero to avoid OOB access.

Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
zero and remove the '& (BN - 1)'. This allows the compiler to common some of
the shared memory loads.
2025-12-31 17:52:09 +02:00
Jeff Bolz 331c6ccd31 vulkan: Use BK=32 for coopmat2 mul_mat_id (llama/18332) 2025-12-31 17:52:09 +02:00
Eve 35cb4abb67 vulkan: small dequantization improvements (llama/18380)
* iq4_xs

* quants
2025-12-31 17:52:09 +02:00
Jeff Bolz 181e36f194 vulkan: Support UPSCALE w/antialias (llama/18327) 2025-12-31 17:52:09 +02:00
Jeff Bolz 67473fef57 vulkan: handle rope with large number of rows (llama/18306) 2025-12-31 17:52:09 +02:00
0Marble 33f75a88ac CANN: implement the SSM_CONV operator (llama/17737)
* CANN: implement SSM_CONV operator

Co-authored-by: Aleksei Lobanov, <zeromarblectm@gmail.com>
Co-authored-by: Sujin Kang, <waterjin326@gmail.com>

* CANN: remove custom error limit for SSM_CONV

* CANN: merge SSM_CONV tensor shape/strides into one line

---------

Co-authored-by: Sujin Kang, <waterjin326@gmail.com>
2025-12-31 17:52:09 +02:00
Aman Gupta 51778354ce ggml-cuda: fix regex for arch list (llama/18371)
* ggml-cuda: fix regex for arch list

* make regex exact
2025-12-31 17:52:09 +02:00
Aman Gupta 8e02f0919d cuda: optimize cumsum cub path (llama/18362)
* cuda: optimize cumsum cub path

* remove heavy perf test
2025-12-31 17:52:09 +02:00
Aman Gupta ea07c5d3b7 ggml-cuda: fix blackwell native builds (llama/18361)
* ggml-cuda: fix blackwell native builds

Replace 12x in native architectures by 12xa

* replace for GGML_NATIVE=OFF too

* only replace for native

* remove 120f-virtual for default compilation

---------

Co-authored-by: Aman Gupta <aman>
2025-12-31 17:52:09 +02:00
Penglin Cai 5f0488f012 CANN: Add support for CONV_TRANSPOSE_1D when kernel size > 255 (llama/17934)
* CONV_TRANSPOSE_1D kernel_size>255

* remove condition check

* fix the bug of type conversion

* removing trailing whitespaces

* fix: return true in the switch case
2025-12-31 17:52:09 +02:00
Aadeshveer Singh db75fff539 ggml : optimize cuda cumsum fallback kernel (llama/18343) 2025-12-31 17:52:09 +02:00
Aman Gupta 41e578ec8a CUDA: experimental native mxfp4 support for blackwell (llama/17906)
* CUDA: experimental native mxfp4 support for blackwell

* optimize load_tiles

* optimize quantize_mxfp4

* cleanup

* first pass review: formatting

* use interleaved layout for mma

* mmq: add assert for size

* use __nv_fp4x4_e2m1

* use iter_k as 512, cleanup

* Use 1200 as blackwell instead of 1000

* address review comments

* mmq: fix stride

* quantize.cu: use reference impl of e8m0 scale

* address review comments

* add 120f-virtual + minor fixes

---------

Co-authored-by: Aman Gupta <aman>
2025-12-31 17:52:09 +02:00
Jeff Bolz f863735caa vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (llama/18302) 2025-12-31 17:52:09 +02:00
Wang Weixuan bab2c02da5 CANN : refactor ACL graph cache (llama/17752)
Move the graph property checking code into methods of LRU cache.

Signed-off-by: Wang Weixuan <wangweixvan@gmail.com>
2025-12-31 17:52:09 +02:00
Ruben Ortlam 1356600679 vulkan: use fewer FA rows for small cache runs (llama/18280) 2025-12-31 17:52:09 +02:00
TianHao324 ec9239d3b7 CANN: Uses yarn_ramp cache in ROPE (llama/17725) 2025-12-31 17:52:09 +02:00
Chris Rohlf 9bdd4658f4 rpc : add check for rpc buffer type (llama/18242) 2025-12-31 17:52:09 +02:00
nullname e4c89612cd ggml-hexagon: create generalized functions for cpu side op (llama/17500)
* refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility

* refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility

* refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity

* add comment

* refactor: remove redundant buffer checks in hexagon supported operations

* wip

* add missing include to fix weak symbol warning

* add ggml_hexagon_op_generic

* refactor: simplify tensor operation initialization and buffer management in hexagon implementation

* refactor: streamline hexagon operation initialization and buffer management

* refactor: update function signatures and streamline request handling in hexagon operations

* wip

* ggml-hexagon: clean up code formatting and improve unary operation handling

* wip

* rename

* fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations

* refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility

refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility

refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity

refactor: remove redundant buffer checks in hexagon supported operations

add missing include to fix weak symbol warning

add ggml_hexagon_op_generic

refactor: simplify tensor operation initialization and buffer management in hexagon implementation

refactor: streamline hexagon operation initialization and buffer management

refactor: update function signatures and streamline request handling in hexagon operations

ggml-hexagon: clean up code formatting and improve unary operation handling

fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations

# Conflicts:
#	ggml/src/ggml-hexagon/ggml-hexagon.cpp

* hexagon: fix merge conflicts

* hexagon: minor cleanup for buffer support checks

* hexagon: factor out op_desc and the overal op logging

* hexagon: further simplify and cleanup op dispatch logic

* snapdragon: update adb scripts to use llama-cli and llama-completion

* fix pipeline failure

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2025-12-31 17:52:09 +02:00
Shouyu 2f33395197 ggml-hexagon: gelu optimization (llama/18151)
* feat: working gelu with src0 put on vtcm

* feat: gelu ping-pong for both in and out

* fix: fixu compile error

* break: distinguish dma ddr->vtcm and vtcm->ddr operation

* fix: fix dma queue size

* break: update dma api to either pop src or dst ptr

* fix: fix activation vtcm allocation issue for src1 when swapperd

* refactor: ping-pong gelu logic to avoid unnecessary if else

* dma: improved queue interface and prefetch handling

* gelu: fix N+2 block prefetch

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2025-12-31 17:52:09 +02:00
Taimur Ahmad 5b0c1c1580 llamafile: add rvv support for sgemm kernels (llama/18199)
Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>
2025-12-31 17:52:09 +02:00
lhez f2fe1e5baf opencl: unpack q4_0 for adreno in get_tensor (llama/18278) 2025-12-31 17:52:09 +02:00
Jeff Bolz dbbe6c11b5 vulkan: Extend rope fusions to allow mrope (llama/18264)
Extend the test-backend-ops tests as well.
2025-12-31 17:52:09 +02:00
Jeff Bolz 98e59a43d1 vulkan: Implement set_tensor_async and the event interfaces (llama/18047)
The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from #7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
2025-12-31 17:52:09 +02:00
Johannes Gäßler b68b12f2d5 llama: fix RPC for -fit on (llama/18233) 2025-12-31 17:52:09 +02:00
Jeff Bolz b893e0813a vulkan: fix im2col overflowing maxworkgroupcount (llama/18180) 2025-12-31 17:52:09 +02:00