whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Aman Gupta	575d894603	ggml-cuda: refactor cuda graph usage (llama/18637) * ggml-cuda: refactor cuda graph usage * use is_enabled() instead of enabled	2026-01-14 09:11:59 +02:00
Beinsezii	ed674cfc10	mmq.cu: tune mmq/rocblas switching for RDNA (llama/18537) * Patch perf regression for mmq kernels in ROCm recover performance regression for https://github.com/ggml-org/llama.cpp/issues/17917 * add n_experts branch like the cdna path * mmq.cu: tune mmq/wmma switching for RDNA * mmq.cu: move amd wmma mmq/wmma switching behind IS_RDNA3 * Update ggml/src/ggml-cuda/mmq.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Jiacheng (Jason) Chen <76919340+jiachengjason@users.noreply.github.com> Co-authored-by: jiachengjason <jasonchen.jiacheng@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-01-14 09:11:59 +02:00
Adrien Gallouët	5520f27363	ggml : fix avx512bf16 build (llama/18623) - include `immintrin.h` when required - remove unused m512bh Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-01-14 09:11:59 +02:00
Raul Torres	9a1a6685ba	CANN: Make `valid_values` variable `static const` (llama/18627)	2026-01-14 09:11:59 +02:00
nwyin	e563e239a7	ggml webgpu: add CEIL operation support (llama/18605) * ggml-webgpu: add CEIL operation support Add support for the CEIL unary operation in the WebGPU backend: - Add CEIL_FUNC shader template in unary_op.wgsl - Add 4 shader variants (f32, f16, inplace versions) - Initialize CEIL pipelines in ggml-webgpu.cpp - Register CEIL in supports_op function * docs: update WebGPU ops support for CEIL	2026-01-14 09:11:59 +02:00
Johannes Gäßler	9956333361	CUDA: fix FA FP16 accumulator overflow for Granite (llama/18614)	2026-01-14 09:11:59 +02:00
Aman Gupta	804f545454	ggml-cuda: check for srcs outside the cgraph (llama/18583) * ggml-cuda: check for srcs outside the cgraph * review: use leafs instead	2026-01-14 09:11:59 +02:00
Jeff Bolz	52ba45e2b8	vulkan: fix topk_moe_sigmoid_norm_bias failures in GLM-4.6 (llama/18582)	2026-01-14 09:11:59 +02:00
Jeff Bolz	0a99b4c377	vulkan: handle quantize_q8_1 overflowing the max workgroup count (llama/18515) * vulkan: handle quantize_q8_1 overflowing the max workgroup count * vulkan: Fix small tile size matmul on lavapipe * fix mul_mat_id failures	2026-01-14 09:11:59 +02:00
Chenguang Li	1d657effe3	CANN: add operator fusion support for ADD + RMS_NORM (llama/17512) This commit implements operator fusion for ADD + RMS_NORM operations in the CANN backend to reduce memory access overhead and improve performance. The fusion is controlled by the GGML_CANN_OPERATOR_FUSION environment variable (default: false). Changes: - Implement ggml_cann_op_add_rms_norm_fused() using ACLNN AddRmsNorm - Add ggml_cann_can_fuse() to check fusion eligibility - Integrate fusion logic into computation graph evaluation - Add test cases for ADD + RMS_NORM fusion - Update documentation with new environment variable The fusion combines ADD and RMS_NORM into a single kernel call, which is more efficient than executing them separately.	2026-01-14 09:11:59 +02:00
Daniel Bevenius	4d6a3fb00d	sampling : add support for backend sampling (llama/17004) * sampling : add support for backend sampling This commit adds support for performing sampling operations on the backend (e.g. GPU) as part of the model computation graph. The motivation for this feature is to enable sampling to be performed directly on the backend as part of the computation graph being executed, allowing for some or all of the sampling to be done on the backend. For example, the backend sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory. It is also possible for the backend samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers. Currently the backend sampling works in a similar manner to how pooling works, it is a function that is called by build_graph and the sampler operations become part of the models computation graph. * llama-cli : add backend sampler configuration * server : add backend sampling options/configuration * webui : add backend sampling options * ggml : add initial cumsum implementation for CUDA * sampling : enable all backend sampler tests This commit enables all exisiting backend sampler tests in the test-backend-sampler. Previously, some tests were disabled because there were missing ggml operation implementations. * graph : do not include llama-model.h * sampling : always expose sampled_ids This commit precomputes and caches the full-vocab token id list in llama_context's constructor, so llama_get_backend_sampled_token_ids_ith always returns a valid pointer. The motivation for this is that this enables both common/sampling.cpp and src/llama-sampling.cpp can simplify their logic. Not all backends samplers that process logits need to set the sampled_tokens_id as they may not change the order of the logits, for example the temperature sampler only scales the logits but does not change their order. Simliar the logit bias sampler only adds bias to specific token ids but does not change the order of the logits. In these cases there will not be a device to host copy of the sampled token ids, and this is the use case where having this precomputed list is useful. * sampling : ensure at most one output token per seq This commit adds a check in the batch allocator to ensure that when backend sampling is enabled, at most one output token is specified per sequence. * CUDA: Optimize argsort for gpu-based token sampling Argsort is used for top-k currently. WE optimize argsort by 2 things: 1. Use `DeviceRadixSort` for single-row/sequence to parallelize it across our SMs 2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the correct entrypoint (the function chooses different execution paths, it contains `DeviceSegmentedRadixSort` as one of the paths and will choose the best one according to heuristics. https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview Some perf numbers for a RTX PRO 6000: On the kernel level, tested with `GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf` Before: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 359.24 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 8192 runs - 861.34 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 1020.01 us/run ``` After: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 312.41 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 16384 runs - 63.48 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 874.36 us/run ```	2026-01-14 09:11:59 +02:00
Aman Gupta	f0bf5b8cc3	CUDA: disable cuda graph when using n-cpu-moe (llama/18593) * CUDA: disable cuda graph when using n-cpu-moe * call ggml_cuda_set_device	2026-01-14 09:11:59 +02:00
Aman Gupta	88f5765c82	ggml-cuda: remove unused params in ggml_cuda_graph (llama/18579)	2026-01-14 09:11:59 +02:00
Aman Gupta	1e725546b0	ggml-cuda: fixes for concurrent streams (llama/18496)	2026-01-14 09:11:59 +02:00
Johannes Gäßler	60d178cee9	CUDA: only allocate FA tmp buffer if needed (llama/18564)	2026-01-14 09:11:59 +02:00
pl752	304e780e5f	(Bugfix, ggml-cuda) Pool alloc count fix + small size computation type adjustment (llama/18559) * CUDA: Fixed obj byte size instead of obj count being passed to pool alloc (fattn-common, dst_tmp_meta) * CUDA: Explicitly casted some of the int alloc counts before multiplication in argsort --------- Co-authored-by: pl752 <maximpl752@gmail.com>	2026-01-14 09:11:59 +02:00
Shouyu	c9e9f083c2	ggml-hexagon: optimize activation function (llama/18393) * refactor: refactor silu * refactor: optimize swiglu * refactor: remove unncessary if in swiglu * refactor: refactor swiglu_oai * chore: fix formatting issue	2026-01-14 09:11:59 +02:00
Jeff Bolz	9d83865607	vulkan: Optimize GGML_OP_CUMSUM (llama/18417) * vulkan: Optimize GGML_OP_CUMSUM There are two paths: The preexisting one that does a whole row per workgroup in a single shader, and one that splits each row into multiple blocks and does two passes. The first pass computes partials within a block, the second adds the block partials to compute the final result. The multipass shader is used when there are a small number of large rows. In the whole-row shader, handle multiple elements per invocation. * use 2 ELEM_PER_THREAD for AMD/Intel * address feedback	2026-01-14 09:11:59 +02:00
Jeff Bolz	b7ff521e71	vulkan: Implement mmvq for iq1_s/iq1_m (llama/18450)	2026-01-14 09:11:59 +02:00
Georgi Gerganov	b99c911c49	metal : adjust extra size for FA buffer to avoid reallocations (llama/18545)	2026-01-14 09:11:59 +02:00
Chris Rohlf	f328b13d5c	rpc : use unordered_map::reserve and emplace (llama/18513)	2026-01-14 09:11:59 +02:00
MeeMin	fbde389665	cuda : fix copy of large tensors (ggml_nbytes <= INT_MAX assertion) (llama/18433) * ggml-cuda: fixed assertion in ggml_cuda_cpy (llama/18140) * ggml-cuda: changes in data types to int64_t * ggml-cuda: added asserts for CUDA block numbers * ggml-cuda: changed the condition for y and z dimension	2026-01-14 09:11:59 +02:00
Aman Gupta	f22c1ccbe4	ggml-cuda: remove unneccesary prints on ggml_cuda_init (llama/18502)	2026-01-14 09:11:59 +02:00
Jeff Bolz	b1f65a4a7e	vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (llama/18295) * vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron Also handle GGML_OP_SCALE at the end (nemotron, deepseek2). Fewer pipeline variants and spec constants, just use push constants. In test_topk_moe, change exp_probs_b to be 1D, matching real networks. Update test-backend-ops and ggml-backend to allow verifying multiple outputs in a fusion test (topk_moe has two outputs). Previously only the final node was verified. * change test_topk_moe to allow results in arbitrary order * disable sigmoid fusion for moltenvk	2026-01-14 09:11:59 +02:00
Georgi Gerganov	ce03f8e759	ggml : bump version to 0.9.5 (ggml/1410)	2025-12-31 18:27:20 +02:00
gatbontonpc	8189f2cb65	metal : add count_equal op (llama/18314) * add count equal for metal * remove trailing whitespace * updated doc ops table * changed shmem to i32 * added multi tg and templating * removed BLAS support from Metal docs * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add memset to set dst to 0 * metal : cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-31 17:52:09 +02:00
Johannes Gäßler	2d250f8049	CUDA: fix KQ max calculation (llama/18487)	2025-12-31 17:52:09 +02:00
Georgi Gerganov	5deaf8f2a3	metal : remove BF16 x F16 kernels (llama/18456)	2025-12-31 17:52:09 +02:00
Aman Gupta	467933199a	sycl: add newline at the end of CMakeLists.txt (llama/18503)	2025-12-31 17:52:09 +02:00
Rahul Sathe	a3635494da	Work around broken IntelSYCLConfig.cmake in Intel oneAPI 2025.x (llama/18345) * cmake: work around broken IntelSYCLConfig.cmake in oneAPI 2025.x * [AI] sycl: auto-detect and skip incompatible IntelSYCL package Automatically detect compiler versions with incompatible IntelSYCL CMake configuration files and fall back to manual SYCL flags instead of requiring users to set options manually. Fixes build failures with oneAPI 2025.x where IntelSYCLConfig.cmake has SYCL_FEATURE_TEST_EXTRACT invocation errors. * refactor: improve SYCL provider handling and error messages in CMake configuration * refactor: enhance SYCL provider validation and error handling in CMake configuration * ggml-sycl: wrap find_package(IntelSYCL) to prevent build crashes	2025-12-31 17:52:09 +02:00
Charles Xu	c9955367d4	kleidiai: add and integrate SVE 256-bit vector-length kernel (llama/18458) * kleidiai: add and integrate SVE 256-bit vector-length kernel * updated for review comments	2025-12-31 17:52:09 +02:00
Aman Gupta	6d4aa96bfa	CUDA: add log line when mxfp4 acceleration is used (llama/18483) * CUDA: add log line when mxfp4 acceleration is used * add in backend_get_features	2025-12-31 17:52:09 +02:00
Johannes Gäßler	5765c5b04e	CUDA: fix replacment of bad archs in CMake (llama/18457)	2025-12-31 17:52:09 +02:00
Johannes Gäßler	d6cb2407b7	CUDA: Blackwell features for non-native builds (llama/18436)	2025-12-31 17:52:09 +02:00
Aman Gupta	e49e88b2d8	cuda: fix race condition in cumsum (llama/18448) * ggml-cuda: fix race condition in cumsum * remove unneccesary sync_threads	2025-12-31 17:52:09 +02:00
uvos	20f5729921	HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated (llama/18202)	2025-12-31 17:52:09 +02:00
Aman Gupta	b8d209f55c	Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413 )" (llama/18426)	2025-12-31 17:52:09 +02:00
o7si	54fe9a645d	rpc: fix segfault on invalid endpoint format (llama/18387) * rpc: fix segfault on invalid endpoint format * rpc: add error log for failed endpoint connection	2025-12-31 17:52:09 +02:00
Boian Berberov	b3788ef729	cmake: Added more x86_64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` (llama/18186) * minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h` * cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` - `ivybridge` - `piledriver` - `cannonlake` - `cascadelake` - `cooperlake` - `zen4` Resolves: #17966	2025-12-31 17:52:09 +02:00
QDelta	31fc2c37c8	ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (llama/18413)	2025-12-31 17:52:09 +02:00
lhez	a800a3acd1	opencl: allow resizing transpose buffers (llama/18384) * opencl: allow resizing transpose buffers instead of using fixed sizes * opencl: remove commented code	2025-12-31 17:52:09 +02:00
Aman Gupta	29f8155445	ggml-cuda: Use same regex for GGML_NATIVE=OFF (llama/18407)	2025-12-31 17:52:09 +02:00
Jeff Bolz	015b618d96	vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (llama/18352) Run a preprocess to count how many times each expert is used, and use this to quickly discard workgroups that aren't needed.	2025-12-31 17:52:09 +02:00
Jeff Bolz	e37c8ed94e	vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (llama/18349) * vulkan: Use BK=32 for coopmat2 mul_mat_id * vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader Disable robustness, remove the OOB check in decodeFuncB, and initialize the row_ids to zero to avoid OOB access. Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of zero and remove the '& (BN - 1)'. This allows the compiler to common some of the shared memory loads.	2025-12-31 17:52:09 +02:00
Jeff Bolz	331c6ccd31	vulkan: Use BK=32 for coopmat2 mul_mat_id (llama/18332)	2025-12-31 17:52:09 +02:00
Eve	35cb4abb67	vulkan: small dequantization improvements (llama/18380) * iq4_xs * quants	2025-12-31 17:52:09 +02:00
Jeff Bolz	181e36f194	vulkan: Support UPSCALE w/antialias (llama/18327)	2025-12-31 17:52:09 +02:00
Jeff Bolz	67473fef57	vulkan: handle rope with large number of rows (llama/18306)	2025-12-31 17:52:09 +02:00
0Marble	33f75a88ac	CANN: implement the SSM_CONV operator (llama/17737) * CANN: implement SSM_CONV operator Co-authored-by: Aleksei Lobanov, <zeromarblectm@gmail.com> Co-authored-by: Sujin Kang, <waterjin326@gmail.com> * CANN: remove custom error limit for SSM_CONV * CANN: merge SSM_CONV tensor shape/strides into one line --------- Co-authored-by: Sujin Kang, <waterjin326@gmail.com>	2025-12-31 17:52:09 +02:00
Aman Gupta	51778354ce	ggml-cuda: fix regex for arch list (llama/18371) * ggml-cuda: fix regex for arch list * make regex exact	2025-12-31 17:52:09 +02:00
Aman Gupta	8e02f0919d	cuda: optimize cumsum cub path (llama/18362) * cuda: optimize cumsum cub path * remove heavy perf test	2025-12-31 17:52:09 +02:00
Aman Gupta	ea07c5d3b7	ggml-cuda: fix blackwell native builds (llama/18361) * ggml-cuda: fix blackwell native builds Replace 12x in native architectures by 12xa * replace for GGML_NATIVE=OFF too * only replace for native * remove 120f-virtual for default compilation --------- Co-authored-by: Aman Gupta <aman>	2025-12-31 17:52:09 +02:00
Penglin Cai	5f0488f012	CANN: Add support for CONV_TRANSPOSE_1D when kernel size > 255 (llama/17934) * CONV_TRANSPOSE_1D kernel_size>255 * remove condition check * fix the bug of type conversion * removing trailing whitespaces * fix: return true in the switch case	2025-12-31 17:52:09 +02:00
Aadeshveer Singh	db75fff539	ggml : optimize cuda cumsum fallback kernel (llama/18343)	2025-12-31 17:52:09 +02:00
Aman Gupta	41e578ec8a	CUDA: experimental native mxfp4 support for blackwell (llama/17906) * CUDA: experimental native mxfp4 support for blackwell * optimize load_tiles * optimize quantize_mxfp4 * cleanup * first pass review: formatting * use interleaved layout for mma * mmq: add assert for size * use __nv_fp4x4_e2m1 * use iter_k as 512, cleanup * Use 1200 as blackwell instead of 1000 * address review comments * mmq: fix stride * quantize.cu: use reference impl of e8m0 scale * address review comments * add 120f-virtual + minor fixes --------- Co-authored-by: Aman Gupta <aman>	2025-12-31 17:52:09 +02:00
Jeff Bolz	f863735caa	vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (llama/18302)	2025-12-31 17:52:09 +02:00
Wang Weixuan	bab2c02da5	CANN : refactor ACL graph cache (llama/17752) Move the graph property checking code into methods of LRU cache. Signed-off-by: Wang Weixuan <wangweixvan@gmail.com>	2025-12-31 17:52:09 +02:00
Ruben Ortlam	1356600679	vulkan: use fewer FA rows for small cache runs (llama/18280)	2025-12-31 17:52:09 +02:00
TianHao324	ec9239d3b7	CANN: Uses yarn_ramp cache in ROPE (llama/17725)	2025-12-31 17:52:09 +02:00
Chris Rohlf	9bdd4658f4	rpc : add check for rpc buffer type (llama/18242)	2025-12-31 17:52:09 +02:00
nullname	e4c89612cd	ggml-hexagon: create generalized functions for cpu side op (llama/17500) * refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility * refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility * refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity * add comment * refactor: remove redundant buffer checks in hexagon supported operations * wip * add missing include to fix weak symbol warning * add ggml_hexagon_op_generic * refactor: simplify tensor operation initialization and buffer management in hexagon implementation * refactor: streamline hexagon operation initialization and buffer management * refactor: update function signatures and streamline request handling in hexagon operations * wip * ggml-hexagon: clean up code formatting and improve unary operation handling * wip * rename * fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations * refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity refactor: remove redundant buffer checks in hexagon supported operations add missing include to fix weak symbol warning add ggml_hexagon_op_generic refactor: simplify tensor operation initialization and buffer management in hexagon implementation refactor: streamline hexagon operation initialization and buffer management refactor: update function signatures and streamline request handling in hexagon operations ggml-hexagon: clean up code formatting and improve unary operation handling fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations # Conflicts: # ggml/src/ggml-hexagon/ggml-hexagon.cpp * hexagon: fix merge conflicts * hexagon: minor cleanup for buffer support checks * hexagon: factor out op_desc and the overal op logging * hexagon: further simplify and cleanup op dispatch logic * snapdragon: update adb scripts to use llama-cli and llama-completion * fix pipeline failure --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2025-12-31 17:52:09 +02:00
Shouyu	2f33395197	ggml-hexagon: gelu optimization (llama/18151) * feat: working gelu with src0 put on vtcm * feat: gelu ping-pong for both in and out * fix: fixu compile error * break: distinguish dma ddr->vtcm and vtcm->ddr operation * fix: fix dma queue size * break: update dma api to either pop src or dst ptr * fix: fix activation vtcm allocation issue for src1 when swapperd * refactor: ping-pong gelu logic to avoid unnecessary if else * dma: improved queue interface and prefetch handling * gelu: fix N+2 block prefetch --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2025-12-31 17:52:09 +02:00
Taimur Ahmad	5b0c1c1580	llamafile: add rvv support for sgemm kernels (llama/18199) Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>	2025-12-31 17:52:09 +02:00
lhez	f2fe1e5baf	opencl: unpack q4_0 for adreno in get_tensor (llama/18278)	2025-12-31 17:52:09 +02:00
Jeff Bolz	dbbe6c11b5	vulkan: Extend rope fusions to allow mrope (llama/18264) Extend the test-backend-ops tests as well.	2025-12-31 17:52:09 +02:00
Jeff Bolz	98e59a43d1	vulkan: Implement set_tensor_async and the event interfaces (llama/18047) The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from #7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.	2025-12-31 17:52:09 +02:00
Johannes Gäßler	b68b12f2d5	llama: fix RPC for -fit on (llama/18233)	2025-12-31 17:52:09 +02:00
Jeff Bolz	b893e0813a	vulkan: fix im2col overflowing maxworkgroupcount (llama/18180)	2025-12-31 17:52:09 +02:00
Jeff Bolz	f407c5e562	vulkan/cuda: fix topk_moe with exp_probs_b (llama/18071) I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn and added coverage for exp_probs_b and some other missing combinations. This exposed a bug in both CUDA and Vulkan backends where they were assuming the input to argsort and the input to get_rows are the same. I'd like to optimize this graph in another change, but for now just get it functional. CUDA also had a bug where it got n_experts from the wrong place, leading to GGML_ASSERT failures in some of the new tests.	2025-12-31 17:52:09 +02:00
Jeff Bolz	ad6ee3865d	vulkan: support GGML_UNARY_OP_XIELU (llama/18062)	2025-12-31 17:52:09 +02:00
Jeff Bolz	3cd141f1a9	vulkan: in graph_optimize, try to group ADD operations (llama/18060) I saw the adds not staying together in the new nemotron 3 nano model.	2025-12-31 17:52:09 +02:00
lovedheart	449fc7c024	Vulkan: some improvement on mul_mat_iq2_xs (llama/18031) * Some improvement on mul_mat_iq2_xs Refactor calculations for db values and grid data to optimize performance and reduce redundancy. * Fix trailing whitespace	2025-12-31 17:52:09 +02:00
Aadeshveer Singh	0983985f06	Added comments explaining thread block size selection logic based on row count and column size, derived from historical commit context (llama/18212)	2025-12-31 17:52:09 +02:00
Alfred	17a4cb15b8	ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations (llama/17977) * feat: implement real Q8_0 * feat: adding cmake option for configuring FP32 quantize group size * typo: set() shall be used --------- Co-authored-by: ngdxzy <zhenyu_xu@uri.edu>	2025-12-31 17:52:09 +02:00
Jeff Bolz	195d8d0c65	vulkan: Add perf logger mode with concurrency (llama/17944) This implements a variation of the perf logger where rather than timing each operation individually with effectively a barrier in between, we put the timing boundaries where we already synchronize and time the groups of work that normally overlap. This can be useful to help understand whether individual operations need to be optimized, or if the group is already running efficiently. GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when GGML_VK_PERF_LOGGER is also set). GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch.	2025-12-31 17:52:09 +02:00
Xuan-Son Nguyen	fea481f412	model : add ASR support for LFM2-Audio-1.5B (conformer) (llama/18106) * ASR with LFM2-Audio-1.5B * Set rope_theta * Fix comment * Remove rope_theta setting * Address PR feedback * rename functions to conformer * remove some redundant ggml_cont * fix missing tensor * add prefix "a." for conv tensors * remove redundant reshape * clean up * add test model --------- Co-authored-by: Tarek Dakhran <tarek@liquid.ai>	2025-12-31 17:52:09 +02:00
Taimur Ahmad	956fac433b	ggml-cpu: extend support for RVV floating-point kernels (llama/17318) * cmake: add BF16 RVV flag for ggml-cpu * ggml-cpu: add floating-point conversion kernels * ggml: add floating-point kernels Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> * ggml-cpu: fix lmul in vec_dot_bf16 * ggml-cpu: change redsum to lmul 4, fix leftover --------- Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>	2025-12-31 17:52:09 +02:00
yulo	325a9b739c	remove i_major_dual (llama/18157) Co-authored-by: zhang hui <you@example.com>	2025-12-31 17:52:09 +02:00
Shouyu	c3a16089e3	ggml-hexagon: swiglu_oai operation (llama/18114) * snapshot: debug ggml-hexagon swiglu-oai * fix: fix hvx_min_scalar_f32 * feat: working swiglu-oai * chore: fix formating isue	2025-12-31 17:52:09 +02:00
Shouyu	c7ccedb5ba	ggml-hexagon: gelu operation (llama/17921) * feat: inital support for gelu using sigmoid approximation * snapshot: faster gelu using polynomial approximation * test: disable l2-block prefetch in polynomail approximation * Revert "test: disable l2-block prefetch in polynomail approximation" This reverts commit 72339994d45b2bed887e79994403c378d90b62b5. * Revert "snapshot: faster gelu using polynomial approximation" This reverts commit 2a787a61d11f9e63e5943a2e6d134b2f0c402ace. * debug: temporarily disable unnecessary log message for debug purpose * Feat: optiized unaligned sigmoid_f32 * Feat: larger l2prefetch block * feat: apply unaligned-load optimization on mul and mul_scalar * Revert "debug: temporarily disable unnecessary log message for debug purpose" This reverts commit 84f2f23aa9f17e2fa826db969cd825d0ab192995. * refactor: cleanup commented unused code * chore: reformat code with clang-formatter to pass cli test * Revert "chore: reformat code with clang-formatter to pass cli test" This reverts commit 952877ec24732b12010c7fa7ed3fc8de4b74e718. * fix: fix loop overflow * chore: fix formating ci error	2025-12-31 17:52:09 +02:00
Alberto Cabrera Pérez	1f72f00542	ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) (llama/18096) * wip: skeleton for q8_0 repack * q8_0 repack GEMV implementations * GEMM implementations * Formatting * Fixed format consistency of repack gemm and gemv declarations * gemv and gemm generic location consistent with declarations * Removed non-correct unused variables statements * Cleanup, consistent style * Missing generic fallbacks for x86 and powerpc	2025-12-31 17:52:09 +02:00
yulo	9118c05dc4	HIP: Refactor mma for RDNA and CDNA (llama/17990) * mma.cuh for rdna4 * mma for rdna3 * mmq for rdna4 * mmq for rdna3 * align i-major and j-major * cdna * fix cuda error * add missing tile of mfma * fix j-major wrong ne on CDNA * fix gramma and empty spaces --------- Co-authored-by: zhang hui <you@example.com>	2025-12-31 17:52:09 +02:00
Naco Siren	00108bb713	llama.android : Rewrite Android binding (w/o cpu_features dep) (llama/17413) * UI: implement basic UI components * util: implement performance monitor; wrap it with a viewmodel * util: implement user preferences utility * UI: implement core flow's screens * UI: add a new MainActivity; update manifest * [WIP] DI: implement simple local vm factory provider * UI: disable triggering drawer via gesture; enable alert dialog on back navigation inside conversation and benchmark * UI: allow drawer's gesture control only on Home and Settings screens; enable alert dialog on back navigation inside conversation and benchmark * UI: split a nested parent settings screen into separate child settings screens * UI: polish system prompt setup UI * Deps: bump Kotlin plugin; introduce KSP; apply in :app subproject * DB: setup Room database * data: introduce repo for System Prompt; flow data from Room to VM * bugfix: properly handle user's quitting conversation screen while tokens in generation * UI: rename `ModeSelection` to `ModelLoading` for better clarity * UI: update app name to be more Arm * UI: polish conversation screen * data: code polish * UI: code polish * bugfix: handle user quitting on model loading * UI: locks user in alert dialog when model is unloading * vm: replace token metrics stubs with actual implementation * UI: refactor top app bars * nit: combine temperatureMetrics and useFahrenheit * DI: introduce Hilt plugin + processor + lib dependencies * DI: make app Hilt injectable * DI: make viewmodels Hilt injectable * DI: replace manual DI with Hilt DI * UI: optimize AppContent's composing * bugfix: wait for model to load before navigating to benchmark screen; use NavigationActions instead of raw navController * UI: navigation with more natural animated transitions * DI: Optimize AppModule * Feature: Introduce ModelRepository and ModelsManagementViewModel; update AppModule * UI: polish UI for ModelsManagementScreen; inject ModelsManagementVieModel * DI: abstract the protocol of SystemPromptRepository; update AppModule * data: [WIP] prepare for ModelRepository refactor & impl * data: introduce Model entity and DAO; update DI module * UI: replace Models Management screen's stubbing with instrumentation * UI: polish sort order menu * data: import local model with file picker * bugfix: use List instead of Collection for ModelDao's deletion * data: add a util file for extracting file name & size and model metadata * UI: enrich ModelManagementState; extract filename to show correct importing UI * UI: implement multiple models deletion; update Models Management screen * UI: handle back navigation when user is in multi-selection mode * util: extract file size formatting into ModelUtils * UI: add a confirmation step when user picks a file; refactor model import overlay into AlertDialog * UI: extract a shared ModelCard component * UI: replace model selection screen's data stubbing; add empty view * nit: tidy SystemPromptViewModel * Util: split FileUtils from ModelUtils; extract copy methods into FileUtils * data: pass through getModelById from ModelDao into ModelRepository * core: extract conversation and benchmark logics into InferenceManager; add logs and missing state updates in stub InferenceEngine * vm: split mono MainViewModel into separate individual ViewModels * vm: merge SystemPromptViewModel into ModelLoadingViewModel * core: break down InferenceManager due to Interface Segregation Principle * UI: show model card in Model Loading screen * UI: show model card in Conversation screen * UI: unify Model Card components * core: swap in LLamaAndroid and mark stub engine for testing only * data: allow canceling the ongoing model import * UI: update UI ongoing model import's cancellation * LLama: update engine state after handling the cancellation of sendUserPrompt * VM: handle the cancellation of ongoing token generation * LLama: refactor loadModel by splitting the system prompt setting into a separate method * feature: check for available space before copying local model * UI: centralize the AppScaffold and modularize its configs * UI: refactor BottomBarConfig.ModelsManagement APIs * UI: combine TopBarConfig and BottomBarConfig into each route's ScaffoldConfig * UI: replace ugly optional as casts in AppScaffold with extension functions * UI: fix the typo `totalGb` in `StorageMetrics` * UI: remove code duplication in sort menu * LLama: add ModelUnloadingState to engine State; add missing state checks in stub engine; fix instrumentation engine's error messages * UI: refactor back handling by removing centralized BackHandlerSetup and UnloadModelConfirmationDialog from AppContent * UI: implement BenchmarkScreen's individual back handling * LLama: add a new Initializing state; ; add two extension properties; rename LibraryLoaded state to Initialized * UI: Introduce an abstract ViewModel to handle additional model unloading logics * UI: expose a single facade ModelUnloadDialogHandler; move UnloadModelState into ModelUnloadingViewModel.kt * UI: migrate ModelLoadingScreen onto ModelLoadingViewModel; update & refine ModelLoadingScreen * UI: migrate ConversationViewModel onto ModelLoadingViewModel; update & refine ConversationScreen * nit: extract app name into a constant value; remove unused onBackPressed callbacks * UI: update AppContent to pass in correct navigation callbacks * nit: polish ModelLoadingScreen UI * core: throw Exception instead of returning null if model fails to load * navigation: sink model loading state management from AppContent down into ModelLoadingScreen; pass ModelLoadingMetrics to Benchmark and Conversation screens * gguf: add GGUF metadata data holder and its corresponding extractor implementation * DB: introduce Kotlin serialization extension's library and plugin; add Room runtime library * GGUF: make GgufMetadata serializable in order to be compatible with Room * nit: refactor data.local package structure * nit: rename lastUsed field to dateLastUsed; add dateAdded field * UI: refactor ModelCard UI to show GGUF metadata * UI: update ModelSelectionScreen with a preselect mechanism * UI: polish model card * nit: allow deselect model on Model Selection screen * nit: revert accidental committing of debug code * UI: polish ModelLoading screen * util: extract formatting helper functions from FileUtils into a new FormatUtils * UI: polish model cards on Benchmark and Conversation screens to show model loading metrics * UI: show a Snack bar to warn user that system prompt is not always supported * UI: handle back press on Model Selection screen * UI: finally support theme modes; remove hardcoded color schemes, default to dynamic color scheme implementation * feature: support searching on Model Selection screen * nit: move scaffold related UI components into a separate package * UI: extract InfoView out into a separate file for reusability * data: move Model related actions (query, filter, sort) into ModelInfo file * UI: animate FAB on model preselection states * feature: support filtering in Model Management screen * ui: show empty models info in Model Management screen * ui: add filter off icon to "Clear filters" menu item * [WIP] ui: polish Benchmark screen; implement its bottom app bar * ui: polish Benchmark screen; implement its bottom app bar's rerun and share * nit: disable mode selection's radio buttons when loading model * feature: implement Conversation screen's bottom app bar * pkg: restructure BottomAppBars into separate files in a child package * pkg: restructure TopBarApps into separate files in a child package * pkg: restructure system metrics into a separate file * UI: polish Conversation screen * data: update system prompt presets * UI: allow hide or show model card on Conversation & Benchmark screens; fix message arrangement * data: update & enhance system prompt presets * deps: introduce Retrofit2 * data: implement HuggingFace data model, data source with Retrofit API * data: update Model data repository to support fetching HuggingFace models * [WIP] UI: replace the HuggingFace stub in Model Management screen with actual API call * UI: map language codes into country Emojis * ui: add "clear results" action to Benchmark screen * nit: print current pp & tg in llama-bench * UI: disable landscape mode; prevent duplicated benchmark running * llama: migrate C/CXX flags into CMakeList * [WIP] llama: ABI split builds five .so artifacts. However, all .so are performing on SVE level * [WIP] llama: ABI split where five tiers are built sequentially. * [WIP] llama: disable OpenMP in ABI split since most SoCs are big.LITTLE * [WIP] llama: enable KleidiAI and disable tier 4 due to `+sve+sve2` bug caused by `ggml_add_cpu_backend_variant_impl` as explained below ```CMake if (NOT SME_ENABLED MATCHES -1) ... set(PRIVATE_ARCH_FLAGS "-fno-tree-vectorize;${PRIVATE_ARCH_FLAGS}+sve+sve2") ... ``` * core: add Google's cpu_features as a submodule * core: implement cpu_detector native lib * core: swap out hardcoded LlamaAndroid library loading * core: add back OpenMP due to huge perf loss on TG128 * misc: reorg the pkg structure * misc: rename LlamaAndroid related class to InferenceEngine prefixes * [WIP] lib: move GgufMetadata into the lib submodule * lib: expose GgufMetadataReader as interface only * lib: replace the naive & plain SharedPreferences with DataStore implementation * lib: hide the internal implementations, only expose a facade and interfaces * lib: expose Arm features * di: add a stub TierDetection; provide both actual impl and stub in AppModule * UI: add visualizer UI for Arm features * misc: UI polish * lib: refactored InferenceEngineLoader; added a `NONE` Llama Tier * UI: support `NONE` Llama Tier in general settings * lib: optimize engine loader; always perform a fresh detection when cache is null * remote: add HuggingFaceModelDetails data class * remote: refine HuggingFaceModel data class * nit: remove `trendingScore` field from HuggingFace model entities, weird... * remote: refactor HuggingFaceApiService; implement download feature in HuggingFaceRemoteDataSource * remote: fix the incorrect parse of HuggingFace's inconsistent & weird JSON response * UI: scaffold Models Management screen and view model * UI: implement a dialog UI to show fetched HuggingFace models. * UI: use a broadcast receiver to listen for download complete events and show local import dialog. * data: handle network exceptions elegantly * pkg: restructure `data`'s packages * data: extract local file info, copy and cleanup logics into LocalFileDataSource * nit: minor UI patch; add missing comments * bugfix: tapping "Home" in navigation drawer should simply close it without any navigation action. * UI: improve autoscroll during token generation * lib: tested on JFrog Artifactory for Maven publishing * UI: show RAM warning if model too large * UI: polish model management screen's error dialog * util: add more items into the mapping table of ISO 639-1 language code to ISO 3166-1 country code * llm: properly propagate error to UI upon failing to load selected model * UI: avoid duplicated calculation of token metrics * lib: read & validate the magic number from the picked source file before executing the import * UI: add "Learn More" hyperlinks to Error dialog upon model import failures * lib: refactor the GgufMetadataReader to take InputStream instead of absolute path as argument * lib: fix the `SIMD` typo in Tier description * core: verify model file path is readable * lib: add UnsupportedArchitectureException for triaged error message * util: split FormatUtils into multiple utils for better readability * UI: change benchmark screen from raw markdown to table view * bugfix: reset preselection upon running the preselected model * misc: linter issue * bugfix: fix the malfunctioning monitoring switch * UI: update Arm features indicator; fix the broken hyperlinks * UI: add quick action buttons to benchmark screen's result card * UI: hide share fab after clearing all benchmark results * UI: fix the model unload dialog message; elevate the model card and hide it by default on Conversation screen; * UI: hide the stubbing actions in Conversation screen * UI: add show/hide stats control to conversation screen's assistant message bubble; fix placeholder * UI: add a info button to explain token metrics * misc: remove the redundant `Companion` added due to refactoring * UI: show corresponding system metrics detailed info upon tapping RAM / storage / temperature indicator * UI: add info button to System Prompt switch; expand the model card by default * UI: disable tag & language chips; add section headers to explain what they are * misc: replace top bar indicator's spacer with padding * UI: merge the Model Selection and Model Management into a unified Models screen * UI: split the ModelsManagementViewModel from a unified ModelsViewModel due to huge complexity * UI: add model loading in progress view; polish the empty model info view * UI: polish the bottom bars and info view when no models found; show loading in progress while fetching models * build: [BREAKING] bump the versions of libraries and plugins * UI: fix the breaking build * UI: add Tooltip on Import FAB for user onboarding * UI: adds AppPreferences to track user onboarding status * UI: tracks user's first success on importing a model * data: add hand crafted rules to filter the models fetched from HuggingFace API * UI: update app name & about; polish top bars' indicators & buttons * UI: polish Hugging Face download dialog UI * UX: implement onboarding tooltips for model import and onboarding * misc: use sentence case for CTA button labels * [WIP] UI: add Arm color palette from Philip.Watson3 * UI: address Rojin's UX feedbacks * UI: address Rojin's UX feedbacks - part 2 * UI: update Arm color palette from Philip.Watson3 * data: make sure fetch preselected models in the same order of their IDs * UI: fix UI issues in the generic settings screen and navigation drawer * nit: address Rojin's feedbacks on model import message again * nit: append `®` to all `Arm` labels * UI: extract a reusable InfoAlertDialog * core: support GGML_CPU_ALL_VARIANTS on Android! * core: restructure Kleidi-Llama library * core: organizing cmake arguments * data: sort preselected models according to device's available RAM * app: update adaptive + themed + legacy icons and app name * UI: fix the font size auto scaling for ArmFeaturesVisualizer * core: further improve the performance on native methods * UI: minor color palette changes; emphasize the bottom bar FABs; fix Settings Screen menu item label * UI: make more room for assistant message bubble's width * UI: better usage of tertiary colors to highlight model cards but not for warnings * UI: fix the layout issue on large font sizes * lib: support x86-64 by dynamically set Arm related definitions * lib: replace the factory pattern for deprecated tiered lib loading with single instance pattern * llama: update the library name in JNI and CMake project * llama: update the library's package name and namespace * llama: update the app's package name and namespace * app: bump ksp version * app: remove deprecated SystemUIController from accompanist by migrating to EdgeToEdge * app: extract AppContent from MainActivity to a separate file in ui package * lib: add File version for GGUF Magic number verification * lib: perform engine state check inclusively instead of exclusively * lib: change `LlamaTier` to `ArmCpuTier` * lib: remove kleidi-llama related namings * cleanup: remove Arm AI Chat/Playground app source code; replace with the basic sample app from https://github.com/hanyin-arm/Arm-AI-Chat-Sample Note: the full Google Play version of AI Chat app will be open will be open sourced in another repo soon, therefore didn't go through the trouble of pruning the history using `git filter-repo` here. * [WIP] doc: update main and Android README docs; add self to code owners * lib: revert System.load back to System.loadLibrary * jni: introduce a logging util to filter different logging levels on different build types * lib: enable app optimization * doc: replace stub Google Play app URL with the actual link add screenshots; add my GitHub ID to maintainer list * Remove cpu_features * Fix linters issues in editorconfig-checker job https://github.com/ggml-org/llama.cpp/actions/runs/19548770247/job/55974800633?pr=17413 * Remove unnecessary Android CMake flag * purge include/cpu_features directory --------- Co-authored-by: Han Yin <han.yin@arm.com>	2025-12-18 08:20:56 +02:00
Aadeshveer Singh	41a95b8ba7	ggml : use WARP_SIZE/2 for argmax reduction offset (llama/18092)	2025-12-18 08:20:56 +02:00
Shouyu	8dd70bdc85	ggml-hexagon: mm for mtmd (llama/17894) * feat: add run_mtmd script for hexagon * fix: fix issue in fp16xfp32 mm * fix: remove opt_experiment for fp16xfp32 mm * fix: ggml-hexagon: matmul fp16xfp32 support non-contigious src0 * fix: fix syntax check for run-mtmd.sh for cli	2025-12-18 08:20:56 +02:00
Jeremy Demeule	b90ec07aba	metal: use shared buffers on eGPU (llama/17866) * metal: use shared buffers on eGPU With #15906, I noticed on important regression when using metal backend on eGPU. This commit restore the previous behavior and add an option to force its activation. * metal: use shared buffers on eGPU * metal: use shared buffers on eGPU	2025-12-18 08:20:56 +02:00
Johannes Gäßler	aaf3f39b4a	llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (llama/16653) * llama: automatically fit args to free memory llama-fit-params tool * fix CI * hints for bug reports, ensure no reallocation * fix segfault with Vulkan * add llama-fit-params to CI * fix CI * fix CI * fix CI * minor adjustments * fix assignment of 1 dense layer * fix logger not being reset on model load failure * remove --n-gpu-layer hint on model load failure * fix llama-fit-params verbosity * fix edge case * fix typo [no ci]	2025-12-18 08:20:56 +02:00
Neo Zhang Jianyu	b5e352a52f	Support gpt-oss by OPs add-id, mul_mat for mxfp4, swiglu_oai (llama/17826) * support gpt-oss GPU by OP add-id, mul_mat for mxfp4, swiglu_oai, fix warning * fix fault ut case, update ops.md * rebase, fix format issue	2025-12-18 08:20:56 +02:00
Ruben Ortlam	3bb4e1e0ac	vulkan: fix mul_mat_vec_iq1_s formatting (llama/18026)	2025-12-18 08:20:56 +02:00
Jeff Bolz	af2c8cba6f	vulkan: Fix data race/hang in scalar/cm1 flash attention (llama/17887)	2025-12-18 08:20:56 +02:00
lovedheart	7e5df2975e	vulkan: improve mul_mat_vec_iq1_s speed (llama/17874)	2025-12-18 08:20:56 +02:00
Eve	cdadfc3b72	vulkan: faster q6_k matmul (llama/17813) * q6_k faster mul mat * 8 values * fix comment * switch to two at a time * start ci for .glsl files	2025-12-18 08:20:56 +02:00
Georgi Gerganov	b62ef9af7a	ggml : arm repack fix build (llama/0)	2025-12-18 08:20:56 +02:00
Jeff Bolz	b901ebe4a3	vulkan: support get_rows for i32 (llama/17941)	2025-12-18 08:20:56 +02:00
Jeff Bolz	f33446643e	vulkan: support GGML_OP_DIAG (llama/17893)	2025-12-18 08:20:56 +02:00
Jeff Bolz	939d3085e9	vulkan: Multi-pass softmax for large number of cols (llama/17892) When the number of cols is large, split each row across multiple workgroups. There are three phases that communicate partial results through temp buffers: (1) compute max partials (2) take max of partials, compute sum(exp(x-max)) partials (3) sum partials, compute scaled result	2025-12-18 08:20:56 +02:00
Jeff Bolz	13bb296dbf	vulkan: Allow non-pow2 n_experts in topk_moe (llama/17872)	2025-12-18 08:20:56 +02:00
Johannes Gäßler	feb856d4a1	CUDA: fix overflow in MMA kernel without stream-k (llama/17939)	2025-12-18 08:20:56 +02:00
Sigbjørn Skjæret	db1fcd958f	cann : fix ops broken by circular padding guard (llama/17825)	2025-12-18 08:20:56 +02:00
ixgbe	2c782ec325	ggml-cpu : fix RISC-V Q4_0 repack select and RVV feature reporting (llama/17951) * ggml-cpu:fix RISC-V Q4_0 repack select and RVV feature reporting Signed-off-by: Wang Yang <yangwang@iscas.ac.cn> * using the name VLEN instead of CNT * Update ggml/include/ggml-cpu.h --------- Signed-off-by: Wang Yang <yangwang@iscas.ac.cn> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-18 08:20:56 +02:00
yulo	25d99e9135	HIP: enable mmf for RDNA3 (llama/17879) * enable mmf for RDNA3 * disable mmf for some shape * move some mmvf to mmf * more mmfv to mmf * 3 is good in mmvf --------- Co-authored-by: zhang hui <you@example.com>	2025-12-18 08:20:56 +02:00
Piotr Wilkin (ilintar)	e0af519a61	SOLVE_TRI extension to more dimensions (llama/17793) * Extended TRI * Fix whitespace * chore: update webui build output * Just use cuBLAS for everything... * Merge both versions * Remove incorrect imports causing failures for CI * Still failing... remove all direct cublas imports and rely on common imports from "common.cuh" * Defines for hipBlas * Aaaand MUSA defines... * I hate this job... * Stupid typo... * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-18 08:20:56 +02:00
Georgi Gerganov	f0c9017a2f	ggml : arm repack fix build (#0 )	2025-12-13 08:04:09 +02:00
Congcong Cai	324dd21d3c	cmake : set `CMAKE_RUNTIME_OUTPUT_DIRECTORY` for non standalone build (ggml/1394) Some backend depends on CMAKE_RUNTIME_OUTPUT_DIRECTORY to create temporary file like metal backened. Missing CMAKE_RUNTIME_OUTPUT_DIRECTORY will cause some cmake error like permission denied (try to copy file to root). This PR wants to setup a default path for CMAKE_RUNTIME_OUTPUT_DIRECTORY when it does not exist.	2025-12-12 17:53:24 +02:00
Georgi Gerganov	1da1a6865c	ggml-alloc : fix reuse-parent logic for misaligned sizes (llama/17884)	2025-12-12 17:53:24 +02:00
nullname	0c88de5c69	ggml-hexagon: fix `rope` failure at `test-backend-ops` (llama/17565) * fix test failure * fix: correct scaling calculations in rope_cache_init * fix: optimize element copying in rope_hex_f32 using memcpy * fix: optimize loop boundaries in rope_hex_f32 for better performance * feat: add profiling macros for performance measurement in operations	2025-12-12 17:53:24 +02:00
Max Krasnyansky	a2886fba48	Fix race conditions in threadpool when dealing with dynamic/frequent n_threads changes (llama/17748) * tests: update barrier test to check for race condition in active threads * cpu: combine n_graph and n_threads into a single atomic update * tests: add multi-graph test for test_barrier	2025-12-12 17:53:24 +02:00
Georgi Gerganov	cd9b8c6d18	ggml : remove GGML_KQ_MASK_PAD constant (llama/17910) * ggml : remove GGML_KQ_MASK_PAD constant * cont : remove comment	2025-12-12 17:53:24 +02:00
Sigbjørn Skjæret	ca8ea18d06	cuda : add missing support check for xielu (llama/17895)	2025-12-12 17:53:23 +02:00
Johannes Gäßler	ea1829134f	CUDA: fix unpadded strides in MMA FA kernel (llama/17891)	2025-12-12 17:53:23 +02:00
Neo Zhang Jianyu	c10b4f9a01	fix softmax for iGPU (llama/17838)	2025-12-12 17:53:23 +02:00
Gabe Goodhart	307dc525bb	metal: SSM kernel improvements (llama/17876) * feat: Add a batched version of ssm_conv This was done using Claude Code. It found a number of optimizations around how the threads were organized, resulting in a huge performance boost! Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Optimized SSM_SCAN kernel for metal This used Claude Code and resulted in a modest performance improvement while maintaining correctness. Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * test: Add test-backend-ops perf tests for SSM_CONV Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * test: Real representitive tests for SSM_CONV Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use function constant for ssm_conv batch size Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * test: backend op tests for ssm_scan from granite4 1b-h Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: remove commented out templates Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: float4 version of ssm_conv_batched Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing ggml_metal_cv_free Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:23 +02:00
Piotr Wilkin (ilintar)	2817582be2	Add DIAG for CUDA (llama/17873) * Add DIAG for CUDA * Refactor parameters	2025-12-12 17:53:23 +02:00
Gabe Goodhart	41bbc034f0	ggml : Provide macos-specific backtrace printing to avoid terminal death (llama/17869) * fix: Provide macos-specific backtrace printing to avoid terminal death Branch: MacOSSafeBacktrace Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add GGML_BACKTRACE_LLDB env var to enable using lldb for backtrace Branch: MacOSSafeBacktrace Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-12-12 17:53:22 +02:00
Georgi Gerganov	b6ae0b29d1	metal : print node names for debugging (llama/17882)	2025-12-12 17:53:22 +02:00
Sigbjørn Skjæret	ba463fb577	ggml : allow fill node alloc inplace (llama/17870)	2025-12-12 17:53:22 +02:00
Chenguang Li	79d86a5c2c	CANN: add support for partial RoPE and Vision mode (llama/17543) * cann: add support for partial RoPE and Vision mode Add support for two important RoPE variants: partial rotation (rope_dims < ne0) and Vision mode rotation. 1. Support for partial RoPE (rope_dims < ne0): - Split tensor into head (first rope_dims dimensions) and tail portions - Apply rotation only to head portion using RotaryPositionEmbedding operator - Copy unrotated tail portion directly from source to destination - Handle both contiguous and non-contiguous tensor layouts 2. Support for Vision mode (GGML_ROPE_TYPE_VISION): - Set rope_dims = ne0 for Vision mode to rotate entire tensor - Vision mode pairs dimension i with dimension i+n_dims (where n_dims = ne0/2) - No tail handling needed since entire tensor is rotated Implementation details: - Use has_tail flag to determine execution path: head/tail splitting when rope_dims < ne0, or full tensor rotation when rope_dims == ne0 - Support both F32 and F16 data types with intermediate F32 conversion - Copy non-contiguous tensors to contiguous buffers before calling RotaryPositionEmbedding operator for compatibility - Improve cache invalidation logic to include rope_dims and indep_sects parameters These enhancements enable CANN backend to handle various RoPE configurations used in modern vision-language models and models with partial rotation. * cann: fix review comment	2025-12-12 17:53:22 +02:00
Johannes Gäßler	bef1f5a57e	CUDA: fix FP16 overflow in tile FA kernel (llama/17875)	2025-12-12 17:53:22 +02:00
Jay Zenith	821c2071ab	cuda : add FILL op support (llama/17851) * cuda : add FILL op support * cuda : add missing FILL op files	2025-12-12 17:53:22 +02:00
wsbagnsv1	e1562e85fc	cuda: optimize SOLVE_TRI using registers and FMAF (llama/17703) * ggml-cuda: optimize solve_tri_f32_fast and fix stride handling - Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts. - Implement explicit `fmaf` instructions for the reduction loop. - Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to `char ` before addition). - Remove unused `MAX_K_FAST` definition. Small cleanup * Remove comments in solve_tri.cu * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Use const for variables in solve_tri.cu * Replace fmaf with more readable code * remove last fmaf --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-12 17:53:21 +02:00
ixgbe	c8d0ee2f9f	ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support (llama/17784) * ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support Signed-off-by: Wang Yang <yangwang@iscas.ac.cn> * cmake: enable RISC-V zihintpause extension for Spacemit builds * readme : add ZIHINTPAUSE support for RISC-V --------- Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>	2025-12-12 17:53:21 +02:00
lovedheart	d6d44fac69	Vulkan: improve mul_mat_vec_iq1_m (llama/16907) * Optimize Vulkan shader for matrix-vector multiplication * Revert changes on compute_outputs and main Refactor compute_outputs to handle remaining rows correctly. * Fix trailing whitespace	2025-12-12 17:53:21 +02:00
Law Po Ying	447ef8633b	sycl: add missing BF16 conversion support for Intel oneAPI (llama/17780) * sycl: add missing BF16 conversion support for Intel oneAPI * Fix Line 645: Trailing whitespace	2025-12-12 17:53:21 +02:00
Jeff Bolz	898f876fe2	vulkan: perf_logger improvements (llama/17672) * vulkan: perf_logger improvements - Move perf_logger from device to ctx. - Add an env var to control the frequency we dump the stats. If you set a very large value, it just dumps when the ctx is destroyed. - Add a fusion info string to the tracking, only log one item per fused op. - Fix MUL_MAT_ID flops calculation. * fix vector sizes	2025-12-12 17:53:21 +02:00
Vishal Singh	ebff8f9db9	ggml-zendnn : add ZenDNN backend for AMD CPUs (llama/17690) * ggml-zennn: add ZenDNN backend support * ggml-zendnn : address ZenDNN backend review fixes and suggestions * docs : apply blockquote syntax to ZenDNN docs --------- Co-authored-by: Manoj Kumar <mkumar@zettabolt.com>	2025-12-12 17:53:21 +02:00
Phylliida Dev	c5e1807071	ggml : add circular tiling support to pad, for Vulkan, CUDA, and CPU (used for making seamless textures) (llama/16985) * Feat: Added vulkan circular tiling support * Feat: Added cpu circular * Feat: Added cuda kernels * Added tests * Added tests * Removed non-pad operations * Removed unneded changes * removed backend non pad tests * Update test-backend-ops.cpp * Fixed comment on pad test * removed trailing whitespace * Removed unneded test in test-backend-ops * Removed removed test from calls * Update ggml/src/ggml-vulkan/vulkan-shaders/pad.comp Co-authored-by: Ruben Ortlam <picard12@live.de> * Fixed alignment * Formatting Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Format pad * Format * Clang format * format * format * don't change so much stuff * clang format and update to bool * fix duplicates * don't need to fix the padding * make circular bool * duplicate again * rename vulkan to wrap around * Don't need indent * moved to const expr * removed unneded extra line break * More readable method calls * Minor wording changes * Added final newline * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Added circular pad ext tests * Gate non circular pad devices * Cleaned gating of non-circular pad devices --------- Co-authored-by: Phylliida <phylliidadev@gmail.com> Co-authored-by: Ruben Ortlam <picard12@live.de> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:20 +02:00
Johannes Gäßler	94be71911f	HIP: fix RDNA3 FP16/BF16 matrix multiplication (llama/17817)	2025-12-12 17:53:20 +02:00
Sky	b67e3abdb2	ggml : improve error handling for search path existence checks (llama/17653) * Improve error handling for search path existence checks Refactor existence checks for search paths using std::error_code to handle potential errors. * Improve cache file existence check with error code Update fs::exists to use std::error_code for error handling. * Simplify existence check for search paths Simplify existence check for search paths * Fix logging path in error message for posix_stat * Update ggml/src/ggml-backend-reg.cpp Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Adapt to the coding standard --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2025-12-12 17:53:20 +02:00
Jeff Bolz	c66c71e9f4	vulkan: Use one row per workgroup for f32 mmv (llama/17711) The MoE models have a mul_mat_vec with very small m (32, 64, 128) right before the topk_moe selection. Running multiple rows per wg doesn't utilize the SMs well. I think even for larger m, f32 is so bandwidth-limited that running multiple rows doesn't help.	2025-12-12 17:53:20 +02:00
Jeff Bolz	875d861473	vulkan: support solve_tri with larger N/K values (llama/17781) Split N into chunks to fit into shared memory. If K > 128, use a larger workgroup with enough invocations. Add perf tests matching qwen3next.	2025-12-12 17:53:20 +02:00
Georgi Gerganov	41cf229d72	metal : fix build(#17799 ) * metal : fix build * tests : fix context destruction	2025-12-12 17:53:20 +02:00
Masato Nakasaka	a8d02735f7	vulkan: Replace deprecated VK_EXT_validation_features (llama/17637) * replaced deprecated VK_EXT_validation_features * forgot to remove old code	2025-12-12 17:53:19 +02:00
Masato Nakasaka	191e5f46a2	vulkan: Fix mismatch in TOPK_MOE unit test (llama/17541) * Fix shader to support 2D workgroup mapping to a single subgroup * Set required_subgroup_size topk_moe shader requires static WARP_SIZE and actual subgroup size to match	2025-12-12 17:53:19 +02:00
Jeff Bolz	64a3f573e0	vulkan: add more num_blocks instantiations in rms_norm (llama/17701)	2025-12-12 17:53:19 +02:00
Jeff Bolz	0484147ab2	vulkan: fix top_k bug when there are ties in the input (llama/17659) * vulkan: Reduce temporary memory usage for TOP_K - Compute row size for the temp buffer based on the output of the first pass. - Update shader addressing math to use the output row size - Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k" For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer from about 3.2MB to 500KB. * vulkan: fix top_k bug when there are ties in the input I noticed by inspection a bug in the vulkan top_k shader where if the least value in the top_k appears multiple times we could end up writing those extra copies out rather than some larger values (if the larger values are on higher numbered threads). I rewrote the test verification to handle this case, where the final index set is not necessarily the same. * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:19 +02:00
Acly	0b53759b29	vulkan : support conv-2d with large output size (llama/17685)	2025-12-12 17:53:19 +02:00
Reese Levine	23984be4da	ggml webgpu: unary op suppport, code refactoring, ops support (llama/17764) * Squashed commit of the following: commit b3c6bf4b0450d8d452b934df27a0fb7cb53cd755 Author: Abhijit Ramesh <abhijitramesh2k@gmail.com> Date: Mon Dec 1 18:29:00 2025 -0800 ggml webgpu: fix xielu parameter passing (llama/11) The XIELU operation was incorrectly using static_cast to convert float parameters to uint32_t, which converted numeric values instead of preserving IEEE 754 bit patterns. This caused incorrect values to be interpreted by the GPU shader. * Use reinterpret_cast to preserve float bit patterns when passing through uint32_t params buffer * Update WGSL shader parameter types from u32 to f32 * Re-enable XIELU support (was disabled due to numerical issues) Fixes NMSE test failures for XIELU operation on WebGPU backend. commit 5ca9b5e49ea7cddc9ab7c8b43a11a9c76a4dff4a Author: neha-ha <137219201+neha-ha@users.noreply.github.com> Date: Tue Nov 18 12:17:00 2025 -0800 Refactored pipelines and workgroup calculations (llama/10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 29 23:13:06 2025 -0700 formatted embed wgsl and ggml-webgpu.cpp commit e1f6baea31645e5d96ad53664acae856f74b96f4 Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 29 23:08:37 2025 -0700 implemented REPL_Template support and removed bug in unary operators kernel commit 8c70b8fece445cdc9a8c660dbddbf201e52da2bb Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 15 16:14:20 2025 -0700 responded and dealt with PR comments commit f9282c660c10dec4487d434549bdb707a9cd9f37 Author: James Contini <jamescontini@gmail.com> Date: Sun Oct 12 13:41:41 2025 -0700 removed unnecesarry checking if node->src[1] exists for unary operators commit 4cf28d7dec41c29186d66152735b244c5699f9dc Author: James Contini <jamescontini@gmail.com> Date: Sun Oct 12 13:32:45 2025 -0700 All operators (inlcluding xielu) working commit 74c6add1761a59d2c2ff60b60e8ad3c8300f6d3e Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 13:16:48 2025 -0700 fixed autoconfig commit 362749910be4f0120c8ffb21ceddeb7d2c088e51 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 13:10:46 2025 -0700 removed vestigial files commit cb0858333785757804c5104e59c4981843207c16 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 12:59:32 2025 -0700 abides by editor-config commit 5360e2852a4b51197d7d67d0a5d42e908b02d7ed Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 12:45:57 2025 -0700 rms_norm double declaration bug atoned commit 7b09baa4aa53711be5a126043670cc182c78bfcd Merge: 8a6ec843 74b8fc17 Author: James Contini <jamescontini@gmail.com> Date: Fri Oct 10 11:50:03 2025 -0700 resolving merge conflicts commit 8a6ec843a50ab82f8cef59b4558eb63f318ba02d Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 8 18:06:47 2025 -0700 unary operators pass ggml tests commit c3ae38278a2db236adc5912c9140e4f0d63f2c19 Author: James Contini <jamescontini@gmail.com> Date: Wed Oct 1 16:22:40 2025 -0700 neg passes backend test commit aa1c9b2f8877a405470ca56709c42a1fd43713de Author: James Contini <jamescontini@gmail.com> Date: Tue Sep 30 23:55:27 2025 -0700 neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though Co-authored-by: James Contini <jamescontini@gmail.com> Co-authored-by: Neha Abbas <neabbas@ucsc.edu> Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com> * Remove extra code and format * Add ops documentation (finally) * Update ggml/src/ggml-webgpu/wgsl-shaders/embed_wgsl.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: James Contini <jamescontini@gmail.com> Co-authored-by: Neha Abbas <neabbas@ucsc.edu> Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-12 17:53:18 +02:00
Jeff Bolz	7e97d3b069	vulkan: enable mmvq for q2_k on NVIDIA (llama/17675)	2025-12-12 17:53:18 +02:00
Jeff Bolz	32ba1ec8e0	vulkan: set all memory allocations to high priority (llama/17624) * vulkan: set all memory allocations to high priority * gate by env var	2025-12-12 17:53:18 +02:00
Georgi Gerganov	aefcd75f4f	rpc : fix alloc size logic (llama/17116) * rpc : fix alloc size logic * rpc : bump version	2025-12-12 17:53:18 +02:00
Georgi Gerganov	322903fa67	metal : add residency sets keep-alive heartbeat (llama/17766) * examples : add idle * metal : attach residency sets to queue * idle : add link * idle : adjust intervals * metal : add residency sets keep-alive heartbeat * cont : adjust default keep-alive time	2025-12-12 17:53:18 +02:00
Johannes Gäßler	4170159dcd	HIP : fix RDNA4 build (llama/17792)	2025-12-12 17:53:18 +02:00
shalinib-ibm	d30b744047	Q4/Q8 Tiled Gemm Optimization. (llama/16999)	2025-12-12 17:53:17 +02:00
Johannes Gäßler	14502d6561	CUDA: fix FA VKQ accumulator overflow (llama/17746)	2025-12-12 17:53:17 +02:00
Jiacheng (Jason) Chen	e3f3c6ead1	HIP: enable WMMA-MMQ INT kernels for RDNA 3 (llama/17576) * enabled wmma instructions for most quantizations other than q2k * fixed the last q2_k test case failure * address comments: fix out of bound write for RDNA4, add comments after #endif * clean up rebase: fix ne error in half2 * fix the EditorConfig CI	2025-12-12 17:53:17 +02:00
Piotr Wilkin (ilintar)	8d44d6181a	Add support for CUMSUM and TRI for CUDA. (llama/17584) * Add support for CUMSUM and TRI for CUDA. * Minor optimizations. * Correct warp_prefix_inclusive_sum in float2 variant to return float2 * Optimize TRI * Whitespace * Fix strides. * Implement double loop * Whitespace * Fix HIP compilation bugs * Optimizations + big case performance tests * Implement using CUB with fallback to custom kernel * Remove error message. * Fixes from code review * Comment out CPU-unsupported F16/BF16 cases to fix CI * Fine, you win :P * Fix last cast, use NO_DEVICE_CODE and GGML_UNUSED_VARS * Vary warp-size based on physical warp size * Add GGML_UNUSED_VARS in tri as well * Use constexpr and call prefix_inclusive with warp_size template param * Update ggml/src/ggml-cuda/cumsum.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Change to tid % warp_size * Fix strides; hardcode mask; add ggml_lane_mask_t * Missing renames, remove unused get_warp_mask(), explicit calls to ggml_cuda_info() * Too hasty... --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-12 17:53:17 +02:00
Gabe Goodhart	8902c9d976	metal: TRI, FILL, EXPM1, SOFTPLUS (llama/16623) * feat(wip): Port initial TRI impl from pervious work The kernel does not work and is not optimized, but the code compiles and runs, so this will be the starting point now that the core op has been merged. Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove argument for constant val override This was added in the original draft, but later removed. With this, the kernel now passes tests. Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Move the ttype conditional to templating to avoid conditional in kernel Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Type fixes Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * feat: Add softplus for metal Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add EXPM1 for metal Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add FILL for metal Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Branchless version of tri using _ggml_vec_tri_cmp as a mask Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unused arguments Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use select instead of branch for softplus non-vec Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:17 +02:00
Alberto Cabrera Pérez	f96ebc92d2	ggml-cpu : remove asserts always evaluating to false (llama/17728)	2025-12-12 17:53:17 +02:00
Georgi Gerganov	194d016456	metal : use params per pipeline instance (llama/17739)	2025-12-12 17:53:16 +02:00
Adrien Gallouët	92e50155c9	build : move _WIN32_WINNT definition to headers (llama/17736) Previously, cmake was forcing `_WIN32_WINNT=0x0A00` for MinGW builds, This caused "macro redefined" warnings with toolchains that define the version. This also removes the `GGML_WIN_VER` variable as it is no longer needed. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-12 17:53:16 +02:00
Herman Semenoff	3794a0d3b6	ggml-cpu: remove duplicate conditional check 'iid' (llama/17650)	2025-12-12 17:53:16 +02:00
Johannes Gäßler	7adbcafb6c	CUDA: generalized (mma) FA, add Volta support (llama/17505) * CUDA: generalized (mma) FA, add Volta support * use struct for MMA FA kernel config --------- Co-authored-by: Aman Gupta <aman>	2025-12-12 17:53:16 +02:00
Georgi Gerganov	4a00f2e3a4	metal : fix data race in pipeline library (llama/17731)	2025-12-12 17:53:16 +02:00
Reese Levine	d263bdbfb6	ggml webgpu: add support for emscripten builds (llama/17184) * Faster tensors (llama/8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (llama/9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (llama/4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Fix .gitignore * Add memory64 option and remove unneeded macros for setting threads to 1 --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-12-12 17:53:16 +02:00
Jeff Bolz	86cb5ab93f	vulkan: Reduce temporary memory usage for TOP_K (llama/17623) - Compute row size for the temp buffer based on the output of the first pass. - Update shader addressing math to use the output row size - Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k" For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer from about 3.2MB to 500KB.	2025-12-12 17:53:15 +02:00
xiaobing318	fffdf679d4	cmake : add utf8 compilation options for msvc (llama/17682)	2025-12-12 17:53:15 +02:00
Adrien Gallouët	16688c6d2c	ggml : use svcntb() for SVE vector length detection (llama/17474) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-12 17:53:15 +02:00
TianHao324	a64d46a529	CANN: Disable Ger operator of OUT_PROD on 310p device (llama/17563)	2025-12-12 17:53:15 +02:00
Daniel Bevenius	201b910743	ggml : remove redundant n_copies check when setting input/output (llama/17612) This commit removes a redundant check for sched->n_copies > 1 when setting input and output flags on tensor copies in ggml_backend_sched_split_graph. The motivation for this change is to clarify the code as the outer if statement already performs this check.	2025-12-12 17:53:15 +02:00
Adrien Gallouët	e2537b4af3	ggml : add fallback definition for HWCAP2_SVE2 (llama/17683) This align with other HWCAP2 feature flags See #17528 Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-12 17:53:15 +02:00
Aman Gupta	4c89232b5c	ggml-cuda: reorder only relevant nodes (llama/17639)	2025-12-12 17:53:14 +02:00
Neo Zhang Jianyu	26732d28c4	enhance argsort for UT (llama/17573) Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>	2025-12-12 17:53:14 +02:00
Georgi Gerganov	32090930f7	metal : add FA head size 48 (llama/17619)	2025-12-12 17:53:14 +02:00
Georgi Gerganov	7cd3de89bf	ggml : extend the GGML_SCHED_NO_REALLOC debug logic of the scheduler (llama/17617)	2025-12-12 17:53:14 +02:00
Aman Gupta	6cc2d0534f	llama-graph: avoid expand_forward for fusion (llama/17633)	2025-12-12 17:53:14 +02:00
Tarek Dakhran	0defeee679	model: LFM2-VL fixes (llama/17577) * Adjust to pytorch * Add antialiasing upscale * Increase number of patches to 1024 * Handle default marker insertion for LFM2 * Switch to flag * Reformat * Cuda implementation of antialias kernel * Change placement in ops.cpp * consistent float literals * Pad only for LFM2 * Address PR feedback * Rollback default marker placement changes * Fallback to CPU implementation for antialias implementation of upscale	2025-12-12 17:53:14 +02:00
Gilad S.	706647202e	ggml: fix: macOS build with `-DGGML_BACKEND_DL=ON` (llama/17581)	2025-12-12 17:53:13 +02:00
Aman Gupta	e68ee6e281	CUDA: add stream-based concurrency (llama/16991) * CUDA: add stream-based concurrency * HIP: fix hipStreamWaitEvent define and nodiscard warnings * ggml-cuda: fix fusion inside stream * ggml-cuda: fix bug w.r.t first stream launch * ggml-cuda: format * ggml-cuda: improve assert message * ggml-cuda: use lambda instead of duplicating code * ggml-cuda: add some more comments * ggml-cuda: add more detailed comments about concurrency * ggml-cuda: rename + remove unused var * ggml-cuda: fix condition for stream launch * ggml-cuda: address review comments, add destructor * common.cuh: add is_valid for concurrent events * common.cuh: make comment better * update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * common.cuh: fix lower_bound condition + remove join_node data from write_ranges * ggml-cuda: fix overlap condition + shadowing parameter --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-12 17:53:13 +02:00
Mahekk Shaikh	2e4a7a21fa	cuda : add error checking for cudaMemcpyAsync in argsort (llama/17599) * cuda : add error checking for cudaMemcpyAsync in argsort (llama/12836) * fix indentation	2025-12-12 17:53:13 +02:00
Acly	2258930c2e	vulkan : fix FA mask load with bounds check (coopmat2) (llama/17606)	2025-12-12 17:53:13 +02:00
Neo Zhang	a3459484bf	sycl : support to malloc memory on device more than 4GB, update the doc and script (llama/17566) Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2025-12-12 17:53:13 +02:00
ixgbe	28dff06555	ggml: replace hwcap with riscv_hwprobe for RVV detection (llama/17567) Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>	2025-12-12 17:53:12 +02:00
Ruben Ortlam	2fcc0a3a9f	Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support (llama/16900) * vulkan: split mul_mmq_funcs for mul_mat_vecq use * add mxfp4 mmvq * add q2_k mmvq * add q3_k mmvq * add q4_k and q5_k mmvq * add q6_k mmvq * handle 4x4 quants per mmvq thread * enable MUL_MAT_ID mmvq support * enable subgroup optimizations for mul_mat_vec_id shaders * device tuning * request prealloc_y sync after quantization * fix indentation * fix llvmpipe test failures * fix mul_mat_id mmvq condition * fix unused variable warning	2025-12-12 17:53:12 +02:00
Jeff Bolz	dbf8766ffa	vulkan: improve topk perf for large k, fix overflow in unit tests (llama/17582)	2025-12-12 17:53:12 +02:00
Diego Devesa	463003e76c	ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched (llama/17276) * ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched Enabled in ggml-ci for testing. * llama : update worst-case graph for unified cache * ci : disable op offload in some tests * fix spelling --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:12 +02:00
R0CKSTAR	c372bdbb3c	enable fp16/fast_fp16/bf16_mma on PH1 (llama/17551) * [MUSA] enable fp16/fast_fp16/bf16_mma on PH1 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Update ggml/src/ggml-cuda/fattn-vec.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/fattn-vec.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/fattn-tile.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-12 17:53:12 +02:00
Aman Gupta	90ca4e0a07	ggml-cuda: add stricter checking for fusion (llama/17568) * ggml-cuda: make conditions for fusion more explicit * ggml-cuda: remove size check as std::equal already does it	2025-12-12 17:53:12 +02:00
Piotr Wilkin (ilintar)	43441ff58a	model : Qwen3 Next (llama/16095) * Qwen3 Next - cleaned up version * Whitespaces and stuff * Correct minor errors * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Misc. fixes. * Clean up code, add missing hybrid qualifier * Did someone transpose the SOLVE_TRI result matrix? Perhaps... * Whitespace * Proper tensors for cb calls * Use llama-graph.h vertical alignment * BROKEN: chunking * Set new tensors as inputs. * Proper chunk logic * It's the circle of life... * More shenanigans for n_seq > 1 * Nail in the coffin? * Fix Windows build * Eh, one fails on Windows, the other fails on Mac... just use general capture. * quant : cleanup * model : cleanup * qwen3 : cleanup * cont : cleanup * cont : cleanup * ggml : revert change * qwen3 : cleanup * cont : cleanup * Readd cmath * qwen3 : fix typo * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Usual suspects * fix my bad suggestion --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:11 +02:00
Johannes Gäßler	37e4c2ed3a	CUDA: no FP16 arithmetic for vector FA kernel (llama/17558)	2025-12-12 17:53:11 +02:00
Jeff Bolz	7a20963140	vulkan: Implement GGML_OP_TRI (llama/17503) * vulkan: Implement GGML_OP_TRI * check types match	2025-12-12 17:53:11 +02:00
Radoslav Gerganov	d26d1c8b85	rpc : cache and reuse compute graphs (llama/15405) Store the last computed graph and reuse it when possible. Also do not return response from GRAPH_COMPUTE and assume it always completes successfully. If this this is not the case, the server closes the connection. This saves us a network round trip to the server.	2025-12-12 17:53:11 +02:00
yulo	f92d542d4d	HIP: enable mul_mat_f for RDNA4 (llama/17437) * enable mmf for rdna4 * move some mmvf to mmf * revert lds128 for wmma loading * Revert "revert lds128 for wmma loading" This reverts commit db9ae8b6b4738a5def5b393caa1611d52133e9b5. * Revert "enable mmf for rdna4" This reverts commit 698c9f24187b990e35c3b73a8067e5387e6ddbd4. * Revert "move some mmvf to mmf" This reverts commit 99b92bd6653cc8593607f641e44606391691792f. * enable mul_mat for rdna4 --------- Co-authored-by: zhang hui <you@example.com>	2025-12-12 17:53:11 +02:00
Piotr Wilkin (ilintar)	51e842d106	SOLVE_TRI CUDA kernel for small matrices (llama/17457)	2025-12-12 17:53:11 +02:00
Neo Zhang Jianyu	93bc8dc5a8	refactor pad_reflect_1d to make the UT case pass (llama/17204) Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>	2025-12-12 17:53:10 +02:00
Jeff Bolz	3727a36c48	vulkan: Implement SOLVE_TRI (llama/17486) * vulkan: Implement SOLVE_TRI * load B matrix through shared memory * use FLOAT_TYPE	2025-12-12 17:53:10 +02:00
matt23654	e682af7886	cuda : fix UMA detection on discrete GPUs. (llama/17537)	2025-12-12 17:53:10 +02:00
Alberto Cabrera Pérez	93f6cdb9c0	ggml-cpu: aarm64: q4_K repack gemm and gemv implementations (dotprod only) (llama/17494) * Enabled q4_K_4x8 path * Fixed generic Q4_K 8x4 implementation * wip: dotprod gemm * Working arm q4_K dotprod gemm Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Undo acc rename Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Q4_K arm dotprod gemm Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fix: q4_qs reinterpret from uint to int Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Removed comments * Fixed macro guards * Fixed unused vars in generic implementation * Fixed unused vars in 8x4 repack * Fixed unused vars in generic implementation, unneeded comment * Missing arch fallback for x86 * minor : style --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:10 +02:00
Acly	ac92424b59	vulkan : move contiguous checks to device_supports_op (llama/17490) * vulkan : remove op_supports_incontiguous and add missing constraints in device_supports_op * im2col: remove contraints on src0 (kernel input)	2025-12-12 17:53:10 +02:00
Jeff Bolz	310db24fca	vulkan: use a fixed 1KB buffer for the add_rms_fusion opt (llama/17514)	2025-12-12 17:53:10 +02:00
lhez	74ef5dd1a9	opencl: add sqr, sqrt, mean and ssm_conv (llama/17476) * opencl: add sqr * opencl: add sqrt * opencl: add mean * opencl: add ssm_conv * opencl: add missing cl_khr_fp16 * opencl: do sqrt in f32 then convert to f16 for better precision	2025-12-12 17:53:09 +02:00
Alberto Cabrera Pérez	3de4372465	Fix chunks being too small with small matrix sizes (llama/17526)	2025-12-12 17:53:09 +02:00
Jeff Bolz	c8050e5fdc	vulkan: allow graph_optimize for prompt processing workloads (llama/17475)	2025-12-12 17:53:09 +02:00
Jeff Bolz	d8b61e05f8	vulkan: Implement top-k (llama/17418) * vulkan: Implement top-k Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10) and discards all but the top K. Repeat until only K are left. And there's a fast path when K==1 to just find the max value rather than sorting. * fix pipeline selection * vulkan: Add N-ary search algorithm for topk * microoptimizations	2025-12-12 17:53:09 +02:00
xctan	fb31a19797	ggml-cpu : add RISC-V Zvfh impl for ggml_vec_mad_f16 (llama/17448) * ggml-cpu : add RISC-V Zvfh impl for ggml_vec_mad_f16 * ggml-cpu : dedup scalar impl * Update ggml/src/ggml-cpu/vec.h --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:09 +02:00
Adrien Gallouët	8e3560c7ce	ggml : fix ARM feature verification (llama/17519) On arm64 with `cmake` version 3.31.6, the final feature verification fails: -- ARM detected flags: -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs -- Performing Test GGML_MACHINE_SUPPORTS_dotprod -- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Success -- Performing Test GGML_MACHINE_SUPPORTS_i8mm -- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Success -- Performing Test GGML_MACHINE_SUPPORTS_sve -- Performing Test GGML_MACHINE_SUPPORTS_sve - Success -- Performing Test GGML_MACHINE_SUPPORTS_sme -- Performing Test GGML_MACHINE_SUPPORTS_sme - Failed -- Performing Test GGML_MACHINE_SUPPORTS_nosme -- Performing Test GGML_MACHINE_SUPPORTS_nosme - Success -- Checking for ARM features using flags: -- -U__ARM_FEATURE_SME -- -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme -- Performing Test HAVE_DOTPROD -- Performing Test HAVE_DOTPROD - Failed -- Performing Test HAVE_SVE -- Performing Test HAVE_SVE - Failed -- Performing Test HAVE_MATMUL_INT8 -- Performing Test HAVE_MATMUL_INT8 - Failed -- Performing Test HAVE_FMA -- Performing Test HAVE_FMA - Success -- Performing Test HAVE_FP16_VECTOR_ARITHMETIC -- Performing Test HAVE_FP16_VECTOR_ARITHMETIC - Failed -- Performing Test HAVE_SME -- Performing Test HAVE_SME - Failed -- Adding CPU backend variant ggml-cpu: -U__ARM_FEATURE_SME;-mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme We need to explicitly replace `;` with spaces from the list to make `CMAKE_REQUIRED_FLAGS` work correctly... Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-12 17:53:08 +02:00
Jiacheng (Jason) Chen	bb7223da8a	HIP: Patch failed testcase in WMMA-MMQ kernels for RDNA 4 (llama/17502) * patch failed test case MUL_MAT(type_a=q4_0,type_b=f32,m=576,n=512,k=576,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1) for enabling WMMA on RDNA4 * Quick clean up on mma.cuh to add ggml_cuda_memcpy_1 back in for half2 and bfloat162	2025-12-12 17:53:08 +02:00
hipudding	f0c54d47e1	CANN: Add MROPE and IMROPE support (llama/17401) * CANN: ROPE supports both MROPE and IMROPE. 1. Optimize the caching logic of rope_cache_init. 2. Add support for mRoPE and i-mRoPE. Note that on Ascend 910B devices, it is necessary to disable FA in CLIP and disable NZ-format conversion. These two issues are still under investigation. * Resolve review comments	2025-12-12 17:53:08 +02:00
Jeff Bolz	208450048c	vulkan: Implement GGML_OP_CUMSUM (llama/17479)	2025-12-12 17:53:08 +02:00
Georgi Gerganov	968db8bcfa	ggml : add ggml_top_k (llama/17365) * ggml : add ggml_top_k * cont : add ggml_argsort_top_k * metal : add top_k support * ggml : cleanup * tests : add virtual err() function for test_case * ggml : add comments	2025-12-12 17:53:08 +02:00
TianHao324	e00bb753d6	CANN: supports out_prod operator for F32 and F16 (llama/17406) Co-authored-by: tianhao <tianhao42@huawei.com>	2025-12-12 17:53:08 +02:00
Jeff Bolz	273e4fe7ae	vulkan: Use fewer rows for scalar FA when HS is not a multiple of 16 (llama/17455)	2025-12-12 17:53:07 +02:00
Jeff Bolz	553d57a4e7	vulkan: more FA details in vk_perf_logger (llama/17443)	2025-12-12 17:53:07 +02:00
Jiacheng (Jason) Chen	371a21865a	HIP: WMMA-MMQ kernels for RDNA 4 (llama/17156) * first commit naive test to enable mmq for RDNA4 * adding appropriate WMMA instructions * git rebase on top of master: fixing the correctness of the mat mul operations, updating layout mappings for RDNA4 * clean up merge conflicts * add comments and code clean up * PR clean up, addressed comments * enable MMQ fallback on RDNA4 * addressed comments: add guards in load generic, separate wmma branch for use_mmq function * Revert build-xcframework.sh * Formating: remove trailing whitespace * revert CMake files * clean up after rebase: remove duplicated change, revert cmake files * clean up after rebase: revert changes from build-xcframework.sh * clean up: remove extra space line in mma.cuh * Revert "clean up: remove extra space line in mma.cuh" This reverts commit b39ed57c4529906466bd0bc7c2a86e08fc2f8bee.	2025-12-12 17:53:07 +02:00
Alberto Cabrera Pérez	f4ede89d24	ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) (llama/16739) * Enabled q4_K_8x8_q8_K path on ARM * wip: I8mm qs multiplication, pending bias * cpu : arm : REPACK gemm q4_K8x8 implementation Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Guard gemm with proper features, improved superblock scale and min calc Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * cpu: arm: Implemented REPACK gemv for Q4_K Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Removed completed TODO * Fixed missing guards when selecting optimal repack type for Q4_K Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fixed macro guard for gemv * Fixed wrong comment in GEMV * Fixed warning for unused variable * vdotq_s32 -> ggml_vdotq_s32 Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Clang-format issues * Apply suggestions from code review Co-authored-by: Diego Devesa <slarengh@gmail.com> * Removed unnecessary GGML_UNUSED * Fixed guards in q4_k gemm and gemv (repack) --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-12-12 17:53:07 +02:00
ixgbe	faf37ffe76	ggml: add RISC-V cpu-feats (llama/17461) * ggml: add RISC-V cpu-feats Signed-off-by: Wang Yang <yangwang@iscas.ac.cn> * fix comment[1] --------- Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>	2025-12-12 17:53:07 +02:00
Max Krasnyansky	77d874b1c3	hexagon: add support for ROPE_NEOX (llama/17458)	2025-12-12 17:53:07 +02:00
Raul Torres	5ed0ddc458	CANN: Define `cann_graph_update_required` before macro (llama/17434) Description of the problem `cann_graph_update_required` is redundantly defined and initialized as `false` inside two mutually exclusive macro branches. Proposed solution Define it right before the macro so that it could serve both branches.	2025-12-12 17:53:06 +02:00
M. Mediouni	75cea7f8be	ggml-hexagon: Initial Hexagon v68/v69 support (llama/17394) * ggml-hexagon: fix build error with GCC Add stdexcept include to fix GCC build errors Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr> * ggml-hexagon: check VTCM acquire failures Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr> * ggml-hexagon: disable destination bypass on older than v73 v68 errors out if having bypass enabled when the VTCM is the destination. At least on v68 this made things actually work... not a proper fix though, so to look at later... Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr> * ggml-hexagon: add initial v68/v69 support v68 is the Hexagon revision notably used on the Snapdragon 8cx Gen 3 and the QCM6490. Also add support for v69. 8MB isn't a supported page size, so relax asked for page size constraint for HAP_compute_res_attr_set_vtcm_param_v2 to optimal. Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr> --------- Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>	2025-12-12 17:53:06 +02:00
nullname	621cb871b3	ggml-hexagon: add `hex_supported_buffer` for better buffer supported check (llama/17212) * hexagon: add buffer support checks for hexagon sessions * refactor: simplify buffer support checks in hexagon operations * hexagon: update buffer support checks to use tensor structure * refactor: streamline buffer initialization for DSP queue in hexagon operations * refactor: simplify buffer initialization in DSP queue for hexagon operations * refactor: optimize hex_supported_buffer function by fold expression * wip * refactor: simplify dspqueue_buffers_init function and its usage in hexagon operations * fix: improve nan handling at hvx_vec_fast_sigmoid_fp32_guard * refactor: optimize hvx_vec_inverse_fp32_guard for better nan handling * refactor: update hvx_vec_fast_sigmoid_fp32_guard to use adjusted exponent limits * refactor: modify hvx_vec_fast_sigmoid_fp32_guard to accept parameters for improved flexibility * refactor: update hvx_vec_exp_fp32_guard to accept max_exp and inf parameters to save some instructions * refactor: move hvx_vec_inverse_fp32_guard implementation to hvx-inverse.c for better perf	2025-12-12 17:53:06 +02:00
Sigbjørn Skjæret	61e0b7ed48	cuda : support non-contiguous i32 to i32 copy (llama/17326) * support non-contiguous i32 to i32 copy * add tests * rename cpy_flt to cpy_scalar and reindent params	2025-12-12 17:53:06 +02:00
Jeff Bolz	deb4958add	vulkan: remove a couple unnecessary switches (llama/17419)	2025-12-12 17:53:06 +02:00
yulo	fc6eae781d	HIP: RDNA4 tensor core support for MMF (llama/17077) * mmf for rdna4 * align the padding for rdna4 * forbit mul_mat_f for rdna4 * fix as comment * remove device kernels * add constexpr for early return * update based on review comment * change based on the review comment * pass compile error * keep code consistency --------- Co-authored-by: zhang hui <you@example.com>	2025-12-12 17:53:06 +02:00
lhez	5c0e4a9cc5	opencl: refine condition for kqv mm (llama/17392)	2025-12-12 17:53:05 +02:00
Jeff Bolz	cdc1a776be	vulkan: disable async for older Intel devices (llama/17369) * vulkan: disable async for older Intel devices * update detection logic * use name string for detection	2025-12-12 17:53:05 +02:00
Raul Torres	a009dc172c	CANN: Refactor `evaluate_and_capture_cann_graph` (llama/17333) * CANN: Refactor `evaluate_and_capture_cann_graph` Description of the problem * `matched_graph` is obtained even if graph mode is disabled. * End of graph capture and graph replay are unnecessarily placed in different `if` blocks. Proposed solution * Obtain `matched_graph` only if graph mode is enabled. * Place end of graph capture and graph reply inside the same `if` block. * Unify graph related comments. * Remove trailing whitespace	2025-12-12 17:53:05 +02:00
nullname	cb3ee1b098	ggml-hexagon: fix swiglu failure at `test-backend-ops` (llama/17344) * refactor: use hvx_vec_exp_fp32_guard_inf for overflow handling in hvx_exp_f32 * feat: add fast sigmoid function with overflow guard for fp32 * refactor: replace hvx_vec_inverse_fp32 with hvx_vec_inverse_fp32_guard_inf for improved overflow handling * feat: enhance hvx_add_scalar_f32 with overflow handling using infinity guard * wip * add HVX_Vector_Alias wip * wip * fix: improve handling of src1 tensor in glu_swiglu_fp32_per_thread function * fix nc * wip * wip * handle nan at inverse * wip * fix neg * wip * rename * fix hvx_vec_inverse_fp32_guard_inf to handle infinity and NaN cases correctly * wip * fix hvx_vec_inverse_fp32_guard_inf to handle NaN cases correctly * wip * wip * wip * fix output sign	2025-12-12 17:53:05 +02:00
Piotr Wilkin (ilintar)	46f893c2fa	ggml : Fix transposed SOLVE_TRI result (llama/17323) * Did someone transpose the SOLVE_TRI result matrix? Perhaps... * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-12 17:53:05 +02:00
Scott Fudally	510805e6c1	DGX Spark: UMA support (llama/17368) * DGX Spark: UMA support * Updates from PR feedback * More PR feedback cleanup * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Remove trailing whitespace * Update ggml/src/ggml-cuda/ggml-cuda.cu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:05 +02:00
Adrien Gallouët	2f20938b58	ggml : remove useless and error-prone variadic macros (llama/17399) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-12 17:53:04 +02:00
sudhiarm	51f5438089	kleidiai: fix zero-size array declaration (llama/17240)	2025-12-12 17:53:04 +02:00
ixgbe	1d3a525001	ggml-cpu:add RISC-V RVV (Zvfh) optimization for FP16 vector scaling (llama/17314) * ggml-cpu:add RISC-V RVV (Zvfh) optimization for FP16 vector scaling Signed-off-by: Wang Yang <yangwang@iscas.ac.cn> * fix comment * fix comment 2 --------- Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>	2025-12-12 17:53:04 +02:00
Giuseppe Scrivano	24b14cad87	vulkan: implement ADD1, ARANGE, FILL, SOFTPLUS, STEP, ROUND, CEIL, FLOOR, TRUNC (llama/17319) * vulkan: initialize array * vulkan: implement ADD1 * vulkan: implement ARANGE * vulkan: implement FILL * vulkan: implement SOFTPLUS * vulkan: implement STEP * vulkan: implement ROUND * vulkan: implement CEIL * vulkan: implement FLOOR * vulkan: implement TRUNC * docs: update Vulkan ops Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-12-12 17:53:04 +02:00
Jeff Bolz	95d0b0b0cf	vulkan: support larger argsort (llama/17313) * vulkan: support larger argsort This is an extension of the original bitonic sorting shader that puts the temporary values in global memory and when more than 1024 threads are needed it runs multiple workgroups and synchronizes through a pipelinebarrier. To improve the memory access pattern, a copy of the float value is kept with the index value. I've applied this same change to the original shared memory version of the shader, which is still used when ncols <= 1024. * Reduce the number of shader variants. Use smaller workgroups when doing a single pass, for a modest perf boost * reduce loop overhead * run multiple cols per invocation, to reduce barrier overhead	2025-12-12 17:53:04 +02:00
Jeff Bolz	ae8865c6e6	vulkan: Add copy_transpose shader (llama/17371)	2025-12-12 17:53:04 +02:00
Aman Gupta	73d396826b	cuda: fix rope fusion for gemma3 (llama/17378)	2025-12-12 17:53:03 +02:00
Piotr Wilkin (ilintar)	746cbed20a	Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition (llama/17332) * Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition * Argh. * Making CISC happy ;) * Integrate CONT tests * Use loopy loop * Skip new tests for (B)F16 for now.	2025-12-12 17:53:03 +02:00
Ruben Ortlam	2097a9c1bd	vulkan: force full subgroups for flash attention to fix intel subgroup crash (llama/17356)	2025-12-12 17:53:03 +02:00
Jeremy Rand	27c69271c5	ggml-cpu: Don't pass -mpowerpc64 when -mcpu already implies it (llama/17308)	2025-12-12 17:53:03 +02:00
Chenguang Li	c137d11b81	CANN: fix acl_tensor_ptr usage in ASCEND_310P ROPE (llama/17347) * cann: fix acl_tensor_ptr usage in ASCEND_310P ROPE implementation Fix compilation errors in the ASCEND_310P-specific ROPE operation code by adding .get() calls when passing acl_tensor_ptr smart pointers to functions expecting raw aclTensor* pointers. This fixes the code that was missed in the previous refactoring commit (8981848) which changed ggml_cann_create_tensor() return type from aclTensor* to acl_tensor_ptr. * cann: format code	2025-12-12 17:53:03 +02:00
Jeff Bolz	24b981eff7	vulkan: support noncontig i32 copy (llama/17328)	2025-12-12 17:53:03 +02:00
Ruben Ortlam	b7dfced37f	vulkan: add log RTE support to fix Nvidia CI (llama/17320) * vulkan: add log RTE support to fix Nvidia CI * actually use the rte shader	2025-12-12 17:53:02 +02:00
Adrien Gallouët	9e429c47e1	cmake : fix ARM feature verification (llama/17170) * cmake : fix ARM feature verification Use check_cxx_source_compiles to prevent conflicts with the existing GGML_NATIVE detection code. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * cmake : unset __ARM_FEATURE when feature is disabled Signed-off-by: Adrien Gallouët <angt@huggingface.co> * cmake : fix scope, this is really a macro Signed-off-by: Adrien Gallouët <angt@huggingface.co> * arm_neon.h is useless Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-12 17:53:02 +02:00
Adrien Gallouët	bb88c2545f	ggml : add missing AVX512 feature checks (llama/17270) _mm512_cvtepu8_epi16 requires __AVX512BW__ _mm512_srli_epi16 requires __AVX512BW__ __builtin_ia32_inserti32x8 requires __AVX512DQ__ Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-12 17:53:02 +02:00
Daniel Bevenius	418314941e	ggml : remove dirty flag from version string (ggml/1391) This commit removes the "-dirty" suffix from the GGML version string. The motivation for this change is to ensure that the version string works with different ways of checking out ggml and using it in projects. By removing the dirty flag from the version string, we avoid potential artifacts like shared libraries getting a -dirty suffix in their names. Instead, if the project is built from a dirty git state, the dirty flag will be appended to the commit hash in the GGML_BUILD_COMMIT variable. This will enable users to still identify that the build was made from from a modified/dirty state even though the version might match a "real" version. For example, the commit can be produces as follows: ```c++ printf("commit: %s\n", ggml_commit()); ``` Which would print the following for a dirty build: ```console commit: 781baf2a-dirty ``` Refs: https://github.com/ggml-org/ggml/pull/1363#issuecomment-3569691546	2025-12-12 17:53:00 +02:00
YangLe	961aec7384	metal : fix compile on macos 11 (#3533 )	2025-11-20 13:54:54 +02:00
Georgi Gerganov	661567357c	metal : support I32 -> I32 copy (llama/17317)	2025-11-17 21:05:46 +02:00
Georgi Gerganov	74bb8a8b23	metal : faster argsort (llama/17315) * metal : faster argsort * cont : keep data in registers	2025-11-17 21:05:46 +02:00
Georgi Gerganov	57c0e6f8b6	metal : add cumsum (llama/17305)	2025-11-17 21:05:46 +02:00
hipudding	d3f5487464	CANN: Use smart pointers to manage ACL objects (llama/17238) * CANN: Use smart pointers to manage ACL objects Previously, ACL objects were managed via manual destruction, which led to multiple memory-leak issues during runtime. This patch replaces manual memory management with smart pointers so that ACL objects are properly released and ownership is clearly defined. Note that the ownership of an ACL object belongs to the function that creates it. Other internal functions should operate on these ACL objects using raw pointers to avoid unintended ownership transfers. Additionally, since aclTensorList automatically frees its contained aclTensor objects, any aclTensor added to a tensor list must release ownership to avoid double free operations. This PR also removes the asynchronous task submission mechanism. Due to changes in recent CANN versions, tiling time has significantly decreased. Even with a dual-thread submission model, the dispatch overhead still falls on the critical path, making async submission less beneficial. Moreover, aclGraph support provides a much better path to reducing operator dispatch latency. * CANN: resolve review comments	2025-11-17 21:05:46 +02:00
Pavels Zaicenkovs	9d95d9a1ee	vulkan: add LOG operation support for F32 and F16 (llama/17183) * vulkan: add LOG operation support for F32 and F16 Part of #14909. * vulkan: Fix LOG operation types * docs: Update operation support documentation for Vulkan LOG operation * vulkan: fix log_f16 shader * docs: restore missing LOG test cases and regenerate ops.md	2025-11-17 21:05:46 +02:00
Ruben Ortlam	f571655e8e	vulkan: fix MMQ quantize_y condition (llama/17301)	2025-11-17 21:05:46 +02:00
Georgi Gerganov	9549cc1051	metal : remove obosolete asserts (llama/17295)	2025-11-17 21:05:46 +02:00
lhez	a75525cad0	opencl: fix rms_norm_mul (llama/17250) * opencl: use subgrroup reduce for reduction in rms_norm_mul * opencl: add comment about workgroup size	2025-11-17 21:05:46 +02:00
shaofeiqi	c78845bfa9	opencl: add kernel to handle mat mul in attention to improve encoding speed (llama/17181) * Add mul_mm_f16_f32_kq_kqv kernel * Add ggml_cl_mul_mat_kq_kqv_adreno func * fix whitespace * remove unused variable * remove redundant * refactor and clean up * remove trailing whitespace	2025-11-17 21:05:46 +02:00
shani-f	1fd63da9f2	sycl : unify unary kernels with a generic implementation and enable wide operator support (llama/17213) * SYCL: add generic unary op implementation for multiple ops (ABS/SGN/…); unify non-contiguous access * SYCL: update documentation and sycl.csv to reflect new unary op support * update ops.md after syncing SYCL.csv changes * Fix SYCL.csv merge conflict * Update ops.md after fixing SYCL.csv conflicts * Fix SYCL.csv tail after merge conflict and regenerate ops.md * Fix line endings and final newline in SYCL.csv * Remove TOPK_MOE entries from SYCL.csv as requested * Update ops.md after removing TOPK_MOE from SYCL.csv * Regenerated SYCL.csv and synced ops.md with upstream * Update ops.md using create_ops_docs.py	2025-11-17 21:05:46 +02:00
Jeff Bolz	ea3ebd8b0d	vulkan: Fuse mul_mat_id+add_id+mul and mul_mat+add+add. (llama/17287) These both show up in gpt-oss. Also, cleanup the mul_mat_vec fusion code a bit.	2025-11-17 21:05:46 +02:00
Ruben Ortlam	7caea54450	vulkan: Replace 16-bit unpack8 calls to work around legacy Windows AMD driver bug (llama/17285)	2025-11-17 21:05:46 +02:00
Giuseppe Scrivano	4c4e663da0	vulkan: implement ABS and NEG (llama/17245) * docs: update Vulkan ops * vulkan: add NEG op * vulkan: add ABS op --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-11-17 21:05:46 +02:00
Jeff Bolz	e1846fc599	vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths (llama/17244) * vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths * set allow_misalign	2025-11-17 21:05:46 +02:00
Jeff Bolz	9614a56314	vulkan: skip all-negative-inf blocks in FA (llama/17186)	2025-11-17 21:05:46 +02:00
Jeff Bolz	37d4bba152	vulkan: change graph_compute to be async and enable get_tensor_async (llama/17158) * vulkan: change graph_compute to be async and enable get_tensor_async This allows some additional CPU/GPU overlap for large pp workloads. Also seems to help a bit for token gen, maybe getting rid of a small bubble between graph_compute and get_tensor. Async set and copy functions seem to be very rarely used, so I didn't enable them because I didn't have a good way to test them. The async commands need to be ordered against each other, so put them all on the compute queue. The non-async commands still use the transfer queue. The fence for graph_compute/get_tensor_async is submitted and waited on in ggml_vk_synchronize. * fix thread safety errors * teardown context cleanly * Handle async read to non-pinned dst	2025-11-17 21:05:46 +02:00
Georgi Gerganov	523a6c27ea	metal : support argsort for ne00 > 1024 (llama/17247) * metal : refactor argsort * cont : sort chunks * cont : merge sorted buckets * cont : cleanup	2025-11-17 21:05:46 +02:00
Georgi Gerganov	b4d7df3ba2	metal : make the FA extra sizes consistent (llama/17143)	2025-11-17 21:05:46 +02:00
Alberto Cabrera Pérez	a81fbfc78e	ggml-cpu: handle 3d tensors in repack mat_mul (llama/17241) * ggml-cpu: handle 3d tensors in repack mul_mat * Removed unnecessary branch, removed need for <algorithm> * Fixed dst_ptr pointer in chunk + clang_format * GGML_ASSERT to check wdata within bounds * Accidental ggml.h inclusion * Improved GGML_ASSERT on wdata boundaries * Address performance regression in Qwen and llama.cpp due to chunking	2025-11-17 21:05:46 +02:00
Piotr Wilkin (ilintar)	3e684f26c1	ggml : add ops SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM (llama/17063) * Add ops needed for new hybrid models: SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Code review * Whitespace * Update tests/test-backend-ops.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> * This is actually sigmoid, duh. * Add CONST, remove TRI_KEEP, other changes from review * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Remove extra script * Update ggml/src/ggml.c Co-authored-by: Diego Devesa <slarengh@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> * moving changes from laptop [no ci] * pre-rebase * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Refactor tests * ggml : cleanup * cont : fix ggml_fill srcs * tests : add note * ggml : add ggml_fill_inplace * ggml : add asserts * ggml : fix ggml_fill constant cast * cont : ggml_tri minor * Use TENSOR_LOCALS * Fix regression from #14596, regenerate * Don't make commits at night... --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-17 21:05:46 +02:00
Ruben Ortlam	e8e0004fe5	vulkan: remove shell call from vulkan-shaders-gen tool, revert file check (llama/17219) * vulkan: remove shell call from vulkan-shaders-gen tool * use string vector for command execution * Fix condition * use string, remove const_cast * Fix dependency file quotation on Windows --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-11-17 21:05:46 +02:00
Diego Devesa	210f0f860b	sched : fix reserve ignoring user tensor assignments (llama/17232)	2025-11-17 21:05:46 +02:00
ixgbe	91fa5b5cac	ggml-cpu : add RISC-V vector intrinsic support for silu and cvar operations (llama/17227) Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>	2025-11-17 21:05:46 +02:00
bagheera	265d326fa8	metal: accelerated conv2d (llama/17175) * metal: accelerated conv2d * cont : cleanup --------- Co-authored-by: bghira <bghira@users.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-17 21:05:46 +02:00
Georgi Gerganov	6a1d830dfd	Revert "ggml-cpu: handle 3d tensors in repack mat_mul (llama/17030)" (llama/17233) This reverts commit 1c398dc9eca9c366ce98deb0e6f3538e444ebc8a.	2025-11-17 21:05:46 +02:00
Diego Devesa	6a91780c3b	ggml-cpu : use template for argsort (llama/17222)	2025-11-17 21:05:46 +02:00
TecJesh	726912d1cb	CANN: Add cross_entropy_loss op support (llama/16886) * update L2_NORM op support * update L2_NORM op support * remove extra whitespace * cann: update cross_entropy_loss op support * remove trailing whitespaces * rebase the latest code in the main repository and remove the l2_norm operator that already exists in another pull request. * undo the l2_norm operator deletion	2025-11-17 21:05:46 +02:00
Aman Gupta	84275fc493	CUDA: fuse rope + set_rows (llama/16884) * CUDA: add fused rope * move k forward_expand up * create helper function instead of re-using params * make assert statement more in line with comment * rope_norm: coalesced writes to global mem	2025-11-17 21:05:46 +02:00
Johannes Gäßler	566c4c4469	CUDA: static assert to prevent misuse of memcpy_1 (llama/17198)	2025-11-17 21:05:46 +02:00
Georgi Gerganov	3810a6180b	ggml : use std::sort in ggml_argsort CPU implementation (llama/17211) * ggml : use std::sort in ggml_argsort CPU implementation * cont : add missing header	2025-11-17 21:05:46 +02:00
Alberto Cabrera Pérez	7df8515824	ggml-cpu: handle 3d tensors in repack mat_mul (llama/17030) * ggml-cpu: handle 3d tensors in repack mul_mat * Removed unnecessary branch, removed need for <algorithm> * Fixed dst_ptr pointer in chunk + clang_format * GGML_ASSERT to check wdata within bounds * Accidental ggml.h inclusion * Improved GGML_ASSERT on wdata boundaries	2025-11-17 21:05:46 +02:00
TecJesh	e8b66d9f94	CANN: Add L2_NORM op support (llama/16856) * update L2_NORM op support * update L2_NORM op support * remove extra whitespace	2025-11-17 21:05:46 +02:00
Neo Zhang Jianyu	8388350c66	fix ci crash about SSM_CONV (llama/17169) * fix ci crash * Update ggml-sycl.cpp * Update ggml/src/ggml-sycl/ggml-sycl.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-17 21:05:46 +02:00
Max Krasnyansky	6748d27f55	hexagon: various Op fixes (llama/17135) * hexagon: explicitly check for ops with zero nrows llm_graph_context::build_inp_out_ids() can generate tensors with zero nrows. Somehow other backends seems to handle this without obvious explicit checks. In the hexagon case we need to check explicitly and skip them. * hexagon: introduce fastdiv, fix test-backend-ops for ADD/SUB/MUL Co-authored-by: chraac <chraac@gmail.com> * hexagon: use fastdiv in ADD_ID * hexagon: use ggml_op_is_empty and ggml_is_empty to check for NOPs --------- Co-authored-by: chraac <chraac@gmail.com>	2025-11-17 21:05:46 +02:00
Eve	559091005a	disable rms norm mul rope for chips with no fp16 rte (llama/17134)	2025-11-17 21:05:46 +02:00
ixgbe	cd8f64d1b5	ggml-cpu : add RISC-V RVV (Zvfh) optimization for FP16 to FP32 conversion (llama/17161) Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>	2025-11-17 21:05:46 +02:00
duduta	1cefb03571	ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16 (llama/16805) * extract rotate_pairs logic from ggml_compute_forward_rope_f32 * templateify ggml_compute_forward_rope_f32 and _f16 * abort when rope type not supported, remove GLM from test-rope * add imrope branch to switch * add rope tests for perf * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-17 21:05:46 +02:00
Charles Xu	3920ecce3a	kleidiai: add optimized per-channel kernels for Q8_0 (llama/16993)	2025-11-17 21:05:46 +02:00
Mike Abbott	c01bf73dd1	cmake : add version to all shared object files (llama/17091) When compiling llama.cpp in Yocto, it fails QA checks because the generated so files aren't versioned. This applies a version to all generated so files, allowing the package to build without errors.	2025-11-17 21:05:46 +02:00
lhez	46615d74d3	opencl: add fastdiv and use it in set_rows, ported from cuda (llama/17090) * opencl: add fastdiv for mm q8_0 * opencl: use uint4 for fastdiv vals * opencl: use fastdiv for set_rows * opencl: do not use fastdiv for q8_0 mm	2025-11-17 21:05:46 +02:00
Max Krasnyansky	ccf525baf0	cpu: skip NOPs to avoid barriers (llama/17133) * cpu: skip NOPs to avoid barriers * cpu: use ggml_op_is_empty	2025-11-17 21:05:46 +02:00
Georgi Gerganov	40aebfe8bf	metal : cap threadgroups size of set_rows (llama/17146)	2025-11-17 21:05:46 +02:00
Adrien Gallouët	86be60093e	ggml-cpu : inspect -march and -mcpu to found the CPU (llama/16333) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-11-17 21:05:46 +02:00
Ruben Ortlam	ef71d83b76	vulkan: check glslc executable string (llama/17144)	2025-11-17 21:05:46 +02:00
Ruben Ortlam	43f2c1ff54	vulkan: fix validation issue introduced by #16868 (llama/17145)	2025-11-17 21:05:46 +02:00
Georgi Gerganov	bb92c79f56	metal : enable tensor API for A19 (llama/17087)	2025-11-17 21:05:46 +02:00
fj-y-saito	4fea91f06e	arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… (#15277 ) * add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_q8_K * Surround SVE function with compiler directive * fix compile switch * fix coding style * ggml : fix indent --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-17 21:05:46 +02:00
Acly	58a97d988f	cuda/vulkan : bicubic interpolation (llama/17022) * vulkan : implement upscale with bicubic interpolation * cuda : implement upscale with bicubic interpolation * tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests * adapt OpenCL backend to not support the OP in that case so tests don't fail * print scale mode & flags in test-backend-ops	2025-11-17 21:05:46 +02:00
Ruben Ortlam	2e04e7a906	vulkan: fix memory allocations (llama/17122)	2025-11-17 21:05:46 +02:00
Ruben Ortlam	1993e397bb	vulkan: iGPU memory reporting fix (llama/17110) * vulkan: use all device-local heaps for memory availability reporting Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com> * use all available heaps for iGPU memory reporting * Allow multiple memory types per buffer request for devices with split heaps --------- Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-11-09 23:38:03 +02:00
Ruben Ortlam	ee8349cf10	vulkan: fix mmq out of bounds reads (llama/17108) * vulkan: fix mmq out of bounds reads, streamline outdated matmul host code * fix mul_mat_id quantization call * Fix compiler warnings	2025-11-09 23:38:03 +02:00
Jeff Bolz	db98e8c5b4	vulkan: fuse mul_mat_id + mul (llama/17095) * vulkan: fuse mul_mat_id + mul This comes up in qwen3 moe. * split mul_mat_id fusion tests into a separate class	2025-11-09 23:38:03 +02:00
Georgi Gerganov	a4339e2ea7	metal : retain src and dst buffers during async ops (llama/17101)	2025-11-09 23:38:03 +02:00
Jeff Bolz	6de3404773	vulkan: Use spec constants for conv2d s/d/p and kernel W/H (llama/16978) * vulkan: Use spec constants for conv2d s/d/p and kernel W/H Also add some additional unroll hints, which seems to help. * lock around map lookup	2025-11-09 23:38:03 +02:00
Aman Gupta	8967c9ad9b	Revert "CUDA: add expert reduce kernel (ggml/16857)" (llama/17100)	2025-11-09 23:38:03 +02:00
Aman Gupta	522b9bce33	CUDA: skip fusion for repeating adds in bias (llama/17080)	2025-11-09 23:38:03 +02:00
SavicStefan	0caa32c772	vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp (llama/16636) Signed-off-by: Stefan Savic <stefan.savic@huawei.com> Co-authored-by: Stefan Savic <stefan.savic@huawei.com>	2025-11-09 23:38:03 +02:00
Aleksei Nikiforov	3c975ad523	ggml: disable vxe for cross-compilation by default (llama/16966) Otherwise compilation will fail due to enabling -mvx -mzvector and not setting corresponding -march options.	2025-11-09 23:38:03 +02:00
Jeff Bolz	257ce2f5c0	vulkan: fuse rms_norm + mul + rope (+ view + set_rows) (llama/16977) This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.	2025-11-09 23:38:03 +02:00
Jeff Bolz	4eef518167	vulkan: Fix test-thread-safety crashes (llama/17024) The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the same time, which needs to hold the lock. To be safe, hold the lock for all of ggml_vk_load_shaders.	2025-11-09 23:38:03 +02:00
Johannes Gäßler	358f77aca7	CUDA: fix MMQ stream-k fixup ne1 indices (llama/17089)	2025-11-09 23:38:03 +02:00
Reese Levine	78ea6c5b67	ggml webgpu: faster matrix multiplication/matrix-vector multiplication (llama/17031) * Faster tensors (llama/8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings	2025-11-09 23:38:03 +02:00
bssrdf	547724b0a5	CUDA: properly handle nb00=nb02 case for cpy (llama/17081)	2025-11-09 23:38:03 +02:00
Acly	11543bf446	vulkan : refactor buffer handling in vk_op_f32 (llama/16840) * vulkan : refactor/simplify buffer handling in vk_op_* functions * Combine UMA handling into ggml_vk_tensor_subbuffer	2025-11-09 23:38:03 +02:00
Johannes Gäßler	af8a88792f	CUDA: fix should_use_mmvf for ne11 == 1 (llama/17085) * CUDA: fix should_use_mmvf for ne11 == 1 * Apply suggestion from @am17an Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2025-11-09 23:38:03 +02:00
Adrien Gallouët	a1746097bc	Revert "ggml-cpu: detect correct cpu flags for arm64 (llama/16229) (#16239 )" (llama/17084) This reverts commit 7c23f3f0d4b9f5d6ea140756eb694b562d5acebb.	2025-11-09 23:38:03 +02:00
iron	512592513c	ggml-cpu: detect correct cpu flags for arm64 (ggml/16229) (llama/16239) When using GCC 9 and GCC 12 on the arm64 platform of ubuntu 2004, the command "gcc -mcpu=native -E -v -" fails to detect the correct CPU flags, which results in compilation failures for certain extended instructions, but the correct CPU flags can be obtained by using gcc -march. Signed-off-by: lizhenneng <lizhenneng@kylinos.cn> Co-authored-by: lizhenneng <lizhenneng@kylinos.cn>	2025-11-09 23:38:03 +02:00
xctan	5bce732795	ggml-cpu : optimize RVV q2_k and q3_k kernels (llama/16887)	2025-11-09 23:38:03 +02:00
Johannes Gäßler	b5d6fa438f	CUDA: fix crash on uneven context without FA (llama/16988)	2025-11-09 23:38:03 +02:00
Georgi Gerganov	32ed574370	metal : initial Metal4 tensor API support (llama/16634) * metal : rework mat-mat multiplication * metal : initial Metal4 support * cont * metal : detect tensor support * cont : better ifdefs * metal : support tensors in mul_mm_id * metal : add env for disabling tensor API * tests : restore * metal : remove unused constants * metal : fix check for bfloat tensor support * cont : handle API incompatibilities * cont : handle even more incompatibilities * metal : use tensor API only on M5 and later	2025-11-09 23:38:03 +02:00
YehuditE	45588b272e	sycl: add CONCAT operator support (llama/16047) * sycl: add CONCAT operator support * cleanup: remove stray lines added by mistake * fix: code format issues in concat.cpp and tests/test-backend-ops.cpp * chore: fix editorconfig violations * cleanup: drop unnecessary i16 type support * docs: update sycl-csv and regenerate ops.md * update docs/ops.md * fix: adapt to upstream master changes after rebase * fix: remove empty files * fix: drop whitespace --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-09 23:38:03 +02:00
l3utterfly	b3324ae7d1	ggml-hexagon: graceful fallback for older socs where rpcmem_alloc2 and FASTRPC_GET_URI is unsupported (llama/16987) * support older socs where FASTRPC_GET_URI is unsupported * added graceful fallback when FASTRPC_GET_URI call fails * use weak symbols instead of loading libcdsprpc.so dynamically * Add weak pragma for rpcmem_alloc2 * Remove weak declaration for rpcmem_alloc2 in ggml-hexagon.cpp Removed weak declaration for rpcmem_alloc2. * Enforce ndev to 1 for archs below v75 Force ndev to 1 for SoCs architectures lower than v75.	2025-11-09 23:38:03 +02:00
bssrdf	13cd906501	improve CUDA cpy memory bandwidth when copying transposed tensor (llama/16841) * WIP * added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth * added BF16 support * more strict check to make sure src0 is a transpose * reformulated to handle more complicated transpose cases * bring back 2D transpose for higher performance * allow build on windows * tranpose copy more shapes * minor tweak * final clean up * restore some test cases * keep only the kernel for true tranposed case; updated with review suggestions * make CI happy * remove headers not needed * reduced bank conflicts for fp16 and bf16 * add missing const* * now bank conflicts free * use padding instead of swizzling --------- Co-authored-by: bssrdf <bssrdf@gmail.com>	2025-11-09 23:38:03 +02:00
Jeff Bolz	558a04c9c7	vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (llama/16919)	2025-11-09 23:38:03 +02:00
Reese Levine	e734b5d6ef	ggml webgpu: minor set rows optimization (llama/16810) * Add buffer label and enable dawn-specific toggles to turn off some checks * Minor set_rows optimization (ggml/4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Remove some comments * Implement overlap binary operators * Revert "Implement overlap binary operators" This reverts commit ed710b36f51ab3f53fa13db15c1685dc8678a32a. * Disable support for non-contiguous binary_op tensors and leave note for future support --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com> Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>	2025-11-09 23:38:03 +02:00
nullname	44e77ccee6	refactor: replace sprintf with snprintf for safer string handling in dump functions (llama/16913)	2025-11-09 23:38:03 +02:00
Jeff Bolz	1672d41ab0	vulkan: remove the need for the dryrun (llama/16826) * vulkan: remove the need for the dryrun Allocate pipelines and descriptor sets when requested. Reallocate the prealloc buffers when needed, and flush any pending work before reallocating. For rms_partials and total_mul_mat_bytes, use the sizes computed the last time the graph was executed. * remove dryrun parameters	2025-11-09 23:38:03 +02:00
Acly	997fdde0c4	ggml-cpu : bicubic interpolation (llama/16891)	2025-11-09 23:38:03 +02:00
Noah	52e43a2fa5	Fix garbled output with REPACK at high thread counts (llama/16956) * Fix garbled output with REPACK at high thread counts Fixed a race condition in the REPACK matrix multiplication code that caused garbled output when using 26+ threads (model-dependent threshold). The issue occurred because with high thread counts, the code forced chunk count to equal thread count, creating many small chunks. After aligning these chunks to NB_COLS boundaries, adjacent chunks could overlap, causing data corruption and race conditions. The fix enforces minimum chunk sizes based on NB_COLS and caps maximum chunk count to prevent creating too many tiny chunks, ensuring proper alignment without overlaps. * Update ggml/src/ggml-cpu/repack.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-cpu/repack.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-09 23:38:03 +02:00
Aman Gupta	e51a2f90fe	CUDA: avoid mul + bias fusion when doing fusion (llama/16935)	2025-11-09 23:38:03 +02:00
lhez	f856023f46	opencl: support imrope (llama/16914) * opencl: support imrope * opencl: fix whitespace	2025-11-09 23:38:03 +02:00
theo77186	82ede64cd0	ggml: CUDA: add head size 72 for flash-attn (llama/16962)	2025-11-09 23:38:03 +02:00
Jinyang He	79801188f7	ggml : LoongArch fixes (llama/16958) * Fix test-quantize-fns f16 and q4_0 failed when use LSX * Fix LoongArch set float intrinsic when use LSX/LASX	2025-11-09 23:38:03 +02:00
shani-f	f1da026bb8	SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt (#16869 ) * SYCL repeat_back v1 — add core op + switch case * Implement repeat_back SYCL operation and minor fixes * SYCL: optimize repeat_back kernel * Remove Hebrew comment from repeat_back.cpp * Remove comments for code clarity Removed comments to clean up the code. * Fix formatting in ggml-sycl.cpp * Formatted lambda according to legacy style. No logic changes * Remove blank line in repeat_back.cpp Remove unnecessary blank line before assigning acc to dst_dd.	2025-11-09 23:38:03 +02:00
Georgi Gerganov	39834fde1b	clip : use FA (llama/16837) * clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-11-09 23:38:03 +02:00
mnehete32	5ed97df483	CUDA: add FLOOR, CEIL, ROUND, TRUNC unary ops (llama/16917)	2025-11-09 23:38:03 +02:00
Aaron Teo	84854d246a	ggml: add s390x cpu-feats (llama/16774)	2025-11-09 23:38:03 +02:00
Jeff Bolz	2001457367	vulkan: Fix multi_add invalid descriptor usage (llama/16899)	2025-11-09 23:38:03 +02:00
Jeff Bolz	90be9c9de1	vulkan: fuse mul_mat+add and mul_mat_id+add_id (llama/16868) * vulkan: fuse mul_mat+add and mul_mat_id+add_id The fusion is only applied for the mat-vec mul paths. * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix 32b build --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-09 23:38:03 +02:00
Oliver Simons	7d55fba06f	CUDA: Remove unneded bias/gate dims in fused mmvq (llama/16858) * CUDA: Remove unneded bias/gate dims in fused mmvq Pointed out [here](https://github.com/ggml-org/llama.cpp/pull/16847#discussion_r2476798989) that only a single value is needed per target col per thread * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Fix "Error 991-D: extra braces are nonstandard" during compilation --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-11-09 23:38:03 +02:00
Johannes Gäßler	52e1bbb554	CUDA: Volta tensor core support for MMF (llama/16843) * CUDA: Volta tensor core support for MMF * more generic checks for hardware support * Update ggml/src/ggml-cuda/mmf.cuh Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2025-11-09 23:38:03 +02:00
Georgi Gerganov	addda802dd	ggml : fix conv2d_dw SVE path (ggml/1380) * Fix test-conv2d-dw failure on ARM SVE by using runtime vector length The ggml_compute_forward_conv_2d_dw_cwhn function was using a hardcoded GGML_F32_EPR (8) for SIMD vectorization, but on ARM SVE the actual vector length varies by hardware. This caused incorrect computation when processing CWHN layout tensors on ARM machines. Fix by using svcntw() to get the runtime SVE vector length instead of the compile-time constant. Co-authored-by: ggerganov <1991296+ggerganov@users.noreply.github.com> * ci : reduce sam score threshold * ci : update bbox checks for sam test --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: ggerganov <1991296+ggerganov@users.noreply.github.com>	2025-11-09 23:38:03 +02:00
Aman Gupta	7d60b431a5	CUDA: add expert reduce kernel (llama/16857) * CUDA: add expert reduce kernel * contigous checks, better formatting, use std::vector instead of array * use vector empty instead of size Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-11-09 23:38:03 +02:00
Jeff Bolz	a9ba988e56	vulkan: disable spirv-opt for rope shaders (llama/16872)	2025-11-09 23:38:03 +02:00
Masato Nakasaka	e2b3eca0dc	vulkan: Fix crash when FP16 mul_mat accumulation is not supported (llama/16796) * Experimenting crash fix * added assert for aborting and fixed comment * changed to check if a pipeline is empty or not * Moved function in class definition * replaced with is_empty * Modified is_empty to check only unaligned pipelines	2025-11-09 23:38:03 +02:00
Ruben Ortlam	7ed570ee94	vulkan: fix shmem overrun in mmq id shader (llama/16873) * vulkan: fix shmem overrun in mmq id shader * metal : fix mul_mm_id --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-09 23:38:03 +02:00
l3utterfly	486d39c2cb	ggml-hexagon: respect input size when getting/setting tensor data (llama/16836) * respect input size when getting/setting tensor data allows partial repacking/copying when get tensor size is smaller than the actual tensor * Removed duplicate repack_mxfp4_mxfp4x4x2 function	2025-11-09 23:38:03 +02:00
lhez	7fdd53ac0d	opencl: fix boundary handling for mul_mm (llama/16875)	2025-11-09 23:38:03 +02:00
Max Krasnyansky	ffe1c832bd	cpu: introduce chunking for repack matmuls and enable matmul-id chunking on ARM64 (llama/16833) Very similar implementation to the flash-attention chunking, with similar benefits.	2025-11-09 23:38:03 +02:00
JJJYmmm	e1780b209d	model: add support for qwen3vl series (llama/16780) * support qwen3vl series. Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> * bugfix: fix the arch check for qwen3vl-moe. * use build_ffn * optimize deepstack structure * optimize deepstack feature saving * Revert "optimize deepstack feature saving" for temporal fix This reverts commit f321b9fdf13e59527408152e73b1071e19a87e71. * code clean * use fused qkv in clip * clean up / rm is_deepstack_layers for simplification * add test model * move test model to "big" section * fix imrope check * remove trailing whitespace * fix rope fail * metal : add imrope support * add imrope support for sycl * vulkan: add imrope w/o check * fix vulkan * webgpu: add imrope w/o check * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix tensor mapping --------- Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-09 23:38:03 +02:00
Max Krasnyansky	f1fdb91e95	cpu: introduce chunking for flash attention (llama/16829) Factor out the core FA loop into flash_atten_f16_one_chunk and add an outter loop on top that handles the chunks.	2025-11-09 23:38:03 +02:00
Sigbjørn Skjæret	f7dfa39104	cuda : fix argsort with 64k+ rows (llama/16849)	2025-11-09 23:38:03 +02:00
Jeff Bolz	887d984558	vulkan: Handle argsort with a large number of rows (llama/16851)	2025-11-09 23:38:03 +02:00
Oliver Simons	41f4daca57	Hide latency of bias and gate-loading (llama/16847) This is realised by loading them into registers before computation of the dot-product, effectively batching them together with said dot-product. As a lot of threads are alive here, the warp scheduler has enough threads available to effectively hide the cost of additionally loading those two floats.	2025-11-09 23:38:03 +02:00
Jeff Bolz	efe8099268	vulkan: Fuse rope+set_rows (llama/16769) This pattern appears in a lot of models, the rope operation is applied right before storing into the KV cache (usually on the K tensor). Add a path to some of the rope shaders that computes the destination address based on the set_rows tensor. Compile variants of the shader with D_TYPE of f16 (the usual KV cache type). Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs the fourth for the row indices. Add fused_ops_write_mask to indicate which intermediate tensors need to write their results to memory. Skipping writing the roped K value helps to allow more nodes to run concurrently. Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It rarely starts out that way in the graph. Add new backend tests.	2025-11-09 23:38:03 +02:00
Jeff Bolz	35a3fda240	vulkan: Update topk_moe fusion to handle gpt's late softmax (llama/16656) * vulkan: Update topk_moe fusion to handle gpt's late softmax Based on #16649. * Add ggml_check_edges * Add sync logging to show fusion effects * handle clamp added in #16655 * Update ggml/src/ggml-impl.h Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-11-09 23:38:03 +02:00
Ruben Ortlam	bc944bddc8	Vulkan MMQ Integer Dot Refactor and K-Quant support (llama/16536) * vulkan: add mmq q2_k integer dot support * Refactor mmq caching * Reduce mmq register use * Load 4 quant blocks into shared memory in one step * Pack q2_k blocks into caches of 32 * Use 32-bit accumulators for integer dot matmul * Add q4_k mmq * Add q3_k mmq * Add q5_k mmq * Add q6_k mmq * Add mxfp4 mmq, enable MMQ MUL_MAT_ID * Fix mmv dm loads	2025-11-09 23:38:03 +02:00
Max Krasnyansky	4d74160c9a	Hexagon Op queue & dispatch optimizations (llama/16820) * hexagon: remove dspqueue callbacks and do all read processing inplace * hexagon: there is no need to ref/deref the buffers at this point We're not going to release the buffers without flushing the session queue. So there is no need to inc/dec the refcounts for every request. We also don't need to include those bufs in the response. * hexagon: bump the thread count in the adb wrapper scripts We can use more CPU cores now that the dedicated dspqueue polling threads are not used (ie no contention). Also enable more agressive polling for now since we still map Flash Attention (and a few other kernels) to the CPU and those dspqueue threads were keeping the CPU cores are higher clock freqs. * hexagon: add lhez as the second code owner	2025-11-09 23:38:03 +02:00
Aman Gupta	6051c704a0	CUDA: use fastdiv in set-rows (llama/16834) * CUDA: use fastdiv in set-rows * add assert about value fitting in u32	2025-11-09 23:38:03 +02:00
Jeff Bolz	82a23ca9c4	vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy (llama/16793) This lets the copy to the destination device use the host-visible vidmem optimization.	2025-11-09 23:38:03 +02:00
Aman Gupta	5c316c48f7	CUDA: Fix bug in topk-moe for gpt-oss (llama/16821) * CUDA: Fix bug in topk-moe for gpt-oss When using ggml_can_fuse_subgraph, the output nodes which are passed are wrong. This causes `test-backend-ops` to still fuse ndoes (because the nodes are not used elsewhere in the graph), but it actually doesn't fuse in the actual gpt-oss * fix for qwen3 too * change ifndef to ifdef	2025-11-09 23:38:03 +02:00
YaelLogic	5850c952e5	sycl: add RMS_NORM_BACK operation support (llama/16808) * sycl: add RMS_NORM_BACK operation support * sycl: rms_norm_back: add dual reduction paths (FP64 and FP32) and savepoint before further changes * sycl: add RMS_NORM_BACK support Implement RMS_NORM_BACK for the SYCL backend using FP32 compensated parallel reduction. Minimal docs updates (ops.md / SYCL.csv). * revert: restore .gitignore and tools/run/CMakeLists.txt to upstream * revert: restore tests/CMakeLists.txt to upstream * sycl: optimize rms_norm_back * fix: restore SYCL.csv to correct state with RMS_NORM_BACK support * Update ggml/src/ggml-sycl/norm.cpp Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> * fix: remove trailing whitespace and add missing newline (EditorConfig) --------- Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2025-11-09 23:38:03 +02:00
YaelGitAccount	a983c9219d	cuda: add SET operation support (llama/16804) * feat(cuda): add GGML_OP_SET support Implement CUDA kernel for SET operation with f32 support. All tests passing (14598/14598). * cuda(set): add I32 support; keep F32 * refactor(cuda): use ggml_cuda_cpy to unify SET operator logic and remove code duplication * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-cuda/set.cu Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-09 23:38:03 +02:00
l3utterfly	f863a42d97	initialise buffer.device in ggml_hexagon_session (llama/16816)	2025-11-09 23:38:03 +02:00
Chenguang Li	cb39359e7f	CANN: Improve device ID handling and aclnnArange checks (llama/16752) * cann: improve device ID handling and aclnnArange checks - Stop relying on CANN's internal device ID retrieval; use a global variable instead. - Enforce stricter dimension validation in aclnnArange for better compatibility across CANN versions. * cann: use thread local var	2025-11-09 23:38:03 +02:00

... 5 6 7 8 9 ...

2159 Commits