whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Herman Semenoff	3794a0d3b6	ggml-cpu: remove duplicate conditional check 'iid' (llama/17650)	2025-12-12 17:53:16 +02:00
Johannes Gäßler	7adbcafb6c	CUDA: generalized (mma) FA, add Volta support (llama/17505) * CUDA: generalized (mma) FA, add Volta support * use struct for MMA FA kernel config --------- Co-authored-by: Aman Gupta <aman>	2025-12-12 17:53:16 +02:00
Georgi Gerganov	4a00f2e3a4	metal : fix data race in pipeline library (llama/17731)	2025-12-12 17:53:16 +02:00
Reese Levine	d263bdbfb6	ggml webgpu: add support for emscripten builds (llama/17184) * Faster tensors (llama/8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (llama/9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (llama/4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Fix .gitignore * Add memory64 option and remove unneeded macros for setting threads to 1 --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-12-12 17:53:16 +02:00
Jeff Bolz	86cb5ab93f	vulkan: Reduce temporary memory usage for TOP_K (llama/17623) - Compute row size for the temp buffer based on the output of the first pass. - Update shader addressing math to use the output row size - Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k" For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer from about 3.2MB to 500KB.	2025-12-12 17:53:15 +02:00
xiaobing318	fffdf679d4	cmake : add utf8 compilation options for msvc (llama/17682)	2025-12-12 17:53:15 +02:00
Adrien Gallouët	16688c6d2c	ggml : use svcntb() for SVE vector length detection (llama/17474) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-12 17:53:15 +02:00
TianHao324	a64d46a529	CANN: Disable Ger operator of OUT_PROD on 310p device (llama/17563)	2025-12-12 17:53:15 +02:00
Daniel Bevenius	201b910743	ggml : remove redundant n_copies check when setting input/output (llama/17612) This commit removes a redundant check for sched->n_copies > 1 when setting input and output flags on tensor copies in ggml_backend_sched_split_graph. The motivation for this change is to clarify the code as the outer if statement already performs this check.	2025-12-12 17:53:15 +02:00
Adrien Gallouët	e2537b4af3	ggml : add fallback definition for HWCAP2_SVE2 (llama/17683) This align with other HWCAP2 feature flags See #17528 Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-12 17:53:15 +02:00
Aman Gupta	4c89232b5c	ggml-cuda: reorder only relevant nodes (llama/17639)	2025-12-12 17:53:14 +02:00
Neo Zhang Jianyu	26732d28c4	enhance argsort for UT (llama/17573) Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>	2025-12-12 17:53:14 +02:00
Georgi Gerganov	32090930f7	metal : add FA head size 48 (llama/17619)	2025-12-12 17:53:14 +02:00
Georgi Gerganov	7cd3de89bf	ggml : extend the GGML_SCHED_NO_REALLOC debug logic of the scheduler (llama/17617)	2025-12-12 17:53:14 +02:00
Aman Gupta	6cc2d0534f	llama-graph: avoid expand_forward for fusion (llama/17633)	2025-12-12 17:53:14 +02:00
Tarek Dakhran	0defeee679	model: LFM2-VL fixes (llama/17577) * Adjust to pytorch * Add antialiasing upscale * Increase number of patches to 1024 * Handle default marker insertion for LFM2 * Switch to flag * Reformat * Cuda implementation of antialias kernel * Change placement in ops.cpp * consistent float literals * Pad only for LFM2 * Address PR feedback * Rollback default marker placement changes * Fallback to CPU implementation for antialias implementation of upscale	2025-12-12 17:53:14 +02:00
Gilad S.	706647202e	ggml: fix: macOS build with `-DGGML_BACKEND_DL=ON` (llama/17581)	2025-12-12 17:53:13 +02:00
Aman Gupta	e68ee6e281	CUDA: add stream-based concurrency (llama/16991) * CUDA: add stream-based concurrency * HIP: fix hipStreamWaitEvent define and nodiscard warnings * ggml-cuda: fix fusion inside stream * ggml-cuda: fix bug w.r.t first stream launch * ggml-cuda: format * ggml-cuda: improve assert message * ggml-cuda: use lambda instead of duplicating code * ggml-cuda: add some more comments * ggml-cuda: add more detailed comments about concurrency * ggml-cuda: rename + remove unused var * ggml-cuda: fix condition for stream launch * ggml-cuda: address review comments, add destructor * common.cuh: add is_valid for concurrent events * common.cuh: make comment better * update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * common.cuh: fix lower_bound condition + remove join_node data from write_ranges * ggml-cuda: fix overlap condition + shadowing parameter --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-12 17:53:13 +02:00
Mahekk Shaikh	2e4a7a21fa	cuda : add error checking for cudaMemcpyAsync in argsort (llama/17599) * cuda : add error checking for cudaMemcpyAsync in argsort (llama/12836) * fix indentation	2025-12-12 17:53:13 +02:00
Acly	2258930c2e	vulkan : fix FA mask load with bounds check (coopmat2) (llama/17606)	2025-12-12 17:53:13 +02:00
Neo Zhang	a3459484bf	sycl : support to malloc memory on device more than 4GB, update the doc and script (llama/17566) Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2025-12-12 17:53:13 +02:00
ixgbe	28dff06555	ggml: replace hwcap with riscv_hwprobe for RVV detection (llama/17567) Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>	2025-12-12 17:53:12 +02:00
Ruben Ortlam	2fcc0a3a9f	Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support (llama/16900) * vulkan: split mul_mmq_funcs for mul_mat_vecq use * add mxfp4 mmvq * add q2_k mmvq * add q3_k mmvq * add q4_k and q5_k mmvq * add q6_k mmvq * handle 4x4 quants per mmvq thread * enable MUL_MAT_ID mmvq support * enable subgroup optimizations for mul_mat_vec_id shaders * device tuning * request prealloc_y sync after quantization * fix indentation * fix llvmpipe test failures * fix mul_mat_id mmvq condition * fix unused variable warning	2025-12-12 17:53:12 +02:00
Jeff Bolz	dbf8766ffa	vulkan: improve topk perf for large k, fix overflow in unit tests (llama/17582)	2025-12-12 17:53:12 +02:00
Diego Devesa	463003e76c	ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched (llama/17276) * ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched Enabled in ggml-ci for testing. * llama : update worst-case graph for unified cache * ci : disable op offload in some tests * fix spelling --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:12 +02:00
R0CKSTAR	c372bdbb3c	enable fp16/fast_fp16/bf16_mma on PH1 (llama/17551) * [MUSA] enable fp16/fast_fp16/bf16_mma on PH1 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Update ggml/src/ggml-cuda/fattn-vec.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/fattn-vec.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/fattn-tile.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-12 17:53:12 +02:00
Aman Gupta	90ca4e0a07	ggml-cuda: add stricter checking for fusion (llama/17568) * ggml-cuda: make conditions for fusion more explicit * ggml-cuda: remove size check as std::equal already does it	2025-12-12 17:53:12 +02:00
Piotr Wilkin (ilintar)	43441ff58a	model : Qwen3 Next (llama/16095) * Qwen3 Next - cleaned up version * Whitespaces and stuff * Correct minor errors * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Misc. fixes. * Clean up code, add missing hybrid qualifier * Did someone transpose the SOLVE_TRI result matrix? Perhaps... * Whitespace * Proper tensors for cb calls * Use llama-graph.h vertical alignment * BROKEN: chunking * Set new tensors as inputs. * Proper chunk logic * It's the circle of life... * More shenanigans for n_seq > 1 * Nail in the coffin? * Fix Windows build * Eh, one fails on Windows, the other fails on Mac... just use general capture. * quant : cleanup * model : cleanup * qwen3 : cleanup * cont : cleanup * cont : cleanup * ggml : revert change * qwen3 : cleanup * cont : cleanup * Readd cmath * qwen3 : fix typo * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Usual suspects * fix my bad suggestion --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:11 +02:00
Johannes Gäßler	37e4c2ed3a	CUDA: no FP16 arithmetic for vector FA kernel (llama/17558)	2025-12-12 17:53:11 +02:00
Jeff Bolz	7a20963140	vulkan: Implement GGML_OP_TRI (llama/17503) * vulkan: Implement GGML_OP_TRI * check types match	2025-12-12 17:53:11 +02:00
Radoslav Gerganov	d26d1c8b85	rpc : cache and reuse compute graphs (llama/15405) Store the last computed graph and reuse it when possible. Also do not return response from GRAPH_COMPUTE and assume it always completes successfully. If this this is not the case, the server closes the connection. This saves us a network round trip to the server.	2025-12-12 17:53:11 +02:00
yulo	f92d542d4d	HIP: enable mul_mat_f for RDNA4 (llama/17437) * enable mmf for rdna4 * move some mmvf to mmf * revert lds128 for wmma loading * Revert "revert lds128 for wmma loading" This reverts commit db9ae8b6b4738a5def5b393caa1611d52133e9b5. * Revert "enable mmf for rdna4" This reverts commit 698c9f24187b990e35c3b73a8067e5387e6ddbd4. * Revert "move some mmvf to mmf" This reverts commit 99b92bd6653cc8593607f641e44606391691792f. * enable mul_mat for rdna4 --------- Co-authored-by: zhang hui <you@example.com>	2025-12-12 17:53:11 +02:00
Piotr Wilkin (ilintar)	51e842d106	SOLVE_TRI CUDA kernel for small matrices (llama/17457)	2025-12-12 17:53:11 +02:00
Neo Zhang Jianyu	93bc8dc5a8	refactor pad_reflect_1d to make the UT case pass (llama/17204) Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>	2025-12-12 17:53:10 +02:00
Jeff Bolz	3727a36c48	vulkan: Implement SOLVE_TRI (llama/17486) * vulkan: Implement SOLVE_TRI * load B matrix through shared memory * use FLOAT_TYPE	2025-12-12 17:53:10 +02:00
matt23654	e682af7886	cuda : fix UMA detection on discrete GPUs. (llama/17537)	2025-12-12 17:53:10 +02:00
Alberto Cabrera Pérez	93f6cdb9c0	ggml-cpu: aarm64: q4_K repack gemm and gemv implementations (dotprod only) (llama/17494) * Enabled q4_K_4x8 path * Fixed generic Q4_K 8x4 implementation * wip: dotprod gemm * Working arm q4_K dotprod gemm Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Undo acc rename Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Q4_K arm dotprod gemm Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fix: q4_qs reinterpret from uint to int Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Removed comments * Fixed macro guards * Fixed unused vars in generic implementation * Fixed unused vars in 8x4 repack * Fixed unused vars in generic implementation, unneeded comment * Missing arch fallback for x86 * minor : style --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:10 +02:00
Acly	ac92424b59	vulkan : move contiguous checks to device_supports_op (llama/17490) * vulkan : remove op_supports_incontiguous and add missing constraints in device_supports_op * im2col: remove contraints on src0 (kernel input)	2025-12-12 17:53:10 +02:00
Jeff Bolz	310db24fca	vulkan: use a fixed 1KB buffer for the add_rms_fusion opt (llama/17514)	2025-12-12 17:53:10 +02:00
lhez	74ef5dd1a9	opencl: add sqr, sqrt, mean and ssm_conv (llama/17476) * opencl: add sqr * opencl: add sqrt * opencl: add mean * opencl: add ssm_conv * opencl: add missing cl_khr_fp16 * opencl: do sqrt in f32 then convert to f16 for better precision	2025-12-12 17:53:09 +02:00
Alberto Cabrera Pérez	3de4372465	Fix chunks being too small with small matrix sizes (llama/17526)	2025-12-12 17:53:09 +02:00
Jeff Bolz	c8050e5fdc	vulkan: allow graph_optimize for prompt processing workloads (llama/17475)	2025-12-12 17:53:09 +02:00
Jeff Bolz	d8b61e05f8	vulkan: Implement top-k (llama/17418) * vulkan: Implement top-k Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10) and discards all but the top K. Repeat until only K are left. And there's a fast path when K==1 to just find the max value rather than sorting. * fix pipeline selection * vulkan: Add N-ary search algorithm for topk * microoptimizations	2025-12-12 17:53:09 +02:00
xctan	fb31a19797	ggml-cpu : add RISC-V Zvfh impl for ggml_vec_mad_f16 (llama/17448) * ggml-cpu : add RISC-V Zvfh impl for ggml_vec_mad_f16 * ggml-cpu : dedup scalar impl * Update ggml/src/ggml-cpu/vec.h --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:09 +02:00
Adrien Gallouët	8e3560c7ce	ggml : fix ARM feature verification (llama/17519) On arm64 with `cmake` version 3.31.6, the final feature verification fails: -- ARM detected flags: -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs -- Performing Test GGML_MACHINE_SUPPORTS_dotprod -- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Success -- Performing Test GGML_MACHINE_SUPPORTS_i8mm -- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Success -- Performing Test GGML_MACHINE_SUPPORTS_sve -- Performing Test GGML_MACHINE_SUPPORTS_sve - Success -- Performing Test GGML_MACHINE_SUPPORTS_sme -- Performing Test GGML_MACHINE_SUPPORTS_sme - Failed -- Performing Test GGML_MACHINE_SUPPORTS_nosme -- Performing Test GGML_MACHINE_SUPPORTS_nosme - Success -- Checking for ARM features using flags: -- -U__ARM_FEATURE_SME -- -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme -- Performing Test HAVE_DOTPROD -- Performing Test HAVE_DOTPROD - Failed -- Performing Test HAVE_SVE -- Performing Test HAVE_SVE - Failed -- Performing Test HAVE_MATMUL_INT8 -- Performing Test HAVE_MATMUL_INT8 - Failed -- Performing Test HAVE_FMA -- Performing Test HAVE_FMA - Success -- Performing Test HAVE_FP16_VECTOR_ARITHMETIC -- Performing Test HAVE_FP16_VECTOR_ARITHMETIC - Failed -- Performing Test HAVE_SME -- Performing Test HAVE_SME - Failed -- Adding CPU backend variant ggml-cpu: -U__ARM_FEATURE_SME;-mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme We need to explicitly replace `;` with spaces from the list to make `CMAKE_REQUIRED_FLAGS` work correctly... Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-12 17:53:08 +02:00
Jiacheng (Jason) Chen	bb7223da8a	HIP: Patch failed testcase in WMMA-MMQ kernels for RDNA 4 (llama/17502) * patch failed test case MUL_MAT(type_a=q4_0,type_b=f32,m=576,n=512,k=576,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1) for enabling WMMA on RDNA4 * Quick clean up on mma.cuh to add ggml_cuda_memcpy_1 back in for half2 and bfloat162	2025-12-12 17:53:08 +02:00
hipudding	f0c54d47e1	CANN: Add MROPE and IMROPE support (llama/17401) * CANN: ROPE supports both MROPE and IMROPE. 1. Optimize the caching logic of rope_cache_init. 2. Add support for mRoPE and i-mRoPE. Note that on Ascend 910B devices, it is necessary to disable FA in CLIP and disable NZ-format conversion. These two issues are still under investigation. * Resolve review comments	2025-12-12 17:53:08 +02:00
Jeff Bolz	208450048c	vulkan: Implement GGML_OP_CUMSUM (llama/17479)	2025-12-12 17:53:08 +02:00
Georgi Gerganov	968db8bcfa	ggml : add ggml_top_k (llama/17365) * ggml : add ggml_top_k * cont : add ggml_argsort_top_k * metal : add top_k support * ggml : cleanup * tests : add virtual err() function for test_case * ggml : add comments	2025-12-12 17:53:08 +02:00
TianHao324	e00bb753d6	CANN: supports out_prod operator for F32 and F16 (llama/17406) Co-authored-by: tianhao <tianhao42@huawei.com>	2025-12-12 17:53:08 +02:00
Jeff Bolz	273e4fe7ae	vulkan: Use fewer rows for scalar FA when HS is not a multiple of 16 (llama/17455)	2025-12-12 17:53:07 +02:00
Jeff Bolz	553d57a4e7	vulkan: more FA details in vk_perf_logger (llama/17443)	2025-12-12 17:53:07 +02:00
Jiacheng (Jason) Chen	371a21865a	HIP: WMMA-MMQ kernels for RDNA 4 (llama/17156) * first commit naive test to enable mmq for RDNA4 * adding appropriate WMMA instructions * git rebase on top of master: fixing the correctness of the mat mul operations, updating layout mappings for RDNA4 * clean up merge conflicts * add comments and code clean up * PR clean up, addressed comments * enable MMQ fallback on RDNA4 * addressed comments: add guards in load generic, separate wmma branch for use_mmq function * Revert build-xcframework.sh * Formating: remove trailing whitespace * revert CMake files * clean up after rebase: remove duplicated change, revert cmake files * clean up after rebase: revert changes from build-xcframework.sh * clean up: remove extra space line in mma.cuh * Revert "clean up: remove extra space line in mma.cuh" This reverts commit b39ed57c4529906466bd0bc7c2a86e08fc2f8bee.	2025-12-12 17:53:07 +02:00
Alberto Cabrera Pérez	f4ede89d24	ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) (llama/16739) * Enabled q4_K_8x8_q8_K path on ARM * wip: I8mm qs multiplication, pending bias * cpu : arm : REPACK gemm q4_K8x8 implementation Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Guard gemm with proper features, improved superblock scale and min calc Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * cpu: arm: Implemented REPACK gemv for Q4_K Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Removed completed TODO * Fixed missing guards when selecting optimal repack type for Q4_K Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fixed macro guard for gemv * Fixed wrong comment in GEMV * Fixed warning for unused variable * vdotq_s32 -> ggml_vdotq_s32 Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Clang-format issues * Apply suggestions from code review Co-authored-by: Diego Devesa <slarengh@gmail.com> * Removed unnecessary GGML_UNUSED * Fixed guards in q4_k gemm and gemv (repack) --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-12-12 17:53:07 +02:00
ixgbe	faf37ffe76	ggml: add RISC-V cpu-feats (llama/17461) * ggml: add RISC-V cpu-feats Signed-off-by: Wang Yang <yangwang@iscas.ac.cn> * fix comment[1] --------- Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>	2025-12-12 17:53:07 +02:00
Max Krasnyansky	77d874b1c3	hexagon: add support for ROPE_NEOX (llama/17458)	2025-12-12 17:53:07 +02:00
Raul Torres	5ed0ddc458	CANN: Define `cann_graph_update_required` before macro (llama/17434) Description of the problem `cann_graph_update_required` is redundantly defined and initialized as `false` inside two mutually exclusive macro branches. Proposed solution Define it right before the macro so that it could serve both branches.	2025-12-12 17:53:06 +02:00
M. Mediouni	75cea7f8be	ggml-hexagon: Initial Hexagon v68/v69 support (llama/17394) * ggml-hexagon: fix build error with GCC Add stdexcept include to fix GCC build errors Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr> * ggml-hexagon: check VTCM acquire failures Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr> * ggml-hexagon: disable destination bypass on older than v73 v68 errors out if having bypass enabled when the VTCM is the destination. At least on v68 this made things actually work... not a proper fix though, so to look at later... Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr> * ggml-hexagon: add initial v68/v69 support v68 is the Hexagon revision notably used on the Snapdragon 8cx Gen 3 and the QCM6490. Also add support for v69. 8MB isn't a supported page size, so relax asked for page size constraint for HAP_compute_res_attr_set_vtcm_param_v2 to optimal. Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr> --------- Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>	2025-12-12 17:53:06 +02:00
nullname	621cb871b3	ggml-hexagon: add `hex_supported_buffer` for better buffer supported check (llama/17212) * hexagon: add buffer support checks for hexagon sessions * refactor: simplify buffer support checks in hexagon operations * hexagon: update buffer support checks to use tensor structure * refactor: streamline buffer initialization for DSP queue in hexagon operations * refactor: simplify buffer initialization in DSP queue for hexagon operations * refactor: optimize hex_supported_buffer function by fold expression * wip * refactor: simplify dspqueue_buffers_init function and its usage in hexagon operations * fix: improve nan handling at hvx_vec_fast_sigmoid_fp32_guard * refactor: optimize hvx_vec_inverse_fp32_guard for better nan handling * refactor: update hvx_vec_fast_sigmoid_fp32_guard to use adjusted exponent limits * refactor: modify hvx_vec_fast_sigmoid_fp32_guard to accept parameters for improved flexibility * refactor: update hvx_vec_exp_fp32_guard to accept max_exp and inf parameters to save some instructions * refactor: move hvx_vec_inverse_fp32_guard implementation to hvx-inverse.c for better perf	2025-12-12 17:53:06 +02:00
Sigbjørn Skjæret	61e0b7ed48	cuda : support non-contiguous i32 to i32 copy (llama/17326) * support non-contiguous i32 to i32 copy * add tests * rename cpy_flt to cpy_scalar and reindent params	2025-12-12 17:53:06 +02:00
Jeff Bolz	deb4958add	vulkan: remove a couple unnecessary switches (llama/17419)	2025-12-12 17:53:06 +02:00
yulo	fc6eae781d	HIP: RDNA4 tensor core support for MMF (llama/17077) * mmf for rdna4 * align the padding for rdna4 * forbit mul_mat_f for rdna4 * fix as comment * remove device kernels * add constexpr for early return * update based on review comment * change based on the review comment * pass compile error * keep code consistency --------- Co-authored-by: zhang hui <you@example.com>	2025-12-12 17:53:06 +02:00
lhez	5c0e4a9cc5	opencl: refine condition for kqv mm (llama/17392)	2025-12-12 17:53:05 +02:00
Jeff Bolz	cdc1a776be	vulkan: disable async for older Intel devices (llama/17369) * vulkan: disable async for older Intel devices * update detection logic * use name string for detection	2025-12-12 17:53:05 +02:00
Raul Torres	a009dc172c	CANN: Refactor `evaluate_and_capture_cann_graph` (llama/17333) * CANN: Refactor `evaluate_and_capture_cann_graph` Description of the problem * `matched_graph` is obtained even if graph mode is disabled. * End of graph capture and graph replay are unnecessarily placed in different `if` blocks. Proposed solution * Obtain `matched_graph` only if graph mode is enabled. * Place end of graph capture and graph reply inside the same `if` block. * Unify graph related comments. * Remove trailing whitespace	2025-12-12 17:53:05 +02:00
nullname	cb3ee1b098	ggml-hexagon: fix swiglu failure at `test-backend-ops` (llama/17344) * refactor: use hvx_vec_exp_fp32_guard_inf for overflow handling in hvx_exp_f32 * feat: add fast sigmoid function with overflow guard for fp32 * refactor: replace hvx_vec_inverse_fp32 with hvx_vec_inverse_fp32_guard_inf for improved overflow handling * feat: enhance hvx_add_scalar_f32 with overflow handling using infinity guard * wip * add HVX_Vector_Alias wip * wip * fix: improve handling of src1 tensor in glu_swiglu_fp32_per_thread function * fix nc * wip * wip * handle nan at inverse * wip * fix neg * wip * rename * fix hvx_vec_inverse_fp32_guard_inf to handle infinity and NaN cases correctly * wip * fix hvx_vec_inverse_fp32_guard_inf to handle NaN cases correctly * wip * wip * wip * fix output sign	2025-12-12 17:53:05 +02:00
Piotr Wilkin (ilintar)	46f893c2fa	ggml : Fix transposed SOLVE_TRI result (llama/17323) * Did someone transpose the SOLVE_TRI result matrix? Perhaps... * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-12 17:53:05 +02:00
Scott Fudally	510805e6c1	DGX Spark: UMA support (llama/17368) * DGX Spark: UMA support * Updates from PR feedback * More PR feedback cleanup * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Remove trailing whitespace * Update ggml/src/ggml-cuda/ggml-cuda.cu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:05 +02:00
Adrien Gallouët	2f20938b58	ggml : remove useless and error-prone variadic macros (llama/17399) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-12 17:53:04 +02:00
sudhiarm	51f5438089	kleidiai: fix zero-size array declaration (llama/17240)	2025-12-12 17:53:04 +02:00
ixgbe	1d3a525001	ggml-cpu:add RISC-V RVV (Zvfh) optimization for FP16 vector scaling (llama/17314) * ggml-cpu:add RISC-V RVV (Zvfh) optimization for FP16 vector scaling Signed-off-by: Wang Yang <yangwang@iscas.ac.cn> * fix comment * fix comment 2 --------- Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>	2025-12-12 17:53:04 +02:00
Giuseppe Scrivano	24b14cad87	vulkan: implement ADD1, ARANGE, FILL, SOFTPLUS, STEP, ROUND, CEIL, FLOOR, TRUNC (llama/17319) * vulkan: initialize array * vulkan: implement ADD1 * vulkan: implement ARANGE * vulkan: implement FILL * vulkan: implement SOFTPLUS * vulkan: implement STEP * vulkan: implement ROUND * vulkan: implement CEIL * vulkan: implement FLOOR * vulkan: implement TRUNC * docs: update Vulkan ops Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-12-12 17:53:04 +02:00
Jeff Bolz	95d0b0b0cf	vulkan: support larger argsort (llama/17313) * vulkan: support larger argsort This is an extension of the original bitonic sorting shader that puts the temporary values in global memory and when more than 1024 threads are needed it runs multiple workgroups and synchronizes through a pipelinebarrier. To improve the memory access pattern, a copy of the float value is kept with the index value. I've applied this same change to the original shared memory version of the shader, which is still used when ncols <= 1024. * Reduce the number of shader variants. Use smaller workgroups when doing a single pass, for a modest perf boost * reduce loop overhead * run multiple cols per invocation, to reduce barrier overhead	2025-12-12 17:53:04 +02:00
Jeff Bolz	ae8865c6e6	vulkan: Add copy_transpose shader (llama/17371)	2025-12-12 17:53:04 +02:00
Aman Gupta	73d396826b	cuda: fix rope fusion for gemma3 (llama/17378)	2025-12-12 17:53:03 +02:00
Piotr Wilkin (ilintar)	746cbed20a	Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition (llama/17332) * Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition * Argh. * Making CISC happy ;) * Integrate CONT tests * Use loopy loop * Skip new tests for (B)F16 for now.	2025-12-12 17:53:03 +02:00
Ruben Ortlam	2097a9c1bd	vulkan: force full subgroups for flash attention to fix intel subgroup crash (llama/17356)	2025-12-12 17:53:03 +02:00
Jeremy Rand	27c69271c5	ggml-cpu: Don't pass -mpowerpc64 when -mcpu already implies it (llama/17308)	2025-12-12 17:53:03 +02:00
Chenguang Li	c137d11b81	CANN: fix acl_tensor_ptr usage in ASCEND_310P ROPE (llama/17347) * cann: fix acl_tensor_ptr usage in ASCEND_310P ROPE implementation Fix compilation errors in the ASCEND_310P-specific ROPE operation code by adding .get() calls when passing acl_tensor_ptr smart pointers to functions expecting raw aclTensor* pointers. This fixes the code that was missed in the previous refactoring commit (8981848) which changed ggml_cann_create_tensor() return type from aclTensor* to acl_tensor_ptr. * cann: format code	2025-12-12 17:53:03 +02:00
Jeff Bolz	24b981eff7	vulkan: support noncontig i32 copy (llama/17328)	2025-12-12 17:53:03 +02:00
Ruben Ortlam	b7dfced37f	vulkan: add log RTE support to fix Nvidia CI (llama/17320) * vulkan: add log RTE support to fix Nvidia CI * actually use the rte shader	2025-12-12 17:53:02 +02:00
Adrien Gallouët	9e429c47e1	cmake : fix ARM feature verification (llama/17170) * cmake : fix ARM feature verification Use check_cxx_source_compiles to prevent conflicts with the existing GGML_NATIVE detection code. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * cmake : unset __ARM_FEATURE when feature is disabled Signed-off-by: Adrien Gallouët <angt@huggingface.co> * cmake : fix scope, this is really a macro Signed-off-by: Adrien Gallouët <angt@huggingface.co> * arm_neon.h is useless Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-12 17:53:02 +02:00
Adrien Gallouët	bb88c2545f	ggml : add missing AVX512 feature checks (llama/17270) _mm512_cvtepu8_epi16 requires __AVX512BW__ _mm512_srli_epi16 requires __AVX512BW__ __builtin_ia32_inserti32x8 requires __AVX512DQ__ Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-12 17:53:02 +02:00
Daniel Bevenius	418314941e	ggml : remove dirty flag from version string (ggml/1391) This commit removes the "-dirty" suffix from the GGML version string. The motivation for this change is to ensure that the version string works with different ways of checking out ggml and using it in projects. By removing the dirty flag from the version string, we avoid potential artifacts like shared libraries getting a -dirty suffix in their names. Instead, if the project is built from a dirty git state, the dirty flag will be appended to the commit hash in the GGML_BUILD_COMMIT variable. This will enable users to still identify that the build was made from from a modified/dirty state even though the version might match a "real" version. For example, the commit can be produces as follows: ```c++ printf("commit: %s\n", ggml_commit()); ``` Which would print the following for a dirty build: ```console commit: 781baf2a-dirty ``` Refs: https://github.com/ggml-org/ggml/pull/1363#issuecomment-3569691546	2025-12-12 17:53:00 +02:00
YangLe	961aec7384	metal : fix compile on macos 11 (#3533 )	2025-11-20 13:54:54 +02:00
Georgi Gerganov	661567357c	metal : support I32 -> I32 copy (llama/17317)	2025-11-17 21:05:46 +02:00
Georgi Gerganov	74bb8a8b23	metal : faster argsort (llama/17315) * metal : faster argsort * cont : keep data in registers	2025-11-17 21:05:46 +02:00
Georgi Gerganov	57c0e6f8b6	metal : add cumsum (llama/17305)	2025-11-17 21:05:46 +02:00
hipudding	d3f5487464	CANN: Use smart pointers to manage ACL objects (llama/17238) * CANN: Use smart pointers to manage ACL objects Previously, ACL objects were managed via manual destruction, which led to multiple memory-leak issues during runtime. This patch replaces manual memory management with smart pointers so that ACL objects are properly released and ownership is clearly defined. Note that the ownership of an ACL object belongs to the function that creates it. Other internal functions should operate on these ACL objects using raw pointers to avoid unintended ownership transfers. Additionally, since aclTensorList automatically frees its contained aclTensor objects, any aclTensor added to a tensor list must release ownership to avoid double free operations. This PR also removes the asynchronous task submission mechanism. Due to changes in recent CANN versions, tiling time has significantly decreased. Even with a dual-thread submission model, the dispatch overhead still falls on the critical path, making async submission less beneficial. Moreover, aclGraph support provides a much better path to reducing operator dispatch latency. * CANN: resolve review comments	2025-11-17 21:05:46 +02:00
Pavels Zaicenkovs	9d95d9a1ee	vulkan: add LOG operation support for F32 and F16 (llama/17183) * vulkan: add LOG operation support for F32 and F16 Part of #14909. * vulkan: Fix LOG operation types * docs: Update operation support documentation for Vulkan LOG operation * vulkan: fix log_f16 shader * docs: restore missing LOG test cases and regenerate ops.md	2025-11-17 21:05:46 +02:00
Ruben Ortlam	f571655e8e	vulkan: fix MMQ quantize_y condition (llama/17301)	2025-11-17 21:05:46 +02:00
Georgi Gerganov	9549cc1051	metal : remove obosolete asserts (llama/17295)	2025-11-17 21:05:46 +02:00
lhez	a75525cad0	opencl: fix rms_norm_mul (llama/17250) * opencl: use subgrroup reduce for reduction in rms_norm_mul * opencl: add comment about workgroup size	2025-11-17 21:05:46 +02:00
shaofeiqi	c78845bfa9	opencl: add kernel to handle mat mul in attention to improve encoding speed (llama/17181) * Add mul_mm_f16_f32_kq_kqv kernel * Add ggml_cl_mul_mat_kq_kqv_adreno func * fix whitespace * remove unused variable * remove redundant * refactor and clean up * remove trailing whitespace	2025-11-17 21:05:46 +02:00
shani-f	1fd63da9f2	sycl : unify unary kernels with a generic implementation and enable wide operator support (llama/17213) * SYCL: add generic unary op implementation for multiple ops (ABS/SGN/…); unify non-contiguous access * SYCL: update documentation and sycl.csv to reflect new unary op support * update ops.md after syncing SYCL.csv changes * Fix SYCL.csv merge conflict * Update ops.md after fixing SYCL.csv conflicts * Fix SYCL.csv tail after merge conflict and regenerate ops.md * Fix line endings and final newline in SYCL.csv * Remove TOPK_MOE entries from SYCL.csv as requested * Update ops.md after removing TOPK_MOE from SYCL.csv * Regenerated SYCL.csv and synced ops.md with upstream * Update ops.md using create_ops_docs.py	2025-11-17 21:05:46 +02:00
Jeff Bolz	ea3ebd8b0d	vulkan: Fuse mul_mat_id+add_id+mul and mul_mat+add+add. (llama/17287) These both show up in gpt-oss. Also, cleanup the mul_mat_vec fusion code a bit.	2025-11-17 21:05:46 +02:00
Ruben Ortlam	7caea54450	vulkan: Replace 16-bit unpack8 calls to work around legacy Windows AMD driver bug (llama/17285)	2025-11-17 21:05:46 +02:00
Giuseppe Scrivano	4c4e663da0	vulkan: implement ABS and NEG (llama/17245) * docs: update Vulkan ops * vulkan: add NEG op * vulkan: add ABS op --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-11-17 21:05:46 +02:00
Jeff Bolz	e1846fc599	vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths (llama/17244) * vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths * set allow_misalign	2025-11-17 21:05:46 +02:00
Jeff Bolz	9614a56314	vulkan: skip all-negative-inf blocks in FA (llama/17186)	2025-11-17 21:05:46 +02:00
Jeff Bolz	37d4bba152	vulkan: change graph_compute to be async and enable get_tensor_async (llama/17158) * vulkan: change graph_compute to be async and enable get_tensor_async This allows some additional CPU/GPU overlap for large pp workloads. Also seems to help a bit for token gen, maybe getting rid of a small bubble between graph_compute and get_tensor. Async set and copy functions seem to be very rarely used, so I didn't enable them because I didn't have a good way to test them. The async commands need to be ordered against each other, so put them all on the compute queue. The non-async commands still use the transfer queue. The fence for graph_compute/get_tensor_async is submitted and waited on in ggml_vk_synchronize. * fix thread safety errors * teardown context cleanly * Handle async read to non-pinned dst	2025-11-17 21:05:46 +02:00
Georgi Gerganov	523a6c27ea	metal : support argsort for ne00 > 1024 (llama/17247) * metal : refactor argsort * cont : sort chunks * cont : merge sorted buckets * cont : cleanup	2025-11-17 21:05:46 +02:00
Georgi Gerganov	b4d7df3ba2	metal : make the FA extra sizes consistent (llama/17143)	2025-11-17 21:05:46 +02:00
Alberto Cabrera Pérez	a81fbfc78e	ggml-cpu: handle 3d tensors in repack mat_mul (llama/17241) * ggml-cpu: handle 3d tensors in repack mul_mat * Removed unnecessary branch, removed need for <algorithm> * Fixed dst_ptr pointer in chunk + clang_format * GGML_ASSERT to check wdata within bounds * Accidental ggml.h inclusion * Improved GGML_ASSERT on wdata boundaries * Address performance regression in Qwen and llama.cpp due to chunking	2025-11-17 21:05:46 +02:00
Piotr Wilkin (ilintar)	3e684f26c1	ggml : add ops SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM (llama/17063) * Add ops needed for new hybrid models: SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Code review * Whitespace * Update tests/test-backend-ops.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> * This is actually sigmoid, duh. * Add CONST, remove TRI_KEEP, other changes from review * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Remove extra script * Update ggml/src/ggml.c Co-authored-by: Diego Devesa <slarengh@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> * moving changes from laptop [no ci] * pre-rebase * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Refactor tests * ggml : cleanup * cont : fix ggml_fill srcs * tests : add note * ggml : add ggml_fill_inplace * ggml : add asserts * ggml : fix ggml_fill constant cast * cont : ggml_tri minor * Use TENSOR_LOCALS * Fix regression from #14596, regenerate * Don't make commits at night... --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-17 21:05:46 +02:00
Ruben Ortlam	e8e0004fe5	vulkan: remove shell call from vulkan-shaders-gen tool, revert file check (llama/17219) * vulkan: remove shell call from vulkan-shaders-gen tool * use string vector for command execution * Fix condition * use string, remove const_cast * Fix dependency file quotation on Windows --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-11-17 21:05:46 +02:00
Diego Devesa	210f0f860b	sched : fix reserve ignoring user tensor assignments (llama/17232)	2025-11-17 21:05:46 +02:00
ixgbe	91fa5b5cac	ggml-cpu : add RISC-V vector intrinsic support for silu and cvar operations (llama/17227) Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>	2025-11-17 21:05:46 +02:00
bagheera	265d326fa8	metal: accelerated conv2d (llama/17175) * metal: accelerated conv2d * cont : cleanup --------- Co-authored-by: bghira <bghira@users.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-17 21:05:46 +02:00
Georgi Gerganov	6a1d830dfd	Revert "ggml-cpu: handle 3d tensors in repack mat_mul (llama/17030)" (llama/17233) This reverts commit 1c398dc9eca9c366ce98deb0e6f3538e444ebc8a.	2025-11-17 21:05:46 +02:00
Diego Devesa	6a91780c3b	ggml-cpu : use template for argsort (llama/17222)	2025-11-17 21:05:46 +02:00
TecJesh	726912d1cb	CANN: Add cross_entropy_loss op support (llama/16886) * update L2_NORM op support * update L2_NORM op support * remove extra whitespace * cann: update cross_entropy_loss op support * remove trailing whitespaces * rebase the latest code in the main repository and remove the l2_norm operator that already exists in another pull request. * undo the l2_norm operator deletion	2025-11-17 21:05:46 +02:00
Aman Gupta	84275fc493	CUDA: fuse rope + set_rows (llama/16884) * CUDA: add fused rope * move k forward_expand up * create helper function instead of re-using params * make assert statement more in line with comment * rope_norm: coalesced writes to global mem	2025-11-17 21:05:46 +02:00
Johannes Gäßler	566c4c4469	CUDA: static assert to prevent misuse of memcpy_1 (llama/17198)	2025-11-17 21:05:46 +02:00
Georgi Gerganov	3810a6180b	ggml : use std::sort in ggml_argsort CPU implementation (llama/17211) * ggml : use std::sort in ggml_argsort CPU implementation * cont : add missing header	2025-11-17 21:05:46 +02:00
Alberto Cabrera Pérez	7df8515824	ggml-cpu: handle 3d tensors in repack mat_mul (llama/17030) * ggml-cpu: handle 3d tensors in repack mul_mat * Removed unnecessary branch, removed need for <algorithm> * Fixed dst_ptr pointer in chunk + clang_format * GGML_ASSERT to check wdata within bounds * Accidental ggml.h inclusion * Improved GGML_ASSERT on wdata boundaries	2025-11-17 21:05:46 +02:00
TecJesh	e8b66d9f94	CANN: Add L2_NORM op support (llama/16856) * update L2_NORM op support * update L2_NORM op support * remove extra whitespace	2025-11-17 21:05:46 +02:00
Neo Zhang Jianyu	8388350c66	fix ci crash about SSM_CONV (llama/17169) * fix ci crash * Update ggml-sycl.cpp * Update ggml/src/ggml-sycl/ggml-sycl.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-17 21:05:46 +02:00
Max Krasnyansky	6748d27f55	hexagon: various Op fixes (llama/17135) * hexagon: explicitly check for ops with zero nrows llm_graph_context::build_inp_out_ids() can generate tensors with zero nrows. Somehow other backends seems to handle this without obvious explicit checks. In the hexagon case we need to check explicitly and skip them. * hexagon: introduce fastdiv, fix test-backend-ops for ADD/SUB/MUL Co-authored-by: chraac <chraac@gmail.com> * hexagon: use fastdiv in ADD_ID * hexagon: use ggml_op_is_empty and ggml_is_empty to check for NOPs --------- Co-authored-by: chraac <chraac@gmail.com>	2025-11-17 21:05:46 +02:00
Eve	559091005a	disable rms norm mul rope for chips with no fp16 rte (llama/17134)	2025-11-17 21:05:46 +02:00
ixgbe	cd8f64d1b5	ggml-cpu : add RISC-V RVV (Zvfh) optimization for FP16 to FP32 conversion (llama/17161) Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>	2025-11-17 21:05:46 +02:00
duduta	1cefb03571	ggml-cpu: templateify ggml_compute_forward_rope_f32 and _f16 (llama/16805) * extract rotate_pairs logic from ggml_compute_forward_rope_f32 * templateify ggml_compute_forward_rope_f32 and _f16 * abort when rope type not supported, remove GLM from test-rope * add imrope branch to switch * add rope tests for perf * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-17 21:05:46 +02:00
Charles Xu	3920ecce3a	kleidiai: add optimized per-channel kernels for Q8_0 (llama/16993)	2025-11-17 21:05:46 +02:00
Mike Abbott	c01bf73dd1	cmake : add version to all shared object files (llama/17091) When compiling llama.cpp in Yocto, it fails QA checks because the generated so files aren't versioned. This applies a version to all generated so files, allowing the package to build without errors.	2025-11-17 21:05:46 +02:00
lhez	46615d74d3	opencl: add fastdiv and use it in set_rows, ported from cuda (llama/17090) * opencl: add fastdiv for mm q8_0 * opencl: use uint4 for fastdiv vals * opencl: use fastdiv for set_rows * opencl: do not use fastdiv for q8_0 mm	2025-11-17 21:05:46 +02:00
Max Krasnyansky	ccf525baf0	cpu: skip NOPs to avoid barriers (llama/17133) * cpu: skip NOPs to avoid barriers * cpu: use ggml_op_is_empty	2025-11-17 21:05:46 +02:00
Georgi Gerganov	40aebfe8bf	metal : cap threadgroups size of set_rows (llama/17146)	2025-11-17 21:05:46 +02:00
Adrien Gallouët	86be60093e	ggml-cpu : inspect -march and -mcpu to found the CPU (llama/16333) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-11-17 21:05:46 +02:00
Ruben Ortlam	ef71d83b76	vulkan: check glslc executable string (llama/17144)	2025-11-17 21:05:46 +02:00
Ruben Ortlam	43f2c1ff54	vulkan: fix validation issue introduced by #16868 (llama/17145)	2025-11-17 21:05:46 +02:00
Georgi Gerganov	bb92c79f56	metal : enable tensor API for A19 (llama/17087)	2025-11-17 21:05:46 +02:00
fj-y-saito	4fea91f06e	arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_… (#15277 ) * add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_q8_K * Surround SVE function with compiler directive * fix compile switch * fix coding style * ggml : fix indent --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-17 21:05:46 +02:00
Acly	58a97d988f	cuda/vulkan : bicubic interpolation (llama/17022) * vulkan : implement upscale with bicubic interpolation * cuda : implement upscale with bicubic interpolation * tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests * adapt OpenCL backend to not support the OP in that case so tests don't fail * print scale mode & flags in test-backend-ops	2025-11-17 21:05:46 +02:00
Ruben Ortlam	2e04e7a906	vulkan: fix memory allocations (llama/17122)	2025-11-17 21:05:46 +02:00
Ruben Ortlam	1993e397bb	vulkan: iGPU memory reporting fix (llama/17110) * vulkan: use all device-local heaps for memory availability reporting Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com> * use all available heaps for iGPU memory reporting * Allow multiple memory types per buffer request for devices with split heaps --------- Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-11-09 23:38:03 +02:00
Ruben Ortlam	ee8349cf10	vulkan: fix mmq out of bounds reads (llama/17108) * vulkan: fix mmq out of bounds reads, streamline outdated matmul host code * fix mul_mat_id quantization call * Fix compiler warnings	2025-11-09 23:38:03 +02:00
Jeff Bolz	db98e8c5b4	vulkan: fuse mul_mat_id + mul (llama/17095) * vulkan: fuse mul_mat_id + mul This comes up in qwen3 moe. * split mul_mat_id fusion tests into a separate class	2025-11-09 23:38:03 +02:00
Georgi Gerganov	a4339e2ea7	metal : retain src and dst buffers during async ops (llama/17101)	2025-11-09 23:38:03 +02:00
Jeff Bolz	6de3404773	vulkan: Use spec constants for conv2d s/d/p and kernel W/H (llama/16978) * vulkan: Use spec constants for conv2d s/d/p and kernel W/H Also add some additional unroll hints, which seems to help. * lock around map lookup	2025-11-09 23:38:03 +02:00
Aman Gupta	8967c9ad9b	Revert "CUDA: add expert reduce kernel (ggml/16857)" (llama/17100)	2025-11-09 23:38:03 +02:00
Aman Gupta	522b9bce33	CUDA: skip fusion for repeating adds in bias (llama/17080)	2025-11-09 23:38:03 +02:00
SavicStefan	0caa32c772	vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp (llama/16636) Signed-off-by: Stefan Savic <stefan.savic@huawei.com> Co-authored-by: Stefan Savic <stefan.savic@huawei.com>	2025-11-09 23:38:03 +02:00
Aleksei Nikiforov	3c975ad523	ggml: disable vxe for cross-compilation by default (llama/16966) Otherwise compilation will fail due to enabling -mvx -mzvector and not setting corresponding -march options.	2025-11-09 23:38:03 +02:00
Jeff Bolz	257ce2f5c0	vulkan: fuse rms_norm + mul + rope (+ view + set_rows) (llama/16977) This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.	2025-11-09 23:38:03 +02:00
Jeff Bolz	4eef518167	vulkan: Fix test-thread-safety crashes (llama/17024) The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the same time, which needs to hold the lock. To be safe, hold the lock for all of ggml_vk_load_shaders.	2025-11-09 23:38:03 +02:00
Johannes Gäßler	358f77aca7	CUDA: fix MMQ stream-k fixup ne1 indices (llama/17089)	2025-11-09 23:38:03 +02:00
Reese Levine	78ea6c5b67	ggml webgpu: faster matrix multiplication/matrix-vector multiplication (llama/17031) * Faster tensors (llama/8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings	2025-11-09 23:38:03 +02:00
bssrdf	547724b0a5	CUDA: properly handle nb00=nb02 case for cpy (llama/17081)	2025-11-09 23:38:03 +02:00
Acly	11543bf446	vulkan : refactor buffer handling in vk_op_f32 (llama/16840) * vulkan : refactor/simplify buffer handling in vk_op_* functions * Combine UMA handling into ggml_vk_tensor_subbuffer	2025-11-09 23:38:03 +02:00
Johannes Gäßler	af8a88792f	CUDA: fix should_use_mmvf for ne11 == 1 (llama/17085) * CUDA: fix should_use_mmvf for ne11 == 1 * Apply suggestion from @am17an Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2025-11-09 23:38:03 +02:00
Adrien Gallouët	a1746097bc	Revert "ggml-cpu: detect correct cpu flags for arm64 (llama/16229) (#16239 )" (llama/17084) This reverts commit 7c23f3f0d4b9f5d6ea140756eb694b562d5acebb.	2025-11-09 23:38:03 +02:00
iron	512592513c	ggml-cpu: detect correct cpu flags for arm64 (ggml/16229) (llama/16239) When using GCC 9 and GCC 12 on the arm64 platform of ubuntu 2004, the command "gcc -mcpu=native -E -v -" fails to detect the correct CPU flags, which results in compilation failures for certain extended instructions, but the correct CPU flags can be obtained by using gcc -march. Signed-off-by: lizhenneng <lizhenneng@kylinos.cn> Co-authored-by: lizhenneng <lizhenneng@kylinos.cn>	2025-11-09 23:38:03 +02:00
xctan	5bce732795	ggml-cpu : optimize RVV q2_k and q3_k kernels (llama/16887)	2025-11-09 23:38:03 +02:00
Johannes Gäßler	b5d6fa438f	CUDA: fix crash on uneven context without FA (llama/16988)	2025-11-09 23:38:03 +02:00
Georgi Gerganov	32ed574370	metal : initial Metal4 tensor API support (llama/16634) * metal : rework mat-mat multiplication * metal : initial Metal4 support * cont * metal : detect tensor support * cont : better ifdefs * metal : support tensors in mul_mm_id * metal : add env for disabling tensor API * tests : restore * metal : remove unused constants * metal : fix check for bfloat tensor support * cont : handle API incompatibilities * cont : handle even more incompatibilities * metal : use tensor API only on M5 and later	2025-11-09 23:38:03 +02:00
YehuditE	45588b272e	sycl: add CONCAT operator support (llama/16047) * sycl: add CONCAT operator support * cleanup: remove stray lines added by mistake * fix: code format issues in concat.cpp and tests/test-backend-ops.cpp * chore: fix editorconfig violations * cleanup: drop unnecessary i16 type support * docs: update sycl-csv and regenerate ops.md * update docs/ops.md * fix: adapt to upstream master changes after rebase * fix: remove empty files * fix: drop whitespace --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-09 23:38:03 +02:00
l3utterfly	b3324ae7d1	ggml-hexagon: graceful fallback for older socs where rpcmem_alloc2 and FASTRPC_GET_URI is unsupported (llama/16987) * support older socs where FASTRPC_GET_URI is unsupported * added graceful fallback when FASTRPC_GET_URI call fails * use weak symbols instead of loading libcdsprpc.so dynamically * Add weak pragma for rpcmem_alloc2 * Remove weak declaration for rpcmem_alloc2 in ggml-hexagon.cpp Removed weak declaration for rpcmem_alloc2. * Enforce ndev to 1 for archs below v75 Force ndev to 1 for SoCs architectures lower than v75.	2025-11-09 23:38:03 +02:00
bssrdf	13cd906501	improve CUDA cpy memory bandwidth when copying transposed tensor (llama/16841) * WIP * added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth * added BF16 support * more strict check to make sure src0 is a transpose * reformulated to handle more complicated transpose cases * bring back 2D transpose for higher performance * allow build on windows * tranpose copy more shapes * minor tweak * final clean up * restore some test cases * keep only the kernel for true tranposed case; updated with review suggestions * make CI happy * remove headers not needed * reduced bank conflicts for fp16 and bf16 * add missing const* * now bank conflicts free * use padding instead of swizzling --------- Co-authored-by: bssrdf <bssrdf@gmail.com>	2025-11-09 23:38:03 +02:00
Jeff Bolz	558a04c9c7	vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (llama/16919)	2025-11-09 23:38:03 +02:00
Reese Levine	e734b5d6ef	ggml webgpu: minor set rows optimization (llama/16810) * Add buffer label and enable dawn-specific toggles to turn off some checks * Minor set_rows optimization (ggml/4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Remove some comments * Implement overlap binary operators * Revert "Implement overlap binary operators" This reverts commit ed710b36f51ab3f53fa13db15c1685dc8678a32a. * Disable support for non-contiguous binary_op tensors and leave note for future support --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com> Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>	2025-11-09 23:38:03 +02:00
nullname	44e77ccee6	refactor: replace sprintf with snprintf for safer string handling in dump functions (llama/16913)	2025-11-09 23:38:03 +02:00
Jeff Bolz	1672d41ab0	vulkan: remove the need for the dryrun (llama/16826) * vulkan: remove the need for the dryrun Allocate pipelines and descriptor sets when requested. Reallocate the prealloc buffers when needed, and flush any pending work before reallocating. For rms_partials and total_mul_mat_bytes, use the sizes computed the last time the graph was executed. * remove dryrun parameters	2025-11-09 23:38:03 +02:00
Acly	997fdde0c4	ggml-cpu : bicubic interpolation (llama/16891)	2025-11-09 23:38:03 +02:00
Noah	52e43a2fa5	Fix garbled output with REPACK at high thread counts (llama/16956) * Fix garbled output with REPACK at high thread counts Fixed a race condition in the REPACK matrix multiplication code that caused garbled output when using 26+ threads (model-dependent threshold). The issue occurred because with high thread counts, the code forced chunk count to equal thread count, creating many small chunks. After aligning these chunks to NB_COLS boundaries, adjacent chunks could overlap, causing data corruption and race conditions. The fix enforces minimum chunk sizes based on NB_COLS and caps maximum chunk count to prevent creating too many tiny chunks, ensuring proper alignment without overlaps. * Update ggml/src/ggml-cpu/repack.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-cpu/repack.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-09 23:38:03 +02:00
Aman Gupta	e51a2f90fe	CUDA: avoid mul + bias fusion when doing fusion (llama/16935)	2025-11-09 23:38:03 +02:00
lhez	f856023f46	opencl: support imrope (llama/16914) * opencl: support imrope * opencl: fix whitespace	2025-11-09 23:38:03 +02:00
theo77186	82ede64cd0	ggml: CUDA: add head size 72 for flash-attn (llama/16962)	2025-11-09 23:38:03 +02:00
Jinyang He	79801188f7	ggml : LoongArch fixes (llama/16958) * Fix test-quantize-fns f16 and q4_0 failed when use LSX * Fix LoongArch set float intrinsic when use LSX/LASX	2025-11-09 23:38:03 +02:00
shani-f	f1da026bb8	SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt (#16869 ) * SYCL repeat_back v1 — add core op + switch case * Implement repeat_back SYCL operation and minor fixes * SYCL: optimize repeat_back kernel * Remove Hebrew comment from repeat_back.cpp * Remove comments for code clarity Removed comments to clean up the code. * Fix formatting in ggml-sycl.cpp * Formatted lambda according to legacy style. No logic changes * Remove blank line in repeat_back.cpp Remove unnecessary blank line before assigning acc to dst_dd.	2025-11-09 23:38:03 +02:00
Georgi Gerganov	39834fde1b	clip : use FA (llama/16837) * clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-11-09 23:38:03 +02:00
mnehete32	5ed97df483	CUDA: add FLOOR, CEIL, ROUND, TRUNC unary ops (llama/16917)	2025-11-09 23:38:03 +02:00
Aaron Teo	84854d246a	ggml: add s390x cpu-feats (llama/16774)	2025-11-09 23:38:03 +02:00
Jeff Bolz	2001457367	vulkan: Fix multi_add invalid descriptor usage (llama/16899)	2025-11-09 23:38:03 +02:00
Jeff Bolz	90be9c9de1	vulkan: fuse mul_mat+add and mul_mat_id+add_id (llama/16868) * vulkan: fuse mul_mat+add and mul_mat_id+add_id The fusion is only applied for the mat-vec mul paths. * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix 32b build --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-09 23:38:03 +02:00
Oliver Simons	7d55fba06f	CUDA: Remove unneded bias/gate dims in fused mmvq (llama/16858) * CUDA: Remove unneded bias/gate dims in fused mmvq Pointed out [here](https://github.com/ggml-org/llama.cpp/pull/16847#discussion_r2476798989) that only a single value is needed per target col per thread * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Fix "Error 991-D: extra braces are nonstandard" during compilation --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-11-09 23:38:03 +02:00
Johannes Gäßler	52e1bbb554	CUDA: Volta tensor core support for MMF (llama/16843) * CUDA: Volta tensor core support for MMF * more generic checks for hardware support * Update ggml/src/ggml-cuda/mmf.cuh Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2025-11-09 23:38:03 +02:00
Georgi Gerganov	addda802dd	ggml : fix conv2d_dw SVE path (ggml/1380) * Fix test-conv2d-dw failure on ARM SVE by using runtime vector length The ggml_compute_forward_conv_2d_dw_cwhn function was using a hardcoded GGML_F32_EPR (8) for SIMD vectorization, but on ARM SVE the actual vector length varies by hardware. This caused incorrect computation when processing CWHN layout tensors on ARM machines. Fix by using svcntw() to get the runtime SVE vector length instead of the compile-time constant. Co-authored-by: ggerganov <1991296+ggerganov@users.noreply.github.com> * ci : reduce sam score threshold * ci : update bbox checks for sam test --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: ggerganov <1991296+ggerganov@users.noreply.github.com>	2025-11-09 23:38:03 +02:00
Aman Gupta	7d60b431a5	CUDA: add expert reduce kernel (llama/16857) * CUDA: add expert reduce kernel * contigous checks, better formatting, use std::vector instead of array * use vector empty instead of size Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-11-09 23:38:03 +02:00
Jeff Bolz	a9ba988e56	vulkan: disable spirv-opt for rope shaders (llama/16872)	2025-11-09 23:38:03 +02:00
Masato Nakasaka	e2b3eca0dc	vulkan: Fix crash when FP16 mul_mat accumulation is not supported (llama/16796) * Experimenting crash fix * added assert for aborting and fixed comment * changed to check if a pipeline is empty or not * Moved function in class definition * replaced with is_empty * Modified is_empty to check only unaligned pipelines	2025-11-09 23:38:03 +02:00
Ruben Ortlam	7ed570ee94	vulkan: fix shmem overrun in mmq id shader (llama/16873) * vulkan: fix shmem overrun in mmq id shader * metal : fix mul_mm_id --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-09 23:38:03 +02:00
l3utterfly	486d39c2cb	ggml-hexagon: respect input size when getting/setting tensor data (llama/16836) * respect input size when getting/setting tensor data allows partial repacking/copying when get tensor size is smaller than the actual tensor * Removed duplicate repack_mxfp4_mxfp4x4x2 function	2025-11-09 23:38:03 +02:00
lhez	7fdd53ac0d	opencl: fix boundary handling for mul_mm (llama/16875)	2025-11-09 23:38:03 +02:00
Max Krasnyansky	ffe1c832bd	cpu: introduce chunking for repack matmuls and enable matmul-id chunking on ARM64 (llama/16833) Very similar implementation to the flash-attention chunking, with similar benefits.	2025-11-09 23:38:03 +02:00
JJJYmmm	e1780b209d	model: add support for qwen3vl series (llama/16780) * support qwen3vl series. Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> * bugfix: fix the arch check for qwen3vl-moe. * use build_ffn * optimize deepstack structure * optimize deepstack feature saving * Revert "optimize deepstack feature saving" for temporal fix This reverts commit f321b9fdf13e59527408152e73b1071e19a87e71. * code clean * use fused qkv in clip * clean up / rm is_deepstack_layers for simplification * add test model * move test model to "big" section * fix imrope check * remove trailing whitespace * fix rope fail * metal : add imrope support * add imrope support for sycl * vulkan: add imrope w/o check * fix vulkan * webgpu: add imrope w/o check * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix tensor mapping --------- Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-09 23:38:03 +02:00
Max Krasnyansky	f1fdb91e95	cpu: introduce chunking for flash attention (llama/16829) Factor out the core FA loop into flash_atten_f16_one_chunk and add an outter loop on top that handles the chunks.	2025-11-09 23:38:03 +02:00
Sigbjørn Skjæret	f7dfa39104	cuda : fix argsort with 64k+ rows (llama/16849)	2025-11-09 23:38:03 +02:00
Jeff Bolz	887d984558	vulkan: Handle argsort with a large number of rows (llama/16851)	2025-11-09 23:38:03 +02:00
Oliver Simons	41f4daca57	Hide latency of bias and gate-loading (llama/16847) This is realised by loading them into registers before computation of the dot-product, effectively batching them together with said dot-product. As a lot of threads are alive here, the warp scheduler has enough threads available to effectively hide the cost of additionally loading those two floats.	2025-11-09 23:38:03 +02:00
Jeff Bolz	efe8099268	vulkan: Fuse rope+set_rows (llama/16769) This pattern appears in a lot of models, the rope operation is applied right before storing into the KV cache (usually on the K tensor). Add a path to some of the rope shaders that computes the destination address based on the set_rows tensor. Compile variants of the shader with D_TYPE of f16 (the usual KV cache type). Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs the fourth for the row indices. Add fused_ops_write_mask to indicate which intermediate tensors need to write their results to memory. Skipping writing the roped K value helps to allow more nodes to run concurrently. Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It rarely starts out that way in the graph. Add new backend tests.	2025-11-09 23:38:03 +02:00
Jeff Bolz	35a3fda240	vulkan: Update topk_moe fusion to handle gpt's late softmax (llama/16656) * vulkan: Update topk_moe fusion to handle gpt's late softmax Based on #16649. * Add ggml_check_edges * Add sync logging to show fusion effects * handle clamp added in #16655 * Update ggml/src/ggml-impl.h Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-11-09 23:38:03 +02:00
Ruben Ortlam	bc944bddc8	Vulkan MMQ Integer Dot Refactor and K-Quant support (llama/16536) * vulkan: add mmq q2_k integer dot support * Refactor mmq caching * Reduce mmq register use * Load 4 quant blocks into shared memory in one step * Pack q2_k blocks into caches of 32 * Use 32-bit accumulators for integer dot matmul * Add q4_k mmq * Add q3_k mmq * Add q5_k mmq * Add q6_k mmq * Add mxfp4 mmq, enable MMQ MUL_MAT_ID * Fix mmv dm loads	2025-11-09 23:38:03 +02:00
Max Krasnyansky	4d74160c9a	Hexagon Op queue & dispatch optimizations (llama/16820) * hexagon: remove dspqueue callbacks and do all read processing inplace * hexagon: there is no need to ref/deref the buffers at this point We're not going to release the buffers without flushing the session queue. So there is no need to inc/dec the refcounts for every request. We also don't need to include those bufs in the response. * hexagon: bump the thread count in the adb wrapper scripts We can use more CPU cores now that the dedicated dspqueue polling threads are not used (ie no contention). Also enable more agressive polling for now since we still map Flash Attention (and a few other kernels) to the CPU and those dspqueue threads were keeping the CPU cores are higher clock freqs. * hexagon: add lhez as the second code owner	2025-11-09 23:38:03 +02:00
Aman Gupta	6051c704a0	CUDA: use fastdiv in set-rows (llama/16834) * CUDA: use fastdiv in set-rows * add assert about value fitting in u32	2025-11-09 23:38:03 +02:00
Jeff Bolz	82a23ca9c4	vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy (llama/16793) This lets the copy to the destination device use the host-visible vidmem optimization.	2025-11-09 23:38:03 +02:00
Aman Gupta	5c316c48f7	CUDA: Fix bug in topk-moe for gpt-oss (llama/16821) * CUDA: Fix bug in topk-moe for gpt-oss When using ggml_can_fuse_subgraph, the output nodes which are passed are wrong. This causes `test-backend-ops` to still fuse ndoes (because the nodes are not used elsewhere in the graph), but it actually doesn't fuse in the actual gpt-oss * fix for qwen3 too * change ifndef to ifdef	2025-11-09 23:38:03 +02:00
YaelLogic	5850c952e5	sycl: add RMS_NORM_BACK operation support (llama/16808) * sycl: add RMS_NORM_BACK operation support * sycl: rms_norm_back: add dual reduction paths (FP64 and FP32) and savepoint before further changes * sycl: add RMS_NORM_BACK support Implement RMS_NORM_BACK for the SYCL backend using FP32 compensated parallel reduction. Minimal docs updates (ops.md / SYCL.csv). * revert: restore .gitignore and tools/run/CMakeLists.txt to upstream * revert: restore tests/CMakeLists.txt to upstream * sycl: optimize rms_norm_back * fix: restore SYCL.csv to correct state with RMS_NORM_BACK support * Update ggml/src/ggml-sycl/norm.cpp Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> * fix: remove trailing whitespace and add missing newline (EditorConfig) --------- Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2025-11-09 23:38:03 +02:00
YaelGitAccount	a983c9219d	cuda: add SET operation support (llama/16804) * feat(cuda): add GGML_OP_SET support Implement CUDA kernel for SET operation with f32 support. All tests passing (14598/14598). * cuda(set): add I32 support; keep F32 * refactor(cuda): use ggml_cuda_cpy to unify SET operator logic and remove code duplication * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-cuda/set.cu Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-09 23:38:03 +02:00
l3utterfly	f863a42d97	initialise buffer.device in ggml_hexagon_session (llama/16816)	2025-11-09 23:38:03 +02:00
Chenguang Li	cb39359e7f	CANN: Improve device ID handling and aclnnArange checks (llama/16752) * cann: improve device ID handling and aclnnArange checks - Stop relying on CANN's internal device ID retrieval; use a global variable instead. - Enforce stricter dimension validation in aclnnArange for better compatibility across CANN versions. * cann: use thread local var	2025-11-09 23:38:03 +02:00
Aman Gupta	0c8ff48103	CUDA: add unused vars to mmvf and mmvq (llama/16807)	2025-11-09 23:38:03 +02:00
tamarPal	9664420a54	sycl: add SSM_CONV operation support (llama/16800) * feat: Add SYCL backend support for SSM_CONV operator * Implement State Space Model Convolution 1D for SYCL backend * Add optimized GPU kernel with parallel work distribution * Support various tensor dimensions and batch sizes * Full integration with existing SYCL infrastructure * All tests pass with CPU backend equivalence verification * feat: Implement SYCL backend support for SSM_CONV operation - Add ggml-sycl/ssm_conv.cpp and ssm_conv.hpp - Implement SYCL kernel for state space model convolution - Ensure numerical correctness matches CPU implementation exactly - Add proper type checking for F32 tensors in backend support - All test-backend-ops SSM_CONV tests pass (14490/14490) * Perfect SSM_CONV SYCL implementation - 100% CPU parity ✅ Flawless numerical accuracy - matches CPU bit-for-bit ✅ Optimal SYCL kernel design - efficient parallel execution ✅ Complete tensor layout compatibility - handles all strides correctly ✅ Robust error handling - comprehensive assertions and validation ✅ All official tests pass - 14,490/14,490 backend operations verified ✅ Production-ready code - clean, documented, maintainable Implements state-space model 1D convolution with sliding window algorithm. Eliminates blocking queue.wait() for better async performance. * Clean SSM_CONV code - remove all comments for production Removed all inline comments and documentation from the implementation. Clean, minimal code ready for production merge. * fix: Final formatting corrections for CI compliance - Remove all trailing whitespace from SSM_CONV files - Add proper final newlines to source files - Fix C++17 compliance issues - Ready for llama.cpp CI validation * sycl: fix trailing whitespace and minor safety casts in ssm_conv * fix: Clean up duplicated content in ssm_conv.hpp header file --------- Co-authored-by: tamarPal <tamarPal@example.com>	2025-11-09 23:38:03 +02:00
Acly	bcda7c3e58	ggml : fix interpolate with align-corners and ne=1 (llama/16700) * ggml : fix interpolate with align-corners and ne=1 * avoid division by zero if one of the spatial dimensions is 1 * cpu, cuda, opencl returned correct result anyway due to clamp * vulkan didn't clamp for align-corners so results were broken * fix clang warning	2025-11-09 23:38:03 +02:00
Johannes Gäßler	1471b1fda7	HIP: fix AMDGPU_TARGETS, update documentation (llama/16803)	2025-11-09 23:38:03 +02:00
tamarPal	0e1b6c5fc4	sycl: add ROLL operation support (llama/16665) * sycl: add ROLL operation support - Implement ggml_sycl_roll function for F32 tensors - Add multi-axis roll operation with SYCL kernel - Support all 4 tensor dimensions with proper shift normalization - Add roll.cpp and roll.hpp to SYCL backend - Update backend dispatch and supports_op for GGML_OP_ROLL - Tests: 17662/17662 pass with identical CPU reference results * fix: remove trailing whitespace from roll.cpp - Fix EditorConfig violations in ggml/src/ggml-sycl/roll.cpp - Remove trailing spaces from lines 6, 11, 28, 47, 58, 60 * ci: retrigger * sycl: remove wait() calls from ROLL operation * fix: editorconfig — LF endings + final newline for roll.hpp --------- Co-authored-by: tamarPal <tamarPal@example.com>	2025-11-09 23:38:03 +02:00
shani-f	543221d824	sycl: add REPEAT_BACK operation support (llama/16734) * SYCL repeat_back v1 — add core op + switch case * Implement repeat_back SYCL operation and minor fixes * Update ggml/src/ggml-sycl/repeat_back.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-sycl/repeat_back.hpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-sycl/ggml-sycl.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-09 23:38:03 +02:00
Aman Gupta	97c3285cc4	CUDA: support for weight clamp in top-k norm (llama/16702)	2025-11-09 23:38:03 +02:00
Acly	bd8734c050	ggml-alloc : make gallocr prefer chunks that allow memory reuse (llama/16788)	2025-11-09 23:38:03 +02:00
Sigbjørn Skjæret	e6ff2bceed	cuda : use fast copy when src and dst are of different type and contiguous (llama/16789) * use fast copy when src and dst are contiguous and same shape * use int64_t ne and ignore shape	2025-11-09 23:38:03 +02:00
leejet	4f4246dcb4	ggml: fix cuda kernel launch configuration for k_compute_batched_ptrs to support large batch (llama/16744) * fix k_compute_batched_ptrs * add backend ops test * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * reduce the batch size --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-11-09 23:38:03 +02:00
Aman Gupta	9f75cc7eef	CUDA: General GEMV fusion (llama/16715)	2025-11-09 23:38:03 +02:00
Gilad S	c00ab7e5e6	vulkan: deduplicate Microsoft Direct3D12 devices (llama/16689) * fix: deduplicate and deprioritize Microsoft Direct3D12 vulkan devices from the `vulkan-dozen` driver * style: indent * fix: decrease priority * fix: switch to `\|\|`	2025-11-09 23:38:03 +02:00
Giuseppe Scrivano	d0b544da70	vulkan: delete dead code (llama/16732) ggml_vk_create_buffer_temp is not used anywhere, and it is the only caller for ggml_vk_pool_malloc. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-11-09 23:38:03 +02:00
Jeff Bolz	070b24f65c	vulkan: Optimize SSM_SCAN (llama/16645)	2025-11-09 23:38:03 +02:00
leejet	5166efa7f0	ggml: fix CUDA grid launch condition for large block_nums.y in binbcast (llama/16742) * Fix CUDA grid launch condition for large block_nums.y * add backend ops test * reduce test repetitions	2025-11-09 23:38:03 +02:00
Aman Gupta	524046d4d1	CUDA: use CUB for arbitary size argsort (llama/16754)	2025-11-09 23:38:03 +02:00
Aman Gupta	47efc4f115	ggml-cuda: use passed ops instead of hardcoded ops (llama/16712)	2025-11-09 23:38:03 +02:00
Matthew Michel	0a5b4c2e9b	sycl: use async memory allocation to fix crashes during graph recording (llama/16644) * sycl: use async memory allocation to fix graph recording failures GGML_SYCL_DISABLE_GRAPHS=0 causes crashes because: - Host waits are currently unsupported in graph recording mode. - SYCL malloc / free calls are unsupported in graph recording mode. The following changes are made to fix SYCL graph functionality: - When graphs are enabled, use the SYCL async memory extension for temp buffers which is supported with SYCL graphs. - For compiler versions that do not support this extension, skip graphs with the affected op. - Switch from USM shared to device memory as the async extension currently just supports device allocations. * Address reviewer feedback * Use global async variable to decide path in sycl_ext_[malloc_device\|free]	2025-11-09 23:38:03 +02:00
Max Krasnyansky	8bb12395fe	Add experimental ggml-hexagon backend for the Hexagon NPU (llama/16547) * model: add support for extra bufs for all devices * hexagon: add experimental ggml-hexagon backend for the Hexagon NPU This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU. Highlights: - Supports Hexagon versions: v73, v75, v79, and v81 - Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5 - Supports Q4_0, Q8_0, MXFP4, and FP32 data types - Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX Note: This backend is experimental and may exhibit instability or limited performance across supported devices. It is intended for early testing and feedback from llama.cpp/ggml developer and user community. Co-Authored-By: Rajdeep Ganguly <rganguly@qti.qualcomm.com> Co-Authored-By: Todor Boinovski <todorb@qti.qualcomm.com> * hexagon: fix format checker errors * hexagon: update readme and cmake presets * ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions * hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input * hexagon: move ADB helper scripts into scripts/snapdragon/adb * hexagon: replace all f/printfs with GGML_LOG_... * readme: add hexagon to the list supported backends * hexagon: stack malmuts with quantized inputs only * hexagon: add TODO for fixing issues in hexagon_graph_optimize * hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC * scripts: fix lint errors * scripts: update qdc pytest script to make linter happy * hexagon: add reduce sum in fp32 * hexagon: reduce number of vector stores in matmul output * hexagon: remove the need for vdelta in reduce-multiply-x8 * hexagon: consistent use of reduce_sum_fp32 for row_sums * hexagon: some more matmul optimizations and comments Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models). We've handled those cases already but at a higher overhead. * hexagon: update cmake presets * hexagon: add OPMASK support for run-bench.sh wrapper * hexagon: update to use GGML_BACKEND_API * hexagon: remove unused logic for setting tensor flags for the views * hexagon: add asserts to set/get_tensor to make sure we handle complete tensors Same asserts as the CPU backend. * hexagon: use cpy_tensor slow path for non-host buffers * hexagon: error checks in the buffer allocator * cmake: move include(extProj) under ggml-hexagon * hexagon: don't forget to delete the backend on free * hexagon: set/get_tensor size assert apply only to quantized tensors * hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way. Ideally we need a bit more finer log levels. * docs: typos in hexagon developer docs (libggm-...) * hexagon: overhaul error handling in the session/device allocation this should handle all failure paths in the session allocation. * hexagon: update cmake presets to enable fp16 vectors * hexagon: remove unused time_usec function * hexagon: don't forget to release buffer contexts * hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure) * hexagon: remove custom can_repeat function and use ggml_can_repeat --------- Co-authored-by: Rajdeep Ganguly <rganguly@qti.qualcomm.com> Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>	2025-11-09 23:38:03 +02:00
Diego Devesa	a2130ac501	Revert "ggml : Leverage the existing GGML_F32_VEC helpers to vectorize ggml_v…" (#16723 ) This reverts commit 19a5a3edfd306516cc419679d69d6435943b6816.	2025-11-09 23:38:03 +02:00
sirus20x6	773041e336	ggml : Leverage the existing GGML_F32_VEC helpers to vectorize ggml_vec_set_f32 for faster fills (llama/16522) * Leverage the existing GGML_F32_VEC helpers to broadcast the fill value across SIMD registers and store in vector-sized chunks, while retaining the scalar tail for leftover elements and non-SIMD builds. * Vectorize additional f32 helper loops * Normalize f32 helper tails for ggml vec ops --------- Co-authored-by: Aaron <shelhamer.aaron@gmail.com>	2025-11-09 23:38:03 +02:00
Aman Gupta	431aaf56f0	CUDA: fix bug in topk-moe softmax (llama/16711)	2025-11-09 23:38:03 +02:00
Aman Gupta	ba41a6ca6a	CUDA: topk-moe: add optional parameter for gpt-oss (llama/16649)	2025-11-09 23:38:03 +02:00
Johannes Gäßler	99cea274e5	CUDA: better error for FA kernel with 0 occupancy (llama/16643)	2025-11-09 23:38:03 +02:00
Aman Gupta	9a8cfb040c	ggml: add ggml_can_fuse_subgraph (llama/16662) * ggml: add ggml_can_fuse_subgraph * ggml-cuda: use ggml_can_fuse_subgraph for topk-moe * format * 1. remove inputs from signature as they are transient nodes 2. add check for views: view_src should be part of the subgraph * - combine check into one loop - check all view_src parents - other minor review comments * remove redudant if test * - rename and other minor review comments * add assert about count < 32	2025-10-22 12:58:11 +03:00
lhez	5c4c477d00	opencl: fix warnings and clean up profiling (llama/16688) * opencl: remove unused headers, fix warnings * opencl: clean up profiling, only keep kernel time	2025-10-22 12:58:11 +03:00
Jeff Bolz	7f16c71068	vulkan: Handle FA with all -inf mask values (llama/16447)	2025-10-22 12:58:11 +03:00
YehuditE	55cf00c20a	sycl : add PAD_REFLECT_D1 operator support (llama/16145) * sycl: add PAD_REFLECT_D1 operator support * docs(ops): regenerate docs/ops.md * remove trailing whitespaces * style: fix editorconfig issues — trim trailing spaces and normalize EOLs * fix: move PAD_REFLECT_1D case outside of fall-through block	2025-10-22 12:58:11 +03:00
Diego Devesa	70b4d22f01	ggml-alloc : fix leak when reusing a tensor with a larger size (llama/16679)	2025-10-22 12:58:11 +03:00
safranowith	bb76672081	SYCL: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators (llama/16613) * SYCL: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators Clean up unrelated changes from previous commit * Chore: remove empty lines and fix indentation * Clean up: remove leftover blank lines and fix spacing * chore: fix trailing whitespace and ensure final newline * Cleanup: remove redundant declarations already defined in header * Sync docs/ops.md with updated backend operation support * docs: update ops.md after rebase * docs: update ops.md - Vulkan supports SSM_CONV and SSM_SCAN	2025-10-22 12:58:11 +03:00
Aaron Teo	82bdf31267	ci : fix binaries release failure for s390x (binaries may not work yet) (llama/16664) * devops: initial patch Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: forgot the z15 suffix Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: attempt at impl GGML_CPU_ALL_VARIANTS for s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: rm baseline version Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-10-22 12:58:11 +03:00
Johannes Gäßler	72d98011db	HIP: fix GPU_TARGETS (llama/16642)	2025-10-22 12:58:11 +03:00
Jeff Bolz	414901a42c	vulkan: Implement topk_moe fused shader, ported from CUDA (llama/16641) This is similar to the CUDA shader from #16130, but doesn't use shared memory and handles different subgroup sizes.	2025-10-22 12:58:11 +03:00
Aman Gupta	08345f15ec	CUDA: use registers instead of smem in topk-moe (llama/16647) Uses the technique used in the vulkan PR #16641. Neat trick!	2025-10-22 12:58:11 +03:00
Shawn Gu	8ffdf4bd96	opencl: transposed gemm/gemv moe kernel with mxfp4,f32 (llama/16602) * opencl: transposed gemm/gemv moe kernel with mxfp4,f32 * add restore kernel for moe transpose * fix trailing whitespaces * resolve compilation warnings	2025-10-22 12:58:11 +03:00
Radoslav Gerganov	6aa18cccd8	rpc : report actual free memory (llama/16616) * rpc : report actual free memory Start reporting the free memory on every device instead of using fixed values. Now llama-cli users can get a nice memory breakdown when using RPC devices. * drop --mem in rpc-server	2025-10-22 12:58:11 +03:00
Giuseppe Scrivano	d22008b631	vulkan: Add State Space Model (SSM) Operations Support (llama/16463) * vulkan: implement SSM scan operation Add State Space Model scan operation to the Vulkan backend. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> * vulkan: implement SSM conv operation Add State Space Model conv operation to the Vulkan backend. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-10-22 12:58:11 +03:00
muggle-stack	328263f8fd	ggml : fix SpaceMit IME array out-of-bounds in task assignment (llama/16629) Fix incorrect task-to-batch index calculation in the quantization phase. The bug caused out-of-bounds access to qnbitgemm_args array when compute_idx exceeded per_gemm_block_count_m, leading to invalid pointer dereferences and SIGBUS errors. Correctly map tasks to batches by dividing compute_idx by per_gemm_block_count_m instead of block_size_m. Example: batch_feature=1, gemm_m=30, block_size_m=4 per_gemm_block_count_m = 8, task_count = 8 Old: gemm_idx = 4/4 = 1 (out of bounds New: gemm_idx = 4/8 = 0 (correct) Tested on SpaceMit K1 RISC-V64 with qwen2.5:0.5b model. Co-authored-by: muggle <mingjun.rong@spacemit.com>	2025-10-22 12:58:11 +03:00
Jeff Bolz	4a384826a8	vulkan: fix debug build (add_rms_len/data not found) (llama/16624)	2025-10-22 12:58:11 +03:00
Ilia Ilmer	0ae492641c	metal : add `CONV_TRANSPOSE_2D` (llama/16542) * initial: headers and metal-device.cpp updates * adding conv_transpose_2d * fix type * fix type: int32->int64 * Update ggml/src/ggml-metal/ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-metal/ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-metal/ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add checks for src[0] and src[1]; add type checks * Update ggml-metal.metal Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add more tests, add optimization to threading * add dynamic memory allocation in metal --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-22 12:58:11 +03:00
GittyBurstein	82332cea27	SYCL SET operator optimized for F32 tensors (llama/16350) * SYCL/SET: implement operator + wire-up; docs/ops updates; element_wise & ggml-sycl changes * sycl(SET): re-apply post-rebase; revert manual docs/ops.md; style cleanups * move SET op to standalone file, GPU-only implementation * Update SYCL SET operator for F32 * ci: fix editorconfig issues (LF endings, trailing spaces, final newline) * fixed ggml-sycl.cpp --------- Co-authored-by: Gitty Burstein <gitty@example.com>	2025-10-22 12:58:11 +03:00
GittyBurstein	7bb53032b3	sycl : add ARANGE operator (llama/16362) * SYCL: update element-wise ops and presets * clean arange * Re-trigger CI --------- Co-authored-by: Gitty Burstein <gitty@example.com>	2025-10-22 12:58:11 +03:00
Chenguang Li	fe965613c0	CANN: format code using .clang-format (llama/15863) This commit applies .clang-format rules to all source files under the ggml-cann directory to ensure consistent coding style and readability. The .clang-format option `SortIncludes: false` has been set to disable automatic reordering of include directives. No functional changes are introduced. Co-authored-by: hipudding <huafengchun@gmail.com>	2025-10-22 12:58:11 +03:00
takuya kodama	3c136d699a	ggml-cpu: replace putenv with setenv for const-correctness (llama/16573) ## Why it failed When compiling with strict compiler flags (-Wwrite-strings -Werror=discarded-qualifiers), the build fails with the following error: ``` cmake \ -S . \ -B ../llama.cpp.build \ --preset=x64-linux-gcc-debug \ -DCMAKE_INSTALL_PREFIX=/tmp/local \ -DCMAKE_C_FLAGS="-Wwrite-strings -Werror=discarded-qualifiers" && \ cmake --build ../llama.cpp.build/ ... /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c: In function ‘ggml_cpu_init’: /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:3572:24: error: passing argument 1 of ‘putenv’ discards ‘const’ qualifier from pointer target type [-Werror=discarded-qualifiers] 3572 \| putenv("KMP_BLOCKTIME=200"); // 200ms \| ^~~~~~~~~~~~~~~~~~~ In file included from /home/otegami/work/cpp/llama.cpp/ggml/src/./ggml-impl.h:10, from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu-impl.h:6, from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/traits.h:3, from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:6: /usr/include/stdlib.h:786:26: note: expected ‘char ’ but argument is of type ‘const char ’ 786 \| extern int putenv (char __string) __THROW __nonnull ((1)); \| ~~~~~~^~~~~~~~ cc1: some warnings being treated as errors ninja: build stopped: subcommand failed. ``` The issue is that putenv() expects a non-const char but receives a string literal (const char ). ## How to fix This PR replaces putenv("KMP_BLOCKTIME=200") with setenv("KMP_BLOCKTIME", "200", 0). Benefits of setenv(): - Accepts const char parameters (no qualifier warnings) - Makes copies of the strings (safer memory handling) - The third parameter (0) ensures we don't overwrite if already set	2025-10-22 12:58:11 +03:00
yael-works	f7b5ecf195	SYCL: Add GGML_OP_MEAN operator support (llama/16009) * SYCL: Add GGML_OP_MEAN operator support * SYCL: Fix formatting for GGML_OP_MEAN case * Update ggml/src/ggml-sycl/ggml-sycl.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-22 12:58:11 +03:00
safranowith	757d51d21d	cpu : add FLOOR, CEIL, ROUND and TRUNC unary operators (llama/16083) * CPU: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators - Added the operators to unary op enum - Implemented API functions - Implemented forward and unary-op logic in CPU backend - Updated ggml_get_n_tasks - Updated operators names array and static_assert - Updated docs and enabled automatic tests * docs: add documentation for ggml_trunc and ggml_trunc_inplace in ggml.h * chore: remove trailing whitespace from ggml.h * Remove unresolved merge markers * Apply review suggestions: cleanup formatting, enum order and leftover artifacts * Regenerate ops.md using create_ops_docs.py	2025-10-22 12:58:11 +03:00
lhez	bef9f74553	opencl: add q8_0 mm support (llama/16469) * opencl: add mm_q8_0_f32 * opencl: fix data loading for incomplete tile * opencl: use q8_0 mm for larger matrix * opencl: add some tests to cover the path	2025-10-22 12:58:11 +03:00
lhez	16dab3d122	opencl: fix FA for f32 (llama/16584)	2025-10-22 12:58:11 +03:00
Sam/Samuel	d8a146b0f9	metal: optimise `GGML_OP_SUM` (llama/16559) * optimise GGML_OP_SUM * add non-contiguous tests by permuting the input * change tests to require full contiguity of OP_SUM * cuda : add check GGML_OP_SUM --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-22 12:58:11 +03:00
Julius Tischbein	0c9d49927c	CUDA: Changing the CUDA scheduling strategy to spin (llama/16585) * CUDA set scheduling strategy to spinning for cc121 * Using prop.major and prop.minor, include HIP and MUSA * Exclude HIP and MUSA * Remove trailing whitespace Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Remove empty line Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-10-22 12:58:11 +03:00
Georgi Gerganov	8ed913da0e	metal : avoid using Metal's gpuAddress property (llama/16576) * metal : avoid using Metal's gpuAddress property * metal : fix rope kernels buffer check	2025-10-22 12:58:11 +03:00
SavicStefan	499f183e75	vulkan: Add ACC_TYPE_VEC2 implementation (llama/16203) Signed-off-by: Stefan Savic <stefan.savic@huawei.com> Co-authored-by: Stefan Savic <stefan.savic@huawei.com>	2025-10-15 09:29:17 +03:00
Aman Gupta	2eb9119754	CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (llama/16577)	2025-10-15 09:29:17 +03:00
Jeff Bolz	393fbbc80b	vulkan: Support FA with K/V in F32 (llama/16543)	2025-10-15 09:29:17 +03:00
Jeff Bolz	73e200ee85	vulkan: Improve build time for MSVC (llama/16545) Enable CMP0147 so custom build steps (invoking vulkan-shader-gen) are run in parallel. Enable /MP so source files are compiled in parallel.	2025-10-15 09:29:17 +03:00
Johannes Gäßler	1bdd746bc8	CUDA: enable FA for FP32 KV cache (llama/16546)	2025-10-15 09:29:17 +03:00
Aman Gupta	f2075667fa	CUDA: use fastdiv + ggml_cuda_mad for mmvf (llama/16557) * CUDA: use fastdiv + ggml_cuda_mad for mmvf * use bf16 directly + fix formatting * Add exception for HIP code	2025-10-15 09:29:17 +03:00
Aman Gupta	b4c5c6f71f	CUDA: add fp kernel for larger batch size MoE (llama/16512) * CUDA: kernel for larger batch sizes for MoE * WIP * WIP * WIP * WIP * WIP * WIP * fixup * tests * Move mmq_ids_helper to mmid * cleanup * Remove redundant checks	2025-10-15 09:29:17 +03:00
Anav Prasad	a12848e8e9	cuda : remove legacy copy-op pointer indirection code (llama/16485) * remove legacy copy-op pointer indirection code * further removal of copy-op indirection code * renamed check_node_graph_compatibility_and_refresh_copy_ops function	2025-10-15 09:29:17 +03:00
Georgi Gerganov	25ac94a6cb	metal : FA support F32 K and V and head size = 32 (llama/16531) * metal : FA support F32 K and V and head size = 32 * graph : remove obsolete comment [no ci]	2025-10-15 09:29:17 +03:00
lhez	66b0fc2fb7	opencl: fix build targeting CL 2 (llama/16554)	2025-10-15 09:29:17 +03:00
Johannes Gäßler	77272fe0df	CUDA: fix numerical issues in tile FA kernel (llama/16540)	2025-10-15 09:29:17 +03:00
Jie Fu (傅杰)	8a9c2ba6a1	ggml : fix build broken with -march=armv9-a on MacOS (llama/16520) * ggml : fix build broken with -march=armv9-a on MacOS Signed-off-by: Jie Fu <jiefu@tencent.com> * Add #pragma message Signed-off-by: Jie Fu <jiefu@tencent.com> * Address review comment. Signed-off-by: Jie Fu <jiefu@tencent.com> * Update ggml/src/ggml-cpu/ggml-cpu.c --------- Signed-off-by: Jie Fu <jiefu@tencent.com> Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-10-15 09:29:17 +03:00
Chenguang Li	417ecdddc5	CANN: fix CPU memory leak in CANN backend (llama/16549) This commit fixes a CPU-side memory leak issue in the CANN backend, which occurred when intermediate aclTensorList objects were not properly released after operator execution. The leak happened during repeated invocations of CANN ops (e.g., FlashAttention), leading to increasing host memory usage over time. Proper resource cleanup (aclDestroyTensorList and related release logic) has been added to ensure that all temporary tensors are correctly freed.	2025-10-15 09:29:17 +03:00
Sam/Samuel	bfd88b8b6e	metal: add support for opt_step_sgd (llama/16539) * metal: add support for opt_step_sgd * add newline to pass EditorConfig check	2025-10-15 09:29:17 +03:00
Georgi Gerganov	ccac1b4772	ggml : fix scalar path for computing norm (llama/16558)	2025-10-15 09:29:17 +03:00
hipudding	53e21364a6	CANN: Update several operators to support FP16 data format (llama/16251) Many Ascend operators internally use FP16 precision for computation. If input data is in FP32, it must first be cast to FP16 before computation, and then cast back to FP32 after computation, which introduces unnecessary cast operations. Moreover, FP16 computation requires significantly less workload compared to FP32, leading to noticeable efficiency improvements. In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended to support multiple data types. Validation on the Qwen2 0.5b model shows correct accuracy and about 10% performance gain in concurrent scenarios. Co-authored-by: noemotiovon <757486878@qq.com>	2025-10-15 09:29:17 +03:00
Sam/Samuel	7f22fe5d8f	metal : add opt_step_adamw and op_sum (llama/16529) * scaffold to support opt step adamw on metal (not written so far) * add opt-step-adamw kernel for metal * pass op->src[4] as a separate buffer to the pipeline * add bounds check to opt-step-adamw kernel * complete scaffold for GGML_OP_SUM * naive GGML_OP_SUM kernel * remove unwanted comment * change OP_SUM capability gate * Add has_simdgroup_reduction to both ops to pass CI	2025-10-15 09:29:17 +03:00
Neo Zhang Jianyu	be778c992f	fix UT fault cases: count-equal, argsort, pad OPs (llama/16521) * fix/refactor OP argsort, pad * fix count-equal op * update SYCL OP list * fix format issue --------- Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>	2025-10-15 09:29:17 +03:00
sirus20x6	70eb30f28e	ggml : Fix FP16 ELU positive branch (llama/16519) Co-authored-by: Aaron <shelhamer.aaron@gmail.com>	2025-10-15 09:29:17 +03:00
sirus20x6	53721d6309	ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (llama/16518) The previous SVE implementation for `ggml_vec_dot_f16_unroll` contained a bug due to a copy-paste error. The wrong variable was used in an FMA instruction, leading to incorrect results. This commit corrects the variable usage and improves the clarity of the code by renaming variables to avoid confusion. Co-authored-by: Aaron <shelhamer.aaron@gmail.com>	2025-10-15 09:29:17 +03:00
Johannes Gäßler	b5fb9b9f58	CUDA: faster tile FA, add oob checks, more HSs (llama/16492)	2025-10-15 09:29:17 +03:00
Georgi Gerganov	d201705e71	metal : fix mul-mm condition + fix mul-mv permuted kernels (llama/16494)	2025-10-12 11:16:23 +03:00
Diego Devesa	1cc342427b	cuda : avoid initializing unused devices (llama/16510)	2025-10-12 11:16:23 +03:00
Prajwal B Mehendarkar	d8f1aa4e1d	cmake : Dont define XOPENSOURCE on AIX (llama/16481)	2025-10-12 11:16:23 +03:00
duduta	d83fef35df	cpu : optimize the ggml NORM operation (llama/15953) * ggml-cpu: optimize norm operation to use intrinsics or Accelerate rename function add endif macro comment Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Aaron Teo <taronaeo@gmail.com> * implement s390x SIMD suggested by @taronaeo * add TODO comment * tidy up spaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Aaron Teo <taronaeo@gmail.com>	2025-10-12 11:16:23 +03:00
Chenguang Li	b9eac9419c	CANN: Improve ACL graph matching (llama/16166) * CANN: improve ACL graph matching Record `ne` and `nb` information for src tensors and include them in the graph matching check. This enhances the robustness of ACL graph matching by preventing incorrect matches when src tensors share the same data address but differ in shape or stride. * CANN: add op_params match	2025-10-12 11:16:23 +03:00
Charles Xu	c8b2c56fd2	kleidiai: kernel interface refactoring (llama/16460)	2025-10-12 11:16:23 +03:00
Neo Zhang Jianyu	7df6766b63	refactor soft_max, add soft_max_back (llama/16472) * refactor to support soft_max_ext * fix error and support soft_max_back * rm unused functions * fix format issue --------- Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>	2025-10-12 11:16:23 +03:00
ai-fonsi	21e6e72a2f	Disable CUDA host buffers on integrated GPUs (llama/16308)	2025-10-12 11:16:23 +03:00
Georgi Gerganov	7ef78a72e1	metal : mark FA blocks (llama/16372) * metal : better unroll in the FA kernels * metal : index FA blocks * tests : restore [no ci] * metal : prevent division by zero in FA kernels * metal : fix -INF detection logic	2025-10-12 11:16:23 +03:00
Reese Levine	4eea3efc49	ggml webgpu: profiling, CI updates, reworking of command submission (llama/16452) * Add profiling * More detailed profiling * Rework command submission to avoid global locks * Update wait handling * try new method of waiting on futures * Add serializing of command submission in some cases * Add new pool for timestamp queries and clean up logging * Serialize command submission in CI and leave a TODO note * Update webgpu CI * Add myself as WebGPU codeowner * Deadlock avoidance * Leave WebGPU/Vulkan CI serialized * Fix divide by 0 * Fix logic in division by inflight_threads * Update CODEOWNERS and remove serialize submit option	2025-10-12 11:16:23 +03:00
Georgi Gerganov	4bce4fa5e9	metal : add support for non-padded FA KV (llama/16148) * metal : pad K, V and Mask when needed * cont : simplify * cuda : add TODO about KV padding requirement * metal : add comments * metal : remove mask padding requirement	2025-10-12 11:16:23 +03:00
Georgi Gerganov	6cf0c21b09	tests : add -INF blocks to the KQ mask in the FA tests (llama/16380) * tests : add -INF blocks to the KQ mask in the FA tests * cont : bump -INF block size to 64 Co-authored-by: Jeff Bolz <jbolz@nvidia.com> * ggml : prevent division by zero in FA CPU op --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-10-12 11:16:23 +03:00
Georgi Gerganov	1a4116f942	metal : various optimizations + refactoring (llama/16446) * metal : ssm_scan minor opts * metal : get_rows optimize * metal : cpy optimize * metal : ssm_conv opt * metal : ssm_scan simplify * metal : ssm_Scan opt	2025-10-12 11:16:23 +03:00
Georgi Gerganov	0e431b3cea	ggml : fix unaligned access in AMX code (llama/16315)	2025-10-12 11:16:23 +03:00
Daniel Bevenius	0f29d7c3fa	ggml-cpu : fix leftover handling in ggml_vec_scale_f32 for SVE (llama/16443) This commit updates the leftover handling in ggml_vec_scale_f32. The motivation for this is that the code currently incorrectly assumes there would be fewer than ggml_f32_epr leftover elements. However, since the main loop processes 2ggml_f32_epr elements per iteration , there can be up to (2ggml_f32_epr - 1) leftover elements. The original single-pass leftover code could only process ggml_f32_epr elements, leaving some elements unscaled. Example scenario with 256-bit SVE: ``` ggml_f32_epr = 8 (elements per register) ggml_f32_step = 16 (two registers per iteration) n = 25 np = 16 leftovers = 9 elements (16-24) Original : processes only elements 16-23, misses element 24 This commit : loop processes elements 16-23, then element 24 ``` Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630	2025-10-12 11:16:23 +03:00
Reese Levine	b8bdf06182	ggml webgpu: actually add softmax, fix rms_norm offset (llama/16400) * implement soft_max * Fix soft_max data race * Temporary fix, wait on each submit	2025-10-12 11:16:23 +03:00
Eve	2ca8fa37fa	vulkan: use a more appropriate amount of threads when generating shaders (llama/16418) * use a more flexible amount of threads * fix windows compile and 0 thread case * nominmax	2025-10-12 11:16:23 +03:00
Radoslav Gerganov	93882335a8	rpc : check src buffer when copying tensor (llama/16421) Only dst buffer is guaranteed to be an RPC buffer. Add check for the src one.	2025-10-12 11:16:23 +03:00
Radoslav Gerganov	af51bbab88	rpc : add support for multiple devices (llama/16276) * rpc : add support for multiple devices Allow rpc-server to expose multiple devices from a single endpoint. Change RPC protocol to include device identifier where needed. closes: #15210 * fixes * use ggml_backend_reg_t * address review comments * fix llama-bench backend report * address review comments, change device naming * fix cmd order	2025-10-12 11:16:23 +03:00
Acly	49e0a426f3	vulkan : incremental shader builds (llama/16341) * vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times * support dep-files so shaders are recompiled if their included files change * rename shader files which are used as "headers" to use .glsl extension * move glslc extension detection shaders to separate folders * the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled * vulkan : only write embedded shader .hpp/.cpp when they change * avoid recompiling ggml-vulkan.cpp when editing shaders * pass single --source argument instead of --input-dir & --filter to shader gen * check for source file match earlier * fix hang in vulkan-shaders-gen when there are compilation errors * early out did not decrement compile_count * clean up * fix glslc integer dot product test * unconditionally write the embedded shader cpp output * replace output filepath in generated dep-files to match output in CMakeLists --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-10-12 11:16:23 +03:00
Georgi Gerganov	93c1305565	metal : fix loop bound in ggml_mem_ranges (llama/16412)	2025-10-12 11:16:23 +03:00
Acly	a70144a873	ggml : fix graph reallocation with multiple chunks (llama/16396) reallocation is needed if a single chunk grows in size, even if total allocation size stays the same or is lower	2025-10-12 11:16:23 +03:00
Jeff Bolz	2e6888089f	vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (llama/16354) * vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers. The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed beyond that limit. This allows > 4GB buffers to be allocated on some implementations (e.g. NVIDIA) and tensors this large can be used for im2col and mul_mat. For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange. I'm not sure this check is ideal, but we always use these buffers as a single full size binding and the limit may be smaller than maxMemoryAllocationSize or maxBufferSize, so I think this is reasonable. Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range. The maxStorageBufferRange may be smaller than the maxBufferSize or maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and it's invalid usage if VK_WHOLE_SIZE computes a range larger than maxStorageBufferRange. With this change, it should be possible to generate videos using wan networks in stable-diffusion.cpp. * vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull	2025-10-12 11:16:23 +03:00
Jeff Bolz	90bdcf2ef6	vulkan: Fix FA coopmat1 invalid array indexing (llama/16365) When computing sinks, the cm1 shader was looping r from 0 to Br rather than to rows_per_thread. I must have copied this from the scalar path (where it is correct), and somehow it wasn't causing failures on current drivers.	2025-10-12 11:16:23 +03:00
Jeff Bolz	fd11cd97ab	vulkan: in flash attention, bounds check against nem1 (don't rely on GGML_KQ_MASK_PAD) (llama/16316)	2025-10-12 11:16:23 +03:00
Reese Levine	27ebde6afd	ggml webgpu: add support for soft_max, optimize rms_norm (llama/16357) * Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-12 11:16:23 +03:00
Piotr Wilkin (ilintar)	33ca8355c4	model : Apertus model implementation (llama/15852) * First attempt * No permute during convert (fixes qk tensors), proper norm application. * RoPE = NeoX * Coherence! * Migrate xielu params from tensors to hyperparameters * Simple CUDA kernel * Revert stupid LLM refactorings * Chat template support * configchecker / flake8 errors * Reorder unary.cu * I do conclude that LLMs are, in fact, stupid. * Fix after merge * Final newline * Make xIELU an UNARY_OP * Final newline * Correctly account for parameter shift * Argh. * Update ggml/src/ggml-cpu/unary-ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Refactor: remove unused methods, inline and factorize softplus, add const modifiers * Revert CUDA changes, implement xIELU as a separate OP * Pesky newline * Add float2half / half2float for F16 inputs/outputs * CUDA variants, attempt 2 * Actually, attempt 3 * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Missing convert header * Proper formula and reference for xIELU in the comments. * Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add tensor mappings for Apertus to global list instead * Fix lazy on scalars * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Add comment about the constraints on positive/negative alpha * Change `softplus` to `ggml_softplus` --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-12 11:16:23 +03:00
R0CKSTAR	e29508be8b	musa: update compile flags (llama/16265) Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>	2025-10-12 11:16:23 +03:00
uvos	b73f67d3f6	HIP: Disable ROCWMMA fattn on CDNA when compiled against ROCWMMA 2.0.0 (llama/16221) * HIP: Disable ROCWMMA fatt on CDNA when compiled against ROCWMMA 2.0.0 rocwmma 2.0.0 includes a bug in the code fakeing fp16 accumulation on CDNA * CUDA: Fix volta condition in ggml_cuda_should_use_wmma_fattn	2025-10-12 11:16:23 +03:00
Eve	b0560310aa	vulkan: make ggml_vk_default_dispatcher support older vulkan headers (llama/16345) * make ggml_vk_default_dispatcher support older vulkan headers * simpilfy with using	2025-10-12 11:16:23 +03:00
lhez	31bb869929	opencl: support pad_ext (llama/15888)	2025-10-12 11:16:23 +03:00
Reese Levine	8208cea829	ggml webgpu: support for rope,div,sub,glu,scale,cont operators (llama/16187) * Work on rope * Simplify inplace operation generation and combine mul/add generation * Work on rope variants * implement neox rope * rope complete * Add sub,div,glu operators * implement scale op * Update cpy shader to handle cont/more types * formatting * Update test vars printing for rope,rms_norm * Avoid ROPE hardcoded constants * Add TODO to change ROPE constants to enum Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix TODO comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-12 11:16:23 +03:00
lhez	199626d79e	opencl: support ne3 in get_rows (llama/15866)	2025-10-12 11:16:23 +03:00
Georgi Gerganov	527ff158d0	ggml : bump version to 0.9.4 (ggml/1363)	2025-09-30 13:54:08 +03:00
anavp-nvidia	62b3b86e3f	cuda : Enable CUDA Graph usage for Nemotron Nano v2 (NemotronH) (llama/16328) * Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs * fix to ensure test-backend-ops check passes	2025-09-30 12:31:04 +03:00
Georgi Gerganov	78f85f2b92	metal : dynamic simdgroups for MV kernels (llama/16340) * metal : dynamic simdgroups for MV kernels * cont : minor	2025-09-30 12:31:04 +03:00
Charles Xu	01e86b69ab	kleidiai : fix work size and threads sync for fp16 (llama/16246)	2025-09-30 12:31:04 +03:00
alex-spacemit	35ebdf7304	ggml: riscv: add riscv spacemit backend (llama/15288) * ggml: add spacemit backend Change-Id: I249bdc043485d815a9c351867137bc1e27cc2e23 * add new line at end of file Change-Id: I889ed1c85fb45e62350ecde0c06f70450cadfbe2 * add riscv zba extension limit Change-Id: I321eb200f859751727afe5cae13074dfce2bb0ce * fixed for review comments, file renamed and format Change-Id: Ia20b6ec24a36638e62e0fe07cf100916a7cce3ce * fixed for code format, after clang-format Change-Id: I5dc33a0412da3d3f2d77075d8939185d3009eca2 * use _Float16 instead of __fp16 Change-Id: I039fb02bb95270e641bc4442204e658735859d43 * add ci for riscv64-spacemit-ime-native Change-Id: I711c1033061df1a289ea77891b2997599dfe8279 * update debian-13-riscv64-spacemit-ime-native ci label Change-Id: Ifb2b891e2fca57b5da604fce2ac255f27731179a * remove license comment for spacemit ime Change-Id: If0dc3ca30a958631ccca0a28b62e0b825f9fb0c3 * upgrade binutils for gcc ime Change-Id: Ibf2fa74c1064408974cb5b45f044d40987e5fb45 * add spacemit ime cross jobs Change-Id: I80d74909941d41cb9cd09e51d8baf01c985cbfc6 * remove native compile for riscv64-spacemit-ime Change-Id: I01920afafdc73fa7424014fd648d243f8ec9e25e * ci : add caching for spacemit ime cross toolchain Change-Id: Ic54a192019a2fd982bbd58225ce3bbc38f4053de * ci: bug fixed for cache path and env Change-Id: I28c42e10b6fff053bb6580926ca2353448cb042a * Update .github/workflows/build-linux-cross.yml for cache path Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * bugfixed for build-linux-cross.yml, syntax error Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: cailinxi <linxi.cai@spacemit.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-09-30 12:31:03 +03:00
Rafal Lewczuk	94fe9bbe2b	ggml-backend : add root cause in error message if loading backend library fails (llama/16172) This PR adds additional information to an error message when loading backend library via ld_load_library() fails. This helps spotting why backend library did not load (missing library, missing dependency or unresolved symbol etc.).	2025-09-30 12:31:00 +03:00
Georgi Gerganov	22c12ee86d	ggml : remove oboslete files (#0 )	2025-09-29 16:47:30 +03:00
Georgi Gerganov	3201382792	cmake : remove metal flag (llama/0)	2025-09-29 15:18:13 +03:00
Sigbjørn Skjæret	112e10f2e4	ggml : check cuda and metal argsort limits and add test (llama/16323) * check cuda argsort limits and add test * add metal check	2025-09-29 15:18:12 +03:00
Georgi Gerganov	7ce0a7bcd0	ggml : fix dependencies for ggml_set_rows (llama/16318)	2025-09-29 15:18:12 +03:00
Jeff Bolz	a375e4c4d2	vulkan: Fix validation failure in quantized flash attention (llama/16292)	2025-09-29 15:18:12 +03:00
Sigbjørn Skjæret	5c6e795607	ggml : fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 (llama/16307) * fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 * add test that fails on simd	2025-09-29 15:18:12 +03:00
Jeff Bolz	55d45edf6d	vulkan: 64-bit im2col (llama/16135) * vulkan: 64-bit im2col Add variants of the im2col shaders that use buffer_device_address/buffer_reference, and use 64-bit address calculations. This is needed for large convolutions used in stable-diffusion.cpp. * fix validation error for large im2col	2025-09-29 15:18:12 +03:00
Georgi Gerganov	0102733cca	metal : extend mat-mat multiplication support (llama/16225) * metal : support mul_mm with src1->type == GGML_TYPE_F16 * metal : support mul_mm_id with src1->type == GGML_TYPE_F16 [no ci] * metal : mul_mm support ne00 % 32 != 0 * metal : support mul_mm_id with ne00 % 32 != 0 * cont : remove unnecessary unrolls * cont : simplify data loading * metal : optimize mul_mm when output bounds checks are not needed	2025-09-29 15:18:12 +03:00
Georgi Gerganov	45976f2857	metal : fuse non-sequential nodes (llama/16102) * metal : fuse non-sequential nodes * cont : add comment * cont : simplify bounds checks	2025-09-29 15:18:12 +03:00
Jeff Bolz	91ab93b756	vulkan: handle mat_mul with A matrix > 4GB (llama/16176) * vulkan: handle mat_mul with A matrix > 4GB This change splits mat_mul operations with huge A matrix into chunks in the M dimension. This works well for stable-diffusion use cases where the im2col matrix has very large M. Fix the order of setting the stride in mul_mm_cm2 - setting the dimension clobbers the stride, so stride should be set after. * build fixes	2025-09-29 15:18:12 +03:00
Jeff Bolz	eb982dd786	vulkan: support arbitrary KV dimension in flash attention (llama/16160) The "Clamp" spec constant is already based on whether KV is a multiple of Bc, so use that to control whether bounds checking is performed. Add bounds checking to the scalar and coopmat1 paths. Coopmat2 didn't need any changes (the K/V tensors are already optionally clamped, nothing else needed to be changed).	2025-09-29 15:18:12 +03:00
Acly	bc1ac13c2f	vulkan : make the vulkan.hpp dynamic dispatcher instance private (llama/16224) * don't use VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE which can cause conflicts if application or other libraries do the same	2025-09-29 15:18:12 +03:00
Aman Gupta	85e4455cd3	CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32 (llama/16277) * CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32 This commit adds mul_mat_id support for ncols_dst >= 16. It does this by packing ncols_dst tiles into the blockDim.y. My tests on a RTX 3090 show that this is faster than the cuBLAS fallback for f16 till bs=64, and for f32 till bs=32 * Review: refactor if statement	2025-09-29 15:18:11 +03:00
Johannes Gäßler	e856483cd6	CUDA: refactor and deduplicate vector FA kernels (llama/16208) * CUDA: refactor and deduplicate vector FA kernels	2025-09-29 15:18:11 +03:00
Dmytro Minochkin	88dd9e0d45	vulkan: throw system error instead of SIGABRT during init on older devices (llama/16156) * Throw system error on old Vulkan driver rather than SIGABRT * Optionally handle any potential error in vulkan init	2025-09-29 15:18:11 +03:00
Jeff Bolz	97bd65f90f	vulkan: support GET_ROWS for k-quants (llama/16235) The dequantize functions are copy/pasted from mul_mm_funcs.comp with very few changes - add a_offset and divide iqs by 2. It's probably possible to call these functions from mul_mm_funcs and avoid the duplication, but I didn't go that far in this change.	2025-09-29 15:18:11 +03:00
Aaron Teo	23b3598952	devops: add s390x & ppc64le CI (llama/15925) * devops: move s390x and ppc64le ci build we have access to ubuntu-24.04-s390x and ppc64le images now Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: disable ppc64le for now since they have compiler errors Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: stop warnings as errors Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: switch to non-macro flag Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: going the llama macro route Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add big-endian gguf test models Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: disable ppc64le to test s390x, check test build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: dup .gguf.inp files for big-endian tests Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: dup .gguf.out files for big-endian too Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add python setup and endian byteswap Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: pooring thing does not have s390x python3 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add missing rust compiler for s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: try rust actions runner Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "devops: try rust actions runner" This reverts commit 3f8db04356033d6c1d7eccc75ca396bc5298250c. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: try a different path for rust Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: dump home directory and user info Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: install gguf-py only Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: missed relative path Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove big-endian files since local swapping is working Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: revert test-tokenizer-0 cmakelists Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Fix unicode flags conversion from and to uint16_t Bitfields are allocated in different order on s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Simplify byteswap command Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Add byteswapping and git-lfs for test-tokenizers-ggml-vocabs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Fix endianness detection in vocab loader Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Disable test-thread-safety on s390x In this test a model is downloaded, then immediately loaded to check if more downloads are needed, and then used for test. There is no clean way to separate all those steps to add byteswapping between them, so just skip this test. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Fix q8_0 test in test-quantize-fns vec_signed uses unexpected rounding mode. Explicitly use different rounding function. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add big-endian stories260K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add s390x test-eval-callback Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix test does not exist Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix model not found llama-eval-callback Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Fix q3_K dot product error in test-quantize-fns on s390x Array q8bytes had only 4 elements allocated, but 8 elements accessed. This lead to write out of bounds and later read of overwritten values out of bounds and incorrect result. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: re-enable ppc64le for testing Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: activate test-thread-safety for s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: disable ppc64le tests for some reason it keeps failing test-thread-safety tests and I do not have a machine that is able to replicate the tests. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: LLAMA_FATAL_WARNINGS=ON Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Correct repository URL for s390x for test-thread-safety model Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Fix fs_get_cache_directory Ensure it works even if both XDG_CACHE_HOME and HOME are unset. This might happen in containers. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Re-enable CI for ppc64le Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Fortify ggml_rope_impl Only memcpy data from sections argument if it's non-NULL. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Add TODO in struct unicode_cpt_flags to reimplement it in endian-independent way * Update URL for big-endian model * Update .github/workflows/build.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update remaining mentions of BE models to ggml-org/models repo --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@linux.ibm.com> Co-authored-by: Aleksei Nikiforov <103434461+AlekseiNikiforovIBM@users.noreply.github.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-09-29 15:18:11 +03:00
Georgi Gerganov	670d54ef5d	metal : report OOM errors (llama/16274)	2025-09-29 15:18:11 +03:00
Adrien Gallouët	9823c5cc51	common : use cpp-httplib as a cURL alternative for downloads (llama/16185) * vendor : update httplib Signed-off-by: Adrien Gallouët <angt@huggingface.co> * common : use cpp-httplib as a cURL alternative for downloads The existing cURL implementation is intentionally left untouched to prevent any regressions and to allow for safe, side-by-side testing by toggling the `LLAMA_CURL` CMake option. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * ggml : Bump to Windows 10 Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-09-29 15:18:11 +03:00
Aaron Teo	89a7b4d22c	ggml-cpu: implement MXFP4 SIMD for s390x (llama/16193) * ggml-cpu: impl mxfp4 s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: missing s = sumf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix incorrect kval_mxfp4 type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: rework mxfp4 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: missing delta calc Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix typo Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix typo for vec_splats Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: expand to 2 blocks per loop Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add unroll to boost perf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: back to 1 block per loop to test perf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: back to 1 block per loop to test perf" This reverts commit 1fe55724e2dc295701101bf838bdd4a512237492. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: rm unroll from single block Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-29 15:18:11 +03:00
R0CKSTAR	98ac209ae1	musa: fix build warnings (llama/15611) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-09-29 15:18:10 +03:00
Aman Gupta	d9bf63cfb8	CUDA: add a fused top-K MoE kernel (llama/16130) * CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback	2025-09-29 15:18:10 +03:00
junchao-zhao	24ea5476de	ggml : fix loongarch lsx compilation error (llama/15864)	2025-09-29 15:18:10 +03:00
Daniel Bevenius	611ff19f20	ggml : remove -dev suffix from release version (ggml/1355) This commit removes the `-dev` suffix from the version string in CMakeLists.txt and the release script. The version will now be just be formatted as `MAJOR.MINOR.PATCH`.	2025-09-29 15:18:10 +03:00
Daniel Bevenius	06d7b3d124	ggml : bump version to 0.9.3 (ggml/1353)	2025-09-29 15:18:10 +03:00
Georgi Gerganov	ac678efb35	metal : fuse NORM + MUL + ADD, support non-multiples of 4 (llama/16220) * metal : fuse NORM + MUL + ADD * metal : support norms of non-multiple of 4 * cont : fix comment [no ci]	2025-09-29 15:18:10 +03:00
Georgi Gerganov	268f1c961b	metal : relax reorder conditions (llama/16216)	2025-09-29 15:18:10 +03:00
Georgi Gerganov	0a5b811f2e	metal : restore im2col perf (llama/16219)	2025-09-29 15:18:10 +03:00
Radoslav Gerganov	0946619662	rpc : use ggml logging facilities Use RPC_DEBUG environment variable to enable debug messages. Add helper macro LOG_DBG() which does an early check of the env var before calling GGML_LOG_DEBUG(). Make sure we log a debug message for every server function.	2025-09-29 15:18:10 +03:00
Johannes Gäßler	cd431223e0	llama: print memory breakdown on exit (llama/15860) * llama: print memory breakdown on exit	2025-09-29 15:18:10 +03:00
Acly	5069c08034	ggml : split graph allocations according to backend max buffer size (llama/15815) * ggml : make gallocr respect the backend's max buffer size * if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers * vulkan: report the actual max allocation size in buffer type interface * fix missing newline, apple-clang warning * track size of individual chunks in ggml_dyn_tallocr and raise max chunks. revert to use suballocation_block_size as max chunk size for vulkan. * track (chunk, offset) pairs instead of "global" offsets through gallocr. * simpler, don't need loops to map between local/global offsets * touches more code * fix dyn_tallocr_max_size and initialization * fix memory leak when buffers are reused due to same buffer type appearing multiple times * make vbuffer allocation follow the same logic as backend_buffer did before * continue to use leftover unallocated space of previous chunks after a new one has been created * treat free blocks of each chunk as separate list * they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges * exhaust freed blocks of all chunks before considering their last blocks with unallocated space * start with 0 chunks/blocks and create chunks as needed * allow the last chunk to grow beyond max size * refactor: move adding new free block and new chunk into separate functions * allocate chunks individually with a separate free-blocks list for each one * needs a bit more memory/allocations/indirections, but code is simpler * fix warnings (missing static) & debug checks	2025-09-29 15:18:09 +03:00
Xiangyan Sun	41245891c1	ggml-cpu: Respect cpumask settings (llama/16164)	2025-09-29 15:18:09 +03:00
Sigbjørn Skjæret	73e8f3acb8	ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (llama/15928) * fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl * change initialization to true	2025-09-29 15:18:09 +03:00
Aaron Teo	c706a50746	zdnn: refactor codebase + add docs (llama/16178) * zdnn: initial matmul refactor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm static from funcs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: update ggml-zdnn.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: change header files to hpp Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: switch to common.hpp Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: move mulmat forward around Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm inline from utils Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: code cleanup Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * docs: add zDNN docs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-29 15:18:09 +03:00
Daniel Bevenius	d8d31e3638	ggml-cpu : fix typo in gemm comments [no ci] (llama/16189)	2025-09-29 15:18:09 +03:00
Sigbjørn Skjæret	4e32ee733b	ggml : implement set_rows with i32 index (llama/16159) * implement set_rows with i32 index * template fix * test quantized path warnings-- * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * forgotten name change * deduplicate cuda/sycl and test-fix * indent++ * vulkan: support set_rows with i32 index type (llama/16162) * disable i32 index for webgpu for now --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-09-29 15:18:09 +03:00
Georgi Gerganov	df672c6372	ggml : extend ggml_can_fuse to work with non-sequential nodes (llama/16123) * ggml : extend ggml_can_fuse to work with non-sequential nodes in the graph * cont : fix wrong bounds check condition * cont : remove unnecessary overload	2025-09-29 15:18:09 +03:00
Georgi Gerganov	973054a8cd	ggml : add ggml_op_is_empty (llama/16122) * ggml : add ggml_op_is_empty * ggml : move to ggml-impl.h	2025-09-29 15:18:09 +03:00
Shin-myoung-serp	9f673df08d	Vulkan: add conv_transpose_2d operation (llama/16022) * Vulkan: add conv_transpose_2d operation * Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L) * Vulkan: fix incorrect indentation in conv_transpose_2d shader * Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation * Vulkan: revert the order of the index calculation and bound check in conv_2d shader * Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation. * Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.	2025-09-29 15:18:09 +03:00
Jeff Bolz	14723f25a1	vulkan: add RTE variants of exp shader (llama/16165) This fixes some failures on Turing where "round to zero" rounds to the max f16 value but the CPU reference value is infinite.	2025-09-29 15:18:08 +03:00
Ruben Ortlam	95b29fab78	vulkan: vec dot matrix multiplication fix (llama/16151) * vulkan: fix matrix multiplication index calculation for odd m/n and odd k in combination with batching * add odd m/n + odd k test with batching	2025-09-29 15:18:08 +03:00
lhez	4b7f09ac0b	opencl: fix concat crash on win arm64 with Adreno (llama/15944)	2025-09-29 15:18:08 +03:00
lhez	0a7096f4f3	opencl: initial `q8_0` mv support (llama/15732)	2025-09-29 15:18:08 +03:00
Giuseppe Scrivano	eae2be0ca2	vulkan: optimize UMA buffer operations and fix driver hangs (llama/16059) * vulkan: optimize UMA buffer operations and fix driver hangs The previous implementation was blocking the GPU for extended periods, causing the i915 driver to reset the context due to the hangcheck protection. [32628.443070] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:85dffffb, in llama-server [194114] [32628.443091] i915 0000:00:02.0: [drm] llama-server[194114] context reset due to GPU hang * vulkan: implement deferred_memset on UMA --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-09-29 15:18:08 +03:00
Jeff Bolz	9a6c2036a9	vulkan: fix validation error about VK_PIPELINE_CREATE_CAPTURE_STATISTICS_BIT_KHR (llama/16086)	2025-09-29 15:18:08 +03:00
Georgi Gerganov	8d10ded025	ggml : prepare for development of 0.9.2-dev	2025-09-29 15:18:08 +03:00
Georgi Gerganov	d89164a08d	ggml : bump version to 0.9.1	2025-09-29 15:18:05 +03:00
Ruben Ortlam	76d0934287	vulkan: use vec dot for matrix matrix multiplications (llama/16056) * vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions * use fma instead of dot to fix Nvidia and Apple performance issues	2025-09-20 13:46:39 +03:00
Xuan-Son Nguyen	2ad00d5586	ggml : refactor forward_dup for cpu backend (llama/16062) * ggml : refactor forward_dup for cpu backend * clean up a bit * add quant/dequant perf test	2025-09-20 13:46:39 +03:00
Adrien Gallouët	4d8cd07825	ggml-amx : fix ggml_amx_init() on generic Linux (llama/16049) Generalize Linux check to `__linux__` to support non-glibc systems (like musl). Also, return `false` on unknown/untested OS. Without this commit, the code compiles (with warnings) but fails: register_backend: registered backend CPU (1 devices) register_device: registered device CPU (Intel(R) Xeon(R) Platinum 8488C) build: 6487 (51c4cac6) with x86_64-linux-musl-gcc (GCC) 15.1.0 for x86_64-linux-musl (debug) system info: n_threads = 8, n_threads_batch = 8, total_threads = 16 .... print_info: n_ctx_orig_yarn = 262144 print_info: rope_finetuned = unknown print_info: model type = 4B Illegal instruction (core dumped) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-09-20 13:46:39 +03:00
Adrien Gallouët	4575f96873	cmake : fix static linking for OpenMP on Unix-like systems (llama/16031) When compiling with GGML_STATIC=ON, the build process would produce a binary that was still dynamically linked to OpenMP. This defeats the purpose of a static build: $ cmake -B build \ -DBUILD_SHARED_LIBS=OFF \ -DLLAMA_CURL=OFF \ -DGGML_CCACHE=OFF \ -DGGML_NATIVE=OFF \ -DGGML_STATIC=ON $ ldd llama-server linux-vdso.so.1 (0x0000e1a434e3b000) libgomp.so.1 => /lib/aarch64-linux-gnu/libgomp.so.1 (0x0000e1a4345a0000) libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000e1a434300000) libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000e1a434240000) libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000e1a434200000) libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000e1a434030000) /lib/ld-linux-aarch64.so.1 (0x0000e1a434df0000) This commit resolves the issue by modifying `CMAKE_FIND_LIBRARY_SUFFIXES` to prioritize `.a` files, forcing CMake to link the static version of the library. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-09-20 13:46:39 +03:00
Shawn Gu	f4a225cea6	opencl: optimize mxfp4 kernels (llama/16037) - flatten mxfp4 and packed fp4->fp16 bit-wise convert function (replace lut) - MoE kernel optimizations --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2025-09-20 13:46:39 +03:00
Jeff Bolz	7fcb7e83ec	rename optimize_graph to graph_optimize (llama/16082)	2025-09-20 13:46:39 +03:00
Bowen Han	fce6354e0f	CUDA: Optimize PAD_REFLECT_1D (llama/15957) * CUDA: Optimize PAD_REFLECT_1D feat: add more test cases for PAD_REFLECT_1D * use fast_div to improve performance * Apply suggestion from JohannesGaessler Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Apply suggestion from JohannesGaessler Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * optimize * use a concise expression to further speedup the cuda kernel --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-09-20 13:46:38 +03:00
Johannes Gäßler	05bdfd4380	CUDA: fix compilation on CC 6.0 (llama/16091)	2025-09-20 13:46:38 +03:00
Georgi Gerganov	960aaa9904	metal : use function constants for mul_mv_ext kernels (llama/16074) * metal : use function constants for mul_mv_ext kernels ggml-ci * metal : remove NW template argument ggml-ci * metal : adjust constants ggml-ci	2025-09-20 13:46:38 +03:00
Sigbjørn Skjæret	225d7c1d5a	cuda : add missing F32<->I32 entries in ggml_cuda_cpy_fn (llama/16060)	2025-09-20 13:46:38 +03:00
Georgi Gerganov	d37f590a77	metal : improve F32, F16 and BF16 mat-vec multiplication (llama/16057) * metal : improve F32, F16 and BF16 mat-vec multiplication ggml-ci * metal : make the NSG a function constant in mul_mv kernels ggml-ci	2025-09-20 13:46:38 +03:00
Jhen-Jie Hong	32b6d9c134	metal : avoid call free for non-owned buffer (llama/16067)	2025-09-20 13:46:38 +03:00
Georgi Gerganov	1f24b1df4d	metal : handle nil cv during pipeline creation (llama/16065) ggml-ci	2025-09-20 13:46:38 +03:00
Chenguang Li	c46adc0817	CANN: Remove print (llama/16044) Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-20 13:46:38 +03:00
Reese Levine	1361f679cc	GGML WebGPU: Support for ADD, MUL, RMS_NORM, GET_ROWS operators (llama/16018) * Add paramater buffer pool, batching of submissions, refactor command building/submission * Add header for linux builds * Free staged parameter buffers at once * Format with clang-format * Fix thread-safe implementation * Use device implicit synchronization * Update workflow to use custom release * Remove testing branch workflow * some f32 tests passing * Disable set_rows until it's implemented * f32 add all tests passing * Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments * Add templated addition, clean up code * Get addition and multiplication working * Implement rms_norm * Add get_rows implementation * Add new get_rows files * Refactor use of wg size entry * Fix compilation * Try manually unrolled q4_0 quant * Revert "Try manually unrolled q4_0 quant" This reverts commit 77f8b96515f7e640ae4b0e44f066321fbc4a6166. * Move to constant max wg size * Check for tensor size in supports_op * Vectorize f32 and change default workgroup size * Move f32 get_rows from < 4 to % 4 != 0 * fix linter errors * Add in-place tests --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>	2025-09-20 13:46:37 +03:00
Georgi Gerganov	eb2c01f92e	metal : refactor + optimize v2 (llama/15995)	2025-09-20 13:46:10 +03:00
Johannes Gäßler	d452f0cf8c	CUDA: fix FA occupancy, optimize tile kernel (llama/15982)	2025-09-20 13:45:30 +03:00
Eve	e96b285011	vulkan: automatically remove unsupported devices (llama/15976) * remove unsupported vulkan devices * make this happen during selection instead * pass by reference	2025-09-20 13:45:30 +03:00
Chenguang Li	e32c3b0fd3	CANN: Optimize ggml_cann_set_device (llama/15935) * CANN: Fix ggml_cann_set_device to avoid redundant device switches - Added a check to skip aclrtSetDevice if the current device is already set. - Prevents unnecessary context switches while keeping thread/device consistency. * CANN: add device default id	2025-09-20 13:45:30 +03:00
Daniel Bevenius	5c524bb879	ggml : fix padding in timestep embedding kernels (llama/15932) * ggml : remove adding extra dim timestep embedding This commit updates the ggml_timestep_embedding function to no longer add an extra dimension when the specified dimension is odd. The motivation for this change is that this introduces an unnecessary dimension when the dimension is odd, which caused an issue in the kernels which were not expecting this extra dimension and it resulted in uninitialized memory for the second to last dimension. * ggml-cuda : fix padding in timestep embedding kernel This commit removes the zeroing out of the last dimension now that we are not adding the extra padding dimension. * ggml-metal : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel * ggml-opencl : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-sycl : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-vulkan : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-cpu : fix padding in timestep embedding function This commit removes the zeroing out of the last dimension now that we are not adding the extra padding dimension.	2025-09-20 13:45:30 +03:00
Jake Karnes	f72ec185fb	CUDA: fix im2col_3d to respect non-contiguous inputs (views) (llama/15956) * fix im2col_3d to respect non-contiguous inputs (views) The CUDA 3D im2col kernel computed source addresses assuming compact layout (products of dims), ignoring nb[] strides. This patch switches im2col_3d source indexing to use true strides derived from src1->nb[] (in elements), mirroring the approach used in the 2D CUDA im2col path. Destination indexing is unchanged. * use ggml_element_size() for src strides Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-09-20 13:45:30 +03:00
yael-works	a642b533a4	SYCL: Add COUNT_EQUAL operator support (llama/15991) * SYCL: Add COUNT_EQUAL operator support (rebased on master) * SYCL: remove duplicate op_count_equal definition * tests: remove test_count_equal_typed and use test_count_equal for all cases * tests: keep only I32 case for COUNT_EQUAL as suggested * tests: keep only I32 case for COUNT_EQUAL as requested	2025-09-20 13:45:30 +03:00
Aman Gupta	10bd5d3626	CUDA: some micro-optimizations in mmf.cuh for mul_mat_id (llama/15926)	2025-09-20 13:45:30 +03:00
Georgi Gerganov	82a8c141ea	metal : remove memory pools (llama/15966) * metal : remove mem pool usage ggml-ci * metal : remove mem pool implementation ggml-ci * metal : take into account the actual allocated memory of the tensor ggml-ci * cont : use ggml_backend_buft_get_alloc_size ggml-ci * cont : improve, comments ggml-ci * cont : add functions for the extra tensor sizes * metal : add comments ggml-ci * metal : implement .get_alloc_size for the rest of the buffer types ggml-ci * metal : remove ggml_metal_heap ggml-ci	2025-09-20 13:45:29 +03:00
Ruben Ortlam	c36358cb3c	Vulkan: Clean up mul_mm shader (llama/15987) * vulkan: move mul_mm dequantization steps into a separate file and functions * improve mul_mm vector load code * fix debug mode issues and warnings	2025-09-20 13:45:29 +03:00
Georgi Gerganov	2d3f15607f	metal : fix kernel requirements (llama/15983) * metal : fix kernel requirements ggml-ci * cont : fix supports_op * cont : fix supports_op for ARGMAX	2025-09-20 13:45:29 +03:00
Aaron Teo	7dca05ca77	ggml-zdnn: rm user mapped buffers (llama/15965) * ggml-zdnn: rm user mapped buffers Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm dead code Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: attempt to fix missing extra data buffer free Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-20 13:45:29 +03:00
Jeff Bolz	1789ed3f2c	vulkan: fix failing dequant shaders (llama/15862) * vulkan: fix failing dequant shaders * add missing const	2025-09-20 13:45:29 +03:00
Jeff Bolz	a3defb0a3b	vulkan: initialize vulkan-hpp to allow using extension function pointers (llama/15705) Use this to query register count for shader compiles on NVIDIA. Currently this is only for performance debug, but it could eventually be used in some heuristics like split_k.	2025-09-20 13:45:29 +03:00
Georgi Gerganov	2caf15d68a	metal : refactor kernel loading (llama/15964) * metal : refactor bin kernels loading ggml-ci * metal : refactor rms kernel loading ggml-ci * ci : try to add memory leaks check ggml-ci * ci : try to enable memory leak detection for Mac * cont : seems to be working	2025-09-20 13:45:29 +03:00
Georgi Gerganov	0d36ba9e1a	metal : allow ops to run concurrently (llama/15929) * metal : run graphs ops concurrently ggml-ci * cont : add flags for debugging and disabling concurrency ggml-ci * cont : refactor and handle fusing ggml-ci * cont : simplify - no need to use GPU address ggml-ci * cont : prepare mem ranges for reuse + add ggml-metal-common.cpp ggml-ci * cont : avoid redundant keywords in cpp [no ci] * metal : reorder graph for better concurrency ggml-ci * metal : fix race on mem pool buffers ggml-ci * cont : add env GGML_METAL_GRAPH_OPTIMIZE_DISABLE ggml-ci * cont : refactor, optimize, add comments ggml-ci * cont : refactor ggml-metal.m ggml-ci * minor : update logs [no ci]	2025-09-20 13:45:29 +03:00
Georgi Gerganov	20a930ec94	metal : fix memory leaks (llama/15962) ggml-ci	2025-09-20 13:45:28 +03:00
Aaron Teo	e902731ccc	ggml-zdnn: fix #15414 , activate FP16 and BF16 acceleration and incorrect zTensor free (llama/15839)	2025-09-20 13:45:28 +03:00
Ruben Ortlam	424c85f22a	Vulkan iGPU device selection overhaul and PCI ID API support (llama/15947) * vulkan: implement ggml igpu device type, implement pci id support * fix compiler warning * prevent printf overflow warning	2025-09-20 13:45:28 +03:00
Mathieu Baudier	5a752bab84	vulkan: Make device memory check more portable (llama/15939)	2025-09-20 13:45:28 +03:00
Neo Zhang Jianyu	cd764eaf2b	Revert "sycl: add usage of enqueue_functions extension (llama/14244)" (llama/15910) * Revert "sycl: add usage of enqueue_functions extension (#14244)" This reverts commit 8308f98c7fb778e54bf75538f5234d8bd20915e9. * fix missed revert code, format the code	2025-09-20 13:45:28 +03:00
Diego Devesa	555dcb3e01	ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type (llama/15797) * ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type ggml-backend : add device id to device props llama : only use iGPU devices if there are no GPU devices llama : do not use multiple devices from different backends with the same device id	2025-09-20 13:45:28 +03:00
Johannes Gäßler	f0768eb575	CUDA: larger SRAM reads for tile FA, AMD FP16 dot (llama/15927) * CUDA: larger SRAM reads for tile FA, AMD FP16 dot * fix logic for availability of v_dot2_f32_f16	2025-09-20 13:45:28 +03:00
Daniel Bevenius	020eb19eb3	ggml-cpu : add check for ARM MATMUL_INT8/i8mm support (llama/15922) This commit adds a check for GGML_MACHINE_SUPPORTS_i8mm when enabling MATMUL_INT8 features, ensuring that i8mm intrinsics are only used when the target hardware actually supports them. The motivation for this is to fix ggml CI build failures where the feature detection correctly identifies that i8mm is not supported, adding the +noi8mm flag, but MATMUL_INT8 preprocessor definitions are still enabled, causing the compiler to attempt to use vmmlaq_s32 intrinsics without i8mm support. Refs: https://github.com/ggml-org/ggml/actions/runs/17525174120/job/49909199499	2025-09-20 13:45:28 +03:00
Charles Xu	b079d9c8b0	kleidiai: fix GGML_ASSERT(cur_backend_id != -1) failed (llama/15614) kleidiai: fix GGML_ASSERT(cur_backend_id != -1) failed removes the Whisper-specific check for GET_ROWS support	2025-09-20 13:45:27 +03:00
hipudding	dadf73665a	CANN: Disable acl_graph for prefill stage (llama/15933) Since the prefill length is not fixed, graphs constructed for the prefill stage cannot be reused. For this reason, ACL graph execution is disabled by default during prefill.	2025-09-20 13:45:27 +03:00
Oliver Simons	f5ef0e25e2	CUDA: Add `fastdiv` to `k_bin_bcast`, giving 1-3% E2E performance (llama/15872) Add fastdiv and fastmodulo to k_bin_bcast kernel * Address review comments * `prod_` instead of `prod` suffix * Add test case for `k_bin_bcast_unravel` in CUDA backend	2025-09-20 13:45:27 +03:00
Daniel Bevenius	3617008c37	ggml-cpu : fix padding in ggml_timestep_embedding (llama/15917) This commit fixes the zero padding for odd dimensions in ggml_compute_forward_timestep_embedding_f32. The motivation for this is that currently if an odd dimension is used, the padding check incorrectly uses the dimension value for indexing. For example, with dim=15: Elements 0-6 are set to cosine values Elements 7-13 are set to sine values Element 14 is left uninitialized (contains garbage) Element 15 is correctly set to zero This fix changes embed_data[dim] to embed_data[2 * half] so that element 14 (the first unused element) is properly set to zero as well as the last element. Resolves: https://github.com/ggml-org/ggml/issues/1324	2025-09-20 13:45:27 +03:00
Georgi Gerganov	7eae055e61	metal : make the backend async (llama/15906)	2025-09-20 13:44:27 +03:00
Chenguang Li	4d453b14a9	CANN: Add ROPE sin/cos cache for reuse (llama/15912) * CANN: Add ROPE sin/cos cache for reuse Introduce sin/cos caching mechanism in ROPE to avoid redundant computation across layers. The cache is built on the first layer per device and reused by subsequent layers if parameters match. - Added sin_cache / cos_cache pointers and position_length tracking - Introduced cache validity flags and properties: (ext_factor, theta_scale, freq_scale, attn_factor, is_neox) - Accelerates ROPE by eliminating repeated sin/cos generation This change reduces overhead in multi-layer scenarios while preserving correctness by verifying parameter consistency. Co-authored-by: hipudding <huafengchun@gmail.com> * fix typo Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com> Co-authored-by: hipudding <huafengchun@gmail.com>	2025-09-20 13:42:53 +03:00
Chenguang Li	9b773acac0	CANN: implement LRU cache for ACL graphs (llama/15814) * CANN: implement LRU cache for ACL graphs in CANN backend - Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects. - Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded. - Updated push, move_to_front, and clear methods to manage cached graphs efficiently. - Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend. * fix typo * The LRU cache capacity can be configured via an env variable Signed-off-by: noemotiovon <757486878@qq.com> * refactory acl graph * refactory && fix review comments Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-20 13:42:53 +03:00
Ruben Ortlam	7abe187860	vulkan: throw the oom error instead of no memory type found (llama/15905)	2025-09-20 13:42:53 +03:00
Jeff Bolz	d0e98656c3	vulkan: Fix OOB accesses in soft_max_back (llama/15861)	2025-09-20 13:42:52 +03:00
Johannes Gäßler	e35d1375ee	HIP: use v_dot2_f32_f16 instruction for FA (llama/15884)	2025-09-20 13:42:52 +03:00
lksj92hs	7fbbb67b47	Workaround for subgroup arithmetic failing on MoltenVK with AMD GPUs (issue 15846) (llama/15886)	2025-09-20 13:42:52 +03:00
Aman Gupta	621764b1a5	CUDA: Add mul_mat_id support for the mmf kernel (llama/15767) * CUDA: Add mul_mat_id support the mmf Add support for mul_mat_id for bs < 16 * Review: use warp_size, fix should_use_mmf condition * Launch one block per expert, stride along n_expert_used * templatize mul_mat_id * Pad shmem to 16 bytes, add helper function mul_mat_f_switch_ids * Reduce compile times by dividing mmf into f16, bf16 and f32 variants * Divide mmf by ncols_dst * Add missing files * Fix MUSA/HIP builds	2025-09-20 13:42:52 +03:00
Johannes Gäßler	260982232c	CUDA: fix GET_ROWS for large tensors (llama/15882)	2025-09-20 13:42:52 +03:00
Jeff Bolz	c29cd54818	vulkan: sort graph to allow more parallel execution (llama/15850) * vulkan: sort graph to allow more parallel execution Add a backend proc to allow the backend to modify the graph. The vulkan implementation looks at which nodes depend on each other and greedily reorders them to group together nodes that don't depend on each other. It only reorders the nodes, doesn't change the contents of any of them. With #15489, this reduces the number of synchronizations needed. * call optimize_graph per-split	2025-09-20 13:42:52 +03:00
Aman Gupta	70ee808f3d	CUDA: generate_cu_files.py - add missing mxfp4 (llama/15880)	2025-09-20 13:42:52 +03:00
Georgi Gerganov	ae6cc6a386	cuda : fix supports_op condition for get_rows when number of blocks is too large (llama/15868) * cuda : fix supports_op condition for get_rows when src1->ne2 > 1 ggml-ci * ggml : add comment about ggml_get_rows ggml-ci * cuda : add FIXME [no ci] * cuda : update support condition ggml-ci	2025-09-20 13:42:52 +03:00
Georgi Gerganov	e9cb59e970	metal : refactor + optimize (llama/15857)	2025-09-20 13:42:51 +03:00
Xuan-Son Nguyen	40bcd1a469	ggml: allow casting between f32 and i32 (llama/15783) * ggml: allow casting between f32 and i32 * fix cuda * add vulkan * fix CPU non-cont * add non-cont test case * add note * extend test number range * correct note * add cont version for vulkan	2025-09-20 13:42:51 +03:00
Sigbjørn Skjæret	0175a1df8d	CUDA: non-contiguous src0 not supported for PAD (llama/15869)	2025-09-20 13:42:51 +03:00
Chenguang Li	d9c0ead2ab	CANN: Stream sync between devices for acl_graph (llama/15809) * CANN: Switch to stream synchronization Switch to stream synchronization because events are not effective. Co-authored-by: hipudding <huafengchun@gmail.com> * CANN: add Comments --------- Co-authored-by: hipudding <huafengchun@gmail.com>	2025-09-20 13:42:51 +03:00
Jeff Bolz	dfa7722e2e	vulkan: support im2col_3d (llama/15795)	2025-09-20 13:42:51 +03:00
Aaron Teo	db4f504b69	ggml-cpu: clean up s390x SIMD (llama/15855) * ggml-cpu: clean up s390x simd Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 0da4b6aa07d96b758812d17b2c82267632fa4ba5) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix hsum data types Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-20 13:42:51 +03:00
Jeff Bolz	9523fd8de6	vulkan: Support pad_ext (llama/15794)	2025-09-20 13:42:51 +03:00
Jeff Bolz	647e2d7de5	vulkan: Use larger loads in scalar/coopmat1 matmul (llama/15729) I think glslang will translate an access like x[i][1].z to OpAccessChain ... x, i, 1, 2 OpLoad float16_t ... rather than loading all of x[i] in a single OpLoad. Change the code to explicitly load the vector/matrix.	2025-09-20 13:42:51 +03:00
Daniel Bevenius	cda7d4e5ac	ggml WebGPU: remove userdata from request adapter callback (llama/15527) * ggml WebGPU: remove userdata from request adapter callback This commit removes the `userdata` parameter from the WebGPU request adapter callback in `ggml-webgpu.cpp`. Instead, the lambda function captures the `webgpu_context` directly. The motivation for this change is to simplify the code and improve readability. * inline the callback lambda into the RequestAdapter call This commit removes the callback lambda variable and inlines it directly into the RequestAdapter call.	2025-09-20 13:42:50 +03:00
Johannes Gäßler	cd70d89628	CUDA: faster tile FA (Pascal/AMD), headsize 256 (llama/15769)	2025-09-20 13:42:50 +03:00
Charles Xu	be2676bb1c	kleidiai: generalize compute_forward_kv_cache to compute_forward_fp16 (llama/15817)	2025-09-20 13:42:50 +03:00
Johannes Gäßler	69400f16f1	ggml-cpu: document use of "free" memory [no ci] (llama/15834)	2025-09-20 13:42:50 +03:00
Aaron Teo	f499271c4e	ggml-cpu: drop support for nnpa intrinsics (llama/15821)	2025-09-20 13:42:50 +03:00
Johannes Gäßler	6ff468cfaa	CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (llama/15802) * CUDA: fastdiv, launch bounds for mmvq + q8_1 quant	2025-09-20 13:42:50 +03:00
Daniel Bevenius	4d6e1144b1	ggml : introduce semantic versioning (ggml/1336) * ggml : introduce semantic versioning This commit introduces semantic versioning for the GGML library. The motivation for this is that the current versioning, using build numbers, makes it difficult to track changes and releases for projects that use ggml. The release steps are the following: 1. Sync the changes from llama.cpp using sync-llama-am.sh and after the PR has been approved and merged move to step 2. 2. Run scripts/release.sh and specify the type of release, major, minor, or patch. This script will handle incrementing the version (major\|minor\|patch), create a new commit with the version change, create a tag for the version, and prepare for the next development iteration. 3. Inspect the commits/tag and push to master. This will trigger the github release workflow which is triggered for new tags which will then publish a new release on github. Example usage: ```console $ ./scripts/release.sh major --dry-run [dry-run] - No changes will be made Step 1: Reading current version... Current version: 0.9.0-dev New release version: 1.0.0 Step 2: Updating version in CMakeLists.txt... [dry-run] Would update GGML_VERSION_MAJOR to 1 [dry-run] Would update GGML_VERSION_MINOR to 0 [dry-run] Would update GGML_VERSION_PATCH to 0 [dry-run] Would remove -dev suffix Step 3: Committing version bump... [dry-run] Would commit: 'ggml : bump version to 1.0.0' Step 4: Creating git tag... [dry-run] Would create tag: v1.0.0 with message 'Release version 1.0.0' Step 5: Preparing for next development cycle... [dry-run] Would update GGML_VERSION_MINOR to 1 [dry-run] Would add -dev suffix back Step 6: Committing development version... [dry-run] Would commit: 'ggml : prepare for development of 1.1.0-dev' [dry-run] Summary (no changes were made): • Would have released version: 1.0.0 • Would have created tag: v1.0.0 • Would have set next development version: 1.1.0-dev ``` Refs: https://github.com/ggml-org/ggml/issues/1333 * ggml: create branch for release candidate and check master * ggml : sign the git tag	2025-09-20 13:42:50 +03:00
Gregor Jasny	c80f78cc7b	CUDA : conditionally add cuda architectures (ggml/1341)	2025-09-20 13:42:50 +03:00
Gabe Goodhart	ffe560cbb1	metal : Add template specialization for mul_mm_id w/ ne20 == 10 (llama/15799) Branch: GGMLMetalNE20 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-09-20 13:42:49 +03:00
Chenguang Li	3780a3c917	CANN: Refactor ND to NZ workspace to be per-device (llama/15763) * CANN:Refactor ND to NZ workspace to be per-device in Ascend backend - Replaced the previous single global ND→NZ workspace with a per-device cache using unordered_map keyed by device ID. - Functions `release_nz_workspace`, `relloc_nz_workspace`, and `get_nz_workspace` now manage workspace independently for each device, preventing memory conflicts in multi-device / pipeline parallel scenarios. - This change fixes potential precision issues caused by workspace overwrites when multiple devices perform ND→NZ conversions concurrently. Co-authored-by: hipudding <huafengchun@gmail.com> * refactor Signed-off-by: noemotiovon <757486878@qq.com> * rename Signed-off-by: noemotiovon <757486878@qq.com> * fix review comments Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com> Co-authored-by: hipudding <huafengchun@gmail.com>	2025-09-20 13:42:49 +03:00
leejet	2228462b19	ggml: add ops for WAN video model (cuda && cpu) (llama/15669) * add conv3d support * add ggml_pad_ext for cpu & cuda backend * cuda/cpu: add im2col_3d support * cuda: make im2col a little faster * fix cuda pad/scale/im2col3d * make im2col_3d faster * gguf: support loading tensors which n_dims > GGML_MAX_DIMS * fix cuda get_rows * avoid ggml_conv_3d conflict * correct GGML_OP_COUNT assertion * avoid build failure * avoid build failure on MacOS * cuda: remove unnecessary MIN define * fix cpu im2col_3d * adjust the code style * cuda: use simpler loop in get_rows * add test_im2col_3d to test-backend-ops * test-backend-ops.cpp: remove trailing whitespace * cpu: im2col_3d support non continuous src Co-authored-by: Jeff Bolz <jbolz@nvidia.com> * fix test_im2col_3d * remove unused variables * cuda: get_rows: dfloat2 -> float2 * add test_pad_ext to test-backend-ops.cpp * add gguf_init_from_file_ext impl * Revert "gguf: support loading tensors which n_dims > GGML_MAX_DIMS" This reverts commit d8377a0a37f314bd3713fe043b4333ad661610c1. * Revert "add gguf_init_from_file_ext impl" This reverts commit d9f1d13208c68ef83b3538201ac7f31614fb1994. * update ggml_backend_vk_device_supports_op * fix ggml_backend_vk_device_supports_op * update other backend supports op for ggml_pad_ext * metal/opencl/sycl/vulkan: fix GGML_OP_PAD check in supports_op --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-09-20 13:42:49 +03:00
hipudding	96efb472b4	CANN: Fix precision issue on 310I DUO multi-devices (llama/15784)	2025-09-20 13:42:49 +03:00
rmatif	1569daf524	opencl: add hs=40 to FA (llama/15758)	2025-09-20 13:42:49 +03:00
Chenguang Li	5c860e94c6	CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (llama/15760) Fixes #15330 Adjust the allocation size of acl_rstd. The parameter `dims` is set to 3 according to the CANN documentation. Co-authored-by: Yuchuan <yuchuan-cao@users.noreply.github.com>	2025-09-20 13:42:49 +03:00
Ruben Ortlam	719a05c665	vulkan: fix mmv subgroup16 selection (llama/15775)	2025-09-20 13:42:49 +03:00
Jeff Bolz	4a702a867c	vulkan: don't use std::string in load_shaders, to improve compile time (llama/15724) * vulkan: don't use std::string in load_shaders, to improve compile time * keep the string version for those calls that use it	2025-09-20 13:42:49 +03:00
Daniel Bevenius	4144ae10e9	vulkan : update ggml_vk_instance_validation_ext_available (llama/15666) * vulkan : update ggml_vk_instance_validation_ext_available This commit updates ggml_vk_instance_validation_ext_available() to check for VK_EXT_validation_features instead of VK_KHR_portability_enumeration. Based on how the returned boolean is used later in the code (to enable both the validation layer and the VK_EXT_validation_features extension), it appears the function may have been intended to check for the validation layer features extension. * remove try/catch This was a left over from a previous iteration where I was explicitly quering for a specific validation layer first, which would throw. * update warning message about validation layers	2025-09-20 13:42:48 +03:00
Shin-myoung-serp	85c7aa3750	ggml vulkan: add hardsigmoid and hardswish operations (llama/15762)	2025-09-20 13:42:48 +03:00
Oliver Simons	9eef377330	CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E (llama/15715) * Add fastdiv, use it in modulo and use modulo in rms_norm_f32 Fastdiv is much faster way to do integer division, which was identified as bottleneck in rms_norm_f32 * Support more `block_size` values in `rms_norm_f32` This makes us more flexible in selecting the optimal threads w.r.t paralellizing across a col vs. launch-overheads of threads and mio throttles * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Replace modulo with fastmodulo in `rms_norm_f32` * Use `BinPackArguments=true` for formating function calls Will file a separate PR to adjust .clang-format file * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Use uint3 for both `fastdiv` and `fastmodulo` The compiler seems to reliably optimize away the unused .z component in the fastdiv use-case, see https://godbolt.org/z/rx8KPrKr3 * More constrained type declarations Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Rename fastdiv and fastmodulo variables to shared variable name As suggest by JohannesGaessler, this increases clarity of the intended use * Pack fastdiv/fastmodulo constants into uint2/uint3 objects By packing constants to be used together into a struct, we are less likely to make errors. * Rename function parameter of fastmodulo `modulo_consts` is more fitting/descriptive --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-09-20 13:42:48 +03:00
hipudding	51bc843f3a	CANN: Add RoPE contiguous check for 310I DUP device (llama/15735)	2025-09-20 13:42:48 +03:00
xctan	75f739c7c8	ggml-cpu : optimize RVV kernels (llama/15720) * ggml-cpu : optimize rvv ggml_vec_dot_f32 * ggml-cpu : optimize 128-bit rvv ggml_vec_dot_q4_K_q8_K * ggml-cpu : fix riscv arch flags * ggml-cpu : add more rvv ops * ggml-cpu : optimize rvv ggml_vec_dot_q4_K_q8_K * ggml-cpu : optimize rvv ggml_vec_dot_q6_K_q8_K * ggml-cpu : minor rvv adjustments * ggml-cpu : fix riscv include	2025-09-20 13:42:48 +03:00
hipudding	91e9e72ecd	CANN: Mask unsupported TRANSPOSE_1D operator (llama/15733) CANN currently does not support kernels larger than 255. This change disables such cases.	2025-09-20 13:42:48 +03:00
Chenguang Li	d84b96d9d0	CANN: Fix type float_t to float (llama/15736) Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-20 13:42:48 +03:00
Ruben Ortlam	e584edb5ba	vulkan: fix shaders gen when no integer dot is available (llama/15740)	2025-09-20 13:42:48 +03:00
hipudding	5aee53c40f	CANN: Resolve soft_max precision issue (llama/15730) Previously, the slope tensor was set to fp16 to improve efficiency. While this worked correctly in FA, it caused precision issues in soft_max. This change applies different data types for different operators to balance both accuracy and performance.	2025-09-20 13:42:47 +03:00
Jeff Bolz	1e03aa66f7	vulkan: Fix macro parameter order for f32 matmul shaders (llama/15716)	2025-09-20 13:42:47 +03:00
rmatif	fb37f91163	opencl: add attn sinks support for FA kernels (llama/15706)	2025-09-20 13:42:47 +03:00
Chenguang Li	3db49c1c26	CANN: Support eager execution mode under ACL graph compilation (llama/15712) * [CANN] Support eager execution mode under ACL graph compilation Add support for running operators in eager mode while ACL graph compilation is enabled. This allows bypassing graph execution and directly submitting ops, which is useful for debugging and reducing graph build overhead in certain scenarios. Signed-off-by: noemotiovon <757486878@qq.com> * fix typo Signed-off-by: noemotiovon <757486878@qq.com> * rename to acl_graph_mode Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-20 13:42:47 +03:00
hipudding	13d3963f71	CANN: Support ext_factor in rope (llama/15710)	2025-09-20 13:42:47 +03:00
Johannes Gäßler	f20a7b0e99	ggml-backend: raise GGML_MAX_SPLIT_INPUTS (llama/15722)	2025-09-20 13:42:47 +03:00
Gilad S	9e3600e569	vulkan: use memory budget extension to read memory usage (llama/15545) * vulkan: use memory budget extension to read memory usage * fix: formatting and names * formatting * fix: detect and cache memory budget extension availability on init * fix: read `budgetprops.heapBudget` instead of `heap.size` when memory budget extension is available * style: lints	2025-09-20 13:42:47 +03:00
Jeff Bolz	7a5e7368a3	vulkan: add missing clamps in new mul_mat_id paths (llama/15702) This is a missing interaction between #15546 and #15652	2025-09-20 13:42:46 +03:00
Ruben Ortlam	d5f80a2982	vulkan: disable large mmv subgroups on older Nvidia GPUs (llama/15717)	2025-09-20 13:42:46 +03:00
s-goto-11	8218dc609c	ggml: SVE support for exponential functions (llama/15145) * SVE support for exponential functions Add const notation to variable pg * Update ggml/src/ggml-cpu/vec.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Add const --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-09-20 13:42:46 +03:00
Prashant Vithule	31840a3a56	ggml: aarch64: Implement SVE F16 kernels for vector functions (llama/15115) * Added sve implementation for vec_dot_fp16 Kernel * removed white spaces * Added comment * removed white spaces * changed GGML_F16x_VEC_FMA for code consistency * Update vec.h --------- Co-authored-by: vithulep <p.m.vithule1517@gmail.com>	2025-09-20 13:42:46 +03:00
Ruben Ortlam	5e70d901b0	Vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants (llama/14903) * vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants * vulkan: use subgroup operations for quantize_q8_1 shader * vulkan: add q8_1_x4 type with 128-bit alignment, use in mul_mat_vecq shader * vulkan: use q8_1_x4 blocks in mul_mmq shader * vulkan: do 8 calculations per invocation instead of 32 in mul_mat_vecq, similar to mul_mat_vec * vulkan: tune mul_mat_vecq performance for Intel * vulkan: fix quantizing issue when tensor is not divisible by 128 * vulkan: adapt integer dot mmv to mmv small m optimization (llama/15355) * vulkan: allow all subgroup modes for mmv and mmvq * vulkan: use prealloc intermediate reuse for mmvq path * vulkan: tune mmvq for Intel, AMD GCN and Nvidia RTX 3090 * vulkan: adapt mmv quantize_y path to conditional sync logic * vulkan: disable q8_0 mmvq on Nvidia * vulkan: enable q8_0 on Nvidia pre-turing * fix prealloc sync condition * fix llvmpipe subgroup 8 issue	2025-09-20 13:42:46 +03:00
Daniel Bevenius	c5f511e697	ggml : WebGPU add TRANSPOSE and RESHAPE to supported ops (llama/15695) * ggml : WebGPU add TRANSPOSE and RESHAPE to supported ops This commit adds support for the TRANSPOSE and RESHAPE operations in the ggml webgpu backend. Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-09-20 13:42:46 +03:00
Akarshan Biswas	2ba5e0cb47	CUDA: fix build error from ambiguous __half conversions in conv2d (llama/15690) * CUDA: fix build error from ambiguous __half conversions in conv2d Building conv2d with half precision failed because `__half` defines multiple implicit conversion operators (to float, int, short, etc.), causing ambiguous overload resolution when multiplying with float. Introduce a templated `to_float` helper that explicitly converts `__half` via `__half2float`, while passing through float unchanged. Use this helper in conv2d accumulation to ensure unambiguous and correct promotion to float. Fixes some build errors with half-precision kernels on CUDA. ggml-ci * CUDA: Replace custom to_float helper with unified ggml_cuda_cast and add half‑>float conversion * CUDA: Add missing convert.cuh header * CUDA: remove unnecessary extension in ggml_cuda_cast * CUDA: Address review comment, remove second type template argument	2025-09-20 13:42:46 +03:00
hipudding	bb5f844ec7	CANN: Optimize MUL_MAT_ID (llama/15658)	2025-09-20 13:42:46 +03:00
hipudding	ed7ebdc757	CANN: fix RoPE cache issue on multi-device (llama/15629) * CANN: fix RoPE cache issue on multi-device RoPE cache only needs to be computed once per token. However, in multi-device scenarios, not every device starts computation from layer 0, which may lead to unallocated memory issues and precision errors. This commit records the first layer of each device to avoid the above issues. * CANN: Optimize first-layer detection method * CANN: Remove trailing whitespace * CANN: Only cache the data that can be determined as unchanged through the parameters. * CANN: Update function comment	2025-09-20 13:42:45 +03:00
Georgi Gerganov	3d470687de	metal : fix checks for available FA kernels (llama/15700) * metal : fix checks for available FA kernels ggml-ci * cont : fix comment [no ci]	2025-09-20 13:42:45 +03:00
Diego Devesa	b11c972b88	llama : separate compute buffer reserve from fattn check (llama/15696) Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.	2025-09-20 13:42:45 +03:00
Jeff Bolz	db7ecfb61d	vulkan: handle large sizes for get_rows (llama/15686)	2025-09-20 13:42:45 +03:00
Jeff Bolz	191def71ce	vulkan: mul_mat_id coopmat2 optimizations (llama/15546) * vulkan: mul_mat_id coopmat2 optimizations Add a path for when the tile fits in BN/2, similar to what we have for mul_mat. Only call fetch_scales/store_scales once per QUANT_K block, and once at the beginning in case start_k is not aligned. * Also add a path for BN/4 - worth a couple more percent	2025-09-20 13:42:45 +03:00
Daniel Bevenius	b092e95aaa	vulkan : remove unused portability_enumeration_ext variable (llama/15679) This commit removes the portability_enumeration_ext variable from the ggml_vk_instance_portability_enumeration_ext_available function as it is initialized to false but never modified, making it redundant.	2025-09-20 13:42:45 +03:00
Jeff Bolz	20ce6fcf6a	vulkan: Allow fallback to sysmem memory when vidmem is full (llama/15649) * vulkan: Allow fallback to sysmem memory when vidmem is full * vulkan: Add env var GGML_VK_ALLOW_SYSMEM_FALLBACK	2025-09-20 13:42:45 +03:00
Jeff Bolz	71f0ee70bf	vulkan: clamp matmul and FA results to the max finite value (llama/15652) * vulkan: clamp matmul and FA results to the max finite value * only clamp for fp16	2025-09-20 13:42:45 +03:00
Charles Xu	74583845b6	ggml: update kleidiai to v1.13.0 (llama/15663)	2025-09-20 13:42:44 +03:00
Johannes Gäßler	f6ba3949b6	llama: use FA + max. GPU layers by default (llama/15434) * llama: use max. GPU layers by default, auto -fa * ggml-backend: abort instead of segfault	2025-09-20 13:42:44 +03:00
Johannes Gäßler	b7809c401b	CUDA: use FP32 arithmetic for conv2d (llama/15683)	2025-09-20 13:42:44 +03:00
Jeff Bolz	a6dec4f49d	vulkan: Skip syncing for prealloc_y when it is reused (llama/15544)	2025-09-20 13:42:44 +03:00
Chenguang Li	d629af157e	CANN: FIx compiler warnings (llama/15661) Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-20 13:42:44 +03:00
Aman Gupta	82ce91e7d2	CUDA: fix bug in rms_norm fusion (llama/15660) * CUDA: fix bug in rms_norm fusion * Fix bug for OP_REPEAT * Fix index for add	2025-09-20 13:42:44 +03:00
Aman Gupta	6d7ddaf793	CUDA: fuse adds, fuse add with rms norm (llama/15631) * CUDA: fused add with rms_norm_mul * Non-broadcast fuse works * Add fused adds * format * Remove n_fuse from template params * Address review comments * Move template inside binbcast	2025-09-20 13:42:44 +03:00
mnehete32	dc9f55bbb0	CUDA: add conv2d (llama/15635) * CUDA: add conv2d * CUDA: conv2d - correct formatting and added const	2025-09-20 13:42:44 +03:00
Aaron Teo	6287027a2c	ggml-cpu: fix invalid hsum build in debug s390x (llama/15634) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-20 13:42:43 +03:00
compilade	6dffbaa0cb	ggml : fix SSM_SCAN for n_groups > 1 (llama/15625)	2025-09-20 13:42:43 +03:00
Georgi Gerganov	cac6253744	kv-cache : remove LLAMA_SET_ROWS checks (llama/15505) ggml-ci	2025-09-20 13:42:43 +03:00
matiaslin	88c0582b61	cuda: Add cublasLt_static linking when GGML_STATIC is enabled (llama/15622) Prior to this change, we faced undefined cublasLt references when attempting to compile 'llama-cli' with GGML_STATIC=ON on Linux. We add linking with CUDA::cublasLt_static when CUDA version is greater than 10.1.	2025-09-20 13:42:43 +03:00
uvos	65fa2c0c1a	HIP: Enable support for ggml_backend_cuda_register_host_buffer (llama/15615)	2025-09-20 13:42:43 +03:00
Chenguang Li	02e8b23137	CANN: refactor mask handling and improve performance in FA (llama/15561) * CANN(flash-attn): refactor mask handling and improve performance 1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode. 2. Optimized performance in non-alibi scenarios by reducing one repeat operation. 3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16. Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: fix review Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: Optimization FA BNSD to BSND Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-20 13:42:43 +03:00
xctan	ece1bdfe7e	ggml-cpu : add basic RVV support for vector f32 ops (llama/15057) * ggml-cpu : add basic RVV support for vector f32 ops * ggml-cpu : add RVV support for f32 softmax	2025-09-20 13:42:43 +03:00
rmatif	a6ec224efa	OpenCL: add fused group_norm/norm, mul, add (llama/15314) * add fused group_norm/norm, mul, add * fix spacing * revert rms_norm logic * fix trailing whitespace	2025-09-20 13:42:43 +03:00
Akarshan Biswas	94fa9f63b3	SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (llama/15592) The original implementation unconditionally returned true for this operation, leading to a failure when the tensor's first dimension (ne[0]) was not a multiple of WARP_SIZE. This caused an GGML_ASSERT(ncols % WARP_SIZE == 0) failure in ggml-sycl/norm.cpp. This change updates the ggml_backend_sycl_device_supports_op check to correctly return true for GGML_OP_RMS_NORM only when the first dimension of the tensor is a multiple of WARP_SIZE, ensuring the operation can be performed without error.	2025-09-20 13:42:42 +03:00
shalinib-ibm	31c7784e09	llamafile: PowerPC Sgemm Optimization (llama/15558) This patch improves GEMM for FP32 Data Type on PowerPC Implements GEMM on large blocks with configurable block size mc, nc, kc (default: 256, 256, 256). Packing Function optimized to access blocks as per memory layout. GEMM Optimized to work on larger blocks. Isolated Packing from GEMM Operations for better MMA utilization. Verified functionality and correctness uing llama-cli and stand alone test case (performs matmul and compares final mattrix C result with base). Minor code refactoring changes: Replace macro with inline function Code Indent made consistent with 4 spaces Performance Testing: Observed 50% ~ 70% improvement in Prompt Processing Speed mesured using llama-bench with Meta-Llama3-8B FP32 Model. Similar gains observed with Mistral-7b-Instruct-v0.3 Model. model Size Params Backend Threads Test Patch Base llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp512 98.58 60.3 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp1024 95.88 57.36 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp2048 85.46 53.26 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp4096 68.66 45.78 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp6144 57.35 40.44 25 ~ 30% improvement in llama-batched-bench with Metla-Llama3-8B in Prompt Processing Speed for large prompts (256, 512, 1024, 2048, 4096)tokens with various batch sizes ( 1, 2, 4, 8, 16) Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>	2025-09-20 13:42:42 +03:00
Johannes Gäßler	53010199a1	CUDA: return -1 for nonexistent compiled arch (llama/15587)	2025-09-20 13:42:42 +03:00
Georgi Gerganov	1c21a850be	metal : optimize FA vec for large sequences and BS <= 8 (llama/15566) * metal : optmize FA vec for large heads and sequences * metal : adjust small-batch mul mv kernels ggml-ci * batched-bench : fix total speed computation ggml-ci * cont : add comments ggml-ci	2025-09-20 13:42:42 +03:00
Georgi Gerganov	dc693ca8c9	metal : improve `MUL_MAT_ID` (llama/15541) * metal : mul_mm_id remove hdst * metal : remove mul_mm_id hsrc1 * metal : mul_mm_id simplify + add test * metal : opt mul_mm_id map0 * metal : optimize mul_mm_id id gathering * metal : mul/div opt * metal : optimize mul_mm_id_map0 ggml-ci	2025-09-20 13:42:42 +03:00
Sigbjørn Skjæret	3bb52acb46	metal : remove contiguous assertion for src0 in IM2COL (llama/15577) * remove contiguous assertion for src0 in IM2COL * add contiguous check in supports_op	2025-09-20 13:42:42 +03:00
Yoshi_likes_e4	9828caafb5	Add a warning for special devices (llama/15563) * Add warning * Print the devices names * Add newlines * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Fix vector names --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-09-20 13:42:42 +03:00
Jeff Bolz	79e2bd5ea8	vulkan: Remove splitting for mul_mat_id (llama/15568) row_ids only needs to hold the BN rows for the current tile.	2025-09-20 13:42:42 +03:00
Qeeweew	2468074e91	CUDA: Accelerate MXFP4 table lookup using `__byte_perm` (llama/15451) * CUDA: optimize get_int_from_table_16 * CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs * revise documentation --------- Co-authored-by: xix <xiapc@outlook.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-09-20 13:42:41 +03:00
lhez	582ef379ab	opencl: fix support ops condition for `rms_norm` (llama/15560)	2025-09-20 13:42:41 +03:00
Ruben Ortlam	335d2a5405	vulkan: fix min subgroup 16 condition for mmid subgroup optimization (llama/15565)	2025-09-20 13:42:41 +03:00
Ihar Hrachyshka	8851ef5463	metal: fix regression when no metal devices are present (llama/15531)	2025-09-20 13:42:41 +03:00
Johannes Gäßler	1e856b2919	CUDA: MoE helper in device code, better tile sizes (llama/15525) * CUDA: MoE helper in device code, better tile sizes * reduce superfluous CUDA blocks	2025-09-20 13:42:41 +03:00
Georgi Gerganov	54be54f4ce	metal : add FA kernels for HS=40 (llama/15559) ggml-ci	2025-09-20 13:42:41 +03:00
Chenguang Li	86331f74e0	CANN: ROPE cache sin/cos repeat (llama/15501) Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-20 13:42:41 +03:00
Ruben Ortlam	ee11ed42a9	vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices (llama/15524) * vulkan: use subgroup function for mul_mat_id shader even without coopmat * vulkan: fix compile warnings * vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id * vulkan: disable subgroup mul_mat_id on devices with subgroups < 16	2025-09-20 13:42:41 +03:00

... 8 9 10 11 12 ...

2159 Commits