whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Johannes Gäßler	f20a7b0e99	ggml-backend: raise GGML_MAX_SPLIT_INPUTS (llama/15722)	2025-09-20 13:42:47 +03:00
Gilad S	9e3600e569	vulkan: use memory budget extension to read memory usage (llama/15545) * vulkan: use memory budget extension to read memory usage * fix: formatting and names * formatting * fix: detect and cache memory budget extension availability on init * fix: read `budgetprops.heapBudget` instead of `heap.size` when memory budget extension is available * style: lints	2025-09-20 13:42:47 +03:00
Jeff Bolz	7a5e7368a3	vulkan: add missing clamps in new mul_mat_id paths (llama/15702) This is a missing interaction between #15546 and #15652	2025-09-20 13:42:46 +03:00
Ruben Ortlam	d5f80a2982	vulkan: disable large mmv subgroups on older Nvidia GPUs (llama/15717)	2025-09-20 13:42:46 +03:00
s-goto-11	8218dc609c	ggml: SVE support for exponential functions (llama/15145) * SVE support for exponential functions Add const notation to variable pg * Update ggml/src/ggml-cpu/vec.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Add const --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-09-20 13:42:46 +03:00
Prashant Vithule	31840a3a56	ggml: aarch64: Implement SVE F16 kernels for vector functions (llama/15115) * Added sve implementation for vec_dot_fp16 Kernel * removed white spaces * Added comment * removed white spaces * changed GGML_F16x_VEC_FMA for code consistency * Update vec.h --------- Co-authored-by: vithulep <p.m.vithule1517@gmail.com>	2025-09-20 13:42:46 +03:00
Ruben Ortlam	5e70d901b0	Vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants (llama/14903) * vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants * vulkan: use subgroup operations for quantize_q8_1 shader * vulkan: add q8_1_x4 type with 128-bit alignment, use in mul_mat_vecq shader * vulkan: use q8_1_x4 blocks in mul_mmq shader * vulkan: do 8 calculations per invocation instead of 32 in mul_mat_vecq, similar to mul_mat_vec * vulkan: tune mul_mat_vecq performance for Intel * vulkan: fix quantizing issue when tensor is not divisible by 128 * vulkan: adapt integer dot mmv to mmv small m optimization (llama/15355) * vulkan: allow all subgroup modes for mmv and mmvq * vulkan: use prealloc intermediate reuse for mmvq path * vulkan: tune mmvq for Intel, AMD GCN and Nvidia RTX 3090 * vulkan: adapt mmv quantize_y path to conditional sync logic * vulkan: disable q8_0 mmvq on Nvidia * vulkan: enable q8_0 on Nvidia pre-turing * fix prealloc sync condition * fix llvmpipe subgroup 8 issue	2025-09-20 13:42:46 +03:00
Daniel Bevenius	c5f511e697	ggml : WebGPU add TRANSPOSE and RESHAPE to supported ops (llama/15695) * ggml : WebGPU add TRANSPOSE and RESHAPE to supported ops This commit adds support for the TRANSPOSE and RESHAPE operations in the ggml webgpu backend. Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-09-20 13:42:46 +03:00
Akarshan Biswas	2ba5e0cb47	CUDA: fix build error from ambiguous __half conversions in conv2d (llama/15690) * CUDA: fix build error from ambiguous __half conversions in conv2d Building conv2d with half precision failed because `__half` defines multiple implicit conversion operators (to float, int, short, etc.), causing ambiguous overload resolution when multiplying with float. Introduce a templated `to_float` helper that explicitly converts `__half` via `__half2float`, while passing through float unchanged. Use this helper in conv2d accumulation to ensure unambiguous and correct promotion to float. Fixes some build errors with half-precision kernels on CUDA. ggml-ci * CUDA: Replace custom to_float helper with unified ggml_cuda_cast and add half‑>float conversion * CUDA: Add missing convert.cuh header * CUDA: remove unnecessary extension in ggml_cuda_cast * CUDA: Address review comment, remove second type template argument	2025-09-20 13:42:46 +03:00
hipudding	bb5f844ec7	CANN: Optimize MUL_MAT_ID (llama/15658)	2025-09-20 13:42:46 +03:00
hipudding	ed7ebdc757	CANN: fix RoPE cache issue on multi-device (llama/15629) * CANN: fix RoPE cache issue on multi-device RoPE cache only needs to be computed once per token. However, in multi-device scenarios, not every device starts computation from layer 0, which may lead to unallocated memory issues and precision errors. This commit records the first layer of each device to avoid the above issues. * CANN: Optimize first-layer detection method * CANN: Remove trailing whitespace * CANN: Only cache the data that can be determined as unchanged through the parameters. * CANN: Update function comment	2025-09-20 13:42:45 +03:00
Georgi Gerganov	3d470687de	metal : fix checks for available FA kernels (llama/15700) * metal : fix checks for available FA kernels ggml-ci * cont : fix comment [no ci]	2025-09-20 13:42:45 +03:00
Diego Devesa	b11c972b88	llama : separate compute buffer reserve from fattn check (llama/15696) Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.	2025-09-20 13:42:45 +03:00
Jeff Bolz	db7ecfb61d	vulkan: handle large sizes for get_rows (llama/15686)	2025-09-20 13:42:45 +03:00
Jeff Bolz	191def71ce	vulkan: mul_mat_id coopmat2 optimizations (llama/15546) * vulkan: mul_mat_id coopmat2 optimizations Add a path for when the tile fits in BN/2, similar to what we have for mul_mat. Only call fetch_scales/store_scales once per QUANT_K block, and once at the beginning in case start_k is not aligned. * Also add a path for BN/4 - worth a couple more percent	2025-09-20 13:42:45 +03:00
Daniel Bevenius	b092e95aaa	vulkan : remove unused portability_enumeration_ext variable (llama/15679) This commit removes the portability_enumeration_ext variable from the ggml_vk_instance_portability_enumeration_ext_available function as it is initialized to false but never modified, making it redundant.	2025-09-20 13:42:45 +03:00
Jeff Bolz	20ce6fcf6a	vulkan: Allow fallback to sysmem memory when vidmem is full (llama/15649) * vulkan: Allow fallback to sysmem memory when vidmem is full * vulkan: Add env var GGML_VK_ALLOW_SYSMEM_FALLBACK	2025-09-20 13:42:45 +03:00
Jeff Bolz	71f0ee70bf	vulkan: clamp matmul and FA results to the max finite value (llama/15652) * vulkan: clamp matmul and FA results to the max finite value * only clamp for fp16	2025-09-20 13:42:45 +03:00
Charles Xu	74583845b6	ggml: update kleidiai to v1.13.0 (llama/15663)	2025-09-20 13:42:44 +03:00
Johannes Gäßler	f6ba3949b6	llama: use FA + max. GPU layers by default (llama/15434) * llama: use max. GPU layers by default, auto -fa * ggml-backend: abort instead of segfault	2025-09-20 13:42:44 +03:00
Johannes Gäßler	b7809c401b	CUDA: use FP32 arithmetic for conv2d (llama/15683)	2025-09-20 13:42:44 +03:00
Jeff Bolz	a6dec4f49d	vulkan: Skip syncing for prealloc_y when it is reused (llama/15544)	2025-09-20 13:42:44 +03:00
Chenguang Li	d629af157e	CANN: FIx compiler warnings (llama/15661) Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-20 13:42:44 +03:00
Aman Gupta	82ce91e7d2	CUDA: fix bug in rms_norm fusion (llama/15660) * CUDA: fix bug in rms_norm fusion * Fix bug for OP_REPEAT * Fix index for add	2025-09-20 13:42:44 +03:00
Aman Gupta	6d7ddaf793	CUDA: fuse adds, fuse add with rms norm (llama/15631) * CUDA: fused add with rms_norm_mul * Non-broadcast fuse works * Add fused adds * format * Remove n_fuse from template params * Address review comments * Move template inside binbcast	2025-09-20 13:42:44 +03:00
mnehete32	dc9f55bbb0	CUDA: add conv2d (llama/15635) * CUDA: add conv2d * CUDA: conv2d - correct formatting and added const	2025-09-20 13:42:44 +03:00
Aaron Teo	6287027a2c	ggml-cpu: fix invalid hsum build in debug s390x (llama/15634) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-20 13:42:43 +03:00
compilade	6dffbaa0cb	ggml : fix SSM_SCAN for n_groups > 1 (llama/15625)	2025-09-20 13:42:43 +03:00
Georgi Gerganov	cac6253744	kv-cache : remove LLAMA_SET_ROWS checks (llama/15505) ggml-ci	2025-09-20 13:42:43 +03:00
matiaslin	88c0582b61	cuda: Add cublasLt_static linking when GGML_STATIC is enabled (llama/15622) Prior to this change, we faced undefined cublasLt references when attempting to compile 'llama-cli' with GGML_STATIC=ON on Linux. We add linking with CUDA::cublasLt_static when CUDA version is greater than 10.1.	2025-09-20 13:42:43 +03:00
uvos	65fa2c0c1a	HIP: Enable support for ggml_backend_cuda_register_host_buffer (llama/15615)	2025-09-20 13:42:43 +03:00
Chenguang Li	02e8b23137	CANN: refactor mask handling and improve performance in FA (llama/15561) * CANN(flash-attn): refactor mask handling and improve performance 1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode. 2. Optimized performance in non-alibi scenarios by reducing one repeat operation. 3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16. Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: fix review Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: Optimization FA BNSD to BSND Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-20 13:42:43 +03:00
xctan	ece1bdfe7e	ggml-cpu : add basic RVV support for vector f32 ops (llama/15057) * ggml-cpu : add basic RVV support for vector f32 ops * ggml-cpu : add RVV support for f32 softmax	2025-09-20 13:42:43 +03:00
rmatif	a6ec224efa	OpenCL: add fused group_norm/norm, mul, add (llama/15314) * add fused group_norm/norm, mul, add * fix spacing * revert rms_norm logic * fix trailing whitespace	2025-09-20 13:42:43 +03:00
Akarshan Biswas	94fa9f63b3	SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (llama/15592) The original implementation unconditionally returned true for this operation, leading to a failure when the tensor's first dimension (ne[0]) was not a multiple of WARP_SIZE. This caused an GGML_ASSERT(ncols % WARP_SIZE == 0) failure in ggml-sycl/norm.cpp. This change updates the ggml_backend_sycl_device_supports_op check to correctly return true for GGML_OP_RMS_NORM only when the first dimension of the tensor is a multiple of WARP_SIZE, ensuring the operation can be performed without error.	2025-09-20 13:42:42 +03:00
shalinib-ibm	31c7784e09	llamafile: PowerPC Sgemm Optimization (llama/15558) This patch improves GEMM for FP32 Data Type on PowerPC Implements GEMM on large blocks with configurable block size mc, nc, kc (default: 256, 256, 256). Packing Function optimized to access blocks as per memory layout. GEMM Optimized to work on larger blocks. Isolated Packing from GEMM Operations for better MMA utilization. Verified functionality and correctness uing llama-cli and stand alone test case (performs matmul and compares final mattrix C result with base). Minor code refactoring changes: Replace macro with inline function Code Indent made consistent with 4 spaces Performance Testing: Observed 50% ~ 70% improvement in Prompt Processing Speed mesured using llama-bench with Meta-Llama3-8B FP32 Model. Similar gains observed with Mistral-7b-Instruct-v0.3 Model. model Size Params Backend Threads Test Patch Base llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp512 98.58 60.3 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp1024 95.88 57.36 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp2048 85.46 53.26 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp4096 68.66 45.78 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp6144 57.35 40.44 25 ~ 30% improvement in llama-batched-bench with Metla-Llama3-8B in Prompt Processing Speed for large prompts (256, 512, 1024, 2048, 4096)tokens with various batch sizes ( 1, 2, 4, 8, 16) Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>	2025-09-20 13:42:42 +03:00
Johannes Gäßler	53010199a1	CUDA: return -1 for nonexistent compiled arch (llama/15587)	2025-09-20 13:42:42 +03:00
Georgi Gerganov	1c21a850be	metal : optimize FA vec for large sequences and BS <= 8 (llama/15566) * metal : optmize FA vec for large heads and sequences * metal : adjust small-batch mul mv kernels ggml-ci * batched-bench : fix total speed computation ggml-ci * cont : add comments ggml-ci	2025-09-20 13:42:42 +03:00
Georgi Gerganov	dc693ca8c9	metal : improve `MUL_MAT_ID` (llama/15541) * metal : mul_mm_id remove hdst * metal : remove mul_mm_id hsrc1 * metal : mul_mm_id simplify + add test * metal : opt mul_mm_id map0 * metal : optimize mul_mm_id id gathering * metal : mul/div opt * metal : optimize mul_mm_id_map0 ggml-ci	2025-09-20 13:42:42 +03:00
Sigbjørn Skjæret	3bb52acb46	metal : remove contiguous assertion for src0 in IM2COL (llama/15577) * remove contiguous assertion for src0 in IM2COL * add contiguous check in supports_op	2025-09-20 13:42:42 +03:00
Yoshi_likes_e4	9828caafb5	Add a warning for special devices (llama/15563) * Add warning * Print the devices names * Add newlines * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Fix vector names --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-09-20 13:42:42 +03:00
Jeff Bolz	79e2bd5ea8	vulkan: Remove splitting for mul_mat_id (llama/15568) row_ids only needs to hold the BN rows for the current tile.	2025-09-20 13:42:42 +03:00
Qeeweew	2468074e91	CUDA: Accelerate MXFP4 table lookup using `__byte_perm` (llama/15451) * CUDA: optimize get_int_from_table_16 * CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs * revise documentation --------- Co-authored-by: xix <xiapc@outlook.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-09-20 13:42:41 +03:00
lhez	582ef379ab	opencl: fix support ops condition for `rms_norm` (llama/15560)	2025-09-20 13:42:41 +03:00
Ruben Ortlam	335d2a5405	vulkan: fix min subgroup 16 condition for mmid subgroup optimization (llama/15565)	2025-09-20 13:42:41 +03:00
Ihar Hrachyshka	8851ef5463	metal: fix regression when no metal devices are present (llama/15531)	2025-09-20 13:42:41 +03:00
Johannes Gäßler	1e856b2919	CUDA: MoE helper in device code, better tile sizes (llama/15525) * CUDA: MoE helper in device code, better tile sizes * reduce superfluous CUDA blocks	2025-09-20 13:42:41 +03:00
Georgi Gerganov	54be54f4ce	metal : add FA kernels for HS=40 (llama/15559) ggml-ci	2025-09-20 13:42:41 +03:00
Chenguang Li	86331f74e0	CANN: ROPE cache sin/cos repeat (llama/15501) Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-20 13:42:41 +03:00
Ruben Ortlam	ee11ed42a9	vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices (llama/15524) * vulkan: use subgroup function for mul_mat_id shader even without coopmat * vulkan: fix compile warnings * vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id * vulkan: disable subgroup mul_mat_id on devices with subgroups < 16	2025-09-20 13:42:41 +03:00
Jeff Bolz	85d4d2c875	vulkan: Support FA with any multiple of 8 head sizes (llama/15537) The scalar FA shader already handled multiples of 8. The coopmat1 FA shader assumed 16x16x16 and the shared memory allocations need the HSK dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation requires multiples of 16 for N and K, and needs the matrix dimensions padded and loads clamped. Store the FA pipelines in a map, indexed by the pipeline state.	2025-09-20 13:42:40 +03:00
Ruben Ortlam	8c7872d6ed	vulkan: enable Conv2D for Apple after MoltenVK fixed the bug (llama/15526)	2025-09-20 13:42:40 +03:00
Jeff Bolz	27817867cc	vulkan: workaround MoltenVK compile failure in multi_add (llama/15506) * vulkan: workaround MoltenVK compile failure in multi_add * Update ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp Co-authored-by: 0cc4m <picard12@live.de>	2025-09-20 13:42:40 +03:00
Johannes Gäßler	b0d15e1eb6	CUDA: fix half2 -> half conversion for HIP (llama/15529)	2025-09-20 13:42:40 +03:00
Jeff Bolz	2f6288c33c	vulkan: optimize rms_norm, and allow the work to spread across multiple SMs (llama/15281) * vulkan: optimize rms_norm, and allow the work to spread across multiple SMs There are really two parts to this change: (1) Some optimizations similar to what we have in soft_max, to unroll with different numbers of iterations. (2) A fusion optimization where we detect add followed by rms_norm, and make the add shader atomically accumulate the values^2 into memory. Then the rms_norm shader can just load that sum. This allows the rms_norm to be parallelized across multiple workgroups, it just becomes a simple per-element multiply. The fusion optimization is currently only applied when the rms_norm is on a single vector. This previously always ran on a single SM. It could apply more broadly, but when there are other dimensions the work can already spread across SMs, and there would be some complexity to tracking multiple atomic sums. * Change add+rms_norm optimization to write out an array of partial sums rather than using atomic add, to make it deterministic. The rms_norm shader fetches a subgroup's worth in parallel and uses subgroupAdd to add them up. * complete rebase against fused adds - multi_add shader can also compute partial sums * fix validation errors * disable add_rms_fusion for Intel due to possible driver bug * resolve against #15489, sync after clearing partial sums	2025-09-20 13:42:40 +03:00
Jeff Bolz	d8eb9f7d67	vulkan: Rewrite synchronization to allow some overlap between nodes (llama/15489) Track a list of nodes that need synchronization, and only sync if the new node depends on them (or overwrites them). This allows some overlap which can improve performance, and centralizes a big chunk of the synchronization logic. The remaining synchronization logic involves writes to memory other than the nodes, e.g. for dequantization or split_k. Each of these allocations has a bool indicating whether they were in use and need to be synced. This should be checked before they are written to, and set to true after they are done being consumed.	2025-09-20 13:42:40 +03:00
Acly	5094171c37	vulkan : support ggml_mean (llama/15393) * vulkan : support ggml_mean * vulkan : support sum, sum_rows and mean with non-contiguous tensors * vulkan : fix subbuffer size not accounting for misalign offset * tests : add backend-op tests for non-contiguous sum_rows * cuda : require contiguous src for SUM_ROWS, MEAN support * sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support * require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader	2025-09-20 13:42:40 +03:00
Jeff Bolz	485c5c3b3b	vulkan: optimize mul_mat_id loading row ids into shared memory (llama/15427) - Spread the work across the whole workgroup. Using more threads seems to far outweigh the synchronization overhead. - Specialize the code for when the division is by a power of two.	2025-09-20 13:42:40 +03:00
Reese Levine	bb5d7e2c31	ggml WebGPU: add support for quantization types (llama/15440) * Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments * Work on templating for different types in shaders * Work on shader type generation * Working q4_0 mul_mat and some templating for different types * Add q4_0_f16 matmul and fix device init * Add matmul support for basic quantization types * Add q2_k and q3_k quantization * Add rest of k-quants * Get firt i-quant working * Closer to supporting all i-quants * Support rest of i-quants * Cleanup code * Fix python formatting * debug * Bugfix for memset * Add padding to end of buffers on creation * Simplify bit-shifting * Update usage of StringView	2025-09-20 13:42:39 +03:00
rmatif	d7b7498e76	ggml: add `conv3d` op (llama/15182) * add conv3d * bump GGML_OP_COUNT	2025-09-20 13:42:39 +03:00
Yavor Ivanov	18ca4e8f63	cuda : add Pad Reflect 1D support (llama/14659) * Add Pad Reflect 1D CUDA support * Update ggml/src/ggml-cuda/pad_reflect_1d.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-09-20 13:42:39 +03:00
Aaron Teo	380d3db216	ggml-cpu: Support Q5_0 and Q5_1 on s390x (llama/15486) * ggml-cpu: initial q5_0 impl for s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: updated q5_0 code for better performance Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: use optimised hsum for better performance Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: introduce q5_1 simd + refactor q5_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix incorrect return type vec_hsum Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: q5_0 incomplete refactor + table_b2b_0 activation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: refactor q5_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: q5_1 update loop unroll to 4 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: update q5_0 unroll to 4 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: update build-s390x docs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: update unused variables q5_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * docs: update the last update date Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-20 13:42:39 +03:00
Chenguang Li	be841c3f6e	CANN: Optimize RMS_NORM using cache (llama/15419) * [CANN] Optimize RMS_NORM using cache Signed-off-by: noemotiovon <757486878@qq.com> * fix typo Signed-off-by: noemotiovon <757486878@qq.com> * fix review comment Signed-off-by: noemotiovon <757486878@qq.com> * codestyle adjustment Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-20 13:42:39 +03:00
Diego Devesa	554f96f385	sched : fix possible use of wrong ids tensor when offloading moe prompt processing (llama/15488)	2025-09-20 13:42:39 +03:00
Acly	9dd5039968	vulkan : support conv_2d_dw with f16 weights (llama/15392)	2025-09-20 13:42:39 +03:00
Dong Won Kim	7eebd498ff	vulkan: add exp operation (llama/15456) Co-authored-by: aeseulgi <kim2h7903@gmail.com>	2025-09-20 13:42:39 +03:00
Jeff Bolz	04d0f9a066	vulkan: Reuse conversion results in prealloc_y (llama/15410) * vulkan: Reuse conversion results in prealloc_y Cache the pipeline and tensor that were most recently used to fill prealloc_y, and skip the conversion if the current pipeline/tensor match. * don't use shared pointer for prealloc_y_last_pipeline_used	2025-09-20 13:42:38 +03:00
Xuan-Son Nguyen	c5874bcf42	ggml : fix condition of im2col on Metal backend (llama/15460)	2025-09-20 13:42:38 +03:00
R0CKSTAR	7c077845fd	musa: add GGML_UNUSED_VARS (llama/15446) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-09-20 13:42:38 +03:00
Diego Devesa	622dec5bf6	sched : copy only the used experts when offloading prompt processing (llama/15346)	2025-09-20 13:42:38 +03:00
Johannes Gäßler	8f0579a33d	CUDA: refactor FA support/selection code (llama/15454)	2025-09-20 13:42:38 +03:00
Johannes Gäßler	316ed78d68	CUDA: replace GGML_CUDA_F16 with CUDA arch checks (llama/15433)	2025-09-20 13:42:38 +03:00
Jeff Bolz	5907ab3e4a	vulkan: shorten pipeline name strings (llama/15431) These detailed strings were causing increased build time on gcc.	2025-09-20 13:42:38 +03:00
R0CKSTAR	0eb2d653bd	musa: fix build warnings (llama/15258) * musa: fix build warnings Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * fix warning: comparison of integers of different signs: 'const int' and 'unsigned int' [-Wsign-compare] Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-09-20 13:42:38 +03:00
lhez	db1d2380a0	opencl: mark `argsort` unsupported if cols exceed workgroup limit (llama/15375)	2025-09-20 13:42:37 +03:00
SHUAI YANG	2572322bac	CANN: optimize rope operator (llama/15335) * optimize rope ops * amendment * delete trailing whitespace * change the variable name	2025-09-20 13:42:37 +03:00
R0CKSTAR	02b49af98d	musa: handle __hgt2_mask, available starting from MUSA SDK rc4.3.0 (llama/15413) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-09-20 13:42:37 +03:00
Marvin Gießing	2ce5860a62	ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware (llama/15385) * Added VSX intrinsics for Power9+ systems Signed-off-by: mgiessing <marvin.giessing@gmail.com> * Manual unrolling for minor perf improvement Signed-off-by: mgiessing <marvin.giessing@gmail.com> * Update ggml/src/ggml-cpu/arch/powerpc/quants.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: mgiessing <marvin.giessing@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-09-20 13:42:37 +03:00
Georgi Gerganov	80447f7412	cuda : remove obsolete sources (ggml/1332) ggml-ci	2025-09-20 13:42:37 +03:00
Carlos Zoido	44fa2f647c	ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (#3426 ) While working on the [whisper-cpp](https://conan.io/center/recipes/whisper-cpp) Conan package for ConanCenter, I noticed that enabling the `with_blas` option fails to build due to an issue in the _MKL_ detection logic. The problem is that the CMake condition currently expands `BLAS_INCLUDE_DIRS` without quotes: ```cmake if (${BLAS_INCLUDE_DIRS} MATCHES "mkl" AND (${GGML_BLAS_VENDOR} MATCHES "Generic" OR ${GGML_BLAS_VENDOR} MATCHES "Intel")) ``` When `BLAS_INCLUDE_DIRS` is a list (as Conan provides it), the `if()` command receives multiple arguments and produces a CMake error: ```bash ... -- BLAS found, Includes: /root/.conan2/p/b/openb034c5a6ca927b/p/include;/root/.conan2/p/b/openb034c5a6ca927b/p/include/openblas CMake Error at ggml/src/ggml-blas/CMakeLists.txt:77 (if): if given arguments: "/root/.conan2/p/b/openb034c5a6ca927b/p/include" "/root/.conan2/p/b/openb034c5a6ca927b/p/include/openblas" "MATCHES" "mkl" "AND" "(" "OpenBLAS" "MATCHES" "Generic" "OR" "OpenBLAS" "MATCHES" "Intel" ")" Unknown arguments specified ... ``` This PR fixes the issue by quoting the variable: ```cmake if ("${BLAS_INCLUDE_DIRS}" MATCHES "mkl" AND (${GGML_BLAS_VENDOR} MATCHES "Generic" OR ${GGML_BLAS_VENDOR} MATCHES "Intel")) ``` With this change, the whole list is treated as a single string and the regex still works correctly.	2025-09-19 05:33:53 +02:00
Reese Levine	5ed45b2518	ggml: Add initial WebGPU backend (llama/14521) ggml-ci	2025-08-18 20:30:45 +03:00
Aaron Teo	03d6607691	ggml : initial zDNN backend (llama/14975)	2025-08-18 20:30:45 +03:00
compilade	0fd4a250df	ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors (llama/15379) * ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors * ggml-quants : avoid division by zero in make_q3_quants	2025-08-18 20:30:45 +03:00
Jeff Bolz	fcd694ec1a	vulkan: disable spirv-opt for bfloat16 shaders (llama/15352)	2025-08-18 20:30:45 +03:00
Jeff Bolz	6835e0cf77	vulkan: Use larger workgroups for mul_mat_vec when M is small (llama/15355) * vulkan: Use larger workgroups for mul_mat_vec when M is small Also use subgroup instructions for (part of) the reduction when supported. Without this, the more expensive reductions would eat into the benefits of the larger workgroups. * update heuristic for amd/intel Co-authored-by: 0cc4m <picard12@live.de> --------- Co-authored-by: 0cc4m <picard12@live.de>	2025-08-18 20:30:45 +03:00
Dong Won Kim	c225f25907	vulkan: support sqrt (llama/15370)	2025-08-18 20:30:45 +03:00
Jeff Bolz	0a8285186a	vulkan: Optimize argsort (llama/15354) - Launch an appropriate number of invocations (next larger power of two). 32 invocations is common and the barrier is much cheaper there. - Specialize for "needs bounds checking" vs not. - Make the code less branchy and [[unroll]] the loops. In the final code, I see no branches inside the main loop (only predicated stores) when needs_bounds_check is false. - Always sort ascending, then apply the ascending vs descending option when doing the final stores to memory. - Copy the values into shared memory, makes them slightly cheaper to access.	2025-08-18 20:30:45 +03:00
Jeff Bolz	c44d449635	vulkan: fuse adds (llama/15252) * vulkan: fuse adds Fuse adds that have the same shape, which are common in MoE models. It will currently fuse up to 6 adds, because we assume no more than 8 descriptors per dispatch. But this could be changed. * check runtimeDescriptorArray feature * disable multi_add for Intel due to likely driver bug	2025-08-18 20:30:45 +03:00
Jeff Bolz	d14e626e6a	vulkan: Support mul_mat_id with f32 accumulators (llama/15337) * vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id * vulkan: Support mul_mat_id with f32 accumulators, but they are not hooked up - There's no explicit way to request f32 precision for mul_mat_id, but there probably should be, and this gets the code in place for that. - A couple fixes to check_results. - Remove casts to fp16 in coopmat1 FA shader (found by inspection).	2025-08-18 20:30:45 +03:00
Jeff Bolz	5b62995350	vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id (llama/15334)	2025-08-18 20:30:45 +03:00
rmatif	e27f4f205d	OpenCL: add initial FA support (llama/14987) * add F16/F16 fa support * fix kernel init * use mad instead of fma * use inline function * mark FA with sinks as unsupported for now * add pragma unroll to loops	2025-08-18 20:30:45 +03:00
lhez	77771b2711	opencl: add initial mxfp4 support via mv (llama/15270) * opencl: add reference `mul_mv_mxfp4_f32` * opencl: add reference `mul_mv_id` for mxfp4 * Q4_0 tranpose fix for Adreno --------- Co-authored-by: shawngu-quic <shawngu@qti.qualcomm.com>	2025-08-18 20:30:45 +03:00
Georgi Gerganov	1e8d692365	vulkan : fix out-of-bounds access in argmax kernel (llama/15342) ggml-ci	2025-08-18 20:30:45 +03:00
Georgi Gerganov	1a92fde1b6	vulkan : fix compile warnings on macos (llama/15340) ggml-ci	2025-08-18 20:30:45 +03:00
Aaron Teo	f797a6f9c8	ggml: initial IBM zDNN backend (llama/14975) * ggml-zdnn: inital backend impl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> ggml-zdnn: temp change z17 to arch15 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> ggml-zdnn: fix build bugs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: tensor->extra logging check Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> ggml-zdnn: add layout name mapping, ztensor information Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> ggml-zdnn: separate logging into its own line Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> ggml-zdnn: add shape comparison Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> ggml-zdnn: add ggml_tensor shape log Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> ggml-zdnn: fix incorrect shape logging Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add output buffer check Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: run compute and store into tensor->extra Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add set_tensor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add more loggers Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: update set_tensor logging to check only for matmul Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: last working matmul version Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add comments to prevent accidentally deleting lines Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: support op out_prod Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: update op out_prod to use tensor->extra Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rewrite the backend implementation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: bugfix new impl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: fix compiler warnings and bugfixes Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: test ztensor finding in init_tensor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: implement at least 1 op to test Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: assign tensor->extra to buffer Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add check for view tensors to prevent init_tensor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rework init_tensor to create new buffers Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: switch to std vector instead of array Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: switch buffers back and set to arbitrary number Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: impl init_tensor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: update supports_op matmul matrix Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: fix incorrect ztensor shape, reduce memory padding Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: code clean up Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: impl matmul Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: fix compiler error missing type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: fix missing data transform call Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add bias init_tensor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: tighten memory usage, change string allocation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add bias ztensor and data free Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add bias data transform Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add more debug info for extra buffer transform Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add logger to check if mat mul ops go through set_tensor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: activate bias transform in matmul Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: move weights transform into mulmat Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add more safeguards in matmul Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: fix sequencing of transforms Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: bugfix transform ztensor vs origtensor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: figure out why sigtrap is happening Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: fix sigsegv Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: move everything back to local declaration Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: move bias data to local also Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: bring back working matmul Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rewrite into mre Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: fix missing vector import Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: fix missing vector import in header Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: attempt to fix sigsegv Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: fix missing load tensor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: fix invalid ztensor buffer release Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add logging to debug free buffer Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: remove free_buffer debug info Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add parmblkformat detections Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add nnpa installed detection Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add zdnn_init call for static libs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add init_tensor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: attempt at fixing invalid buffer Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: switch to using deque to fix pointer deref problem Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add weights logging to check Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: attempt to use unique ptr Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add tensor to pre_tfm_desc logging Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add inputs logging Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: disable op_none initialisation for testing Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: fix missing return from init_tensor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: load ztensors in cgraph exec Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: work on moving output ztensor as well Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: disable logging and breakpoints for full test Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: attempt at manually changing the layout Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: attempt at using default nwhc format instead Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: disable global load ztensor for now Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: fix errorenous output load tensor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: add guards to prevent loading ztensor if transformed Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: code cleanup Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: bring load ztensor back to init routine Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: code clean up Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: fix ztensor deallocation abort stabilise ggml <-> zdnn api Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: clean up matmul selection Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: clean up project structure Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: update documentation, prepare for upstream Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * chore: add codeowners Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: disable batched matmul Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: attempt at fixing tensor views during matmul Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: deny all view tensors directly Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: fix pr comments Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * docs: update ops docs for zdnn Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: redo test-backend-ops for ops.md Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: fix typo in build-s390x.md Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * codeowners: remove taronaeo for now Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "codeowners: remove taronaeo for now" This reverts commit 411ea4ed78d08778967bd0bd33a6538cfcbe082f. * ggml-zdnn: remove unused ggml_zdnn macro Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-08-18 20:30:45 +03:00
Johannes Gäßler	ba32f5df0a	CUDA: fix negative KV_max values in FA (llama/15321)	2025-08-18 20:30:45 +03:00
uvos	0e15332255	HIP: Cleanup hipification header (llama/15285) add expicit conversion operator to support older versions of rocm Switch over to hip_bf16 from legacy hip_bfloat16 Simplify RDNA3 define Reduce swap over of new hipblas api to rocm 6.5 as this version is used for rocm 7.0 previews --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-08-18 20:30:45 +03:00
Jeff Bolz	1d8b21caa0	vulkan: perf_logger improvements (llama/15246) * vulkan: perf_logger improvements - Account for batch dimension in flops calculation. - Fix how "_VEC" is detected for mat_mul_id. - Fix "n" dimension for mat_mul_id (in case of broadcasting). - Include a->type in name. * use <=mul_mat_vec_max_cols rather than ==1	2025-08-18 20:30:45 +03:00
Jason Ni	4a6cf896ad	ggml: fix ggml_conv_1d_dw bug (ggml/1323) * ggml: fix ggml_conv_1d_dw bug * Fixed conv1d_dw weight tensor dimension.	2025-08-18 20:30:45 +03:00
Sigbjørn Skjæret	367cd11f5d	cuda : fix GGML_CUDA_GRAPHS=OFF (llama/15300) * fix USE_CUDA_GRAPH=OFF ggml-ci * check capture status * completely disable capturing check instead	2025-08-18 20:30:45 +03:00
Jonathan Graehl	c76ec72d59	finetune: SGD optimizer, more CLI args (llama/13873) * examples/finetune -opt SGD (stochastic gradient descent) memory opt add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating m, v tensors. support finetune.cpp arg -opt SGD (or sgd). (default adamw as before) llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch) when using SGD instead of 19gb (55 sec/epoch) using adamw. (wikipedia 100 lines finetune) ( using the same GPU memory, adamw can only do before OOM 512 batch/context, reaching: train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00 val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00 SGD is superior, though it converges slower, with max before OOM 1728 batch/context (esp see the better validation perf): train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00 val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00 ) note: when finetuning long enough (or w/ enough -lr), validation accuracy eventually drops ('catastrophic forgetting') -lr-half (halflife) option useful for SGD to avoid oscillation or super slow underdamped learning (makes setting -lr more forgiving). terminal -lr for now is set by lr-halvings i.e. if you want at most 1/8 the inital -lr you set -lr-halvings 3. note: objective loss not directly comparable between adamw, sgd? - check perplexity or accuracy or consider relative improvements for convergence new finetune args -wd 1e-9 to enable weight decay in sgd or adamw, and max -epochs N (default 2 as before) cache (1 - wdalpha) in 'adamw' opt struct - no noticeable perf benefit, disabled (still done for new SGD though) since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params would probably be able to change between SGD and AdamW with each epoch but would need to use adamw for the first (unconfirmed - no cmdline arg to set such a policy yet) test-opt checks adamw as before and now sgd (except for a few disabled tests for sgd only; probably just needs logging values and adding alternate reference values); tolerance on the 'regression' test is broader for sgd (so we don't need many more epochs) Vulkan: Implement GGML_OP_OPT_STEP_SGD * tests: Fix OPT_STEP_SGD test-backend-ops * SGD op param store weight-decay and not 1-alphawd minor + cosmetic changes * fix vulkan sgd * try CI fix --------- Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-08-18 20:30:45 +03:00
uvos	cbaec6c4ac	HIP: bump requirement to rocm 6.1 (llama/15296)	2025-08-18 20:30:45 +03:00
Judd	80ef57f0f0	ggml : update `ggml_rope_multi` (llama/12665) * update `rope_multi`: 1. add `ggml_rope_multi_inplace`; 1. use `GGML_MROPE_SECTIONS` instead of 4. * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-08-18 20:30:45 +03:00
Georgi Gerganov	0e8b244366	ggml : repack block_iq4_nlx8 (llama/14904) ggml-ci	2025-08-18 20:30:45 +03:00
Oliver Simons	b8b1b50c47	CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n (llama/15132) * Factor out `reduce_rows_f32` from common.cuh This increases iteration cycle speed by not having to recompile every kernel all the time * Hide memory-latency by loop unrolling in reduce_rows_f32 * Further optimizations to `reduce_rows_f32` 1. Increase threadblock size to better hide latency of memory requests. As a consequence of bigger threadblocks, do 2-step summation, using shared memory to communicate results between invocations 2. Use sum_temp array to reduce waits on sum 3. Adjust num_unroll to reflext bigger threadblock 4. Improve default block_dims, increase support for more block_dims * Add perf tests for `reduce_rows_f32` kernel * Add heuristic to toggle 128/512 threads based on sm count Break even point was the minimum of the following multiples. \| GPU Model \| Nrow SM Count Multiple \| \| ----------- \| ----------- \| \| RTX 4000 SFF ADA \| 2.0x \| \| RTX 6000 ADA \| 2.5x \| \| RTX PRO 6000 Blackwell Max-Q \| 3.04x \| \| RTX PRO 4500 Blackwell \| 3.15x \| * Ensure perf gains also for small ncols and large nrows Alternative to this, one could have also made the number of unrollings template-able, but that would require compiling the kernel multiple times, increasing binary size unnecessarily * Modify perf and unit-tests * Apply auto-formatting by clang * Fix CI build failure See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486 Building with VS generator worked though. * Remove sm_count property from `ggml_backend_cuda_context` Requested by @JohannesGaessler, and should fix remaining CI issues as a side-effect * Add CUB-based implementation for GGML_OP_MEAN Currently this branch is only executed for nrows==1 * Add heuristics to execute CUB branch only when it brings perf Heuristics were determined on the following HW: * RTX 4000 SFF ADA * RTX 6000 ADA * RTX PRO 6000 Blackwell Max-Q * RTX PRO 4500 Blackwell * Add unit-test for CUB-based mean Tests should run with CUDA Graphs enabled per default on NVGPUs * Rename `USE_CUB` to `GGML_CUDA_USE_CUB` Suggested by @JohannesGaessler * Unindent Preprocessor directives See https://github.com/ggml-org/llama.cpp/pull/15132#discussion_r2269213506	2025-08-18 20:30:45 +03:00
Tak-RS	4e234ac013	ggml-rpc: chunk send()/recv() to avoid EINVAL for very large tensors over RPC (macOS & others) (llama/15188) * ggml-rpc: chunk send()/recv() to avoid EINVAL for very large tensors over RPC (macOS & others). Fixes #15055 * ggml-rpc: rename RPC_IO_CHUNK->MAX_CHUNK_SIZE, use std::min() for cap, switch to GGML_LOG_ERROR, handle 0-length send/recv * rpc: drop n==0 special case in send_data(); retry in loop per review * rpc: remove trailing whitespace in send_data() --------- Co-authored-by: Shinnosuke Takagi <nosuke@nosukenoMacBook-Pro.local>	2025-08-18 20:30:45 +03:00
uvos	8df931b608	HIP: disable sync warp shuffel operators from clr amd_warp_sync_functions.h (llama/15273)	2025-08-18 20:30:45 +03:00
Romain Biessy	1334f434f3	sycl: Fix and disable more configurations of mul_mat (llama/15151) * sycl: Fix and disable more configurations of mul_mat * Disable more configurations	2025-08-18 20:30:45 +03:00
rmatif	139110701e	opencl: allow mixed f16/f32 `add` (llama/15140)	2025-08-18 20:30:45 +03:00
Aman Gupta	082c7ba67c	CUDA cmake: add `-lineinfo` for easier debug (llama/15260)	2025-08-18 20:30:45 +03:00
Chenguang Li	0effaad964	CANN: GGML_OP_CPY optimization (llama/15070) Signed-off-by: noemotiovon <757486878@qq.com>	2025-08-18 20:30:45 +03:00
R0CKSTAR	8e2ddfec31	musa: fix failures in test-backend-ops for mul_mat_id op (llama/15236) * musa: fix failures in test-backend-ops for mul_mat_id op Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-08-18 20:30:45 +03:00
hipudding	3e2c262c08	CANN: Add broadcast for softmax and FA (llama/15208) * refactor softmax * fix fa * fix mask shape * format * add comments * Remove whitespace	2025-08-18 20:30:45 +03:00
Charles Xu	30cc11dc94	kleidiai: fix unsigned overflow bug (llama/15150) * kleidiai: fix unsigned overflow bug * address review comments	2025-08-18 20:30:45 +03:00
David Zhao	457eadfe6f	cuda: refactored ssm_scan and use CUB (llama/13291) * cuda: refactored ssm_scan to use CUB * fixed compilation error when when not using CUB * assign L to constant and use size_t instead of int * deduplicated functions * change min blocks per mp to 1 * Use cub load and store warp transpose * suppress clang warning	2025-08-18 20:30:45 +03:00
Aman Gupta	93c7a08019	CUDA: add attention sinks for tile and wmma (llama/15178) * CUDA: add attention sinks for tile and wmma * Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma	2025-08-18 20:30:45 +03:00
compilade	62566a5436	gguf-py : add Numpy MXFP4 de/quantization support (llama/15111) * gguf-py : add MXFP4 de/quantization support * ggml-quants : handle zero amax for MXFP4	2025-08-18 20:30:45 +03:00
AN Long	573bf9d128	ggml : fix field name when new ggml_backend (llama/14944)	2025-08-18 20:30:45 +03:00
Johannes Gäßler	2baea5e4b3	CUDA: attention sinks for mma FlashAttention (llama/15157)	2025-08-18 20:30:45 +03:00
lhez	8a36cd924a	opencl: support sink in `soft_max` (attn sinks) (llama/15152)	2025-08-18 20:30:45 +03:00
Jeff Bolz	1984530710	vulkan: support fattn sinks (llama/15126)	2025-08-18 20:30:45 +03:00
Jeff Bolz	414e9074e0	vulkan: Add env var to disable host visible vidmem (llama/15109)	2025-08-18 20:30:45 +03:00
uvos	813ceb2a74	HIP: add cmake option to enable compiler output of kernel resource usage metrics (llama/15103)	2025-08-18 20:30:45 +03:00
Christian Kastner	6d7ffea292	ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (llama/15094) Any available libraries are found and loaded dynamically at runtime.	2025-08-18 20:30:45 +03:00
Johannes Gäßler	5caf8a1ea2	CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16 (llama/15131) * CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16	2025-08-18 20:30:45 +03:00
rmatif	b405fd88b3	fix profiling crash (llama/15072)	2025-08-18 20:30:45 +03:00
lhez	d153cfb507	opencl: add `swiglu_oai` and `add_id` (llama/15121) * opencl: add `swiglu-oai` * opencl: add `add_id` * opencl: add missing `add_id.cl`	2025-08-18 20:30:45 +03:00
Diego Devesa	6fb55d8f7c	ggml : fix fallback to CPU for ununsupported ops (llama/15118)	2025-08-18 20:30:45 +03:00
Chenguang Li	e809e81e69	CANN: add support for ACL Graph (llama/15065) * feat(cann): add optional support for ACL Graph execution This commit adds support for executing ggml computational graphs using Huawei's ACL graph mode via the USE_CANN_GRAPH flag. The support can be enabled at compile time using the CMake option: -DUSE_CANN_GRAPH=ON By default, ACL graph execution is disabled, and the fallback path uses node-by-node execution. Key additions: - CMake option to toggle graph mode - Graph capture and execution logic using - Tensor property matching to determine whether graph update is required - Safe fallback and logging if the environment variable LLAMA_SET_ROWS is unset or invalid This prepares the backend for performance improvements in repetitive graph execution scenarios on Ascend devices. Signed-off-by: noemotiovon <757486878@qq.com> * Fix review comments Signed-off-by: noemotiovon <757486878@qq.com> * remane USE_CANN_GRAPH to USE_ACL_GRAPH Signed-off-by: noemotiovon <757486878@qq.com> * fix typo Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-08-18 20:30:45 +03:00
Georgi Gerganov	d3aab3efde	llama : add gpt-oss (llama/15091) * oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (llama/7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (llama/1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (llama/11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <slarengh@gmail.com> change kvalues_mxfp4 table to match e2m1 (llama/6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (llama/13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: slaren <slarengh@gmail.com>	2025-08-18 20:30:45 +03:00
Romain Biessy	6558022873	sycl: fix mul_mat selection (llama/15092)	2025-08-18 20:30:45 +03:00
Christian Kastner	349b9a2097	cmake: Add GGML_BACKEND_DIR option (llama/15074) * cmake: Add GGML_BACKEND_DIR option This can be used by distributions to specify where to look for backends when ggml is built with GGML_BACKEND_DL=ON. * Fix phrasing	2025-08-18 20:30:45 +03:00
Jeff Bolz	00ff38376a	vulkan: fix build when using glslang that does not support coopmat2 (llama/15062)	2025-08-18 20:30:45 +03:00
Jeff Bolz	abc971e69a	vulkan: Use coopmat2 for conv2d (llama/14982)	2025-08-18 20:30:45 +03:00
lhez	53d8c5179f	opencl: fix adreno compiler detection logic (llama/15029)	2025-08-18 20:30:45 +03:00
Johannes Gäßler	d6e7315717	CUDA: use mma FA kernel for gqa > 4 on RTX 4000 (llama/15035)	2025-08-18 20:30:45 +03:00
leejet	a3123e105b	cuda: make im2col a little faster (llama/15025)	2025-08-18 20:30:45 +03:00
Georgi Gerganov	d119ecf0c1	cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 (llama/15038) * cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 ggml-ci * cont : fix cont types ggml-ci * cont : adopt variable names and comment from the other branch	2025-08-18 20:30:45 +03:00
Jeff Bolz	b374fd6172	vulkan: coopmat2 mul_mat optimizations (llama/14934) - Increase tile size for k-quants, to match non-k-quants - Choose more carefully between large and medium tiles, considering how it interacts with split_k - Allow larger/non-power of two split_k, and make the splits a multiple of 256 - Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used	2025-08-18 20:30:45 +03:00
Jeff Bolz	97341224b2	vulkan: Support ne[3]>1 in noncontig matrix-vector multiply (llama/15015)	2025-08-18 20:30:45 +03:00
Jeff Bolz	46e9e5b9a7	vulkan: optimizations for direct convolution (llama/14933) * vulkan: optimizations for direct convolution - Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill the GPU. The new size should be amenable to using coopmat, too. - Fix shmem bank conflicts. 16B padding should work with coopmat. - Some explicit loop unrolling. - Skip math/stores work for parts of the tile that are OOB. - Apply fastdiv opt. - Disable shuffles for NV. * Three tiles sizes for CONV_2D, and a heuristic to choose * reallow collectives for pre-Turing * make SHMEM_PAD a spec constant * fixes for intel perf - no shmem padding, placeholder shader core count * shader variants with/without unrolling * 0cc4m's fixes for AMD perf Co-authored-by: 0cc4m <picard12@live.de> --------- Co-authored-by: 0cc4m <picard12@live.de>	2025-08-18 20:30:45 +03:00
Johannes Gäßler	7e7557ac50	CUDA: fix MMQ nwarps for AMD with warp_size==32 (llama/15014)	2025-08-18 20:30:45 +03:00
lhez	ba6a81c9c9	opencl: add f16 for `add`, `sub`, `mul`, `div` (llama/14984)	2025-08-18 20:30:45 +03:00
Srihari-mcw	1c6cb7df47	ggml : Q2k interleaving implementation - x86/x64 SIMD (llama/14373) * Initial Q2_K Block Interleaving Implementation * Addressed review comments and clean up of the code * Post rebase fixes * Initial CI/CD fixes * Update declarations in arch-fallback.h * Changes for GEMV Q2_K in arch-fallback.h * Enable repacking only on AVX-512 machines * Update comments in repack.cpp * Address q2k comments --------- Co-authored-by: Manogna-Sree <elisetti.manognasree@multicorewareinc.com>	2025-08-18 20:30:45 +03:00
diannao	78668cb8d1	docker : add cann build pipline (llama/14591) * docker: add cann build pipline * docker: add cann build pipline * docker: fix cann devops * cann : fix multi card hccl * Update ggml/src/ggml-cann/ggml-cann.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Update ggml-cann.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2025-08-18 20:30:45 +03:00
Ruben Ortlam	41e161657e	Vulkan: Fix minor debug mode issues (llama/14899) * vulkan: fix debug mode issues * vulkan: remove broken check_results GGML_OP_SET_ROWS support	2025-08-18 20:30:45 +03:00
hipudding	572152d6af	CANN: Improve loading efficiency after converting weights to NZ format. (llama/14985) * CANN: Improve loading efficiency after converting weights to NZ format. * CANN: fix typo	2025-08-18 20:30:45 +03:00
lhez	4904bc3bda	opencl: add `mul_mat_f32_f32_l4_lm` and `mul_mat_f16_f32_l4_lm` (llama/14809)	2025-08-18 20:30:45 +03:00
uvos	8ed27b407d	HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes (llama/14949)	2025-08-18 20:30:45 +03:00
Johannes Gäßler	113d88686b	CUDA: skip masked KV slices for all FA kernels (llama/14924)	2025-08-18 20:30:45 +03:00
uvos	4e624e42fa	HIP: remove the use of __HIP_PLATFORM_AMD__, explicitly support only AMD targets (llama/14945)	2025-08-18 20:30:45 +03:00
uvos	7f203f41aa	HIP: add GGML_HIP_MMQ_MFMA option to allow disableing the MFMA path. (llama/14930) This is useful for testing for regressions on GCN with CDNA hardware. With GGML_HIP_MMQ_MFMA=Off and GGML_CUDA_FORCE_MMQ=On we can conveniently test the GCN code path on CDNA. As CDNA is just GCN renamed with MFMA added and limited use ACC registers, this provides a good alternative for regression testing when GCN hardware is not available.	2025-08-18 20:30:45 +03:00
uvos	a3899e78af	HIP: Ignore unsupported unroll transformation in fattn-vec (llama/14931) llvm with the amdgcn target dose not support unrolling loops with conditional break statements, when those statements can not be resolved at compile time. Similar to other places in GGML lets simply ignore this warning.	2025-08-18 20:30:45 +03:00
hipudding	c42e55e054	CANN: Add ggml_set_rows (llama/14943)	2025-08-18 20:30:45 +03:00
Sigbjørn Skjæret	682d659416	cuda : add softcap fusion (llama/14907)	2025-08-18 20:30:45 +03:00
Aman Gupta	577f47111e	CUDA: add roll (llama/14919) * CUDA: add roll * Make everything const, use __restrict__	2025-08-18 20:30:45 +03:00
xctan	4dca34a4de	ggml-cpu : deduplicate scalar implementations (llama/14897) * remove redundant code in riscv * remove redundant code in arm * remove redundant code in loongarch * remove redundant code in ppc * remove redundant code in s390 * remove redundant code in wasm * remove redundant code in x86 * remove fallback headers * fix x86 ggml_vec_dot_q8_0_q8_0	2025-08-18 20:30:45 +03:00
Akarshan Biswas	4908e9dd05	SYCL: Add set_rows support for quantized types (llama/14883) * SYCL: Add set_rows support for quantized types This commit adds support for GGML_OP_SET_ROWS operation for various quantized tensor types (Q8_0, Q5_1, Q5_0, Q4_1, Q4_0, IQ4_NL) and BF16 type in the SYCL backend. The quantization/dequantization copy kernels were moved from cpy.cpp to cpy.hpp to make them available for set_rows.cpp. This addresses part of the TODOs mentioned in the code. * Use get_global_linear_id() instead ggml-ci * Fix formatting ggml-ci * Use const for ne11 and size_t variables in set_rows_sycl_q ggml-ci * Increase block size for q kernel to 256 ggml-ci * Cleanup imports * Add float.h to cpy.hpp	2025-08-18 20:30:45 +03:00
Johannes Gäßler	24d3524bfd	CUDA: fix pointer incrementation in FA (llama/14916)	2025-08-18 20:30:45 +03:00
Alberto Cabrera Pérez	923619ffd5	sycl: refactor quantization to q8_1 (llama/14815) * sycl: quantization to q8_1 refactor * Refactored src1 copy logic in op_mul_mat	2025-08-18 20:30:45 +03:00
Kai Pastor	45784c05ae	cmake : Fix BLAS link interface (ggml/1316)	2025-08-18 20:30:45 +03:00
Kai Pastor	01bdc522e0	vulkan : fix 32-bit builds (ggml/1313) The pipeline member can be cast to VkPipeline. This is a VkPipeline_T* on 64 bit but a uint64_t on 32 bit. Cf. VK_DEFINE_NON_DISPATCHABLE_HANDLE documentation.	2025-08-18 20:30:45 +03:00
Georgi Gerganov	28b39c624e	ggml : remove old kompute, cann (skip) (#3349 ) ggml-ci	2025-07-30 16:08:57 +03:00
Erik Scholz	d96f4d8ea1	vulkan : add fp16 support for the conv_2d kernel (llama/14872) * add f16 to conv_2d testing * weaken conv2d test error threshold	2025-07-28 13:02:32 +03:00
Jeff Bolz	5693b857d2	vulkan: skip empty set_rows to avoid invalid API usage (llama/14860)	2025-07-28 13:02:32 +03:00
deepsek	b275e52b46	HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (llama/14624) This commit adds support for MFMA instructions to MMQ. CDNA1/GFX908 CDNA2/GFX90a and CDNA3/GFX942 are supported by the MFMA-enabled code path added by this commit. The code path and stream-k is only enabled on CDNA3 for now as it fails to outperform blas in all cases on the other devices. Blas is currently only consistently outperformed on CDNA3 due to issues in the amd-provided blas libraries. This commit also improves the awareness of MMQ towards different warp sizes and as a side effect improves the performance of all quant formats besides q4_0 and q4_1, which regress slightly, on GCN gpus.	2025-07-28 13:02:32 +03:00
hipudding	4692558a1f	CANN: Implement GLU ops (llama/14884) Implement REGLU, GEGLU, SWIGLU ops according to #14158	2025-07-28 13:02:32 +03:00
R0CKSTAR	8643960acc	musa: fix build warnings (unused variable) (llama/14869) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-07-28 13:02:32 +03:00
Aaron Teo	6629201471	ggml-cpu : disable GGML_NNPA by default due to instability (llama/14880) * docs: update s390x document for sentencepiece Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit e086c5e3a7ab3463d8e0906efcfa39352db0a48d) * docs: update huggingface links + reword Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 8410b085ea8c46e22be38266147a1e94757ef108) * ggml-cpu: disable ggml-nnpa compile flag by default fixes #14877 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 412f4c7c88894b8f55846b4719c76892a23cfe09) * docs: update s390x build docs to reflect nnpa disable Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit c1eeae1d0c2edc74ab9fbeff2707b0d357cf0b4d) --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-07-28 13:02:32 +03:00
Gabe Goodhart	0b0de0bbf2	metal: SSM_SCAN performance (llama/14743) * feat: Add s_off as a parameter in the args struct This may not be necessary, but it more closely mirrors the CUDA kernel Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * perf: Parallelize mamba2 SSM_SCAN metal kernel over d_state This is a first attempt at optimizing the metal kernel. The changes here are: - Launch the kernel with a thread group of size d_state - Use simd groups and shared memory to do the summation for the y computation When tested with G4 tiny preview, this shows roughly a 3x speedup on prefill and 15% speedup on decode. Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Update logic to correctly do the multi-layer parallel sum Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Correctly size the shared memory bufer and assert expected size relationships Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Compute block offsets once rather than once per token Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Use local variable for state recursion Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Use a secondary simd_sum instead of a for loop Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add assertion and comment about relationship between simd size and num simd groups Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parallelize of d_state for mamba-1 Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parallel sum in SSM_CONV Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * Revert "feat: Parallel sum in SSM_CONV" After discussion with @compilade, the size of the parallelism here is not worth the cost in complexity or overhead of the parallel for. https://github.com/ggml-org/llama.cpp/pull/14743#discussion_r2223395357 This reverts commit 16bc059660c1c59e566628201c0ca2c20c9f4bc3. Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Simplify shared memory sizing Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-Authored-By: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-28 13:02:32 +03:00
lhez	d414c3f6ac	opencl: add fused `rms_norm_mul` (llama/14841) * opencl: add fused `rms_norm` + `mul` * opencl: improve workgroup size for `rms_norm_mul`	2025-07-28 13:02:32 +03:00
Oliver Simons	bbf2389919	ggml : remove invalid portPos specifiers from dot files (llama/14838) Neither "g" nor "x" are valid portPos specifiers per the official [graphviz documents](https://graphviz.org/docs/attr-types/portPos/): > If a compass point is used, it must have the form "n","ne","e","se","s","sw","w","nw","c","_". I tested locally for it to fall back to default portPos specifier if an invalid portPos is specified. As a consequence, we can remove associated code.	2025-07-28 13:02:32 +03:00
Chris Rohlf	56350ecc12	rpc : check for null buffers in get/set/copy tensor endpoints (llama/14868)	2025-07-28 13:02:32 +03:00
Diego Devesa	270fa9b25c	sched : fix multiple evaluations of the same graph with pipeline parallelism (llama/14855) ggml-ci	2025-07-28 13:02:32 +03:00
R0CKSTAR	89ae789450	musa: upgrade musa sdk to rc4.2.0 (llama/14498) * musa: apply mublas API changes Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: update musa version to 4.2.0 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: restore MUSA graph settings in CMakeLists.txt Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: disable mudnnMemcpyAsync by default Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: switch back to non-mudnn images Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * minor changes Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: restore rc in docker image tag Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-07-28 13:02:32 +03:00
Kai Pastor	5823eabc78	cmake : Indent ggml-config.cmake (ggml/1310)	2025-07-28 13:02:32 +03:00
Alberto Cabrera Pérez	7dc5ae2d6a	sycl: fixed semantics of block offset calculation (llama/14814)	2025-07-28 13:02:32 +03:00
Georgi Gerganov	faedce5dcb	metal : fix fusion across different encoders (llama/14849) * metal : fix fusion across different encoders ggml-ci * cont : add assertion ggml-ci	2025-07-28 13:02:32 +03:00
Donghyeon Jeong	e648f9f079	sycl: fix undefined variable in work group size check (llama/14843)	2025-07-28 13:02:32 +03:00
Johannes Gäßler	95efcf011d	CUDA: fix overflow in FA, tune performance (llama/14840)	2025-07-28 13:02:32 +03:00
Johannes Gäßler	8272aa9f14	CUDA: fix compilation with GGML_CUDA_F16 (llama/14837)	2025-07-28 13:02:32 +03:00
Johannes Gäßler	a65976fc3c	CUDA: fix quantized KV cache + multiple sequences (llama/14822) * CUDA: fix quantized KV cache + multiple sequences * Update ggml/src/ggml-cuda/fattn-common.cuh Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-28 13:02:32 +03:00
lixing-star	026d8a0c6e	ggml: fix loongarch quantize_row_q8_1 error (llama/14827)	2025-07-28 13:02:32 +03:00
chen fan	49d5540206	CANN: weight format to NZ for Ascend310P3 (llama/14407) * weight format to nz for 310p * remove quant weight format to nz * clean code * fix * make the conditions for converting weights to NZ format consistent * clean code	2025-07-28 13:02:32 +03:00
Aman Gupta	f8402d0a95	CUDA: add fused rms norm (llama/14800)	2025-07-28 13:02:32 +03:00
Jeff Bolz	c91361379a	vulkan: fix rms_norm_mul to handle broadcasting dim0 (llama/14817)	2025-07-28 13:02:32 +03:00
Sigbjørn Skjæret	810018a63a	cuda : implement bf16 cpy ops and enable bf16 cont (llama/14763) * implement bf16 cpy ops and enable bf16 cont * deduplicate copy functions * deduplicate checks	2025-07-28 13:02:32 +03:00
lhez	de49384ab3	opencl: remove unreachable `return` (llama/14806)	2025-07-28 13:02:32 +03:00
R0CKSTAR	9008410087	cuda: remove linking to cublasLt (llama/14790) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-07-28 13:02:32 +03:00
Sigbjørn Skjæret	e81e17b048	opencl: fix `im2col` when `KW!=KH` (llama/14803)	2025-07-28 13:02:32 +03:00
rmatif	a2a5612402	opencl: add conv2d kernel (llama/14403) * add conv2d kernel * fix trailing whitespace * whitespace fixe * handle f16 input and f16 kernel, more opt * resolve conflicts * use enqueue_ndrange_kernel	2025-07-28 13:02:32 +03:00
Romain Biessy	52ad451c8a	sycl: Fix im2col (llama/14797)	2025-07-28 13:02:32 +03:00
Charles Xu	fc2ff438fd	kleidiai: add support for get_rows (llama/14676) * kleidiai: add support for get_rows * apply fixes based on code review * apply more fixes based on code review	2025-07-28 13:02:32 +03:00
Jeff Bolz	e3f4162a06	vulkan/cuda: Fix im2col when KW!=KH (llama/14789) The tid is decomposed into "ow + kyOW + kxOW*KH". Change "ksize" to match.	2025-07-28 13:02:32 +03:00
Ervin Áron Tasnádi	92a9e85d8b	ggml: adds CONV_2D op and direct GEMM Vulkan implementation (llama/14316) * ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan * ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly with gemm (no need for im2col), * test-backend-ops: adds test_case_ref to check the validity/performance of ops against reference implementations having different graphs, adds tests * * Performance fixes: minimized branch divergence, uses collectives to eliminate redundant calculation, macros removed. * Kernel shared memory size check * Updates test-backend-ops to support graphs for performance measurement. * * Apple/Win32 compile errors fixed * Subgroup size used to determine tile size -> fixes llvmpipe errors. * Collectives disabled by default. * Intel support is disabled as the performance is poor. * Conv2d enabled for Intel with disabled collectives, disabled for Apple * test-backend-ops modifications are reverted * Trailing spaces and missing override fixed. * Triggering pipeline relaunch. * Code formatted with .clang-format.	2025-07-28 13:02:32 +03:00
Peter0x44	50f983a17e	vulkan: Add logging for bf16 features to ggml_vk_print_gpu_info (#13274 ) (llama/14707)	2025-07-28 13:02:32 +03:00
0cc4m	b06f314667	Vulkan: Fix fprintf format-security warning (llama/14770)	2025-07-28 13:02:32 +03:00
Kai Pastor	5c3b794c51	cmake : fix usage issues (ggml/1257) * CMake config: Create target only once Fix error on repeated find_package(ggml). For simplicity, check only for the top-level ggml::ggml. * CMake config: Add CUDA link libs * CMake config: Add OpenCL link libs * CMake config: Use canonical find_dependency Use set and append to control link lib variables. Apply more $<LINK_ONLY...>. * CMake config: Wire OpenMP dependency	2025-07-28 13:02:32 +03:00
Daniel Bevenius	e238dc1bdd	ggml-cpu : remove stdlib include from repack.cpp (ggml/1276) This commit removes the inclusion of `<cstdlib>`. The motivation for this change is that this source file does not seem to use any functions from this header and the comment about `qsort` is a little misleading/confusing.	2025-07-28 13:02:32 +03:00
Georgi Gerganov	0ed687c6f1	metal : fuse add, mul + add tests (llama/14596) ggml-ci	2025-07-20 00:23:50 +03:00
Oliver Simons	d4a7ea1634	cuda : Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs (llama/14741) * Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs Gemma3n uses Matrix-Matrix addition as part of their input processing, wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size of 1 is used. * Exclude `project_per_layer_input` by matching node names This ensures that all other graphs which don't exhibit this pattern do not have their behavior changed. * Revert unnecessary formatting changes	2025-07-20 00:23:50 +03:00
Aman Gupta	9a07cb064a	CUDA: set_rows + cpy.cu refactor (llama/14712)	2025-07-20 00:23:50 +03:00
Neo Zhang Jianyu	fed20b0682	use max work group size for device to replace the magic number (llama/14732)	2025-07-20 00:23:50 +03:00
Reese Levine	17c5411195	ggml: Add initial WebGPU backend (llama/14521) * Minimal setup of webgpu backend with dawn. Just prints out the adapter and segfaults * Initialize webgpu device * Making progress on setting up the backend * Finish more boilerplate/utility functions * Organize file and work on alloc buffer * Add webgpu_context to prepare for actually running some shaders * Work on memset and add shader loading * Work on memset polyfill * Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs since webgpu doesn't support it * Implement get_tensor and buffer_clear * Finish rest of setup * Start work on compute graph * Basic mat mul working * Work on emscripten build * Basic WebGPU backend instructions * Use EMSCRIPTEN flag * Work on passing ci, implement 4d tensor multiplication * Pass thread safety test * Implement permuting for mul_mat and cpy * minor cleanups * Address feedback * Remove division by type size in cpy op * Fix formatting and add github action workflows for vulkan and metal (m-series) webgpu backends * Fix name * Fix macos dawn prefix path	2025-07-20 00:23:50 +03:00
Georgi Gerganov	ae1bb2c8ea	llama : add high-throughput mode (llama/14363) * kv-cache : prepare K/V buffers for separation ggml-ci * batched-bench : fix oob write ggml-ci * llama : add "virtual sequences" ggml-ci * llama : use "stream" vs "virtual sequence" ggml-ci * graph : fix stream splitting when KV cache is not used ggml-ci * kv-cache : add multi-stream save/load support ggml-ci * llama : add "--attn-streams" flag ggml-ci * kv-cache : fix handling when find_slot fails ggml-ci * kv-cache : restore find_slot impl ggml-ci * kv-cache : add comments * kv-cache : add bounds checks for sequence id ggml-ci * cont : add n_seq_max to batch allocr ggml-ci * kv-cache : perform stream copies lazily after llama_synchronize ggml-ci * kv-cache : avoid throwing exceptions across the C boundary ggml-ci * CUDA: 4D FlashAttention support (llama/14628) * CUDA: 4D FlashAttention support * CUDA: fix WMMA FA kernel * llama : rename attn_streams -> kv_unified ggml-ci * common : rename kv_split -> kv_unified ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-07-20 00:23:50 +03:00
Georgi Gerganov	9cc645fec0	ggml : add asserts (llama/14720) * ggml : add asserts ggml-ci * cont : fix constant type Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-07-20 00:23:50 +03:00
Jeff Bolz	8d1a0485f1	vulkan: fix noncontig check for mat_mul_id splitting (llama/14683) * vulkan: fix noncontig check for mat_mul_id splitting Remove supports_op check for > 4096 (splitting fixes this) * vulkan: fix batched matmul dequant for Q*_K	2025-07-20 00:23:50 +03:00
Jeff Bolz	b33841c453	vulkan: add RTE variants for glu/add/sub/mul/div (llama/14653)	2025-07-20 00:23:50 +03:00
R0CKSTAR	ab79c6c118	cuda: fix build warnings in set-rows.cu (unused variable) (llama/14687) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-07-20 00:23:50 +03:00
Anton Mitkov	a6b9271c2c	sycl: Hotfix for non dnnl codepath (llama/14677)	2025-07-20 00:23:50 +03:00
shalinib-ibm	ded2e3cf6d	ggml : refactor llamafile_sgemm PPC code (llama/14673) Remove un-necessary templates from class definition and packing functions Reduce deeply nested conditionals, if-else switching in mnapck function Replace repetitive code with inline functions in Packing functions 2 ~ 7% improvement in Q8 Model 15 ~ 50% improvement in Q4 Model Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>	2025-07-20 00:23:50 +03:00
Akarshan Biswas	ebb0e9d0ed	SYCL: use 1D kernel for set_rows (llama/14618) * SYCL: Use 1D kernel for set_rows * Remove dangling comment * Refactor and use ceil_div	2025-07-20 00:23:50 +03:00
Anton Mitkov	24803d62c6	sycl: Batched mulmat rework for oneDNN dispatch (llama/14617)	2025-07-20 00:23:50 +03:00
Sigbjørn Skjæret	0611387d17	cuda : add set rows for bf16 (llama/14664)	2025-07-20 00:23:50 +03:00
Yavor Ivanov	fe33572b22	cuda : add ELU support (llama/14657)	2025-07-20 00:23:50 +03:00
Georgi Gerganov	21308b4e6e	ggml : add build-time message to remind about ggml_set_rows (llama/14661) ggml-ci	2025-07-20 00:23:50 +03:00
Yavor Ivanov	3cad26d807	metal : Add missing unary ops Metal support (llama/14660)	2025-07-20 00:23:50 +03:00
Aman Gupta	66b3a39bdc	CUDA: add set rows for f32 and f16 (llama/14551) * CUDA: add set rows for f32 and f16 * Review: change kernel params, use strides from host * Use 1-d kernel * Review: use int64_t for blockDim.x, rename nb->s for clarity	2025-07-20 00:23:50 +03:00
Georgi Gerganov	3775c503d5	sync : resolve conflicts (#0 ) ggml-ci	2025-07-12 19:23:56 +03:00
Georgi Gerganov	85dcc74b88	sync : resolve conflicts (ggml/0) ggml-ci	2025-07-12 19:23:56 +03:00
Jeff Bolz	915fc153a5	vulkan: support SET_ROWS (llama/14587) * vulkan: support SET_ROWS Add variants of the copy_to_quant shader that do the SET_ROWS operation. Change these shaders to spread the work across the workgroup. The memory access pattern is probably not great (one thread per quant block), but should be fine for now. * vulkan: optimize set_rows Larger workgroups for non-quant types. Set "norepeat" (there is manual repeat logic). Use fastmod.	2025-07-12 19:23:56 +03:00
Jeff Bolz	8670a3fd5d	vulkan: optimizations for deepseek prompt processing (llama/14555) * vulkan: allow unclamped loads in coopmat2 mul_mat_id shader * vulkan: increase coopmat2 mul_mat_id tile size * vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path * vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)	2025-07-12 19:23:56 +03:00
Tarek Dakhran	74f6d47904	model : support LiquidAI LFM2 hybrid family (llama/14620) Important LFM2 was [merged ](https://github.com/huggingface/transformers/pull/39340)into transformers, but has not yet been released. To convert into gguf, install transformers from source ```shell pip install "transformers @ git+https://github.com/huggingface/transformers.git@main" ```	2025-07-12 19:23:56 +03:00
Slobodan Josic	a4ff4ec9cb	HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (llama/14634)	2025-07-12 19:23:56 +03:00
rmatif	b0754136be	opencl: add tiled mul_mat_f16_f32 (llama/14535) * add tiled mul_mat_f16_f32 * fix trailing whitespace * add insightful comments	2025-07-12 19:23:56 +03:00
lhez	6f113cbcaa	opencl: add `set_rows` for `f16` and `f32` (llama/14547) * opencl: add `set_rows` for `f16` and `f32` * opencl: better choose workgroup size for `set_rows`	2025-07-12 19:23:56 +03:00
Akarshan Biswas	3c21cde540	SYCL: Initial set_rows kernel implementation (llama/14562) * SYCL: Initial set_rows kernel implementation * Revert max_threads to 256 * Refactor set_rows and address review comments * Deduplicate conversion function * Remove guard before kernel launch and refactor * Fix and add back SFINAE	2025-07-12 19:23:56 +03:00
compilade	fb885fa48b	cuda : support Falcon-H1 state size for SSM_SCAN (llama/14602)	2025-07-12 19:23:56 +03:00
Xuan-Son Nguyen	2021870fb8	ggml : add ggml_scale_bias (llama/14417) * ggml : add ggml_scale_bias * ggml_vec_mad1_f32 * add more simd * add CUDA * sycl * vulkan * cann (placeholder) * opencl * will this fix cpu? * fix cuda * suggestions from coderabbit * fix cann compile error * vDSP_vsmsa * rm __ARM_FEATURE_SVE * use memcpy for op params * make code looks more consistent * use scalar for __ARM_FEATURE_SVE * add x param to ggml_vec_mad1_f32	2025-07-12 19:23:56 +03:00
Miaoqian Lin	48b18f9eb8	ggml : prevent integer overflow in gguf tensor size calculation (llama/14595)	2025-07-12 19:23:56 +03:00
Jeff Bolz	fadb3233b6	vulkan: optimize flash attention split_k_reduce (llama/14554) * vulkan: allow FA split_k with smaller KV values * vulkan: spread split_k_reduce work across more threads k_num can get rather large. Use the whole workgroup to reduce the M/L values. Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).	2025-07-12 19:23:56 +03:00
Jeff Bolz	9750e4c988	vulkan : fix rope with partial rotation and non-cont src (llama/14582)	2025-07-12 19:23:56 +03:00
Georgi Gerganov	c3942b3db6	cuda : fix rope with partial rotation and non-cont src (llama/14580) * cuda : fix rope non-cont ggml-ci * cont : fix multi-rope + add test ggml-ci * sycl : try fix ggml-ci * cont : fix sycl + clean-up cuda ggml-ci	2025-07-12 19:23:56 +03:00
Aman Gupta	98e7beac6c	CUDA: add bilinear interpolation for upscale (llama/14563)	2025-07-12 19:23:56 +03:00
R0CKSTAR	7e9c6bbab2	musa: fix build warnings (unused variable) (llama/14561) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-07-12 19:23:56 +03:00
Aman Gupta	8e545f466c	CUDA: add bf16 and i32 to getrows (llama/14529)	2025-07-12 19:23:56 +03:00
Eve	e753b9a952	vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (llama/14485) Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260 Co-authored-by: Rémy Oudompheng <remyoudompheng@gmail.com>	2025-07-12 19:23:56 +03:00
Jeff Bolz	9d0c408260	vulkan: fix rms_norm+mul fusion (llama/14545) The fused operation was grabbing the epsilon value from the wrong place. Add an env var to disable fusion. Add some missing checks for supported shapes/types. Handle fused rms_norm+mul in check_results.	2025-07-12 19:23:56 +03:00
Jeff Bolz	3aebb8d5d3	vulkan: Handle updated FA dim2/3 definition (llama/14518) * vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1	2025-07-12 19:23:56 +03:00
Sigbjørn Skjæret	df5af1dc75	opencl: add GELU_ERF (llama/14476)	2025-07-12 19:23:56 +03:00
Georgi Gerganov	10d0d28f7c	metal : disable fast math in all quantize kernels (llama/14528) ggml-ci	2025-07-12 19:23:56 +03:00
luyhcsu	af304ef080	CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (llama/14002) Co-authored-by: luyuhong <luyuhong@kylinos.cn>	2025-07-12 19:23:56 +03:00
Sigbjørn Skjæret	e8138c51d2	ggml : implement GEGLU_ERF and GEGLU_QUICK ops (llama/14445)	2025-07-12 19:23:56 +03:00
lhez	7cec4cc83a	opencl : broadcast for soft_max (llama/14510)	2025-07-12 19:23:56 +03:00
Jeff Bolz	a432929d58	vulkan: support mixed/deepseekR1 FA head sizes (llama/14509) * vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes	2025-07-12 19:23:56 +03:00
Johannes Gäßler	4aaf8114e7	ggml: backward pass for split swiglu (llama/14483)	2025-07-12 19:23:56 +03:00
Nicolò Scipione	0ca760433c	Fix conditional enabling following arch checks for ggml-sycl (llama/14504) Signed-off-by: nscipione <nicolo.scipione@codeplay.com>	2025-07-12 19:23:56 +03:00
Georgi Gerganov	ed639c7f22	kv-cache : use ggml_set_rows (llama/14285) * kv-cache : use ggml_set_rows ggml-ci * graph : separate k and v indices ggml-ci * cont : remove redundant ifs ggml-ci * kv-cache : improve find_slot impl * kv-cache : bounds-check when accessing slot_info indices * kv-cache : add comments ggml-ci * ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends ggml-ci	2025-07-12 19:23:56 +03:00
Georgi Gerganov	0abd0660e1	ggml : fix FA mask dim 2 and 3 (llama/14505) * ggml : fix FA mask dim 2 and 3 ggml-ci * backends : unsupport batched FA in CUDA and Vulkan ggml-ci * vulkan : disable FA for mask->ne[2] != 1	2025-07-12 19:23:56 +03:00
Aman Gupta	9cde908c0a	CUDA: add dynamic shared mem to softmax, refactor general usage (llama/14497)	2025-07-12 19:23:56 +03:00
compilade	d2d120c256	llama : initial Mamba-2 support (llama/9126) * llama : initial Mamba-2 support * ggml : SIMD ggml_ssm_scan for Mamba-2 * ggml : improve ggml_mul speed when masking recurrent states * llama : support running Mamba-Codestral-7B-v0.1 * llama : fix Mamba-2 conv state saving * ggml : make the ggml_mul fast broadcast path more consistently formatted * llama : remove unused variable * llama : add missing break * convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly. * llama : avoid redundant state copy for Mamba 1 and 2 * metal : attempt to adapt SSM_SCAN for Mamba-2 * metal : fix SSM_SCAN pipeline scope * metal : use log and exp instead of log1pf and expf in SSM_SCAN * metal : remove unused arguments for SSM_SCAN The max index is 31, so trimming the arguments is necessary. * metal : add back n_seqs to SSM_SCAN args Whoops, this is needed for the offset in the concatenated output. * metal : fix SSM_SCAN state head offset * metal : fix wrong number of tokens per sequence in SSM_SCAN * ggml : remove unused fast broadcast path in GGML_MUL This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity. * ggml : avoid multiply by D in GGML_OP_SSM_SCAN This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks * convert : fix flake8 lint * metal : fix confusion between ; and , * metal : add missing args for nb references in ssm_scan_f32_group * metal : single-user mamba2 inference works * kv-cache : remove const_cast when setting inputs for s_copy And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy. * convert : avoid AutoConfig for Mamba and Mamba2 hparams * kv-cache : allow context shift for recurrent models * graph : fix recurrent state copies when avoiding copies Works, but using lambda functions might not be that clean. * ggml : fix mamba2 ssm scan when compiled with SVE * ggml-cpu : reorder SVE FMA for consistency with other SIMD arches * cuda : implement ssm scan for Mamba2 There is still room for improvement, but it works! * cuda : adapt Mamba1 ssm scan to shape changes from Mamba2 * mamba : fix mismatched new and delete size for llm_build_mamba Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1\|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON * cuda : graceful fallback for Mamba-1 models with weird embd size	2025-07-12 19:23:56 +03:00
Aman Gupta	fb5c4095ee	CUDA: add softmax broadcast (llama/14475) * CUDA: add softmax broadcast * Pass by const ref * Review: Use blockDims for indexing, remove designated initializers * Add TODO for noncontigous input/output	2025-07-12 19:23:56 +03:00
Johannes Gäßler	70515ed728	CUDA: broadcasting for FlashAttention mask (llama/14500)	2025-07-12 19:23:56 +03:00
Jeff Bolz	1b3e06a400	vulkan: support softmax/FA batch and broadcast (llama/14449)	2025-07-12 19:23:56 +03:00
Georgi Gerganov	d1286cf32b	ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (llama/14435)	2025-07-12 19:23:56 +03:00
zhouwg	2e04b81f3e	opencl : fix possible buffer overflow in dump_tensor (llama/14490)	2025-07-12 19:23:56 +03:00
Eric Zhang	cd87a2f7e0	opencl : skip empty nodes on cgraph compute (llama/14491)	2025-07-12 19:23:56 +03:00
lhez	e43c38f9f1	opencl : update upscale to support align corners (llama/14488)	2025-07-12 19:23:56 +03:00
Björn Ganster	ab850d4680	ggml : Callback before abort (llama/14481) * Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed. * Return previous callback to allow callback chaining * style fixes --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-07-12 19:23:56 +03:00
Georgi Gerganov	cdf5e72163	ci : disable fast-math for Metal GHA CI (llama/14478) * ci : disable fast-math for Metal GHA CI ggml-ci * cont : remove -g flag ggml-ci	2025-07-12 19:23:56 +03:00
Chenguang Li	32d7c10766	CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (llama/14411) * [CANN]update to aclnnGroupedMatmulV2 Signed-off-by: noemotiovon <757486878@qq.com> * Support MUL_MAT_ID on 310p Signed-off-by: noemotiovon <757486878@qq.com> * fix editorconfig Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-07-12 19:23:56 +03:00
Jeff Bolz	3c7939cfe5	vulkan: Split large mul_mat_id to fit in shared memory (llama/14451)	2025-07-12 19:23:56 +03:00
Sigbjørn Skjæret	6fc80e8456	add GELU_ERF (llama/14455)	2025-07-12 19:23:56 +03:00
Acly	19b9aaf044	vulkan : implement bilinear interpolation for ggml_upscale/ggml_interpolate (ggml/1291) * supports GGML_SCALE_MODE_BILINEAR and GGML_SCALE_FLAG_ALIGN_CORNERS	2025-07-12 19:23:56 +03:00
Acly	f98cb6607b	vulkan : implement ggml_roll (ggml/1290) * vulkan : implement ggml_roll * vulkan : refactor vk_op_unary_push_constants initialization	2025-07-12 19:23:56 +03:00
Daniel Bevenius	5ea5c58768	ggml : add version function to get lib version (ggml/1286) * ggml : add version function to get lib version This commit adds a function `ggml_version()` to the ggml library that returns the version of the library as a string. The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used. Usage: ```c printf("GGML version: %s\n", ggml_version()); ``` Output: ```console GGML version: 0.0.2219 ``` * ggml : add ggml_commit() --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-12 19:23:56 +03:00
Georgi Gerganov	c4ea72be9a	ggml : remove trailing whitespace (llama/0)	2025-07-01 17:54:53 +03:00
lhez	1e930ab1b8	opencl : add GEGLU, REGLU, SWIGLU (llama/14456)	2025-07-01 17:54:53 +03:00
Aman Gupta	b5b237d49a	Add Conv2d for CPU (llama/14388) * Conv2D: Add CPU version * Half decent * Tiled approach for F32 * remove file * Fix tests * Support F16 operations * add assert about size * Review: further formatting fixes, add assert and use CPU version of fp32->fp16	2025-07-01 17:54:53 +03:00
Georgi Gerganov	679f31a9d1	metal : disable fast-math for some cpy kernels (llama/14460) * metal : disable fast-math for some cpy kernels ggml-ci * cont : disable for q4_1 ggml-ci * cont : disable for iq4_nl ggml-ci	2025-07-01 17:54:53 +03:00
Romain Biessy	e29e36aee7	ggml-cpu: sycl: Re-enable exp f16 (llama/14462)	2025-07-01 17:54:53 +03:00
xiaobing318	6bb1234a56	cmake : Remove redundant include path in CMakeLists.txt (llama/14452) * Update docker.yml 修改docker.yml文件中的内容使其停止周期性的运行该workflow，如果想要运行该workflow可以手动启动 * Remove redundant include path in CMakeLists.txt The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths. * Enable scheduled Docker image builds Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.	2025-07-01 17:54:53 +03:00
Akarshan Biswas	e81be92931	SYCL: disable faulty fp16 exp kernel (llama/14395) * SYCL: disable faulty fp16 CPU exponent for now * Revert "SYCL: disable faulty fp16 CPU exponent for now" This reverts commit ed0aab1ec31b4eb4b0f275dd7acd41d96a375202. * SYCL: disable faulty fp16 CPU exponent for now * Fix logic of disabling exponent kernel	2025-07-01 17:54:53 +03:00
Sigbjørn Skjæret	130044f228	ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (llama/14443)	2025-07-01 17:54:53 +03:00
Sigbjørn Skjæret	8bc638ee56	ggml : implement REGLU/GEGLU/SWIGLU ops (llama/14158) * implement unary REGLU/GEGLU/SWIGLU cpu ops * relax constraints * duplicate shape of source * fix ggml_vec_geglu_f16 * special case gated ops * implement unary REGLU/GEGLU/SWIGLU cuda ops * tighten constraints again * refactor into GGML_GLU_OP * metal : add glu kernels ggml-ci * add CUDA_GLU_BLOCK_SIZE [no ci] * more constraints and use 64bit ints ggml-ci * 64bit multiplication [no ci] * implement swapped variants (cpu/cuda) * update comment [no ci] ggml-ci * Vulkan: Add GLU ops and shaders * SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate * ggml : implement GLU for split up/gate (llama/14181) * implement GLU for split up/gate * add tests for ggml_glu_split * Vulkan: Implement glu_split logic and shader support * add split to logging [no ci] * SYCL: refactor element_size ops and add split up and gate support to gated kernels * SYCL: switch GEGLU to use tanh approximation --------- Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Akarshan <akarshan@menlo.ai> * GGML: increase OP count in assertion * Refactor: Optimize SYCL element-wise operations with unary function inlining This commit refactors the SYCL element-wise operations to improve performance by: - Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead. - Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic. - Replacing direct kernel calls with calls to these inlined functions. - Using `__dpct_inline__` to encourage compiler inlining. - Minor code cleanup and consistency improvements. The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices. * vulkan: Increase workgroup size for GLU, for performance (llama/14345) * vulkan: Increase workgroup size for GLU, for performance * vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup * merge fix * metal : add support for split and swap ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Akarshan <akarshan@menlo.ai> Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-07-01 17:54:53 +03:00
Jeff Bolz	00b36237ba	vulkan: Add fusion support for RMS_NORM+MUL (llama/14366) * vulkan: Add fusion support for RMS_NORM+MUL - Add a use_count to ggml_tensor, so we can detect if an output is used more than once. - Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor. - Add detection logic and basic fusion logic in ggml-vulkan. - Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test. * extract some common fusion logic * fix -Winconsistent-missing-override * move ggml_can_fuse to a common function * build fix * C and C++ versions of can_fuse * move use count to the graph to avoid data races and double increments when used in multiple threads * use hash table lookup to find node index * change use_counts to be indexed by hash table slot * minimize hash lookups style fixes * last node doesn't need single use. fix type. handle mul operands being swapped. * remove redundant parameter --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-07-01 17:54:53 +03:00
Aman Gupta	b900ee424c	CUDA: add bf16 and f32 support to cublas_mul_mat_batched (llama/14361) * CUDA: add bf16 and f32 support to cublas_mul_mat_batched * Review: add type traits and make function more generic * Review: make check more explicit, add back comments, and fix formatting * Review: fix formatting, remove useless type conversion, fix naming for bools	2025-07-01 17:54:53 +03:00
Jeff Bolz	f641a4c410	vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (llama/14378)	2025-07-01 17:54:53 +03:00
Jeff Bolz	9e48afba2f	vulkan: lock accesses of pinned_memory vector (llama/14333)	2025-07-01 17:54:53 +03:00
Xinpeng Dou	f31ed384f4	fix async_mode bug (llama/14432)	2025-07-01 17:54:53 +03:00
Jeff Bolz	0b09f5bbad	vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (llama/14427) This setting needs to be passed through to vulkan-shaders-gen	2025-07-01 17:54:53 +03:00
Radoslav Gerganov	48fb51f314	ggml : add ggml_set_rows (llama/14274) * ggml : add ggml_set_rows Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'. ref: #8366 * use I64 for indices * ggml : add repeat impl for i64 * ggml : add ggml_is_contiguous_rows * ggml : ggml_set_rows support broadcast * ggml : ggml_set_rows support quantized dst ggml-ci * ggml : support GGML_TYPE_F32 ".from_float" trait * ggml : ggml_set_rows update comment + better index name * tests : add ggml_set_rows * metal : add ggml_set_rows implementation ggml-ci * ggml : simplify forward_dup_f32 * ggml : fix supports_op * tests : add comment to set_rows * ggml : leave the repeat_i64 for a separate PR ggml-ci * ggml : set_rows use std::min instead of MIN * ggml : better error message for set_rows unsupported type * metal : perform op->type check only once * tests : more consistent implementation + more tests ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-01 17:54:53 +03:00
bandoti	566462a5c0	cmake: regen vulkan shaders when shaders-gen sources change (llama/14398) * Add shaders-gen sources as target deps	2025-07-01 17:54:53 +03:00
Georgi Gerganov	c300f1e32d	metal : add special-case mat-vec mul for ne00 == 4 (llama/14385) ggml-ci	2025-07-01 17:54:53 +03:00
Georgi Gerganov	c848b9fbef	metal : batch rows copy in a single threadgroup (llama/14384) * metal : batch rows copy in a single threadgroup ggml-ci * metal : handle some edge cases when threadgroup size is not a power of 2 ggml-ci	2025-07-01 17:54:53 +03:00
R0CKSTAR	a5e6a3c953	musa: enable fp16 mma (all) and cublas on qy2 (llama/13842) * musa: enable fp16 mma (all) and cublas on qy2 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: disable MUL_MAT_ID (q2_k × f32) due to precision issues Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-07-01 17:54:53 +03:00
Aaron Teo	16aa7d151d	ggml-cpu: enable IBM NNPA Vector Intrinsics (llama/14317) * ggml-cpu: add nnpa compile flag Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1) * ggml-cpu: add fp16->fp32 nnpa first Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929) * ggml-cpu: add fp32->fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627) * ggml-cpu: better variable names Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f) * docs: update s390x docs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7) * ggml-cpu: add debugging prints to see if dlf16 is correct Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix print vs printf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix float placeholder Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: ensure fp16 and fp32 load and stores are called Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fp16 load ensured to hit Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove sigint from fp16 store for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: nnpa switch to vec_xst test Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to vec_xst for 4 element loops also Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: rework noop Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove noop, general code cleanup Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: clarify variable naming Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add breakpoint for debugging Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: test fix for conversion failure Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: disable fp32->fp16 nnpa conversions for now there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to elif macro Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: reattempt fp32->fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix typo Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: reattempt fp32->fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix compiler types Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: change to typedef vector types Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add 4 element loops for fp32->fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: clarified vector naming Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bring back fp32->fp16 store nnpa Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add nnpa macro check in ggml-impl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add missing __func__ Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: diagnose why __NNPA__ macro is not being defined Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: import vecintrin.h to fix compiler errors Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: update macro tests Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move s390x typedef to own header file Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: move s390x typedef to own header file" This reverts commit 157f856c34589566151630e294563a420702db39. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to importing ggml-cpu-impl instead Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix macro declaration Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: test more macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add debug prints Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bruteforce macro definitions Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move macro definitions Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add ggml-impl.h to cmakelists Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to private macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move s390x typedef to own header file Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 157f856c34589566151630e294563a420702db39) * ggml-cpu: move things around Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bring back compile macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to quotes for import Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add compiler error macro Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add s390x detection in ggml-src Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bring back compile definitions Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: undo cmakelists work Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: move s390x typedef to own header file" This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove typedefs.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove typedef from cmakelists Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add ggml-impl.h future notes Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add todo comment for future reference Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: clarify naming of dlf16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove unnecessary target compile definitions Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * docs: update broken huggingface link for s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix duplicate func names during compile Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: fix duplicate func names during compile" This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu" This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: refactor fp16<->fp32 simd to ggml-cpu Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix missing simd-mappings.h import in quants.c Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix missing simd-mappings.h within repack Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix amx mmq missing simd-mappings.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: attempt at fixing loongarch failing build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move nnpa together with other fp16<->fp32 simd Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix wrong refactor of ggml-base ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: remove dependency on ggml-cpu from ggml-base Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove mistaken fallback macro fallback logic was already implemented but i was too sleepy to realise Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: move ggml_table_f32_f16 to ggml-cpu ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures" This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml: move ggml_table_f32_f16 to ggml-cpu" This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: move ggml_table_f32_f16 to ggml-cpu ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4) * ggml: move ggml_table_f32_f16 to ggml-cpu.c Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: extern c ggml_table_f32_f16 + chore docs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h we rely on the variable declaration in ggml-cpu.c instead Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h" This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bring back ggml_table_f32_f16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: bring back ggml_table_f32_f16" This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * fix ggml time initialization * fix f32_f16 table init * remove extra line --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: slaren <slarengh@gmail.com>	2025-07-01 17:54:53 +03:00
Sigbjørn Skjæret	99764f5767	ggml : do not output unprintable characters on GGUF load failure (llama/14381)	2025-07-01 17:54:53 +03:00
Anton Mitkov	fc28594112	sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices (llama/13973)	2025-07-01 17:54:53 +03:00
lhez	acfbf2921b	opencl: ref count `ggml_backend_opencl_context` and refactor profiling (llama/14254) * Move profiling info into `ggml_backend_opencl_context` * Add `enqueue_ndrange_kernel` to launch kernel	2025-07-01 17:54:53 +03:00
uvos	6a1d12a8ea	CUDA/HIP: optimize mmv paths taken for HIP devices (llama/14324) Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-07-01 17:54:53 +03:00
Johannes Gäßler	06b01ba87b	CUDA: mul_mat_v support for batch sizes > 1 (llama/14262) * CUDA: mul_mat_v support for batch sizes > 1 * use 64 bit math for initial offset calculation	2025-07-01 17:54:53 +03:00
uvos	791201a974	HIP: enable vec fattn on RDNA4 (llama/14323)	2025-07-01 17:54:53 +03:00
Aman Gupta	abb650c0ec	CUDA: add mean operation (llama/14313) * CUDA: add mean operation * add back sum_rows_f32_cuda * Review: early exit if col!=0	2025-07-01 17:54:53 +03:00
Markus Tavenrath	e036676795	Add support for VK_EXT_debug_utils to add labels to Vulkan objects. (llama/13792) * Add support for VK_EXT_debug_utils to add labels to Vulkan objects. In step 1 compute pipelines are getting labeled. * remove #ifdef for debug utils and add queue marker.	2025-07-01 17:54:53 +03:00
Georgi Gerganov	c1418b9906	metal : fix thread-safety (llama/14300) ggml-ci	2025-07-01 17:54:53 +03:00
Acly	9d7cb80f04	ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285) * add "align corners" mode for bilinear upscale, and allow downscaling * add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag * test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners	2025-07-01 17:54:53 +03:00
Daniel Bevenius	515df20351	ggml-quants : rename best_mad to best_error (ggml/1283) This commit renames the variable `best_mad` to `best_error` in the `make_qkx2_quants` function. The motivation for this is that the name `best_mad` can be somewhat confusing if mean absolute deviation (MAD) is not in use.	2025-07-01 17:54:53 +03:00
Aman Gupta	b68222f92c	CUDA: add conv_2d_transpose (llama/14287) * CUDA: add conv_2d_transpose * remove direct include of cuda_fp16 * Review: add brackets for readability, remove ggml_set_param and add asserts	2025-06-21 07:34:17 +03:00
Nicolò Scipione	a455dcb04c	sycl: add usage of enqueue_functions extension (llama/14244) * Add header and namespace to use enqueue_functions extension * Convert submit and parallel_for to use new extension in convert.cpp * Convert submit and parallel_for to use extension in ggml-sycl.cpp * Convert submit and parallel_for to use extension in gla.cpp * Convert submit and parallel_for in mmq.cpp * Convert submit and parallel_for in mmvq.cpp * Convert submit and parallel_for in remaining files * Convert all simple parallel_for to nd_launch from enqueue_functions extension * Wrapping extension in general function Create a general function that enable the enqueue_functions extension if it is enable in the compiler, otherwise call the general SYCL function to launch kernels. --------- Signed-off-by: nscipione <nicolo.scipione@codeplay.com>	2025-06-21 07:34:17 +03:00
Christian Kastner	af7168174c	Implement GGML_CPU_ALL_VARIANTS for PowerPC (llama/14286) * Add PowerPC feature detection and scoring * ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for PowerPC * ggml-cpu: Delay some initializations until function is called When using GGML_BACKEND_DL=ON, these initializations might use instructions that are not supported by the current CPU. --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-06-21 07:34:17 +03:00
Diego Devesa	33d1f0a3e0	cuda : synchronize graph capture and cublas handle destruction (llama/14288) Workarounds an issue that may cause CUDA graph capture to fail when a cuBLAS handle is destroyed in a different thread	2025-06-21 07:34:17 +03:00
Georgi Gerganov	018b2d340e	ggml : fix repack work size for mul_mat_id (llama/14292) ggml-ci	2025-06-21 07:34:17 +03:00
Charles Xu	694f435d22	ggml: Update KleidiAI to v1.9.0 (llama/14277)	2025-06-21 07:34:17 +03:00
Aman Gupta	5efd43c956	CUDA: add conv_2d_dw (llama/14265) * CUDA: add conv_2d_dw * better naming * simplify using template * Review: fix operation ordering in ggml-cuda, use __forceinline__, use more const	2025-06-21 07:34:17 +03:00
Diego Devesa	71adde9203	ggml-cpu : remove unnecesary arm feature detection (llama/14281) Support for Arm runtime feature detection has now been added to GGML_CPU_ALL_VARIANTS. This removes the old and not very functional code.	2025-06-21 07:34:17 +03:00
fanyang	cef59c1e26	build : suppress gcc15 compile warnings (llama/14261) * Change _contains_any() substrs to std::string_view and fix the find comparison logic.	2025-06-21 07:34:17 +03:00
Anton Mitkov	a02a2d4240	sycl: Cleanup codepaths in Get Rows in sycl backend (llama/14215) Addresses unused reorder path	2025-06-21 07:34:17 +03:00
Aaron Teo	be4ea0826b	llamafile : support s390x SIMD instruction set (llama/14273)	2025-06-21 07:34:17 +03:00
0cc4m	1aca7b5c8a	Vulkan: Set device max size for host memory to avoid OOM warning and fallback to CPU buffer (llama/14249)	2025-06-21 07:34:17 +03:00
Georgi Gerganov	b251d739ad	metal : add mean kernel (llama/14267) * metal : add mean kernel ggml-ci * cont : dedup implementation ggml-ci	2025-06-21 07:34:17 +03:00
Aaron Teo	203451bcba	ggml-cpu: reduce asm calls for hsum (llama/14037) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-06-21 07:34:17 +03:00
Aaron Teo	34940abe53	ggml-cpu: fix uncaught underscore terminators (llama/14023) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-06-21 07:34:17 +03:00
Charles Xu	4fc9c34126	ggml: Add Apple support for GGML_CPU_ALL_VARIANTS (llama/14258)	2025-06-21 07:34:17 +03:00
Acly	471df139fa	Add `ggml_roll` (ggml/1274) * ggml : add ggml_roll * use set/get_op_params & std::min	2025-06-21 07:34:17 +03:00
bandoti	0e068779c7	cmake: remove shader-gen step-targets from ggml-vulkan (llama/14226) * Remove step-targets from vulkan-shaders-gen * Unset DESTDIR when building vulkan-shaders-gen	2025-06-18 12:40:34 +03:00
xctan	ac8a303c9a	ggml-cpu : remove the weak alias trick (llama/14221)	2025-06-18 12:40:34 +03:00
R0CKSTAR	2a84593960	musa: fix build warning (unused variable) (llama/14231) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-06-18 12:40:34 +03:00
Diego Devesa	44871c8a3e	llama : add thread safety test (llama/14035) * llama : add thread safety test * llamafile : remove global state * llama : better LLAMA_SPLIT_MODE_NONE logic when main_gpu < 0 GPU devices are not used --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-06-18 12:40:34 +03:00
bandoti	ad6cd94a3a	cmake: clean up external project logic for vulkan-shaders-gen (llama/14179) * Remove install step for vulkan-shaders-gen * Add install step to normalize msvc with make * Regenerate modified shaders at build-time	2025-06-18 12:40:34 +03:00
uvos	dbad9d8fba	HIP: disable rocwmma on gfx12 by default until rocm 7.0 (llama/14202)	2025-06-18 12:40:34 +03:00
Charles Xu	518835ee56	ggml: Add Android support for GGML_CPU_ALL_VARIANTS (llama/14206)	2025-06-18 12:40:34 +03:00
Jeff Bolz	a3d1c55c66	vulkan: mutex around vkQueueSubmit (llama/14127) This fixes the remaining crash in test-thread-safety on my system.	2025-06-18 12:40:34 +03:00
xctan	0c25129d30	ggml-cpu : rework weak alias on apple targets (llama/14146) * ggml-cpu : rework weak alias on apple targets * fix powerpc detection * fix ppc detection * fix powerpc detection on darwin	2025-06-18 12:40:34 +03:00
uvos	a433680a2f	CUDA/HIP: fix ssm_scan on devices where warp size is not 32 (llama/14196)	2025-06-18 12:40:34 +03:00
uvos	aeaed9806f	HIP: Replace usage of depricated preprocessor macro __AMDGCN_WAVEFRONT_SIZE__ (llama/14183)	2025-06-18 12:40:34 +03:00
Anton Mitkov	4ea599afdf	sycl: Adding additional cpy dbg print output (llama/14034)	2025-06-18 12:40:34 +03:00
Ewan Crawford	783cf0309f	SYCL: Bump oneMath commit (llama/14152) Update oneMath commit to merged PR https://github.com/uxlfoundation/oneMath/pull/669 which adds SYCL-Graph support for recording CUDA BLAS commands. With this change the `MUL_MAT` tests now pass on DPC++ CUDA backends with SYCL-Graph enabled. Prior to this change, an error would be thrown. ``` $ GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0 -o MUL_MAT -p type_a=f16,type_b=f32,m=16,n=1,k=256,bs=\\[1,1\\],nr=\\[2 UR CUDA ERROR: Value: 700 Name: CUDA_ERROR_ILLEGAL_ADDRESS Description: an illegal memory access was encountered Function: operator() Source Location: $HOME/dpcpp/unified-runtime/source/adapters/cuda/queue.cpp:154 Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN) Exception caught at file:$HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:3598, func:operator() SYCL error: CHECK_TRY_ERROR((stream)->wait()): Meet error in this line code! in function ggml_backend_sycl_synchronize at $HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3598 $HOME/llama.cpp/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:118: SYCL error Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf ptrace: Operation not permitted. No stack. The program is not being run. ```	2025-06-18 12:40:34 +03:00
Anton Mitkov	0097eaf839	sycl: Remove not needed copy f16->f32 for dnnl mul mat (llama/14125)	2025-06-18 12:40:34 +03:00
Georgi Gerganov	a96a880f7b	cmake : handle whitepsaces in path during metal build (llama/14126) * cmake : handle whitepsaces in path during metal build ggml-ci * cont : proper fix ggml-ci --------- Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2025-06-18 12:40:34 +03:00
Christian Kastner	26c16ad6bd	Implement GGML_CPU_ALL_VARIANTS for ARM (llama/14080) * ggml-cpu: Factor out feature detection build from x86 * ggml-cpu: Add ARM feature detection and scoring This is analogous to cpu-feats-x86.cpp. However, to detect compile-time activation of features, we rely on GGML_USE_<FEAT> which need to be set in cmake, instead of GGML_<FEAT> that users would set for x86. This is because on ARM, users specify features with GGML_CPU_ARM_ARCH, rather than with individual flags. * ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for ARM Like x86, however to pass around arch flags within cmake, we use GGML_INTERNAL_<FEAT> as we don't have GGML_<FEAT>. Some features are optional, so we may need to build multiple backends per arch version (armv8.2_1, armv8.2_2, ...), and let the scoring function sort out which one can be used. * ggml-cpu: Limit ARM GGML_CPU_ALL_VARIANTS to Linux for now The other platforms will need their own specific variants. This also fixes the bug that the the variant-building branch was always being executed as the else-branch of GGML_NATIVE=OFF. The branch is moved to an elseif-branch which restores the previous behavior.	2025-06-18 12:40:34 +03:00
Jeff Bolz	40d0d47cf1	vulkan: Better thread-safety for command pools/buffers (llama/14116) This change moves the command pool/buffer tracking into a vk_command_pool structure. There are two instances per context (for compute+transfer) and two instances per device for operations that don't go through a context. This should prevent separate contexts from stomping on each other.	2025-06-18 12:40:34 +03:00
Jeff Bolz	40c6525517	vulkan: Track descriptor pools/sets per-context (llama/14109) Use the same descriptor set layout for all pipelines (MAX_PARAMETER_COUNT == 8) and move it to the vk_device. Move all the descriptor pool and set tracking to the context - none of it is specific to pipelines anymore. It has a single vector of pools and vector of sets, and a single counter to track requests and a single counter to track use.	2025-06-18 12:40:34 +03:00
lhez	74c68067dc	opencl: add `mul_mv_id_q4_0_f32_8x_flat` (llama/14003)	2025-06-18 12:40:34 +03:00
0cc4m	794bf23994	Vulkan: Don't default to CPU device (like llvmpipe), even if no other device is available, to allow fallback to CPU backend (llama/14099)	2025-06-18 12:40:34 +03:00
Isaac McFadyen	26dcc196c7	rpc : nicer error messages for RPC server crash (llama/14076)	2025-06-18 12:40:34 +03:00
Daniel Bevenius	ffe5400d1b	ggml : disable warnings for tests when using MSVC (ggml/1273) * ggml : disable warnings for tests when using MSVC This commit disables warnings for tests on windows when using MSVC. The motivation for this is that this brings the build output more inline with what Linux/MacOS systems produce. There is still one warning generated for the tests which is: ```console Building Custom Rule C:/ggml/tests/CMakeLists.txt cl : command line warning D9025: overriding '/DNDEBUG' with '/UNDEBUG' [C:\ggml\build\tests\test-arange.vcxproj] test-arange.cpp test-arange.vcxproj -> C:\ggml\build\bin\Release\test-arange.exe ``` * ggml : fix typo in tests disable list	2025-06-18 12:40:34 +03:00
Daniel Bevenius	1b01c0cc4e	ggml : remove unused ggml_context_container (ggml/1272) This commit removes the unused `ggml_context_container` structure from the ggml library. It looks like the usage of this struct was removed in Commit 4757fe18d56ec11bf9c07feaca6e9d5b5357e7f4 ("ggml : alloc ggml_contexts on the heap (whisper/2525)"). The motivation for this changes is to improve code clarity/readability.	2025-06-18 12:40:34 +03:00
Daniel Bevenius	db30f46761	examples : include examples in msvc disable warn (ggml/1270) This commit adds the examples in the "list" of targets to ignore MSVC warnings. The motivation for this is that currently the examples generate a number of warnings that are ignore/disabled for the core ggml project. This makes for a cleaner output when building.	2025-06-18 12:40:34 +03:00
Georgi Gerganov	93d543905e	ggml : fix weak alias win32 (#0 ) ggml-ci	2025-06-10 12:40:33 +03:00
Georgi Gerganov	175e7e4f1a	files : remove old sources (part 2)	2025-06-10 12:40:33 +03:00
Georgi Gerganov	38347a7dda	files : remove old sources	2025-06-10 12:40:33 +03:00
Georgi Gerganov	7a675807a2	metal : use less stack memory in FA kernel (llama/14088) * metal : use less stack memory in FA kernel ggml-ci * cont : fix BF16 variant	2025-06-10 12:40:33 +03:00
xctan	8cbc889f85	ggml-cpu : split arch-specific implementations (llama/13892) * move ggml-cpu-aarch64 to repack * split quantize_row_q8_0/1 * split helper functions * split ggml_vec_dot_q4_0_q8_0 * split ggml_vec_dot_q4_1_q8_1 * split ggml_vec_dot_q5_0_q8_0 * split ggml_vec_dot_q5_1_q8_1 * split ggml_vec_dot_q8_0_q8_0 * split ggml_vec_dot_tq1_0_q8_K * split ggml_vec_dot_tq2_0_q8_K * split ggml_vec_dot_q2_K_q8_K * split ggml_vec_dot_q3_K_q8_K * split ggml_vec_dot_q4_K_q8_K * split ggml_vec_dot_q5_K_q8_K * split ggml_vec_dot_q6_K_q8_K * split ggml_vec_dot_iq2_xxs_q8_K * split ggml_vec_dot_iq2_xs_q8_K * split ggml_vec_dot_iq2_s_q8_K * split ggml_vec_dot_iq3_xxs_q8_K * split ggml_vec_dot_iq3_s_q8_K * split ggml_vec_dot_iq1_s_q8_K * split ggml_vec_dot_iq1_m_q8_K * split ggml_vec_dot_iq4_nl_q8_0 * split ggml_vec_dot_iq4_xs_q8_K * fix typos * fix missing prototypes * rename ggml-cpu-quants.c * rename ggml-cpu-traits * rename arm folder * move cpu-feats-x86.cpp * rename ggml-cpu-hbm * update arm detection macro in quants.c * move iq quant tables * split ggml_quantize_mat_q8_0/K * split ggml_gemv_* * split ggml_gemm_* * rename namespace aarch64 to repack * use weak aliases to replace test macros * rename GGML_CPU_AARCH64 to GGML_CPU_REPACK * rename more aarch64 to repack * clean up rebase leftover * fix compilation errors * remove trailing spaces * try to fix clang compilation errors * try to fix clang compilation errors again * try to fix clang compilation errors, 3rd attempt * try to fix clang compilation errors, 4th attempt * try to fix clang compilation errors, 5th attempt * try to fix clang compilation errors, 6th attempt * try to fix clang compilation errors, 7th attempt * try to fix clang compilation errors, 8th attempt * try to fix clang compilation errors, 9th attempt * more cleanup * fix compilation errors * fix apple targets * fix a typo in arm version of ggml_vec_dot_q4_K_q8_K Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-06-10 12:40:33 +03:00
Diego Devesa	e16a84cd95	cuda : fix device sync on buffer clear (llama/14033)	2025-06-10 12:40:33 +03:00
Xinpeng Dou	26282282fa	CANN: Simplify the environment variable setting(#13104 ) * Simplify the environment variable setting to specify the memory pool type. * Adjust the GGML_CANN_ASYNC_MODE setting to accept yes, enable, 1, or on (case-insensitive) as valid options. * update * fix CI * update * delete whitespace * fix according to review * update CANN.md * update CANN.md	2025-06-10 12:40:33 +03:00
Nicolò Scipione	4737a8c780	sycl: Add reorder to Q6_K mmvq implementation (llama/13885) * Add Reorder to Q6_K mmvq implementation * Address PR comments: clean up comments * Remove unused parameter after refactoring q4_k * Adding inline to function and removing unnecessary reference to int --------- Signed-off-by: nscipione <nicolo.scipione@codeplay.com>	2025-06-10 12:40:33 +03:00
Diego Devesa	8a70f4d18b	cuda : fix buffer type check with integrated GPUs (llama/14069)	2025-06-10 12:40:33 +03:00
Akarshan Biswas	489dc158a6	SYCL: Implement few same quantized type copy kernels (llama/13739) * SYCL: Implement few same quantized type copy kernels * Use memcpy for copying contiguous tensors ggml-ci * feat(sycl): add contiguous tensor copy support and device checks Adds a memcpy path for contiguous tensors of the same type to optimize data transfer. Updates device support checks to recognize contiguous tensor operations, improving compatibility and performance. * refactor: replace specific block copy functions with template The changes replace multiple redundant block copy functions (e.g., cpy_block_q8_0_q8_0, cpy_block_q5_0_q5_0) with a single templated function cpy_blck_q_q. This reduces code duplication by using a generic template that works for any block type, improving maintainability while preserving the same functionality. The template is instantiated with specific block types (e.g., block_q8_0) where needed. * Exclude BF16 support for COPY tensors for now ggml-ci * perf: adjust SYCL copy kernel block sizes for efficiency Use ceil_div to ensure full element coverage and update nd_range parameters to better align with SYCL block sizes, improving parallelism and device utilization in copy operations.	2025-06-10 12:40:33 +03:00
Masato Nakasaka	f0f5a9f7fb	vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs (llama/14001) * allowing B580 and U9-288V * experimenting code to detect Xe2 * allowing coopmat only for Xe2 GPUs * fixed comment wording * fixed comment wording * removed unnecessary driver check	2025-06-10 12:40:33 +03:00
Diego Devesa	13a03c5d33	llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WIN_VER to llama.cpp sources (llama/14013)	2025-06-10 12:40:33 +03:00
Jeff Bolz	6dd91d4f7e	vulkan: automatically deduce size of push constants (llama/13936)	2025-06-10 12:40:33 +03:00
Ervin Áron Tasnádi	5171b24f70	ggml-vulkan: adds support for op CONV_TRANSPOSE_1D (llama/13813) * * ggml-vulkan: adds op CONV_TRANSPOSE_1D * test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D * Missing barrier added to shader. Number of additional tests reduced to 108. * * Fixes typo in variable name. * Removes extra whitespaces. * Adds int64->int32 casts to prevent possible warnings. * Problem size reduced in tests to pass tests with llvmpipe. * supports_op condition moved from unintended position	2025-06-10 12:40:33 +03:00
Diego Devesa	23e2fe0682	releases : use dl backend for linux release, remove arm64 linux release (llama/13996)	2025-06-10 12:40:33 +03:00
Johannes Gäßler	7f4d110f53	CUDA: fix FTZ in FA for Gemma 3 (llama/13991)	2025-06-10 12:40:33 +03:00
Jeff Bolz	ee0ef39fee	vulkan: fix warnings in perf logger querypool code (llama/13937)	2025-06-10 12:40:33 +03:00
lhez	62791ba2e6	opencl: add `backend_synchronize` (llama/13939) * This is not needed by the normal use where the result is read using `tensor_get`, but it allows perf mode of `test-backend-ops` to properly measure performance.	2025-06-10 12:40:33 +03:00
rmatif	e16ef08884	OpenCL: Add concat, tsembd, upscale, tanh, pad and repeat (llama/13840) * add concat, pad, repeat, tsembd, tanh, upscale * small fixes	2025-06-10 12:40:33 +03:00
Georgi Gerganov	c72d3ce935	metal : use F32 accumulators in FA kernels (llama/13975) ggml-ci	2025-06-10 12:40:33 +03:00
shalinib-ibm	126aeb4a49	cmake : Handle mixed-case 'Power' strings in POWER CPU detection (llama/13966) Some systems report the CPU implementation as "Power11" instead of "POWER11". The existing CMake logic uses a case-sensitive regular expression to extract the CPU generation, which fails when the casing doesn't exactly match "POWER". This patch provides a fix by first converting the string to uppercase before applying the regex. Signed-off-by: root <root@rheldb2v.pperf.tadn.ibm.com> Co-authored-by: root <root@rheldb2v.pperf.tadn.ibm.com>	2025-06-10 12:40:33 +03:00
Atharva Dubey	ef2a79d2b8	sycl: quantize and reorder the input to q8_1 when reorder is enabled (llama/13826) * [WIP]: fuse q8 quantization and reorder * wip2: fuse q8 quantization and reorder * working q8 reorder commit * restored common.hpp * remove debug prints * remove unnecessary headers and remove trailing whitespace * Update ggml/src/ggml-sycl/ggml-sycl.cpp Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com> --------- Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@intel.com>	2025-06-10 12:40:33 +03:00
Johannes Gäßler	9589645e72	gguf: fix failure on version == 0 (llama/13956)	2025-06-10 12:40:33 +03:00
Aaron Teo	20f913d119	ggml: check if non-native endian model is being loaded (llama/13943) * gguf: prevent non-native endian models from being loaded Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * gguf: update error message Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * gguf: make the non-native endian check more verbose Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: move ggml_assert location Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: reword the endianness check error message Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-06-10 12:40:33 +03:00
Kai Pastor	b933d17c30	Add in-build ggml::ggml ALIAS library (ggml/1260) Enable uniform linking with subproject and with find_package.	2025-06-10 12:40:33 +03:00
Max Krasnyansky	1e16340f4b	threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling (llama/12995) * threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling We talked about adding LOW priority for GGML threads in the original threadpool PR. It might be useful for some cases to avoid contention. Latest Windows ARM64 releases started parking (offlining) the CPU cores more aggresively which results in suboptimal performance with n_threads > 4. To deal with that we now disable Power Throttling for our threads for the NORMAL and higher priorities. Co-authored-by: Diego Devesa <slarengh@gmail.com> * threading: disable SetThreadInfo() calls for older Windows versions * Update tools/llama-bench/llama-bench.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-06-01 15:14:44 +03:00
Shawn yang	4a50254998	CUDA: add a prop in ggml_cuda_device_infor for distinguish iGPU or dGPU in cuda (#13856 ) (llama/13895) * 1. add "integrated" in ggml_cuda_device_info for distinguish whether it is Intergrate_gpu or discrete_gpu 2. Adjust the func:"ggml_backend_cuda_device_supports_buft" for this new feature * Update ggml/src/ggml-cuda/ggml-cuda.cu Adjusted code indentation Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/ggml-cuda.cu Fixed incorrect setting of variable types Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/ggml-cuda.cu Adjusted the judgment logic Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * add a host_buft assert in case of integrated_cuda_device with func:'evaluate_and_capture_cuda_graph()' * Update ggml/src/ggml-cuda/ggml-cuda.cu Add a defensive security assert Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/ggml-cuda.cu Adjusted the support judgment logic. Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * revoke the suggest commit changes due to it's not applicable in jetson_device * Update ggml/src/ggml-cuda/ggml-cuda.cu Add parentheses to enforce operator precedence Co-authored-by: Diego Devesa <slarengh@gmail.com> * Update ggml/src/ggml-cuda/ggml-cuda.cu Fix ci bug: add a spaces Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: yangxiao <yang_xl@tju.edu.cn> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: yangxiao <yangxl_zz@qq.com> Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-06-01 15:14:44 +03:00
Johannes Gäßler	a5aff28198	CUDA: fix typo in FlashAttention code (llama/13926)	2025-06-01 15:14:44 +03:00
Diego Devesa	6c0472ab8f	sched : avoid changing cur_copy when a graph is already allocated (llama/13922)	2025-06-01 15:14:44 +03:00
Diego Devesa	b14cee184a	cuda : prevent using split buffers with 3d/4d matrices (llama/13919)	2025-06-01 15:14:44 +03:00
Akarshan Biswas	f7f92d0aab	SYCL: Add mrope kernel (llama/13755) * SYCL: Add mrope kernel * feat: Optimize rope operations with vectorization Uses `sycl::vec` to load and store two elements at a time, significantly improving performance in `rope_norm`, `rope_neox`, and `rope_multi`. This reduces the number of memory accesses and leverages SIMD instructions for faster execution. * Use ceil_div	2025-06-01 15:14:44 +03:00
Christian Kastner	1893359cfd	cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (llama/13890)	2025-06-01 15:14:44 +03:00
Yibo Cai	ea643c6ae3	arm64: optimize q4_k_q8_k kernel with i8mm (llama/13886) This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction. Tested on neoverse-n2 with llama3 8b q4_k_m quantization model. - 34% ~ 50% S_PP uplift for all batch sizes - 12% ~ 37% S_TG uplift for batch size 4 and above Perplexity doesn't change with this PR. ``` // tested on neoverse-n2 $ llama-batched-bench \ -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \ --no-mmap -fa \ -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \ -npl 1,2,4,8,16,32 \ -t 64 --------------------------------------------------------------------- \| PP \| TG \| B \| S_PP t/s \| S_TG t/s \| \| \| \| \| original \| this pr \| original \| this pr \| \|-------\|--------\|------\|----------\|----------\|----------\|----------\| \| 128 \| 128 \| 1 \| 110.12 \| 147.83 \| 24.36 \| 24.28 \| \| 128 \| 128 \| 2 \| 121.16 \| 172.42 \| 46.36 \| 47.93 \| \| 128 \| 128 \| 4 \| 120.15 \| 169.75 \| 74.68 \| 84.00 \| \| 128 \| 128 \| 8 \| 130.97 \| 196.81 \| 91.04 \| 114.74 \| \| 128 \| 128 \| 16 \| 131.01 \| 196.88 \| 101.43 \| 135.79 \| \| 128 \| 128 \| 32 \| 130.85 \| 196.51 \| 106.97 \| 147.29 \| --------------------------------------------------------------------- ```	2025-06-01 15:14:44 +03:00
Christian Kastner	1d7b3c79f4	cmake: Factor out CPU architecture detection (llama/13883) * cmake: Define function for querying architecture The tests and results match exactly those of src/CMakeLists.txt * Switch arch detection over to new function	2025-06-01 15:14:44 +03:00
Vineel Abhinav	ccfaac2bb0	ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (llama/13882) * F32-Mamba-Seq_Scan-SVE * Fix formatting * ggml : missing space --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-06-01 15:14:44 +03:00
Vineel Abhinav	1230d37bca	ggml: aarch64: Implement SVE F32 kernels for vector functions (llama/13843) * F32-Mamba-SVE * F32-Mamba-SVE * Resolve test errors-1 * Resolve test errors-2 * F32-vec-SVE * F32-vec-SVE * F32-vec-SVE	2025-06-01 15:14:44 +03:00
Johannes Gäßler	9a500394ad	CUDA: fix FA tg at long context for CC >= 8.9 (llama/13852)	2025-06-01 15:14:44 +03:00
leo-pony	0035b8527c	CANN: Add SOC TYPE printing in cmake configuration (llama/13837)	2025-06-01 15:14:44 +03:00
lhez	3623186312	opencl: add new ops - `argsort`, `div`, `sub`, `addrows`, `sigmoid`, `group_norm` (llama/13787) * opencl: add `argsort` * opencl: add `div` * opencl: add `add_rows` * opencl: add `sub` * opencl: add `sigmoid`, both `f16` and `f32` * opencl: add `group_norm`	2025-06-01 15:14:44 +03:00
lhez	67beac47f3	opencl: mark `mul_mat` `f32f32` as supporting non-contiguous tensors (llama/13790)	2025-06-01 15:14:44 +03:00
Jeff Bolz	47a19bae25	vulkan: use timestamp queries for GGML_VULKAN_PERF (llama/13817) Also change it to be controlled by an env var rather than cmake flag	2025-06-01 15:14:44 +03:00
Akarshan Biswas	3d5c7ca4bc	SYCL: add gelu_erf kernel (llama/13749) * SYCL: add gelu_erf kernel * refactor code Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com> * Use scope_op_debug_print --------- Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com>	2025-06-01 15:14:44 +03:00
Xuan-Son Nguyen	4dfb2c2215	ggml : add ggml_repeat_4d (llama/13824)	2025-06-01 15:14:44 +03:00
Kai Pastor	ad433403ce	vulkan : Remove unexpected ; (ggml/1253)	2025-06-01 15:14:44 +03:00
Kai Pastor	4064dd6484	cmake : Fix broken CMake error messages (ggml/1252)	2025-06-01 15:14:44 +03:00
Radoslav Gerganov	fd75c4995b	ggml : remove ggml_graph_import and ggml_graph_export declarations (ggml/1247) The implementation is already deleted with commit 9d0762e. closes: #1235	2025-06-01 15:14:44 +03:00
Daniel Tang	4d18e52f55	ggml : Fix backtrace breaking Windows build (#3203 )	2025-05-29 13:26:58 +03:00
Radoslav Gerganov	48dddbbac1	ggml : install dynamic backends (ggml/1240)	2025-05-29 09:56:26 +03:00
Daniel Tang	5ea2c37a4c	ggml : Print backtrace on uncaught C++ exceptions (ggml/1232) The goal is to have what users call "full logs" contain the backtrace. This is registered upon ggml_init. Also fixes a minor fd leak on Linux.	2025-05-29 09:56:26 +03:00
Simon Booth	5720426d97	whisper : install shared libs when using GGML_BACKEND_DL (#3195 )	2025-05-28 10:15:04 +02:00
xctan	15ae9dc2a4	ggml : riscv: add xtheadvector support (llama/13720) * ggml : riscv: add xtheadvector support * ggml : clean up some macro usage	2025-05-27 18:03:00 +03:00
Christian Kastner	2e7a1e3e43	ggml-cpu: x86 feature detection is specific to x86 (llama/13811)	2025-05-27 18:03:00 +03:00
Diego Devesa	b75babebb2	ggml : allow CUDA graphs when using pipeline parallelism (llama/13814)	2025-05-27 18:03:00 +03:00
Georgi Gerganov	cc7a0105ef	cuda : avoid cuGetErrorString (llama/13791) ggml-ci	2025-05-27 18:03:00 +03:00
Akarshan Biswas	195fde8804	SYCL: Add non contiguous support in RMS_NORM and NORM kernels (llama/13611) * SYCL: Add non contiguous input support to norm kernel * refactor and add RMS_NORM non contiguous input support ggml-ci * restore subgroup reduction for multi-subgroup thread blocks in norm kernels * Swap grid dims of nsamples and nrows ggml-ci * Revert "Swap grid dims of nsamples and nrows" This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf. * restore not required changes ggml-ci * address review comments: change it to more like SYCL * Use a common function to calculate offset * remove wrap around logic for handling broadcasts * remove static from calculate_offset fn and use ceil_div	2025-05-27 18:03:00 +03:00
Romain Biessy	25e27904ca	sycl: Add more debug prints (llama/13640)	2025-05-27 18:03:00 +03:00
Jeff Bolz	474f7be8b6	vulkan: mark IM2COL as supporting non-contig (llama/13783)	2025-05-27 18:03:00 +03:00
Bizhao Shi	e35fecc2a1	CANN: Add the basic supports of Flash Attention kernel (llama/13627) * cann: add the basic FA support * cann: update the readme * cann: update the FlashAttention with PSEShift * cann: update the input parameters in FA * cann: update the alibi with max_bias * cann: add the constrints of softcap * cann: update the docs CANN.md * cann: update the docs CANN.md * cann: fix typo of CANN.md * cann: add some comments and update the CANN.md * cann: update the CANN.md * cann: update the inner precise for fusedInferAttention * cann: update the constraints of flash_attn_ext on ggml-cann.cpp * cann: clean the whitespace * cann: clean the whitespace * cann: add a new endline	2025-05-27 18:03:00 +03:00
Akarshan Biswas	1cd7028428	SYCL: revert "sycl: simplify bin_bcast_kernel (ggml/13383)" (llama/13752) Temporarily reverted due to failing fp16 DIV operation This reverts commit 02cdd2d8b092b5a4bb18e013c6887ce49ba20ac5. ggml-ci	2025-05-27 18:03:00 +03:00
Diego Devesa	99596d6031	ggml-cpu : set openmp wait time if not set (llama/13758)	2025-05-27 18:03:00 +03:00
Xuan-Son Nguyen	2d6c6862f7	ggml : add ggml_gelu_erf() CUDA kernel (llama/13719) * ggml : add ggml_gelu_erf() CUDA kernel * missing semicolon	2025-05-27 18:03:00 +03:00
Johannes Gäßler	f1576b2659	CUDA: fix race condition in FA vector kernels (llama/13742)	2025-05-27 18:03:00 +03:00
Chenguang Li	994b4f86ab	CANN: Support MUL_MAT_ID for q8_0 and q4_0 (llama/13705) * [CANN]Support MUL_MAT_ID Q8 && Q4 Signed-off-by: noemotiovon <757486878@qq.com> * codestyle adjustment Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-05-27 18:03:00 +03:00
Xuan-Son Nguyen	3e7eaccf55	ggml : fix the order of ggml_unary_op (llama/13718)	2025-05-27 18:03:00 +03:00
Jeff Bolz	191f040414	vulkan: support CPY from any type to itself (llama/13695) Reuse the f16/f32 copy shaders, and just scale the number of elements according to the type size.	2025-05-27 18:03:00 +03:00
Jeff Bolz	2d49d4a9b5	vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (llama/13696)	2025-05-27 18:03:00 +03:00
Judd	000d65befb	use LOG_WARN to replace `std::cerr` (llama/13657)	2025-05-27 18:03:00 +03:00
Nicolò Scipione	f0803e6646	sycl : Remove waits from function calls (llama/13702) * removes the waits in async memcpy functions	2025-05-27 18:03:00 +03:00
Ewan Crawford	730a00be8a	SYCL: Avoid using with SYCL-Graph for unsupported nodes (llama/13587) Currently on a CUDA backend to SYCL when running `GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there are two operations that throw an exception from the blocking waits during queue recording. * `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187 * `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074 We've noticed that `ggml-cuda.cu` has the [check_node_graph_compatibility_and_refresh_copy_ops](`39e73ae0d6/ggml/src/ggml-cuda/ggml-cuda.cu (L2458-L2458)`) method for checking if a graph can be used, even if enabled. I've taken a similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking if a graph can be used for the operations even if a user has asked for it to be enabled.	2025-05-27 18:03:00 +03:00
Henry Linjamäki	316600e8ee	opencl: Add support for multiple devices (llama/12622) * opencl: Add support for multiple devices ... but limited to one platform. A platform with a GPU will be preferred. Additionally: * Filter out devices that lack capabilities needed by the backend implementation (half support, OpenCL 2.0+, etc). * Make ggml_backend_opencl_reg() thread-safe. * fixup: fix an error in sync_with_other_backends ... when there is only one OpenCL device available.	2025-05-27 18:03:00 +03:00
Henry Linjamäki	42f2b3bb65	opencl: fix couple crashes (llama/12795) * opencl: fix couple crashes * fix kernel launches failed on devices which do not support non-uniform work-groups. When non-uniform work-groups are not supported, set `local_work_size` to NULL (= let driver choose the work-group sizes). This patch does not cover everything - just the cases tested by test-backend-ops. * fix sub-buffer creation failed due to `cl_buffer_region::origin` not being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`. * OpenCL: query non-uniform WG sizes only on OpenCL 3.0+	2025-05-27 18:03:00 +03:00
Xuan-Son Nguyen	dd6ef64060	ggml : add ggml_gelu_erf() (llama/13667) * ggml : add ggml_gelu_na (not approximated) * fix naming order * rename na --> erf * apply review suggesions * revert naming order	2025-05-27 18:03:00 +03:00
R0CKSTAR	131ee546ca	musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (llama/13647) * musa: fix build warning (unused parameter) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: upgrade MUSA SDK version to rc4.0.1 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Update ggml/src/ggml-cuda/cpy.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-05-27 18:03:00 +03:00
Eve	4712f7b663	vulkan: fix warnings (llama/13626) * small fixes * remove ifdef	2025-05-27 18:03:00 +03:00
Johannes Gäßler	926fe234e9	CUDA: skip fully masked-out KV in FA vec kernel (llama/13584) * CUDA: skip fully masked-out KV in FA vec kernel	2025-05-27 18:03:00 +03:00
Svetlozar Georgiev	f44b53480f	sycl: disable reorder for sycl mulmat (llama/13536)	2025-05-27 18:03:00 +03:00
Georgi Gerganov	e04e8f1c79	metal : fix typo in FA kernel comments (llama/13651)	2025-05-27 18:03:00 +03:00
Nicolò Scipione	ee3f177cba	sycl : Overcoming workaround for mmap() allocation on Windows (llama/13482) * Remove mmap workaround on windows After some testing I found that mmap is supported on windows and for many GPUs on Linux. Therefore I remove the workaround for windows since it is not necessary. * Update llama-bench README SYCL backend introduced a workaround that allows execution of llama-bench also without specifying `--mmp 0` flag	2025-05-27 18:03:00 +03:00
0cc4m	0b69f74e15	Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (llama/13607)	2025-05-27 18:03:00 +03:00
Chenguang Li	9da3fc27be	CANN: Support MOE Model MUL_MAT_ID (llama/13042) Signed-off-by: noemotiovon <757486878@qq.com>	2025-05-19 14:58:39 +03:00
Gilad S.	2c13651e08	cmake: use the current build config for vulkan-shaders-gen (llama/13595) * fix: use the current build config for `vulkan-shaders-gen` * fix: only pass a valid build type to `--config`	2025-05-19 14:58:39 +03:00
Jeff Bolz	13dca86c56	vulkan: move common FA code to flash_attn_base.comp (llama/13556) * vulkan: move common FA code to flash_attn_base.comp * vulkan: move common FA index/stride setup code to flash_attn_base.comp * build fix	2025-05-19 14:58:39 +03:00
Jeff Bolz	6d61a09bc4	vulkan: use scalar FA rather than coopmat2 when N==1 (llama/13554)	2025-05-19 14:58:39 +03:00
Georgi Gerganov	4fedad988b	metal : add FA-vec kernel for head size 64 (llama/13583) ggml-ci	2025-05-19 14:58:39 +03:00
Łukasz Ślusarczyk	a8e17a244d	sycl : fixed compilation warnings (llama/13582)	2025-05-19 14:58:39 +03:00
Diego Devesa	0c76acd08a	gguf : use ggml log system (llama/13571) * gguf : use ggml log system * llama : remove unnecessary new lines in exception messages	2025-05-19 14:58:39 +03:00
Atharva Dubey	27964db1be	sycl: simplify bin_bcast_kernel (llama/13383)	2025-05-19 14:58:39 +03:00
Svetlozar Georgiev	8081e7a23d	sycl: reordered Q4_K MMVQ (llama/13109)	2025-05-19 14:58:39 +03:00
Łukasz Ślusarczyk	d807c497a4	sycl: use oneDNN for matrices multiplication (llama/12972)	2025-05-19 14:58:39 +03:00
Yibo Cai	8e9bf548f4	arm64: optimize q6_k_q8_k kernel with i8mm (llama/13519) This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction. Tested on neoverse-n2 with llama3 8b q6_k quantization model. - 40% ~ 54% S_PP uplift for all batch sizes - 16% ~ 47% S_TG uplift for batch size 4 and above Perplexity doesn't change with this PR. ``` // tested on neoverse-n2 $ llama-batched-bench \ -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \ --no-mmap -fa \ -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \ -npl 1,2,4,8,16,32 \ -t 64 --------------------------------------------------------------------- \| PP \| TG \| B \| S_PP t/s \| S_TG t/s \| \| \| \| \| original \| this pr \| original \| this pr \| \|-------\|--------\|------\|----------\|----------\|----------\|----------\| \| 128 \| 128 \| 1 \| 78.52 \| 109.18 \| 18.63 \| 18.88 \| \| 128 \| 128 \| 2 \| 84.62 \| 123.94 \| 34.54 \| 36.92 \| \| 128 \| 128 \| 4 \| 84.36 \| 122.49 \| 52.65 \| 61.32 \| \| 128 \| 128 \| 8 \| 90.52 \| 138.87 \| 63.46 \| 84.41 \| \| 128 \| 128 \| 16 \| 90.11 \| 138.56 \| 71.04 \| 101.33 \| \| 128 \| 128 \| 32 \| 89.81 \| 137.79 \| 75.14 \| 110.47 \| --------------------------------------------------------------------- ```	2025-05-19 14:58:39 +03:00
Johannes Gäßler	0dda27bc0b	CUDA: fix crash on large batch size for quant. MoE (llama/13537)	2025-05-19 14:58:39 +03:00
Johannes Gäßler	ffa4720f25	CUDA: faster Deepseek FA, add Turing support (llama/13435)	2025-05-19 14:58:39 +03:00
bandoti	9b8eea28b5	cmake: simplify vulkan shader test logic (llama/13263)	2025-05-19 14:58:39 +03:00
Jeff Bolz	162bbe8220	vulkan: KHR_coopmat flash attention (llama/13506) This shader uses coopmat1 to do the QK^T multiply. The PV multiply is more difficult for various reasons so I haven't done it. Performance for this shader is around 2.5x better than for the scalar shader when doing prompt processing. Some of the benefit may be from other optimizations like staging through shared memory, or splitting by rows.	2025-05-19 14:58:39 +03:00
Jeff Bolz	a221288dc6	vulkan: workaround FA compile failures on macos (llama/13517)	2025-05-19 14:58:39 +03:00
Georgi Gerganov	08436716ae	metal : use FA-vec kernel up to batch size 20 (llama/13496) * batched-bench : fix pp batch contents * metal : optimize multi-sequence FA vec kernel ggml-ci * metal : use FA-vec kernel up to batch size 20 ggml-ci	2025-05-19 14:58:39 +03:00
Georgi Gerganov	e11fc21e6c	metal : optimize multi-sequence FA vec kernel (llama/13493) * batched-bench : fix pp batch contents * metal : optimize multi-sequence FA vec kernel ggml-ci	2025-05-19 14:58:39 +03:00
Dan Johansson	a77a924b20	ggml-cpu: Update KleidiAI to v1.6 and fix include directives (llama/13509) Signed-off-by: Dan Johansson <dan.johansson@arm.com>	2025-05-19 14:58:39 +03:00
Johannes Gäßler	405b9c77ad	mnist: fix segmentation fault (ggml/1227)	2025-05-19 14:58:39 +03:00
Diego Devesa	9c3bfc1499	ggml : fix apple OS check in ggml_print_backtrace (ggml/1229)	2025-05-19 14:58:39 +03:00
Daniel Tang	5b7797f674	ggml : Fix missing backtrace on Linux (ggml/1228) * Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1 * Fixed lldb attach * Simplify by having the child do ggml_print_backtrace_symbols	2025-05-19 14:58:39 +03:00
Xuan-Son Nguyen	75e9a840c5	ggml : add mrope kernel for metal (llama/13457)	2025-05-13 13:59:21 +03:00
Georgi Gerganov	41ed62bdbc	metal : optimize MoE for large batches (llama/13388)	2025-05-13 13:59:21 +03:00
lhez	029c8837f8	opencl: remove unnecessary assert for `add` (llama/13257)	2025-05-13 13:59:21 +03:00
Johannes Gäßler	5d8b068249	llama/ggml: add LLM training support (llama/10544) * llama/ggml: add LLM training support more compact progress bar llama_save_model_to_file llama_opt_param_filter ggml_graph_dup force_grads refactor ggml_opt, fix test-opt * remove logits_all * refactor CUDA implementation for ACC * reset graph at beginning of opt period	2025-05-13 13:59:21 +03:00
Dan Johansson	93ef22657e	ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel (llama/13053) * ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel Signed-off-by: Dan Johansson <dan.johansson@arm.com> * * code review fixes Signed-off-by: Dan Johansson <dan.johansson@arm.com> * * adds a comment that clarifies barrier usage Signed-off-by: Dan Johansson <dan.johansson@arm.com> --------- Signed-off-by: Dan Johansson <dan.johansson@arm.com> Co-authored-by: Charles Xu <charles.xu@arm.com>	2025-05-13 13:59:21 +03:00
Johannes Gäßler	866f685bbc	CUDA: fix misaligned synchronization in FA (llama/13469)	2025-05-13 13:59:21 +03:00
Atharva Dubey	250bcc041a	enable dpcpp nightly builds with libraries (llama/13406)	2025-05-13 13:59:21 +03:00
Johannes Gäßler	90b17a99bf	CUDA: fix crash with partial offloading of MoE (llama/13439)	2025-05-13 13:59:21 +03:00
David Huang	e1b2ace0f8	Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (llama/13386)	2025-05-13 13:59:21 +03:00
Johannes Gäßler	6db0e01db6	CUDA: fix race conditions FlashAttention kernels (llama/13438)	2025-05-13 13:59:21 +03:00
Johannes Gäßler	16f3546f38	CUDA: fix FlashAttention on Turing (llama/13415)	2025-05-13 13:59:21 +03:00
Jeff Bolz	a04b329ad1	vulkan: scalar flash attention implementation (llama/13324) * vulkan: scalar flash attention implementation * vulkan: always use fp32 for scalar flash attention * vulkan: use vector loads in scalar flash attention shader * vulkan: remove PV matrix, helps with register usage * vulkan: reduce register usage in scalar FA, but perf may be slightly worse * vulkan: load each Q value once. optimize O reduction. more tuning * vulkan: support q4_0/q8_0 KV in scalar FA * CI: increase timeout to accommodate newly-supported tests * vulkan: for scalar FA, select between 1 and 8 rows * vulkan: avoid using Float16 capability in scalar FA	2025-05-13 13:59:21 +03:00
Alberto Cabrera Pérez	45d8b2352e	sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (llama/12858) * sycl : Implemented reorder Q4_0 mmvq Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com> * sycl : Fixed mmvq being called when reorder is disabled * sycl : Improved comments in the quants header Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com> * Use static_assert * safe_div -> ceil_div * Clarify qi comment * change the reorder tensor from init to execute OP * dbg * Undo changes to test-backend-ops * Refactor changes on top of q4_0 reorder fix * Missing Reverts * Refactored opt_for_reorder logic to simplify code path * Explicit inlining and unroll * Renamed mul_mat_algo enum for consistency --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com> Co-authored-by: romain.biessy <romain.biessy@codeplay.com>	2025-05-13 13:59:21 +03:00
Johannes Gäßler	2d436bfbfb	CUDA: FA support for Deepseek (Ampere or newer) (llama/13306) * CUDA: FA support for Deepseek (Ampere or newer) * do loop unrolling via C++ template	2025-05-13 13:59:21 +03:00
Johannes Gäßler	4b7cbb62ef	CUDA: fix crash on large batch size for MoE models (llama/13384)	2025-05-13 13:59:21 +03:00
Radoslav Gerganov	e27c91f6d6	rpc : add rpc_msg_set_tensor_hash_req (llama/13353) * rpc : add rpc_msg_set_tensor_hash_req Use a dedicated struct for the request of RPC_CMD_SET_TENSOR_HASH which makes the code cleaner. * fix	2025-05-13 13:59:21 +03:00
Jeff Bolz	e46df4850f	vulkan: Allow up to 4096 elements for mul_mat_id row_ids (llama/13326) This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf: GGML_ASSERT(nei0 * nei1 <= 3072); The tensor is 8 x 512. Increase this array size to accommodate.	2025-05-13 13:59:21 +03:00
Alberto Cabrera Pérez	e8a7f1b7bb	sycl: addressing non-contiguous src1 mul_mats (nc and batched) (llama/13343) * sycl: fixed non-contiguous src1 mul_mats (nc and batched) * Fixed wrong static_cast inside kernel	2025-05-13 13:59:21 +03:00
R0CKSTAR	09e6b66025	cuda : remove nrows_x in mul_mat_q_process_tile (llama/13325) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-05-07 21:00:32 +03:00
Johannes Gäßler	d41cf26a0f	CUDA: mix virt/real CUDA archs for GGML_NATIVE=OFF (llama/13135)	2025-05-07 21:00:32 +03:00
Akarshan Biswas	3c67195be9	SYCL: Disable reorder optimize by default and stop setting tensor extras when optimize is disabled (llama/13254) * SYCL: Do not set tensor extras when reorder optimize is disabled * SYCL: Disable reorder optimize by default	2025-05-07 21:00:32 +03:00
Johannes Gäßler	f9f78a773f	CUDA: fix bad asserts for partial offload (llama/13337)	2025-05-07 21:00:32 +03:00
Johannes Gäßler	be55e25cac	CUDA: fix --split-mode row for MMQ (llama/13323)	2025-05-07 21:00:32 +03:00
Johannes Gäßler	2ffdda99e8	CUDA: fix logic for clearing padding with -ngl 0 (llama/13320)	2025-05-07 21:00:32 +03:00
Akarshan Biswas	9bbedc51cc	SYCL: Disable mul_mat kernels for noncontiguous tensor b (llama/13308) ggml-ci	2025-05-07 21:00:32 +03:00
Diego Devesa	1e1fa27add	rpc : use backend registry, support dl backends (llama/13304)	2025-05-07 21:00:32 +03:00
Aaron Teo	e1bdd148c5	ggml : activate s390x simd for Q3_K (llama/13301) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-05-07 21:00:32 +03:00
Johannes Gäßler	7fa8bb303f	CUDA: fix race condition in MMQ stream-k fixup (llama/13299)	2025-05-07 21:00:32 +03:00
Johannes Gäßler	7564f5e6f1	CUDA: fix race condition in MMQ ids_dst (llama/13294)	2025-05-07 21:00:32 +03:00
Jeff Bolz	22ba2e27ce	vulkan: Additional type support for unary, binary, and copy (llama/13266) Support f16->f32 copy. Support f16->f16 and f32->f32 unary ops. Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div.	2025-05-07 21:00:32 +03:00
Georgi Gerganov	5eac2a3fbb	vulkan : fix lint (llama/0)	2025-05-07 15:39:32 +03:00
shalinib-ibm	42938398f9	ggml : Enable MMA for BF16 in llamafile_sgemm (llama/13148) This patch upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for BF16 data type. This change results in 9x - 40x gains in total speed S t/s (ie all tokens/total time), across various batch sizes tested using llama-batched-bench benchmark. The patch is tested with Meta-Lllama-3-8B, and Mistral-7B models (BF16 models generated by using llama-quantize from corresponding FP32 models) on an IBM POWER10 machine. Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>	2025-05-07 15:39:32 +03:00
Justin Santa Barbara	a8fe90ae15	rpc : avoid uninitialized memory in serialize_tensor (llama/13210) Zero out the name and padding buffers.	2025-05-07 15:39:32 +03:00
Jesse Gross	c5a5a2da5b	ggml: Don't assert fail when tensor data changes (llama/13222) The following scenario will cause an assertion failure in the graph allocator: - Build and allocate a graph containing a tensor with a non-NULL data pointer - Build and allocate a new graph where that data is NULL Result: ggml-alloc.c:819: GGML_ASSERT(talloc->buffer_id >= 0) failed This happens during revalidation because we think that memory should have been previously allocated based on the current graph but in reality the previous graph was different. In this situation, we should do a full reallocation pass.	2025-05-07 15:39:32 +03:00
Diego Devesa	8316bfd82b	build : fix build info on windows (llama/13239) * build : fix build info on windows * fix cuda host compiler msg	2025-05-07 15:39:32 +03:00
Jeff Bolz	fd1cb9fc12	vulkan: Add bfloat16 support (llama/12554) * vulkan: Add bfloat16 support This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16. The extension is required for coopmat multiply support, but matrix-vector multiply trivially promotes bf16 to fp32 and doesn't require the extension. The copy/get_rows shaders also don't require the extension. It's probably possible to fall back to non-coopmat and promote to fp32 when the extension isn't supported, but this change doesn't do that. The coopmat support also requires a glslc that supports the extension, which currently requires a custom build. * vulkan: Support bf16 tensors without the bf16 extension or coopmat support Compile a variant of the scalar mul_mm shader that will promote the bf16 values to float, and use that when either the bf16 extension or the coopmat extensions aren't available. * vulkan: bfloat16 fixes (really works without bfloat16 support now) * vulkan: fix spirv-val failure and reenable -O	2025-05-07 15:39:32 +03:00
Jeff Bolz	17f6b8225e	vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader (llama/13191) * vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader	2025-05-07 15:39:32 +03:00
Acly	6374ea32ca	vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204) * vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW) * review: remove src_x/y < 0 checks; add performance tests	2025-05-07 15:39:32 +03:00
Daniel Bevenius	09846f4e12	whisper: remove MSVC warnings pragmas (#3090 ) * ggml : remove MSVC warnings pragmas This commit removes the MSVC-specific pragmas as these are now handled in CMakeLists.txt. * whisper : remove MSVC warning pragmas This commit removes the MSVC-specific pragmas. These are now handled in the CMakeLists.txt file.	2025-05-05 13:09:35 +02:00
Jared Tweed	9f540ad8cb	cmake : removed stdc++fs (#3097 ) * removed stdc++fs * kept line, but removed stdc++fs	2025-05-02 12:41:35 +03:00
Johannes Gäßler	d052e64d42	CUDA: batched+noncont MMQ, refactor bs>1 MoE code (llama/13199)	2025-05-01 13:29:02 +03:00
Jeff Bolz	780750a108	vulkan: use uint array index to avoid glslang bug (llama/13193)	2025-05-01 13:29:02 +03:00
shalinib-ibm	919c78e618	ggml : fix ppc64le build (llama/13176) Build fails with compilation error on power pc. This patch fixes the same. Tested with unit tests run via --build <build_dir> && cd <build_dir> && make test Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>	2025-05-01 13:29:02 +03:00
Aaron Teo	dc288f84cd	feat(ggml-cpu): enable z17 compile (llama/13182) z17 compilation requires GCC 15.1.0 and onwards Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-05-01 13:29:02 +03:00
Johannes Gäßler	1543a3600c	CUDA: fix non-cont. inputs for batched mat mul (llama/13155)	2025-05-01 13:29:02 +03:00
Ville Vesilehto	4872355f6e	fix(rpc): Improve input validation and error handling (llama/13069) * fix(rpc): Improve input validation and error handling The `rpc-server` was vulnerable to Denial of Service attacks via several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed messages could trigger failed assertions (e.g., invalid `ggml_type`) or out-of-bounds reads/writes leading to `GGML_ABORT` calls, crashing the server process. This PR introduces robust input validation and replaces `abort()` calls with graceful error handling: - Type Validation: `deserialize_tensor` now checks if the `tensor->type` is within the valid `GGML_TYPE_COUNT` range before calling `ggml_new_tensor_4d`. Returns `nullptr` on invalid type. - Bounds Checks: Replaced `GGML_ABORT` in `set_tensor`, `set_tensor_hash`, and `get_tensor` handlers with error logging and returning `false` when data/offset parameters are out of buffer bounds. - Size Checks: Added safe arithmetic checks (for overflow) in `graph_compute` when calculating required message sizes based on client-provided `n_nodes` and `n_tensors`. Returns early if the reported sizes conflict with the actual message size or would lead to overflow. - Error Propagation: - `create_node` now checks for `nullptr` return values from `deserialize_tensor` and its recursive calls, propagating `nullptr` upwards on failure. Uses `find` instead of `at` for safer map access. - `copy_tensor` now checks for `nullptr` from `deserialize_tensor` and sets the response status to failure if deserialization or bounds checks fail. - `graph_compute` now checks for `nullptr` return from `create_node` and returns failure status correctly. The final return value now reflects the actual computation status. These changes improve the RPC server's resilience against malformed client requests, preventing crashes and ensuring errors are handled more gracefully. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): address pr comments removed comments and unnecessary returns Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): ambiguous nullptr from create_node rpc_server::create_node could previously return nullptr if the input ID was 0 (valid) or if an internal error (deserialization, recursion failure) occurred (invalid). This ambiguity made error handling difficult for the caller (`graph_compute`). This commit clarifies the meaning of nullptr: - `graph_compute` now checks if the input 'id' was non-zero when `create_node` returns nullptr, correctly identifying failures versus intentional null links. - `create_node` avoids recursive calls for zero IDs and propagates nullptr unambiguously on failure during recursion. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): initial zero check in create_node The caller (`graph_compute`) already checks `id != 0` when handling a `nullptr` return from `create_node`, correctly distinguishing intentional null links from actual errors. This makes the initial `if (id == 0)` check redundant. Also removes the log message when a tensor ID is not found in the provided map which was added in this branch. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * fix(rpc): Handle get_alloc_size failure in server Check the return value of `server.get_alloc_size` in the RPC server loop. If the call fails, return early to close the connection. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): input size validation in graph_compute Removes detailed, step-by-step size calculations and overflow checks in favor of simpler direct comparisons, assuming 64-bit overflow is unlikely. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove extra status code setting Removes the explicit setting of `response.result = GGML_STATUS_FAILED` when `create_node` returns `nullptr` within `graph_compute`. Primary signal is the `false` return value in case of failure. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> * refactor(rpc): remove redundant check for tensor->type Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus the check is not needed. Signed-off-by: Ville Vesilehto <ville@vesilehto.fi> --------- Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>	2025-05-01 13:29:02 +03:00
Akarshan Biswas	1a76e97c28	SYCL: Add all missing unary kernels (llama/13074) * SYCL: Add all missing unary kernels ggml-ci * decouple kernel launch range from data size using strided loop * use ciel_div helper for num_blocks ggml-ci * clean auto imported header files	2025-05-01 13:29:02 +03:00
R0CKSTAR	7017c1d37d	musa: fix typo in cc control (llama/13144) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-05-01 13:29:02 +03:00
Johannes Gäßler	670bf02662	CUDA: fix q_nope_absorbed prec for DS 2 Lite f16 (llama/13137)	2025-05-01 13:29:02 +03:00
R0CKSTAR	9fff2f751c	musa: fix build warning (llama/13129) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-05-01 13:29:02 +03:00
SXX	46392f733f	ggml: move fp16/bf16 conversion optimizations to CPU backend + export conversion APIs (llama/13107) * ggml: dynamic x86_64 feature detection for FP32 <-> FP16/BF16 conversion * move fp converter to ggml-cpu * Switch ggml_compute_forward_get_rows_f16/bf16 to new ggml_cpu_fp16/bf16_to_fp32	2025-05-01 13:29:02 +03:00
Neo Zhang Jianyu	eeb259909e	change the reorder tensor from init to execute OP (llama/13003)	2025-05-01 13:29:02 +03:00
Radoslav Gerganov	fe21ddf0dc	rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (llama/12943) RPC_CMD_SET_TENSOR always returns an empty response and we send this 4 times per token. We can improve TG speed if we don't wait for this empty response. The performance impact of this change depends on the network latency.	2025-05-01 13:29:02 +03:00
Diego Devesa	33bdbfbb33	ggml : fix ggml_gallocr_ptr type (ggml/1205)	2025-05-01 13:29:02 +03:00
Daniel Bevenius	0f49edf0f3	whisper : add check that target name exists (#3103 ) This commit adds a check to makes sure that the target exists before trying to add compile options to ignore warnings when using MSVC. The motivation for this is currently the build is broken depending on the cmake options provided. With this fix it should be possible to build even if the targets are not actually available. Refs: https://github.com/ggml-org/whisper.cpp/pull/3090#issuecomment-2842760104	2025-05-01 10:05:24 +02:00
Daniel Bevenius	55d73a13f5	ggml : suppress Windows compiler warnings (#3075 ) * whisper: suppress Windows compiler warnings This commit disables compiler warnings on window using MSVC. The motivation for these changes is that some compilers generate warnings for these conversion, for example Windows MSVC, and there are quite a few of them. This makes it a little difficult to spot new warnings that may be introduced and also can be difficult for users/embedders of ggml where these warnings are hard to separate from their own warnings. * squash! whisper: suppress Windows compiler warnings Move ggml related warnings into ggml. This commit also fixes the indentation and adds a missing whitespace to the if statement.	2025-04-29 15:47:55 +02:00
Georgi Gerganov	6c0d843f9d	cuda : fix unused variable compile warning (#0 ) ggml-ci	2025-04-24 20:39:16 +03:00
Georgi Gerganov	337becefb9	opencl : remove obsolete files (skip) (ggml/1200)	2025-04-24 20:39:16 +03:00
lhez	88c3cecd43	opencl: split ggml-opencl.cl into multiple files and cleanup (llama/12886) --------- Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com>	2025-04-24 20:39:16 +03:00
Georgi Gerganov	fe4acb33e3	ggml : fix trailing whitespaces (llama/0)	2025-04-24 20:39:16 +03:00
Johannes Gäßler	fd5a3e1bc6	CUDA: use switch statements in constexpr functions (llama/13095)	2025-04-24 20:39:16 +03:00
Georgi Gerganov	01e1600edd	metal : fix floating-point range of attention scores in FA kernels (llama/13090) ggml-ci	2025-04-24 20:39:16 +03:00
Eve	cf3eb291ab	vulkan: matmul gcn tuning (llama/13016) * tune matmul for gcn * this one is more power efficient * Update ggml/src/ggml-vulkan/ggml-vulkan.cpp Co-authored-by: 0cc4m <picard12@live.de> * disable this tune for the proprietary driver --------- Co-authored-by: 0cc4m <picard12@live.de>	2025-04-24 20:39:16 +03:00
Johannes Gäßler	3d54b68ea7	CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID (llama/13014) * CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID * fix logic for RoPE support, CUDA graphs	2025-04-24 20:39:16 +03:00
Diego Devesa	11218294db	ggml : add SSE 4.2 and x64 base variant for CPUs without AVX (llama/12871) * ggml : add SSE 4.2 variant for CPUs without AVX * ggml : add x64 base ABI variant	2025-04-24 20:39:16 +03:00
Akarshan Biswas	33c89ade7d	SYCL: Add non-contiguous support in ROPE (llama/12993) ggml-ci	2025-04-24 20:39:16 +03:00
Jeff Bolz	27a56e7243	vulkan: support noncontiguous rms_norm (llama/13031)	2025-04-24 20:39:16 +03:00
Jeffrey Morgan	f4ca3e2f9c	metal: add neg operator (llama/13029)	2025-04-24 20:39:16 +03:00
Akarshan Biswas	0287a5c51b	SYCL: Refactor and enable FP16 in binary broadcast OPs (llama/12975) * SYCL: refactor move to a separate file * Fix binbcast * Remove duplicates * fix include formatting * fix typo	2025-04-24 20:39:16 +03:00
Radoslav Gerganov	24d29c55df	rpc : add RPC_CMD_HELLO (llama/12955) Add RPC_CMD_HELLO for getting the version of the protocol implemend by the server. Follow the semantic versioning rules at https://semver.org Hopefully this bring better user experience when we make breaking changes at the protocol level and avoid issues like #12465	2025-04-24 20:39:16 +03:00
Georgi Gerganov	36019c35a3	graph : make FA compatible with MLA + add initial Metal kernels (llama/12953) * graph : make mla compatible with FA * metal : add exp FA kernels for DeepSeek models ggml-ci * llama : minor naming updates ggml-ci * ggml : disable FA for DS head sizes * tests : add FA tests for MLA shapes ggml-ci	2025-04-24 20:39:16 +03:00
Alan Gray	4e936e2afa	ggml: Re-enable CUDA graphs in presence of CONT and DUP nodes (llama/12970)	2025-04-24 20:39:16 +03:00
hipudding	314ce5981e	CANN: Add support for async operator submission (llama/12864) Submit operators using asynchronous threads to improve performance. Use the environment variable GGML_CANN_ASYNC_MODE to control whether asynchronous submission is enabled. It is disabled by default. Testing shows a 10%–20% performance improvement in scenarios with small parameter sizes, especially in quantized models.	2025-04-24 20:39:16 +03:00
kimminsu	cb7642b0f5	opencl: fix incorrect local_size index in profiling log (llama/12868)	2025-04-24 20:39:16 +03:00
Jeff Bolz	7db8f278f0	vulkan: enable coopmat2 FA gqa and split_k optimizations more often (llama/12931) The grouped query attention optmization doesn't require a power of two ratio, the only thing relying on it was the modulo operation written as bitwise &. split_k need not depend on gqa_ratio - enable it any time there's only one workgroup in the X dimension. The shader gets the split index from the x coord, and multiple workgroups in the X dimension (pre-split) indicates a larger FA operation that wouldn't need splitting.	2025-04-24 20:39:16 +03:00
Chenguang Li	be42a19eab	CANN: Add 310P operator support check (llama/12962)	2025-04-24 20:39:16 +03:00
Georgi Gerganov	b8755670ca	metal : add FA-vec kernels for head size 96 (llama/12952) ggml-ci	2025-04-24 20:39:16 +03:00
hipudding	483eecae62	CANN: Add x86 build ci (llama/12950) * CANN: Add x86 build ci * CANN: fix code format	2025-04-24 20:39:16 +03:00
David Huang	43e3d25d93	CUDA/HIP: Share the same unified memory allocation logic. (llama/12934) Replace compile-time `GGML_HIP_UMA` with environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY`. This unifies the usage on NVIDIA and AMD GPUs, and allows a single binary to be shared between integrated and dedicated GPUs.	2025-04-24 20:39:16 +03:00
Akarshan Biswas	e1dbf9a42e	SYCL: Add ROPE vision kernel (llama/12887) * SYCL: Add ROPE vision kernel * Add comment about rope mode	2025-04-24 20:39:16 +03:00
Srihari-mcw	ee0013865d	ggml : Add AVX512 implementation of GEMM - Q4_Kx8 (llama/12829) * Add AVX512 implementation of GEMM - q4kx8 * Update changes to remove unnecessary whitespaces	2025-04-24 20:39:16 +03:00
Chenguang Li	32a407166b	CANN: Opt ROPE optimization (llama/12865) * [CANN]Opt ROPE optimization * [CANN]Codestyle adjustment * [CANN]Fix the ROPE precision issue * [CANN]codestyle fix * [CANN]add rope unsupport case Signed-off-by: noemotiovon <noemotiovon@gmail.com>	2025-04-24 20:39:16 +03:00
Xinpeng Dou	622f981853	CANN: Optimize CANN buffer pool memory management (llama/12875) Multiple optional memory pools are provided for CANN, including VMM, priority queue-based, and traditional memory pools. 1.When the memory pool is available and GGML_CANN_DISABLE_VMM_POOL is not defined, the VMM pool is selected by default. 2.Otherwise, if GGML_CANN_ENABLE_BUF_PRIO_POOL is defined, the priority queue-based memory pool is used. 3.If neither condition is met, the default memory pool is used.	2025-04-24 20:39:16 +03:00
Akarshan Biswas	d049d67065	SYCL: Fix im2col (llama/12910) * SYCL: Fix im2col * restore local workgroup size adjustments for large inputs * restore format	2025-04-24 20:39:16 +03:00
Radoslav Gerganov	877308838e	rpc : use ggml_context_ptr (llama/12938)	2025-04-24 20:39:16 +03:00
Acly	d87dfcf7c0	ggml : Depthwise 2D convolution (ggml/1152) * ggml-cpu : kernels for faster depthwise 2D convolution * fix compile: remove static after moving to ops.cpp * add dilation for depthwise_conv_2d * review: rename to ggml_conv_2d_dw_direct, remove redundant struct keywords, pass by ref, whitespace * review: rename depthwise_conv_2d -> conv_2d_dw everywhere	2025-04-24 20:39:16 +03:00
SXX	915c14ef10	ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register (llama/12773) * ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register * simplifies the codebase by removing redundant functions	2025-04-24 20:39:16 +03:00
Alan Gray	5d33d3c929	ggml: disable CUDA graphs for unsupported DUP and CONT node types (llama/12891) Fixes #12798	2025-04-24 20:39:16 +03:00
Jeff Bolz	751e42b21e	vulkan: use aligned loads for flash attention mask (llama/12853) Rewrite the stride logic for the mask tensor in the FA shader to force the stride to be aligned, to allow using more efficient loads.	2025-04-24 20:39:16 +03:00
Ewan Crawford	e8ee32d12d	sycl: Support sycl_ext_oneapi_limited_graph (llama/12873) The current usage of the SYCL-Graph extension checks for the `sycl_ext_oneapi_graph` device aspect. However, it is also possible to support `sycl_ext_oneapi_limied_graph` devices that don't support update	2025-04-24 20:39:16 +03:00
Akarshan Biswas	e9ce285135	SYCL: Add fp16 type support to unary op kernels (llama/12788) * SYCL: Add fp16 support to some elementwise OP kernels * remove comment ggml-ci * Use static_cast directly * remove not needed cast from tanh * Use static cast and remove unneeded castings * Adjust device_support_op for unary OPs * Use cast_data and typed_data struct to deduplicate casting code	2025-04-24 20:39:16 +03:00
Aaron Teo	b942f451b6	ggml: fix compilation error s390x (llama/12848) * ggml: fixes #12846 compilation error Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@ibm.com> * ggml: add documentation for code change Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@ibm.com> * ggml: refactor to type-cast and update documentation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@ibm.com> * ggml: update documentation to provide full issue link Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@ibm.com> --------- Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@ibm.com>	2025-04-24 20:39:16 +03:00
cmdr2	e6410faf99	cpu: fix cpu backend's supports-op for GET_ROWS_BACK. fixes a fatal when running test-backend-ops with only the CPU backend (ggml/1190)	2025-04-24 20:39:16 +03:00
Chenguang Li	182df69384	CANN: Support more ops (llama/12841) * [CANN]Support Opt LOG && MEAN && PAD_REFLECT_1D * [CANN]Support COUNT_EQUAL && STEP && SGN * [CANN]codestyle adjustment * [CANN]codestyle adjustment --------- Signed-off-by: noemotiovon <noemotiovon@gmail.com>	2025-04-24 20:39:16 +03:00
Prajwal B Mehendarkar	3bf9691dfd	Fixes #12823 (llama/12830) * Including limits file on AIX * Fixes #12823	2025-04-24 20:39:16 +03:00
Piotr Kubaj	ba444e9c23	ggml-cpu-impl.h: do not redefine bool on POWER9 (llama/12856) error: unknown type name '_Bool'	2025-04-24 20:39:16 +03:00
Piotr Kubaj	c6caf8eef2	ggml-impl.h: fix build on POWER9 (llama/12855) error: ISO C++17 does not allow 'register' storage class specifier	2025-04-24 20:39:16 +03:00
Chenguang Li	6cae79a1d7	CANN: Support Opt CONV_TRANSPOSE_1D and ELU (llama/12786) * [CANN] Support ELU and CONV_TRANSPOSE_1D * [CANN]Modification review comments * [CANN]Modification review comments * [CANN]name adjustment * [CANN]remove lambda used in template * [CANN]Use std::func instead of template * [CANN]Modify the code according to the review comments --------- Signed-off-by: noemotiovon <noemotiovon@gmail.com>	2025-04-24 20:39:16 +03:00
Jeff Bolz	b9bfe0c693	vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (llama/12833) q4_k and q5_k had a lot of redundant global loads where the same 16B of scale information is repeatedly loaded and decoded during each loop iteration. This change restructures the loops to more explicitly iterate over whole blocks in the outer loop (with unrolled inner loop) and to copy/decode the scale data into shared memory once at the start of each outer loop. The copy is pipelined so the scale load from global memory is relatively cheap. This improves q4_k/q5_k model prompt processing performance by around 5-7%. I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k and hurt for q4_0. The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped variants isn't used as often as it originally was (e.g. due to the padded_N change), so I trimmed it down to offset some of the new complexity of the semi-manual loop unrolling.	2025-04-24 20:39:16 +03:00
Jeff Bolz	1d50c6ac22	vulkan: Use fp16 for the flash attention P*V multiplication (llama/12783) This is consistent with the ggml-cuda behavior and the mul_mat fallback.	2025-04-24 20:39:16 +03:00
Sigbjørn Skjæret	79f23d9132	cuda : add f32 to bf16 copy op (llama/12806) This allows BF16 KV-cache on CUDA.	2025-04-24 20:39:16 +03:00
Georgi Gerganov	ee2cbeeb74	llama : fix FA when KV cache is not used (i.e. embeddings) (llama/12825) * ggml : FA supports F32 V * graph : cast KV to F16 when the KV cache is not used ggml-ci * server : add test that exercises embeddings with FA enabled ggml-ci	2025-04-24 20:39:16 +03:00
cmdr2	868a5ce310	ggml: don't include arm_neon.h when using CUDA 12 with ARM Neon (ggml/1187) fix #1186	2025-04-24 20:39:16 +03:00
Diego Devesa	b9c71fae5a	ggml : add bilinear upscale support (ggml/1185)	2025-04-24 20:39:16 +03:00
Diego Devesa	6d67c6d93d	ggml : add more generic custom op, remove deprecated custom ops (ggml/1183) * ggml : add more generic ggml_custom op * ggml : remove deprecated custom ops	2025-04-24 20:39:16 +03:00
Neo Zhang Jianyu	12cade118e	Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (llama/12812) * Revert "sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_s…" This reverts commit 518a01480eb3a7c80a4951b430db9dee55428310. * Update ggml/src/ggml-sycl/ggml-sycl.cpp * Update ggml/src/ggml-sycl/ggml-sycl.cpp * rm tail space	2025-04-24 20:39:16 +03:00
lhez	fd1c725e65	opencl: better identify Adreno GPU (llama/12760)	2025-04-24 20:39:16 +03:00
Georgi Gerganov	d33fd00cfe	cuda : fix HIP and MUSA BF16 (llama/0) ggml-ci	2025-04-24 20:39:16 +03:00
zhouwg	3e0d89782a	sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor (llama/12734)	2025-04-24 20:39:16 +03:00
zhouwg	7074b622eb	CANN: fix typo in ggml-cann (llama/12733)	2025-04-24 20:39:16 +03:00
hipudding	b8d3e45342	CANN: Refactor to reduce duplicate code (llama/12731) * CANN: Refactor to reduce duplicate code * CANN: fix review comment	2025-04-24 20:39:16 +03:00
R0CKSTAR	1901505138	musa: fix compilation warnings in mp_22/31 (llama/12780) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-04-24 20:39:16 +03:00
Jeff Bolz	3c26dd3353	vulkan: fix NaN issue in flash attention shader (llama/12776) Use -FLT_MAX/2 rather than -inf as the initial value for computing the maximum.	2025-04-24 20:39:16 +03:00
Jeff Bolz	d792d2a2dc	vulkan: Use unclamped loads for flash attention mask (llama/12720) nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.	2025-04-24 20:39:16 +03:00
0cc4m	8add58aa5e	Vulkan: Tune Vulkan mmq int dot shader for performance (llama/12767)	2025-04-24 20:39:16 +03:00
Nicolò Scipione	8f8ede1b12	sycl: allow ggml-sycl configuration and compilation using Visual Studio project/solution (llama/12625)	2025-04-24 20:39:16 +03:00
Ronny Brendel	3a6fe8d767	cmake: fix ggml-shaders-gen compiler paths containing spaces (llama/12747) fixes error for compiler paths with spaces	2025-04-24 20:39:16 +03:00
Jeff Bolz	76231bda56	vulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (llama/12630) There seems to be a bubble waking up from waitForFences, which costs a few percent performance and also increased variance in performance. This change inserts an "almost_ready" fence when the graph is about 80% complete and we waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting for the final fence to be signaled.	2025-04-24 20:39:16 +03:00
Jeff Bolz	785437c253	vulkan: set cmake minimum and project name in vulkan-shaders (llama/12744)	2025-04-24 20:39:16 +03:00
Gaurav Garg	2f0612cb1c	CUDA: Prefer vector flash decoding kernel for Gemma models (llama/12738) * Prefer vector flash decoding kernel for Gemma models Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category. Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models. * Update ggml/src/ggml-cuda/fattn.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-04-24 20:39:16 +03:00
Jeff Bolz	e944065d5b	vulkan: Fix missing cmake logic for dot product extension (llama/12721)	2025-04-24 20:39:16 +03:00
a3sh	ccc7b5df0b	fix MUSA compiler warning (llama/12704) * fix MUSA compiler warning * replace (void) with GGML_UNUSED	2025-04-24 20:39:16 +03:00
Chenguang Li	fbed36851e	CANN: Support operator SIN COS ARGMAX (llama/12709) * [CANN]support sin cos argmax Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]codestyle adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]Remove redundant code Signed-off-by: noemotiovon <noemotiovon@gmail.com> --------- Signed-off-by: noemotiovon <noemotiovon@gmail.com> Co-authored-by: noemotiovon <noemotiovon@gmail.com>	2025-04-24 20:39:16 +03:00
Alan Gray	d1d847f184	Simplify and improve CUDA graphs through use of indirect copy pointers (llama/9017) * CUDA: Simplify and improve CUDA graphs through use of indirect copy pointers Previously there was complexity in the CUDA graphs implementation due frequently changing parameters to copy kernels associated with K and V cache pointers. This patch simplifies by using indirection to avoid such parameters frequently changing, avoiding the need for frequent graph updates. Fixes #12152 * Addressed comments * fix HIP builds * properly sync to stream * removed ggml_cuda_cpy_fn_ptrs * move stream sync before free * guard to only use indirection with graphs * style fixes * check for errors --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-04-24 20:39:16 +03:00
hipudding	337f91d4a6	CANN: Fix failed test cases (llama/12708) * CANN: Fix memory waste in aclnn_tensor * CANN: fix backend ops fail * CANN: fix acl_tensor memory alloc. * CANN: format * CANN: remove trailing whitespace	2025-04-24 20:39:16 +03:00
lhez	317a0031f9	opencl: use `max_alloc_size` in backend ctx instead of querying again (llama/12705)	2025-04-24 20:39:16 +03:00
Jeff Bolz	b243416918	vulkan: Implement split_k for coopmat2 flash attention. (llama/12627) When using group query attention, we have one workgroup per KV batch and this can be very few workgroups (e.g. just 8 in some models). Enable split_k to spread the work across SMs. This helps a lot when the KV cache is large.	2025-04-24 20:39:16 +03:00
bandoti	6e532c7187	cmake: remove caching from vulkan coopmat checks (llama/12719)	2025-04-24 20:39:16 +03:00
Jeff Bolz	2105b110d3	vulkan: Implement grouped query attention in the coopmat2 FA shader (llama/12559) When adjacent batches of Q share the same batches of K/V, batch them into the same workgroup. For example, when: dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1)) previously we would run 32 workgroups computing 1 result each, now we will run 8 workgroups computing 4 results each. This doesn't directly translate to better performance (at least when you have >=32 SMs), but in a subsequent change I'll enable split_k which will scale much better with 4x fewer workgroups.	2025-04-24 20:39:16 +03:00
0cc4m	f82622180f	Vulkan: Fix mmq int dot float cache size (llama/12722)	2025-04-24 20:39:16 +03:00
Diego Devesa	a71c64512a	llama : add option to override model tensor buffers (llama/11397) * llama : add option to override tensor buffers * ggml : fix possible underflow in ggml_nbytes	2025-04-24 20:39:16 +03:00
Georgi Gerganov	1e9c2f87f1	ggml : simplify Arm fp16 CPU logic (ggml/1177) * ggml : simlpify Arm fp16 CPU logic ggml-ci * cont : bring back CUDA/MUSA checks ggml-ci	2025-04-24 20:39:16 +03:00
Sigbjørn Skjæret	06ce8f83e6	CUDA: don't convert BF16 weights to FP32 (ggml/1174) * add bf16 support * use convert_from_bf16_cuda instead of convert_unary_cuda for f32 * revert 7ec5085 * move functionality into convert_unary with constexpr	2025-04-24 20:39:16 +03:00
cmdr2	513ecf8dc0	cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167) * cpu: refactor SIMD mappings and vectorized op functions into separate files * Fix warning for ggml_float to float * Fix warnings * cpu: move all the operations (except mul_mat) to a separate c++ file * fix whitespace * Update ggml/src/ggml-cpu/vec.h Co-authored-by: Diego Devesa <slarengh@gmail.com> * Fix PR comments - use GGML_UNUSED, use cassert in ops.cpp * Reverse the order of import for ops.h and vec.h, to match what was present in ggml-cpu.c previously --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-04-03 10:30:16 +03:00
Chenguang Li	d7a9346ab1	get_rows and dup optimization (llama/12671) * [CANN]get_rows and dup optimization. Co-authored-by: hipudding <huafengchun@gmail.com> Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]GET_ROWS and CPY/DUP optimization Co-authored-by: hipudding <huafengchun@gmail.com> Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> * [CANN]code style adjustment Signed-off-by: noemotiovon <noemotiovon@gmail.com> --------- Signed-off-by: noemotiovon <noemotiovon@gmail.com> Co-authored-by: noemotiovon <noemotiovon@gmail.com> Co-authored-by: hipudding <huafengchun@gmail.com>	2025-04-02 15:51:57 +03:00
Junil Kim	b63d23f728	opencl : fix memory allocation size (llama/12649) issue: https://github.com/CodeLinaro/llama.cpp/pull/17#issuecomment-2760611283 This patch fixes the memory allocation size not exceeding the maximum size of the OpenCL device.	2025-04-02 15:51:57 +03:00
Georgi Gerganov	f6ce10e4a1	metal : use F32 prec in FA kernels (llama/12688) * metal : use F32 prec in FA kernels ggml-ci * cont : fix FA vec kernel ggml-ci	2025-04-02 15:51:57 +03:00
R0CKSTAR	6cb2b86581	Fix clang warning in gguf_check_reserved_keys (llama/12686) * Fix clang warning in gguf_check_reserved_keys Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Fix typo Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-04-02 15:51:57 +03:00
Wagner Bruna	801d6bd809	vulkan: fix build when glslc doesn't support coopmat (llama/12683)	2025-04-02 15:51:57 +03:00
Romain Biessy	ddf7e6a15d	SYCL: Rename oneMKL to oneMath (llama/12192) * Rename oneMKL Interface to oneMath * Use oneMath for Intel vendor * Rename occurences to mkl * clang-format * Silence verbose warnings * Set oneMath HIP_TARGETS * Fix silence warnings * Remove step to build oneMath from build instructions * Use fixed oneMath version * Remove INTEL_CPU * Fold CMake oneDNN conditions * Use Intel oneMKL for Intel devices * Improve CMake message * Link against MKL::MKL_SYCL::BLAS only * Move oneMath documentation to Nvidia and AMD sections	2025-04-02 15:51:57 +03:00
Akarshan Biswas	0d42097fd3	SYCL: switch to SYCL namespace (llama/12674)	2025-04-02 15:51:57 +03:00
a3sh	842b9c984c	ggml : faster ssm scan (llama/10558) * faster ssm_scan * delete unused commnet * clang format * add space * modify unnecessary calculations * faster ssm conv implementatioin * modify file name with dash	2025-04-02 15:51:57 +03:00
0cc4m	0810f02547	Vulkan: Add DP4A MMQ and Q8_1 quantization shader (llama/12135) * Vulkan: Add DP4A MMQ and Q8_1 quantization shader * Add q4_0 x q8_1 matrix matrix multiplication support * Vulkan: Add int8 coopmat MMQ support * Vulkan: Add q4_1, q5_0 and q5_1 quants, improve integer dot code * Add GL_EXT_integer_dot_product check * Remove ggml changes, fix mmq pipeline picker * Remove ggml changes, restore Intel coopmat behaviour * Fix glsl compile attempt when integer vec dot is not supported * Remove redundant code, use non-saturating integer dot, enable all matmul sizes for mmq * Remove redundant comment * Fix integer dot check * Fix compile issue with unsupported int dot glslc * Update Windows build Vulkan SDK version	2025-04-02 15:51:57 +03:00
Georgi Gerganov	8c13c78f9d	cmake : fix whitespace (llama/0)	2025-04-02 15:51:57 +03:00
Akarshan Biswas	2e2f0f954b	SYCL: Remove misleading ggml_sycl_op_flatten function (llama/12387) * SYCL: Remove misleading ggml_sycl_op_flatten function * remove trailing whitespace * Fix L2 norm from rebase * remove try catch block from element_wise.cpp * remove comment from common.hp * ggml-sycl.cpp: Add try catch sycl::exception block in compute_forward * norm.cpp: remove try catch exception block	2025-03-31 14:56:53 +03:00
Georgi Gerganov	93631b2be6	metal : use constexpr in FA kernels + fix typedef (llama/12659) * metal : use constexpr in FA kernels ggml-ci * cont ggml-ci * cont : fix typedef ggml-ci	2025-03-31 14:56:53 +03:00
R0CKSTAR	f9015b585b	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (llama/12611) * musa: fix all warnings Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: enable -DLLAMA_FATAL_WARNINGS=ON in run.sh Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: update ci doc (install ccache) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * fix Windows build issue Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-31 14:56:53 +03:00
Jay	1880ffd7ff	cmake : fix ccache conflict (llama/12522) If users already set CMAKE_C_COMPILER_LAUNCHER globally, setting it in cmake again will lead to conflict and compile fail. Signed-off-by: Jay <BusyJay@users.noreply.github.com>	2025-03-31 14:56:53 +03:00
Xuan-Son Nguyen	9173932c78	cpu : rm unused variable (ggml/1166)	2025-03-31 14:56:53 +03:00
cmdr2	94c3f3877f	cpu: de-duplicate some of the operators and refactor (ggml/1144) * cpu: de-duplicate some of the operators and refactor * Fix PR comments * Fix PR comments	2025-03-31 14:56:53 +03:00
Sandro Hanea	00086469fb	cmake: improve Vulkan cooperative matrix support checks (#2966 ) Co-authored-by: Sandro Hanea <me@sandro.rocks>	2025-03-31 13:44:36 +03:00
Georgi Gerganov	27533e7f63	metal : improve FA + improve MoE (llama/12612) * ggml : FA with different K, V head sizes (CPU) ggml-ci * metal : add FA with HS=192 * metal : extend FA to support different K and V head sizes ggml-ci * metal : add FA vector kernels for heads K 192 and V 128 ggml-ci * ggml : restrict op on other backends to equal head sizes ggml-ci * metal : optimize FA-vec kernel ggml-ci * metal : FA remove mq registers * metal : improve MoE mul_mat_id condition ggml-ci * metal : fix comments + remove unnecessary addition ggml-ci * metal : avoid too much shared memory usage with mul_mat_id ggml-ci	2025-03-28 21:47:42 +02:00
Icenowy Zheng	1b81415963	vulkan: fix coopmat shader generation when cross-compiling (llama/12272) * vulkan: fix coopmat shader generation when cross-compiling Previously the status of coopmat{,2} support isn't passed to the vulkan-shaders-gen project building on the host, which leads to build failure because of the cross-compiling code expecting coopmat{,2} shaders that didn't get generated. Fix this by passing the coopmat{,2} support status to vulkan-shaders subproject. Signed-off-by: Icenowy Zheng <uwu@icenowy.me> * Only call coop-mat shaders once * Fix whitespace --------- Signed-off-by: Icenowy Zheng <uwu@icenowy.me> Co-authored-by: bandoti <141645996+bandoti@users.noreply.github.com>	2025-03-28 21:47:42 +02:00
amritahs-ibm	0001ec075f	llamafile : ppc64le GEMV forwarding for FP32. (llama/12594) This patch enables usage of MMA when one of the dimensions of the matrix(ie either M or N) is 1. This is useful in case of token generation where N < 2. The concept of 'GEMV Forwarding' is used where when one of the matrix has a single row/column, the elements are broadcasted, instead of using packing routine to prepack the matrix elements. This change results in 5% - 15% improvement in total speed(ie all tokens/total time), across various batch sizes. This is in comparision with the corresponding dot product implementation. The patch is tested with FP32 models of Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf on a IBM POWER10 machine. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>	2025-03-28 21:47:42 +02:00
Radoslav Gerganov	5bad2e5099	rpc : send hash when tensor data is above some fixed threshold (llama/12496) * rpc : send hash when tensor data is above some fixed threshold ref #10095 * rpc : put cache under $HOME/.cache/llama.cpp * try to fix win32 build * another try to fix win32 build * remove llama as dependency	2025-03-28 21:47:42 +02:00
lhez	6fc0ae2f5a	opencl: add multi and vision rope, `gelu_quick` and `im2col` (llama/12600) * opencl: add `im2col` * opencl: add `gelu_quick` * opencl: add mrope * opencl: add vision rope	2025-03-28 21:47:42 +02:00
Georgi Gerganov	1fbdfb1d36	files : remove old wkv6 (#0 ) ggml-ci	2025-03-27 11:06:03 +02:00
Georgi Gerganov	8ca67df291	ggml : sync/merge cmake,riscv,powerpc, add common.cmake (ggml/0)	2025-03-27 11:06:03 +02:00
amritahs-ibm	fc6d343e76	llamafile : ppc64le MMA implementation for Q4_0. (llama/12489) This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le ISA using MMA builtins. This patch handles matrix multiplication between quantised datatypes, block_q4_0 and block_q8_0. This change results in 5% - 50% improvement in total speed(ie all tokens/total time), across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>	2025-03-27 11:06:03 +02:00
Akarshan Biswas	3199356d3a	SYCL: implement memset ggml backend buffer interface (llama/12580) * SYCL: implement memset ggml backend buffer interface * use GGML_ABORT macro * Do not wait for all queues to finish for memset operation	2025-03-27 11:06:03 +02:00
Slobodan Josic	e0c43b0bbf	HIP: Add support for RDNA4 targets (llama/12372)	2025-03-27 11:06:03 +02:00
Georgi Gerganov	f4f619ea8e	metal : refactor mat-vec code (llama/12569) * metal : refactor mat-vec code ggml-ci * metal : rename all_sum -> sum_all ggml-ci * metal : fix comments [no ci] * metal : fix nr constant [no ci] * metal : mv q6_K support nr0 > 1 ggml-ci * metal : reduce register pressure ggml-ci * metal : fix typo [no ci] * metal : reduce register pressure ggml-ci	2025-03-27 11:06:03 +02:00
Georgi Gerganov	3c4d363872	ggml : fix MUL_MAT_ID repack with Q8_K (llama/12544) * ggml : fix MUL_MAT_ID repack with Q8_K ggml-ci * ggml : improve repack templates ggml-ci	2025-03-27 11:06:03 +02:00
Dan Johansson	15aa189329	ggml-cpu : update KleidiAI to v1.5.0 (llama/12568) ggml-cpu : bug fix related to KleidiAI LHS packing Signed-off-by: Dan Johansson <dan.johansson@arm.com>	2025-03-27 11:06:03 +02:00
Akarshan Biswas	c53d5c9e85	SYCL: disable Q4_0 reorder optimization (llama/12560) ggml-ci	2025-03-27 11:06:03 +02:00
lhez	ba6f584f30	opencl: simplify kernel embedding logic in cmakefile (llama/12503) Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>	2025-03-27 11:06:03 +02:00
R0CKSTAR	a219941812	CUDA: Fix clang warnings (llama/12540) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-27 11:06:03 +02:00
Jeff Bolz	a2cc8c2666	vulkan: fix mul_mat_vec failure in backend tests (llama/12529) The OOB calculation could be wrong if the last iteration was during one of the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple new backend tests that hit this failure on NVIDIA GPUs.	2025-03-27 11:06:03 +02:00
Georgi Gerganov	388ed98220	ggml : fix quantized cpy op (llama/12310) * ggml : fix quantized cpy op ggml-ci * tests : add cpy tests for all types ggml-ci * tests : add BF16 copy tests ggml-ci * tests : fix loop for same-type copy ggml-ci * tests : add option to permute the dst tensor ggml-ci	2025-03-27 11:06:03 +02:00
R0CKSTAR	d487a28ae1	musa: refine compute capability (llama/12493) * musa: refine compute capability Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-27 11:06:03 +02:00
Jeff Bolz	cbb88c4050	vulkan: Optimize mul_mat_vec p021 and nc shaders (llama/12505) * tests: add mul_mat perf/functional tests for p021/nc vulkan shaders * vulkan: Optimize mul_mat_vec p021 and nc shaders. These shaders are used in attention calculations, and when the KV cache grows large they start to dominate the run time. For the nc shader (which is called with large 'k' dimension), use unrolling and vector loads. For the p021 shader (which is called with large 'm' and small 'k' dimensions), take advantage of grouped query attention to reuse loads from the A matrix for the whole group, and reduce the number of workgroups (too much overhead from tiny dispatches). Using subgroupAdd in the p021 shader also helps, use that conditionally.	2025-03-27 11:06:03 +02:00
stduhpf	13455c0b5f	Vulkan: RTE rounding for cpy to quant (llama/12480) * Vulkan: RTE rounding for cpy to quant Co-Authored-By: Jeff Bolz <jbolz@nvidia.com> * remove trailing whitespace * avoid duplicating pipeline_cpy_f32_quant * fix copypasting issue * remove duplicated code --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-03-27 11:06:03 +02:00
Eve	2f77a9e9bd	vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (llama/12472)	2025-03-27 11:06:03 +02:00
蕭澧邦	fa2b5249ff	Fix build on Windows when ccache enabled (ggml/9954) (llama/9976) * [SYCL] Fix build on Windows when ccache enabled (llama/9954) * take effect only on windows and force it to icl --------- Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>	2025-03-27 11:06:03 +02:00
Svetlozar Georgiev	5b854ebba5	sycl: cleanup oneDNN related code (llama/12097)	2025-03-27 11:06:03 +02:00
Srihari-mcw	8058f19d0b	ggml : block interleaving support for Q4_K quantization for x86 AVX2 architecture (llama/12332) * Add block interleaving support for Q4_K quantization * Remove whitespaces and fix CI/CD issues * Update pointer of bsums from int16_t to const int16_t * Add vector version of quantize_q8_K_4x8 function * Update code formatting based on review comments	2025-03-27 11:06:03 +02:00
Gaurav Garg	ae6a9bb9a5	CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183) - Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value. - Prefer vector flash attention kernels over MMA kernel for BS=1 Fixes Issue: #12182 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-03-27 11:06:03 +02:00
Jeff Bolz	24faba9e9b	vulkan: optimize iq1 coopmat2 dequant functions (llama/12427)	2025-03-27 11:06:03 +02:00
Guus Waals	c722ff84d3	Fix visionOS build and add CI (llama/12415) * ci: add visionOS build workflow Add a new GitHub Actions workflow for building on visionOS with CMake and Xcode. * ggml: Define _DARWIN_C_SOURCE for visionOS to fix missing u_xxx typedefs * ci: remove define hacks for u_xxx system types --------- Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com>	2025-03-27 11:06:03 +02:00
Jeff Bolz	102af79f63	vulkan: Submit once enough matmul work has been recorded (llama/12406) I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB, and ramp up after the first few submits. This seems to resolve the issue, and also increases perf for non-FA a bit.	2025-03-27 11:06:03 +02:00
lhez	03c364557d	opencl: improve profiling (llama/12442) * opencl: more profiling timing * opencl: generate trace for profiling * opencl: reduce profiling overhead * Populate profiling timing info at the end rather than after each kernel run * opencl: fix for chrome tracing	2025-03-27 11:06:03 +02:00
R0CKSTAR	31b62276cf	musa: override warp_size of musa device to 32 (llama/12445) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-27 11:06:03 +02:00
Łukasz Ślusarczyk	97b5a3055d	SYCL: using graphs is configurable by environment variable and compile option (llama/12371) * alberto changes * enable sycl graphs by env variable * fixed compilation warnings in ggml-sycl.cpp * renamed graph variables * fix markdown in docs/backend/SYCL.md Co-authored-by: Romain Biessy <romain.biessy@codeplay.com> * fix markdown in docs/backend/SYCL.md again * compiling graphs by default, renamed graph_enable to graph_disable --------- Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>	2025-03-27 11:06:03 +02:00
fj-y-saito	9993c3f703	ggml : add SVE support for q6_K_q8_K (llama/12361)	2025-03-27 11:06:03 +02:00
0cc4m	fa72479cfb	Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (llama/12434)	2025-03-27 11:06:03 +02:00
Łukasz Ślusarczyk	6c15539c54	fixed compilation warnings in ggml-sycl (llama/12424)	2025-03-27 11:06:03 +02:00
Molly Sophia	52c4c03b0a	llama: Add support for RWKV v7 architecture (llama/12412) * ggml: Add op l2_norm Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: Add op rwkv_wkv7 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: Add support for RWKV7 and ARWKV7 models Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix inference with RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: add more (a)rwkv7 variants in size Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Apply code-format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * fix MUSA build Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix shape error with rwkv using llama-parallel Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2025-03-27 11:06:03 +02:00
Gaurav Garg	cfc2560e41	cuda : enable CUDA Graph on CUDA Toolkit < 12.x (llama/12394) * Enable CUDA Graph on CTK < 12.x `cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x. * Fix compilation errors with MUSA * Disable CUDA Graph for MUSA	2025-03-27 11:06:03 +02:00
Guus Waals	db6e8056b5	ggml-vulkan: remove unused find_program(glslc) (llama/12416) It's already found by FindVulkan.cmake in the parent CMakeLists	2025-03-27 11:06:03 +02:00
Jeff Bolz	b3f3779c1b	vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (llama/12312)	2025-03-27 11:06:03 +02:00
Daniele	13eeebb1b2	vulkan: subgroup size tuning (llama/12087) * vulkan: subgroup size test * Vulkan: Add device architecture enum and logic to recognize AMD generations * vulkan: use new architecture logic to specify subgroup size * Initial vulkan subgroup size tuning for RDNA3 * vulkan: commonize RDNA subgroup tuning * vulkan: override subgroup size if required_subgroup_size = 0 * vulkan: disable warp 32 for RDNA3 * vulkan: fine tuned RDNA1 subgroup sizes * vulkan: adjusted subgroup size map * vulkan: fixed RDNA2 subgroup map --------- Co-authored-by: 0cc4m <picard12@live.de>	2025-03-27 11:06:03 +02:00
Jeff Bolz	905b834af1	vulkan: use fp32 in coopmat2 q4_k dequant function (llama/12309)	2025-03-27 11:06:03 +02:00
Jeff Bolz	2cd3061a23	vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (llama/12273) * vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking	2025-03-27 11:06:03 +02:00
Jeff Bolz	88d59e21b2	vulkan: Adjust coopmat2 tile sizes and selection heuristic (llama/12258)	2025-03-27 11:06:03 +02:00
Christian Kastner	4917f122d4	cmake : enable building llama.cpp using system libggml (llama/12321) * cmake: Factor out compiler flag function from ggml llama.cpps's build requires it, too, and we may want to make use of it without add_subdirectory(ggml). * cmake: Enable building against system ggml This facilitates package maintenance for Linux distributions, where the libggml library most likely will be shipped as an individual package upon which a llama.cpp package depends.	2025-03-27 11:06:03 +02:00
Akarshan Biswas	16a1b77249	SYCL: set extras only on GGML_TYPE_Q4_0 (llama/12366) * SYCL: set extras only on GGML_TYPE_Q4_0 * release tensor_extras in reset buffer interface	2025-03-27 11:06:03 +02:00
aubreyli	51d1398a0a	SYCL: Delete redundant plus sign and space (llama/12391)	2025-03-27 11:06:03 +02:00
fairydreaming	3499dd83c0	SYCL : support non-contiguous tensors in binary ops (add, sub, etc) (llama/12399) * sycl : support non-contiguous tensors in binary ops * sycl : silence unused variable warning --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2025-03-27 11:06:03 +02:00
Chenguang Li	7b7d9ae35e	MUL_MAT optimization (llama/12382)	2025-03-27 11:06:03 +02:00
Alberto Cabrera Pérez	2dcb7181ff	sycl : variable sg_size support for mmvq kernels (llama/12336)	2025-03-27 11:06:03 +02:00
uvos	96ab3b2465	CUDA/HIP: Fix fattn-vec-* when device warp size is not 32 (llama/12315) When fattn-wmma was ported over to warp64 various bits that also touch fattn-vec where converted to selectable warp size, however the fattn-vec kernels dont work with 64 wide warps for now, so we need to avoid launching them with parameters for warp64	2025-03-27 11:06:03 +02:00
Jeff Bolz	08f32992d0	vulkan: fix bug in coopmat1 mul_mat_id (llama/12316) * tests: run mul_mat_id with a larger N * vulkan: fix bug in coopmat1 mul_mat_id	2025-03-27 11:06:03 +02:00
uvos	394fae57c3	CUDA/HIP: refractor mmqv to unify the calculation of nwarps and rows per block between host and device code. (llama/12177) refactor mmqv to unify the calculation of nwarps and rows per block between host and device code. --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-03-27 11:06:03 +02:00
jklincn	0708835301	ggml-backend : fix backend search path (llama/12330) * Fix backend search path * replace .native() with '/' * reverted .native()	2025-03-27 11:06:03 +02:00
BB-fat	774c519433	metal : Cache the Metal library at the device context level (llama/12265)	2025-03-27 11:06:03 +02:00
Eve	776cdceb9e	mat vec double buffer (llama/12188)	2025-03-27 11:06:03 +02:00
R0CKSTAR	03d050481e	musa: support new arch mp_31 and update doc (llama/12296) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-03-27 11:06:03 +02:00
Henry Linjamäki	3d60219622	opencl: use OpenCL C standard supported by the device (llama/12221) This patch nudges the llama.cpp a bit to be supported on PoCL which doesn't support OpenCL C CL2.0. The issue is solved by querying the device for the supported OpenCL C versions and using the highest one available.	2025-03-27 11:06:03 +02:00
Jason C.H	521d72d76e	ggml-backend : make path_str compatible with C++20 (llama/12269)	2025-03-27 11:06:03 +02:00
Daniel Bevenius	9fb9025a40	ggml : skip intermediate .air file when compiling .metallib (llama/12247) This commit updates the compilation of default.metallib to skip the intermediate .air (Apple Intermediate Representation) file. The motivation for this change is to simplify the custom command a little and avoid generating and then removing the .air file.	2025-03-27 11:06:03 +02:00
Christian Kastner	3c2abb01e8	cmake: Enable specifying exact PowerPC CPU architecture (ggml/1138) In the process, guard automatic CPU detection with GGML_NATIVE. https://gcc.gnu.org/onlinedocs/gcc/RS_002f6000-and-PowerPC-Options.html#index-mcpu-10	2025-03-27 11:06:03 +02:00
Christian Kastner	efd9407e22	cmake: Comment out GGML_BIN_DIR for now (ggml/1139) Nothing installs to it yet, so when attempting to use the cmake package, set_and_check() triggers an error if the directory doesn't already exist for other reasons.	2025-03-27 11:06:03 +02:00
Daniel Bevenius	b82ac32a6c	ggml : add logging for native build options/vars (#2935 ) This commit adds debug level logging for the native build options and variables to ggml/CMakeLists.txt. The motivation for this is that it can be useful to see the effective result of `GGML_NATIVE`, `GGML_NATIVE_DEFAULT`, and `INS_ENB` for a cmake build. I've found myself adding similar logging a few times now, so I thought it might be a good idea to add this. Example output, specifying `-DCMAKE_MESSAGE_LOG_LEVEL=DEBUG` when running cmake produces the following output: ```console -- GGML_NATIVE : OFF -- GGML_NATIVE_DEFAULT : OFF -- INS_ENB : OFF ```	2025-03-24 09:53:38 +01:00
Daniel Bevenius	6e8242f7fe	examples : command.wasm updates (#2904 ) This commit updates the command.wasm example by adding a server.py script to make it easy to start a local http server to try out the example, updates the build instructions, and also addresses some of the compiler warnings that were being generated. * emscripten : fix TOTAL_STACK for wasm This commit moves the TOTAL_STACK setting from the compile flags to the linker flags. This is because the TOTAL_STACK setting is a linker setting. The motivation for this change is that currently the following warnings are generated when building: ```console em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument] em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument] em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument] em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument] em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument] em++: warning: linker setting ignored during compilation: 'TOTAL_STACK' [-Wunused-command-line-argument] ``` * examples : suppress C++17 deprecation warning for std::codecvt_utf8 This commit suppresses the C++17 deprecation warning for std::codecvt_utf8 similar to what is done in examples/talk-llama/unicode.cpp. The motivation for this change is to suppress these warnings: ```console /Users/danbev/work/ai/whisper-work/examples/common.cpp:251:31: warning: 'codecvt_utf8<wchar_t>' is deprecated [-Wdeprecated-declarations] 251 \| std::wstring_convert<std::codecvt_utf8<wchar_t>> converter; \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/codecvt:193:28: note: 'codecvt_utf8<wchar_t>' has been explicitly marked deprecated here 193 \| class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 codecvt_utf8 : public __codecvt_utf8<_Elem> { \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17' 723 \| # define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED' 688 \| # define _LIBCPP_DEPRECATED __attribute__((__deprecated__)) \| ^ /Users/danbev/work/ai/whisper-work/examples/common.cpp:251:10: warning: 'wstring_convert<std::codecvt_utf8<wchar_t>>' is deprecated [-Wdeprecated-declarations] 251 \| std::wstring_convert<std::codecvt_utf8<wchar_t>> converter; \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/locale:3145:28: note: 'wstring_convert<std::codecvt_utf8<wchar_t>>' has been explicitly marked deprecated here 3145 \| class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 wstring_convert { \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17' 723 \| # define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED' 688 \| # define _LIBCPP_DEPRECATED __attribute__((__deprecated__)) \| ^ /Users/danbev/work/ai/whisper-work/examples/common.cpp:257:31: warning: 'codecvt_utf8<wchar_t>' is deprecated [-Wdeprecated-declarations] 257 \| std::wstring_convert<std::codecvt_utf8<wchar_t>> converter; \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/codecvt:193:28: note: 'codecvt_utf8<wchar_t>' has been explicitly marked deprecated here 193 \| class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 codecvt_utf8 : public __codecvt_utf8<_Elem> { \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17' 723 \| # define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED' 688 \| # define _LIBCPP_DEPRECATED __attribute__((__deprecated__)) \| ^ /Users/danbev/work/ai/whisper-work/examples/common.cpp:257:10: warning: 'wstring_convert<std::codecvt_utf8<wchar_t>>' is deprecated [-Wdeprecated-declarations] 257 \| std::wstring_convert<std::codecvt_utf8<wchar_t>> converter; \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/locale:3145:28: note: 'wstring_convert<std::codecvt_utf8<wchar_t>>' has been explicitly marked deprecated here 3145 \| class _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX17 wstring_convert { \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:723:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX17' 723 \| # define _LIBCPP_DEPRECATED_IN_CXX17 _LIBCPP_DEPRECATED \| ^ /Users/danbev/work/wasm/emsdk/upstream/emscripten/cache/sysroot/include/c++/v1/__config:688:49: note: expanded from macro '_LIBCPP_DEPRECATED' 688 \| # define _LIBCPP_DEPRECATED __attribute__((__deprecated__)) \| ^ 4 warnings generated. ``` * ggml : suppress double-promotion warning in GGML_F16x4_REDUCE This commit adds a cast to `ggml_float` in the `GGML_F16x4_REDUCE` macro to suppress a double-promotion warning. Currently the following warning is generated when compiling the command.wasm example: ```console /whisper-work/ggml/src/ggml-cpu/ggml-cpu.c:1592:5: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion] 1592 \| GGML_F16_VEC_REDUCE(sumf, sum); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /Users/danbev/work/ai/whisper-work/ggml/src/ggml-cpu/ggml-cpu.c:932:37: note: expanded from macro 'GGML_F16_VEC_REDUCE' 932 \| #define GGML_F16_VEC_REDUCE GGML_F16x4_REDUCE \| ^ /Users/danbev/work/ai/whisper-work/ggml/src/ggml-cpu/ggml-cpu.c:920:44: note: expanded from macro 'GGML_F16x4_REDUCE' 918 \| res = wasm_f32x4_extract_lane(x[0], 0) + \ \| ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 919 \| wasm_f32x4_extract_lane(x[0], 1) + \ \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 920 \| wasm_f32x4_extract_lane(x[0], 2) + \ \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~ 921 \| wasm_f32x4_extract_lane(x[0], 3); \ \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /whisper-work/ggml/src/ggml-cpu/ggml-cpu.c:1640:9: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion] 1640 \| GGML_F16_VEC_REDUCE(sumf[k], sum[k]); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /Users/danbev/work/ai/whisper-work/ggml/src/ggml-cpu/ggml-cpu.c:932:37: note: expanded from macro 'GGML_F16_VEC_REDUCE' 932 \| #define GGML_F16_VEC_REDUCE GGML_F16x4_REDUCE \| ^ /Users/danbev/work/ai/whisper-work/ggml/src/ggml-cpu/ggml-cpu.c:920:44: note: expanded from macro 'GGML_F16x4_REDUCE' 918 \| res = wasm_f32x4_extract_lane(x[0], 0) + \ \| ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 919 \| wasm_f32x4_extract_lane(x[0], 1) + \ \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 920 \| wasm_f32x4_extract_lane(x[0], 2) + \ \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~ 921 \| wasm_f32x4_extract_lane(x[0], 3); \ \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2 warnings generated. ``` wasm_f32x4_extract_lane returns a 32-bit float and this is what the addition is performed on. But there is an implicit conversion from 32-bit float to 64-bit double when the result is assigned to `res`, which is of type `ggml_float`. My understanding here is that this is intentional and adding a cast to `ggml_float` should suppress the warning. * emscripten : add -Wno-deprecated to for emscripten This commit adds -Wno-deprecated to the CMAKE_CXX_FLAGS for emscripten builds. The motivation for this is that currently there a number of warnings generated like the following: ```console warning: JS library symbol '$print' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated] warning: JS library symbol '$printErr' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated] em++: warning: warnings in JS library compilation [-Wjs-compiler] em++: warning: linker setting ignored during compilation: 'ENVIRONMENT' [-Wunused-command-line-argument] warning: JS library symbol '$print' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated] warning: JS library symbol '$printErr' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated] em++: warning: warnings in JS library compilation [-Wjs-compiler] warning: JS library symbol '$print' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated] warning: JS library symbol '$printErr' is deprecated. Please open a bug if you have a continuing need for this symbol [-Wdeprecated] em++: warning: warnings in JS library compilation [-Wjs-compiler] em++: warning: linker setting ignored during compilation: 'ENVIRONMENT' [-Wunused-command-line-argument] em++: warning: linker setting ignored during compilation: 'ENVIRONMENT' [-Wunused-command-line-argument] ``` The downside of this is that we might miss other deprecation warnings in the future so I'm not sure if this is acceptable. But it make the wasm examples cleaner without the warnings. * examples : fix tautological-compare warning in stb_vorbis.c [no ci] This commit applies a fix to address a tautological-compare warning in stb_vorbis.c. The motivation for this is that currently the following warning is generated when compiling the commmand-wasm example: ```console /Users/danbev/work/ai/whisper-work/examples/stb_vorbis.c:1404:75: warning: pointer comparison always evaluates to false [-Wtautological-compare] 1404 \| if (f->stream_start + loc >= f->stream_end \|\| f->stream_start + loc < f->stream_start) { \| ^ 1 warning generated. ``` This fix was taken from an open pull request on the stb repository that addreses this issue: https://github.com/nothings/stb/pull/1746 * squash! examples : update command.wasm instructions [no ci] This commit adds a Python script to serve the the wasm examples build in the `build-em` directory. Initially I thought that it would be enough to start a simple python server but I did not notice that there was an error in the browser console when I did that: ```console command.js:1 Uncaught (in promise) DataCloneError: Failed to execute 'postMessage' on 'Worker': SharedArrayBuffer transfer requires self.crossOriginIsolated. at command.js:1:1206224 at new Promise (<anonymous>) at loadWasmModuleToWorker (command.js:1:1204981) at Array.map (<anonymous>) at Object.loadWasmModuleToAllWorkers (command.js:1:1206428) at command.js:1:1204318 at callRuntimeCallbacks (command.js:1:1202062) at preRun (command.js:1:6136) at run (command.js:1:1294094) at removeRunDependency (command.js:1:7046) ``` We need a few CORS headers to be set and in order hopefully make this easy for users a Python script is added to the examples directory. This should be able to server all the wasm examples provided they have been built. command.wasm's README.md is updated to reflect this change. * examples : remove unused functions This commit removed the unused functions convert_to_utf8 and convert_to_wstring from examples/common.cpp. * Revert "examples : fix tautological-compare warning in stb_vorbis.c [no ci]" This reverts commit `8e3c47d961`. We should not make this change here and instead when the upstream PR is merged we can sync with it. Refs: https://github.com/ggerganov/whisper.cpp/issues/2784	2025-03-20 07:02:18 +01:00
Georgi Gerganov	4ffb8e3e4d	cmake : fix ggml-config (ggml/0)	2025-03-08 15:13:01 +02:00
Rémy O	eebf6bc0bd	ggml-cpu: faster AVX2 variant for IQ1_M (llama/12216)	2025-03-08 15:13:01 +02:00
BB-fat	dc8f423b40	metal : simplify kernel arguments using a struct (ggml/3229) (llama/12194) * metal : refactor im2col parameters into a struct * metal: Change im2col offset types from int32_t to uint64_t to support larger memory offsets * metal : refactor sum_rows parameters into a struct * metal : refactor soft_max parameters into a struct * metal : refactor diag_mask_inf parameters into a struct * metal : refactor ssm_conv parameters into a struct * metal : refactor ssm_scan parameters into a struct * metal : refactor get_rows parameters into a struct * metal : refactor group_norm parameters into a struct * metal : refactor conv_transpose_1d parameters into a struct * metal : refactor upscale parameters into a struct * metal : refactor pad parameters into a struct * metal : refactor pad_reflect_1d parameters into a struct * metal : refactor arange parameters into a struct * metal : refactor timestep_embedding parameters into a struct * metal : refactor argsort parameters into a struct * metal : refactor leaky_relu parameters into a struct * metal : refactor pool_2d parameters into a struct * metal : fix trailing whitespace --------- Co-authored-by: alexju <alexju@tencent.com>	2025-03-08 15:13:01 +02:00
Daniel Bevenius	548e7052f1	metal : fix default.metallib build (llama/12224) This commit updates the custom command to build the default.metallib file to use the correct path to ../ggml-common.h by using the variable METALLIB_COMMON. The motivation for this change is that currently when building and specifying GGML_METAL_EMBED_LIBRARY=OFF the following error is generated: ```console [ 11%] Linking CXX shared library ../../bin/libggml.dylib [ 11%] Built target ggml make[2]: * No rule to make target `ggml/src/ggml-metal/ggml-common.h', needed by `bin/default.metallib'. Stop. make[1]: * [ggml/src/ggml-metal/CMakeFiles/ggml-metal-lib.dir/all] Error 2 ``` With the above change the build could progress but there was a follow on error about not being able to find the ggml-common.h file in ggml-metal.metal where is was included as a relative path: ```console [ 11%] Compiling Metal kernels /Users/danbev/work/llama.cpp/build/bin/ggml-metal.metal:6:10: error: '../ggml-common.h' file not found, did you mean 'ggml-common.h'? ^~~~~~~~~~~~~~~~~~ "ggml-common.h" 1 error generated. ``` Removing the relative path then allowed the build to complete successfully.	2025-03-08 15:13:01 +02:00
lhez	a34cb73dc2	opencl: Noncontiguous `norm`, `rms_norm`, disable `fp16` for some ops (llama/12217) * opencl: support noncontiguous `norm` * opencl: support noncontiguous `rms_norm` * opencl: disable fp16 for `ADD`, `MUL`, `SCALE`, `RELU`, `GELU`, `SILU`, `CLAMP`	2025-03-08 15:13:01 +02:00
xiaofei	82f9496657	cmake : fix undefined reference errors for std::filesystem in ggml (#12092 ) (llama/12094) Signed-off-by: Ray Lee <hburaylee@gmail.com> Co-authored-by: Ray Lee <hburaylee@gmail.com>	2025-03-08 15:13:01 +02:00
Johannes Gäßler	e3c85e75bd	CUDA: fix FA logic for PTX 7.0 and CC >= 7.5 (llama/12222)	2025-03-08 15:13:01 +02:00
uvos	b9eab73fa2	HIP/CUDA: set the paramerter value in maintain_cuda_graph instead of replaceing it. (llama/12209) This avoids conflict with internal cuda/hip runtimes memory managment behavior.	2025-03-08 15:13:01 +02:00
Henry Linjamäki	76385c8311	opencl : fix buffer alignment (llama/12197) Fix the following error: ``` ggml-alloc.c:99: not enough space in the buffer ggml_tallocr_alloc: not enough space in the buffer to allocate blk.17.ffn_down.weight (needed 27525120, available 27521024) ``` which occurs when `ggml_backend_opencl_context::alignment` is larger than `cl_ptr_base` (hard-coded to `0x1000`). Also, fix `ggml_backend_opencl_context::alignment` was set to `CL_DEVICE_MEM_BASE_ADDR_ALIGN` which was treated as bytes but the value is reported in bits.	2025-03-08 15:13:01 +02:00
Henry Linjamäki	442cd1d2e7	opencl : fix `ulong` kernel args were set from `int` variables (llama/12174) ... which left garbage bits in the upper half of the kernel args. This caused segmentation faults when running PoCL.	2025-03-08 15:13:01 +02:00
simon886212	bc8cb97e02	opencl : fix profile-related errors (llama/12095) Co-authored-by: ubuntu <ubuntu@localhost.localdomain>	2025-03-08 15:13:01 +02:00
Rémy O	8dcadf736b	ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions (llama/12154) * ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions * cmake: Add GGML_BMI2 build option * ggml: enable BMI2 on relevant CPU variants * ggml-cpu: include BMI2 in backend score * ggml-cpu: register BMI2 in ggml_backend_cpu_get_features * ggml-cpu: add __BMI2__ define when using MSVC	2025-03-08 15:13:01 +02:00
Akarshan Biswas	93986b61e0	SYCL: Disable f16 Unary OPs as not supported by the kernels (llama/12201)	2025-03-08 15:13:01 +02:00
Plamen Minev	bd1a9e34c9	ggml : fix GGMLMetalClass ODR (llama/12200) -- it might happen if ggml is loaded from 2 separate libraries since each one of them will expose the class. This is more of a guard since we want to use only Metal as embedded library and don't care about the other case.	2025-03-08 15:13:01 +02:00
vmobilis	cc03608e78	ggml : ggml_compute_forward_concat() for arbitrary tensor type (ggml/1118) * ggml_compute_forward_concat() for arbitrary tensor type * Check that tensors' type match * ggml-cpu.c: check type of source tensors * ggml-cpu.c: move tensor type check to ggml_compute_forward_concat() * ggml.c: check concatenated tensor type * Remove tensor type check from ggml_compute_forward_concat() in ggml-cpu.c ..., as it was moved to ggml.c.	2025-03-08 15:13:01 +02:00
Georgi Gerganov	54a54faee4	vulkan : sync (llama/0) ggml-ci	2025-03-08 15:13:01 +02:00
mgroeber9110	96a92ecc4c	ggml : portability fixes for VS 2017 (llama/12150) * Add include files for std::min/max and std::toupper/tolower * win32: move _USE_MATH_DEFINES before includes to ensure M_PI is defined * Use GGML_RESTRICT instead of "restrict" keyword everywhere, and use "__restrict" in MSVC plain C mode * win32: only use __restrict in MSVC if C11/C17 support is not enabled --------- Co-authored-by: Marcus Groeber <Marcus.Groeber@cerence.com>	2025-03-08 15:13:01 +02:00
David Huang	edd1d8686a	HIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (llama/12032) Adds GGML_HIP_ROCWMMA_FATTN and rocwmma header check Adds rocWMMA support to fattn-wmma-f16	2025-03-08 15:13:01 +02:00
ag2s20150909	dc6f4e7c05	ggml : fix kleidiai build (llama/12159) The libggml API has changed, but this has not been updated.	2025-03-08 15:13:01 +02:00
Akarshan Biswas	74c85d154e	SYCL: Move CPY kernels to a separate file and add few missing kernels (llama/12133) * SYCL: refactor and move cpy kernels to a separate file * Add few missing cpy kernels * refactor and add debug logs	2025-03-08 15:13:01 +02:00
Diego Devesa	eb2d8b6ffd	ggml-backend : keep paths in native string type when possible (llama/12144)	2025-03-08 15:13:01 +02:00
Erik Scholz	b442dcd598	CUDA: compress mode option and default to size (llama/12029) cuda 12.8 added the option to specify stronger compression for binaries, so we now default to "size".	2025-03-08 15:13:01 +02:00
William Tambellini	c98681e6d5	ggml : upgrade init_tensor API to return a ggml_status (llama/11854) * Upgrade init_tensor API to return a ggml_status To prepare for an 'abort-free' ggml (ggml not to abort on OOMs but return a OOM status), as agreeed with Diego in the ggml repo, upgrade the init_tensor() and view_init() APIs to return a ggml_status. * misc fixes --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-03-08 15:13:01 +02:00
Rémy O	3bab804981	vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations (llama/11595) * vulkan: implement specialized MMV kernels for IQ2 quantizations * vulkan: add MMV kernels for IQ3 quants * vulkan: Increase MMV batch size and unroll IQ LUT setup * vulkan: fix init_iq_shmem for WG sizes larger than tables * vulkan: common batch size for all I-quants	2025-03-08 15:13:01 +02:00
Johannes Gäßler	c927830a70	CUDA: fix logic for V100 + GGML_CUDA_FORCE_MMQ (llama/12098)	2025-03-08 15:13:01 +02:00
Prashant Vithule	992b51b3d5	ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot (llama/12064) * Added SVE Support for Q2_K Quantized Models * Use 4-space indentation in the switch cases * removed comments lines * Remove the loop Retain the curly bracess for better understanding of code * Remove the comment like added for q3_k_q8_k kernel --------- Co-authored-by: vithulep <p.m.vithule1517@gmail.com>	2025-03-08 15:13:01 +02:00
hipudding	2c882cbe4c	CANN: Fix build error with GCC 13 (llama/11990) Remove unused header file that causes compilation failure on ARM platform with GCC 13.	2025-03-08 15:13:01 +02:00
Eve	1fbb119b1e	vulkan: matmul dequantization improvements (llama/12015) * faster dequant for old quants * dont use unpack for iq4_nl * vec2 unpack for q8	2025-03-08 15:13:01 +02:00
Daniele	40dea850fd	vulkan: improve im2col (llama/11826) * vulkan: improve im2col performance	2025-03-08 15:13:01 +02:00
Vladimir Vuksanovic	8255a830a8	cmake: Fix ggml backend dependencies and installation (llama/11818) * Fix dependencies between ggml and backends ggml backends link only to ggml-base and ggml links to all backends. * Fix installation of ggml backends Set up GNUInstallDirs before setting the installation directory of ggml backends	2025-03-08 15:13:01 +02:00
Jeff Bolz	a0f76b2da7	vulkan: fix assertion when qy_needs_dequant (llama/12068) Looks like a copy/paste bug from qx_needs_dequant.	2025-03-08 15:13:01 +02:00
Molly Sophia	394768c48b	ggml-cpu: Fix build with sve (llama/12059) * ggml-cpu: Fix build with sve Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml-cpu: Remove unused variable in sve q3_k vec dot Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2025-03-08 15:13:01 +02:00
cmdr2	846e01b2c0	cuda: unary ops as float + de-duplicate (ggml/1130)	2025-03-08 15:13:01 +02:00
cmdr2	6ac8e6b2ce	cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129) * cuda: restrict SILU_BACK to fp32, since fp16 exceeds the desired test threshold * vulkan: specify fp32-only support for certain ops (that are now tested for fp16 as well) * f32 sigmoid in vulkan supports op * Revert "f32 sigmoid in vulkan supports op" This reverts commit c6f04b3c19bf4504c2776149c6d8cd84e0b48acb.	2025-03-08 15:13:01 +02:00
cmdr2	60d2ddebdf	cuda/cpu: Increase support for fp16 unary operations (ggml/1125) * Support fp16 unary operations in the CUDA backend * cpu: increase fp16 support for unary operators in the CPU backend * cuda: increase fp16 support for unary operators in the CUDA backend * Add test cases for fp16 unary operators * metal: update supports_op for unary operators that don't support fp16, to prevent test-backend-ops from failing * metal: fix PR comments for unary op support after fp16 unary tests	2025-03-08 15:13:01 +02:00
petterreinholdtsen	2e180184a8	Told cmake to install ggml-cpp.h as a public header file. (ggml/1126) It is used by Whisper talk-llama example. Co-authored-by: Petter Reinholdtsen <pere@debian.org>	2025-03-08 15:13:01 +02:00
Diego Devesa	339a1cba5d	whisper : support GGML_BACKEND_DL (#2843 ) * whisper : support GGML_BACKEND_DL * fix DTW crash * whisper.objc : fix build - add ggml-cpp.h --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-27 13:35:07 +01:00
cmdr2	cdaee8b4bd	Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (ggml/1121) * Support float16-to-float16 add/sub/mul/div operations in the CUDA backend * Add fp16 support for add/sub/mul/div on the CPU backend * Add test cases for fp16 add/sub/mul/div	2025-02-27 08:55:36 +02:00
Gian-Carlo Pascutto	4b60ff4f92	metal : copy kernels for quant to F32/F16 conversions (llama/12017) metal: use dequantize_q templates --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-27 08:55:36 +02:00
lhez	b43b9d928c	opencl: fix for small models (llama/11950) * opencl: fix small shape gemv, remove unused extensions * opencl: fix `transpose_16`, `dump_tensor`, enforce subgroup size * opencl: fix for token length < 4 * opencl: use wave size of 64 for all Adreno GPUs --------- Co-authored-by: Shawn Gu <quic_shawngu@quicinc.com> Co-authored-by: Skyler Szot <quic_sszot@quicinc.com>	2025-02-27 08:55:36 +02:00
Neo Zhang Jianyu	e3cb412a59	Optimize mul_mat for Q4_0 on Intel GPU (llama/12035) * opt performance by reorder for Intel GPU * detect hw type and save opt feature, and print opt feature * correct name * support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed * add env variable GGML_SYCL_DISABLE_OPT for debug * use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT * add performance data * mv getrows functions to separeted files * fix global variables --------- Co-authored-by: arthw <14088817+arthw@users.noreply.github.com>	2025-02-27 08:55:36 +02:00
Akarshan Biswas	ac301a7d9b	SYCL: Fix GGML_SYCL_DEBUG macro (llama/11995)	2025-02-27 08:55:36 +02:00
Aaron Teo	82e04e7670	ggml-cpu: Support s390x SIMD Instruction Set (llama/12019) * ggml: add s390x ARCH_FLAGS for compilation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add SIMD for s390x using vector intrinsics SIMD is activated for: * ggml_vec_dot_f32 * ggml_vec_dot_f16 * ggml_vec_mad_f32 * ggml_vec_mad_f16 * ggml_vec_mad_f32_unroll * ggml_vec_scale_f32 * ggml_vec_scale_f16 SIMD is NOT activated for: * ggml_vec_dot_f16_unroll (pending bugfix) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix missing escape character in GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix s390x GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: full SIMD activation for F32,F16 s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add option to disable s390x VXE/VXE2 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: change vecintrin.h include to ggml-cpu-impl * add __VXE__ and __VXE2__ macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * cmake: add s390x target detection for VX/VXE/VXE2 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: move s390x vector intrinsics to ggml-cpu-impl.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x Q8_0 SIMD Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: correct documentation for Q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x reduce code complexity Q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x bugfix typo Q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activated for Q4_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x inline vec_reve Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for Q4_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add VXE backend feature Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: remove test.py Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for quantize_row_q8_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for quantize_row_q8_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for iq4_xs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: bugfix iq4_xs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for iq4_nl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add float, double, and long vector data type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: clean up iq4_xs SIMD Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix improper use of restrict keyword Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: update warning message for ggml_vec_tbl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: switch to restrict for iq4_nl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: slight dot product speed improvement for q4_1_q8_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for q6_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add missing `_t` to ggml_int8x16x4_t Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix missing `_t` for ggml_vec_xl_s8x4 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix more missing `_t` Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add unroll and prefetch to Q8_0 increase of 3.86% for prompt processing and 32.22% for token generation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: patch Q8_0 to use proper vector sizes Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: optimise Q8_0 dot prod compute kernel further Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: add unroll and prefetch to Q4_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: refactor Q6_K variable naming for readability Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q6_K typos Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for Q5_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix wrong charx16_t naming Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> ggml: Q5_K y0 wrong signness Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: s390x SIMD activation for Q4_K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: fix Q4_K invalid vector intrinsics Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: simplify ggml_padd_s16 compute kernel Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: correct ggml-cpu vxe wording Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: change ggml_aligned_malloc alignment to 256 256 is the cache line size for s390x platforms Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: resolve pr merge via cherry-pick 225bbbf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml : fix LoongArch compile error with 128-bit SIMD (llama/11701) * ggml: resolve pr merge via cherry-pick 4571953 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: cmake remove fork when determining s390x machine type thank you @ericcurtin Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: Jinyang He <hejinyang@loongson.cn> Co-authored-by: junchao-zhao <68935141+junchao-loongson@users.noreply.github.com>	2025-02-27 08:55:36 +02:00
Johannes Gäßler	38ac47cd4d	CUDA: app option to compile without FlashAttention (llama/12025)	2025-02-27 08:55:36 +02:00
Johannes Gäßler	2d70cd36d7	CUDA: optimize FA for GQA + large batches (llama/12014)	2025-02-27 08:55:36 +02:00
Gian-Carlo Pascutto	98dab49b9a	cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (llama/12000)	2025-02-27 08:55:36 +02:00
PureJourney	b1385e9aa9	CUDA: correct the lowest Maxwell supported by CUDA 12 (llama/11984) * CUDA: correct the lowest Maxwell supported by CUDA 12 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-02-27 08:55:36 +02:00
Bodhi	48f5e893f5	MUSA: support ARM64 and enable dp4a .etc (llama/11843) * MUSA: support ARM64 and enable __dp4a .etc * fix cross entropy loss op for musa * update * add cc info log for musa * add comment for the MUSA .cc calculation block --------- Co-authored-by: Bodhi Hu <huaishun.hu@mthreads.com>	2025-02-27 08:55:36 +02:00
Charles Xu	dc21871fcb	ggml-cpu: Add CPU backend support for KleidiAI library (llama/11390) * ggml-cpu: Add CPU backend support for KleidiAI library * Add environmental variable GGML_KLEIDIAI_SME * Add support for multithread LHS conversion * Switch kernel selection order to dotprod and i8mm * updates for review comments * More updates for review comments * Reorganize and rename KleidiAI files * Move ggml-cpu-traits.h to source file * Update cmake for SME build and add alignment for SME * Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list	2025-02-27 08:55:36 +02:00
Prashant Vithule	64a430bc81	ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot (llama/11917) * Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file * Improved Formating of code in ggml-cpu-quants.c file * style : minor fixes * style : less whitespaces * style : ptr spaceing --------- Co-authored-by: vithulep <p.m.vithule1517@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-27 08:55:36 +02:00
Johannes Gäßler	51a3580c79	CUDA: use async data loading for FlashAttention (llama/11894) * CUDA: use async data loading for FlashAttention --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-02-27 08:55:36 +02:00
Rémy O	37a21dd43d	vulkan: implement several ops relevant for ggml_opt (llama/11769) * vulkan: support memset_tensor * vulkan: support GGML_OP_SUM * vulkan: implement GGML_OP_ARGMAX * vulkan: implement GGML_OP_SUB * vulkan: implement GGML_OP_COUNT_EQUAL * vulkan: implement GGML_OP_OPT_STEP_ADAMW * vulkan: fix check_results RWKV_WKV6 crash and memory leaks * vulkan: implement GGML_OP_REPEAT_BACK * tests: remove invalid test-backend-ops REPEAT_BACK tests * vulkan: fix COUNT_EQUAL memset using a fillBuffer command	2025-02-27 08:55:36 +02:00
Jeff Bolz	8a22a8b17f	vulkan: support multi/vision rope, and noncontiguous rope (llama/11902)	2025-02-27 08:55:36 +02:00
Hale Chan	fcbcad0c90	metal : fix the crash caused by the lack of residency set support on Intel Macs. (llama/11904)	2025-02-27 08:55:36 +02:00
Adrian Kretz	4444db7360	metal : optimize dequant q6_K kernel (llama/11892)	2025-02-27 08:55:36 +02:00
Georgi Gerganov	a7fc1038ca	repo : update links to new url (llama/11886) * repo : update links to new url ggml-ci * cont : more urls ggml-ci	2025-02-27 08:55:36 +02:00
Rémy O	1689aaf854	vulkan: initial support for IQ1_S and IQ1_M quantizations (llama/11528) * vulkan: initial support for IQ1_S and IQ1_M quantizations * vulkan: define MMV kernels for IQ1 quantizations * devops: increase timeout of Vulkan tests again * vulkan: simplify ifdef for init_iq_shmem	2025-02-27 08:55:36 +02:00
lhez	4b48fe449a	opencl: Fix rope and softmax (llama/11833) * opencl: fix `ROPE` * opencl: fix `SOFT_MAX` * Add fp16 variant * opencl: enforce subgroup size for `soft_max`	2025-02-27 08:55:36 +02:00
Diego Devesa	47cc043e69	cuda : add ampere to the list of default architectures (llama/11870)	2025-02-27 08:55:36 +02:00
Jinyang He	e3d9ffb98b	ggml: optimize some vec dot functions for LoongArch ASX (llama/11842) * Optimize ggml_vec_dot_q3_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q4_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q6_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q5_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q2_K_q8_K for LoongArch ASX * Optimize mul_sum_i8_pairs_float for LoongArch ASX * Optimize ggml_vec_dot_iq4_xs_q8_K for LoongArch ASX	2025-02-27 08:55:36 +02:00
Eve	e22d69839d	vulkan: linux builds + small subgroup size fixes (llama/11767) * mm subgroup size * upload vulkan x86 builds	2025-02-27 08:55:36 +02:00
Jeffrey Morgan	defe731263	llamafile: use member variable instead of constant for iq4nlt (llama/11780)	2025-02-27 08:55:36 +02:00
R0CKSTAR	4e07957bf9	musa: bump MUSA SDK version to rc3.1.1 (llama/11822) * musa: Update MUSA SDK version to rc3.1.1 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: Remove workaround in PR #10042 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-02-27 08:55:36 +02:00
Diego Devesa	d2c5154bb5	ggml-cpu : add chunking support to mul_mat_id (llama/11666) * ggml-cpu : add chunking support to mul_mat_id * allocate chunk counter in wdata parallelize src1 quantization by column to allows parallelization even when there is only one row * disable for arm * cleanup * better way to disable for arm * fix uninitialized counter when using 1 thread only * revert test-backend-ops changes	2025-02-27 08:55:36 +02:00
Xuan-Son Nguyen	4fac43fe00	ggml : x2 speed for WASM by optimizing SIMD (llama/11453) * ggml : x2 speed for WASM by optimizing SIMD * fix bad merging * rm trailing spaces * rm redundant clamp * better quantize_row_q8_K Co-authored-by: camel-cdr <camel-cdr@protonmail.com> * remove memset that causes buffer overflow Co-authored-by: camel-cdr <camel-cdr@protonmail.com> --------- Co-authored-by: camel-cdr <camel-cdr@protonmail.com>	2025-02-27 08:55:36 +02:00
uvos	3be9670f17	HIP: Remove GCN from list of devices that avoid MMQ (llama/11831)	2025-02-27 08:55:36 +02:00
uvos	86729fcd6d	HIP: Switch to std::vector in rocblas version check (llama/11820)	2025-02-27 08:55:36 +02:00
bandoti	7fbca6304e	cleanup: fix compile warnings associated with gnu_printf (llama/11811)	2025-02-27 08:55:36 +02:00
Richard	d597f83e1a	ggml : fix multi-threaded clamp_f32 (llama/11824) * Bug fix for clamp_f32 When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0. * Bug fix for clamp_f32 * Bug fix for clamp_f32	2025-02-27 08:55:36 +02:00
Weizhao Ouyang	e5edcc6259	ggml-cpu: Fix duplicate MATMUL_INT8 (llama/11817) Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>	2025-02-27 08:55:36 +02:00
Johannes Gäßler	556f773d53	CUDA: fix CUDART_VERSION checks (llama/11821)	2025-02-27 08:55:36 +02:00
Sheldon Robinson	91d02de332	Fix #11802 : Compile bug - RegQueryValueExA changed to RegQueryValueEx (llama/11803) * Fix #11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx * Fix #11802: PR #11803 - keep RegQueryValueExA, remove TEXT macro, description needs to be ANSI string	2025-02-27 08:55:36 +02:00
Johannes Gäßler	1b67d72f87	CUDA: use arch list for compatibility check (llama/11775) * CUDA: use arch list for feature availability check --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-02-27 08:55:36 +02:00
Maxim Evtush	14d7c0368d	fix: typos in documentation files (llama/11791) * Update ggml.c * Update arg.cpp * Update speculative.h	2025-02-27 08:55:36 +02:00
Danny Milosavljevic	db6e19188a	vulkan: Make Vulkan optional at runtime (ggml/11493). (llama/11494) Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-02-27 08:55:36 +02:00
Wagner Bruna	b4b063a5c9	vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid VRAM allocation (llama/11592)	2025-02-27 08:55:36 +02:00
Jeff Bolz	930b739e7a	vulkan: account for lookup tables when checking shared memory size (llama/11502)	2025-02-27 08:55:36 +02:00
Karol Kontny	5981352bb5	ggml: Fix data race in ggml threadpool (llama/11736) After the barrier in last iteration is executed, still the loop termination condition will be executed. However main thread can destroy the cgraph object and its nodes already, then another thread will access it, but the thing is already gone. Also trouble can happen when n_nodes == 0 or abort is called, but I'm not sure if the prior situation is possible. Last syncronization should be done after the loop to ensure the cgraph/cplan won't be accessed after the main thread exits from the function.	2025-02-27 08:55:36 +02:00
Johannes Gäßler	7561da244e	CUDA: fix min. version for movmatrix (llama/11751)	2025-02-27 08:55:36 +02:00
Jeff Bolz	be83f342fb	vulkan: print shared memory size (llama/11719)	2025-02-27 08:55:36 +02:00
Akarshan Biswas	fd369871f7	SYCL: remove XMX info from print devices (llama/11712)	2025-02-27 08:55:36 +02:00
Jinyang He	bbd8364f5e	ggml : optimize and build warning fix for LoongArch (llama/11709) * ggml : optimize convert f32<->f16 for loongarch_asx * ggml : optimize loongarch_asx extend i16,i8,u8 to i32,i16 * ggml : Fix warnings when run cpu CI locally on LoongArch	2025-02-27 08:55:36 +02:00
Akarshan Biswas	e4102440ef	SYCL: Adjust support condition for norm operators (llama/11674) SYCL does not support non contiguous tensors for norm operations	2025-02-27 08:55:36 +02:00
junchao-zhao	f8242ec483	ggml : fix LoongArch compile error with 128-bit SIMD (llama/11701)	2025-02-27 08:55:36 +02:00
Jeff Bolz	ef51b4cba4	vulkan: optimize coopmat2 iq2/iq3 callbacks (llama/11521) * vulkan: optimize coopmat2 iq2/iq3 callbacks * build: trigger CI on GLSL compute shader changes	2025-02-27 08:55:36 +02:00
Rémy O	6f08b24146	vulkan: initial support for IQ4_XS quantization (llama/11501)	2025-02-27 08:55:36 +02:00
Jeff Bolz	7c165d7fa8	vulkan: use smaller combined allocations to avoid fragmentation (llama/11551)	2025-02-27 08:55:36 +02:00
Charles Duffy	2f0cf44915	metal : avoid breaking build when metal API predates TARGET_OS_VISION (llama/11690) Avoids breakage in nix flake build introduced by b0569130c5e9c671152c913d82803b7c2f014ff9	2025-02-27 08:55:36 +02:00
Georgi Gerganov	b9c972fd0d	metal : adjust support conditions for norm operators (llama/11671) cont #11659 ggml-ci	2025-02-27 08:55:36 +02:00
Johannes Gäßler	01c9aafbfd	CUDA: support for mat. mul. with ne03 != ne13 (llama/11656)	2025-02-27 08:55:36 +02:00
Johannes Gäßler	bae6bbf487	CUDA: non-contiguous (RMS) norm support (llama/11659) * CUDA: non-contiguous (RMS) norm support --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-27 08:55:36 +02:00
fxzjshm	c310272fa0	HIP: force max threads per block to be 1024 (llama/11621) Some old/vendor forked version of llvm still use 256. Explicitly set it to 1024 to align with upstream llvm. Signed-off-by: fxzjshm <fxzjshm@163.com>	2025-02-27 08:55:36 +02:00
Jhen-Jie Hong	bd0b55dbe0	metal : use residency set for other platforms (llama/11648)	2025-02-27 08:55:36 +02:00
Patrick Peng	ba4645db2c	rpc: fix known RCE in rpc-server (ggml/1103) Add bounds checking in `rpc_server::copy_tensor` to prevent out-of-bounds writes + Check if `(uint8_t *)dst->data + ggml_nbytes(src)` remains within the destination buffer’s allocated region.	2025-02-27 08:55:36 +02:00
midnight	46d07b9c85	cmake : fix compile assumptions for power9/etc (#2777 ) * Add small comment re: VSX to readme Co-authored-by: midnight <midnight@example.com>	2025-02-05 14:41:10 +02:00
Christian Kastner	16245b35e4	cmake: Add ability to pass in GGML_BUILD_NUMBER (ggml/1096) This makes git as a dependency optional, and is useful in the case where ggml is built not from git, but from a tarball, or a distribution source package. This conditional also affects GGML_BUILD_COMMIT. Nothing seems to be using it, though, so there doesn't seem much value factor it out, or even require it.	2025-02-04 13:03:03 +02:00
Georgi Gerganov	b8ab126343	cmake : sync cmake scripts	2025-02-03 22:00:57 +02:00
Johannes Gäßler	dbeb7916b8	CUDA: fix Volta FlashAttention logic (llama/11615)	2025-02-03 22:00:57 +02:00
Johannes Gäßler	fad2806352	HIP: fix flash_attn_stream_k_fixup warning (llama/11604)	2025-02-03 22:00:57 +02:00
uvos	9906792ec3	CUDA/HIP: add support for selectable warp size to mmv (llama/11519) CUDA/HIP: add support for selectable warp size to mmv	2025-02-03 22:00:57 +02:00
uvos	c49ee07ff4	HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other (llama/11601) This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly	2025-02-03 22:00:57 +02:00
Johannes Gäßler	f8a831779e	CUDA: use mma PTX instructions for FlashAttention (llama/11583) * CUDA: use mma PTX instructions for FlashAttention * __shfl_sync workaround for movmatrix * add __shfl_sync to HIP Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-02-03 22:00:57 +02:00
Olivier Chafik	85451e3612	`ci`: use sccache on windows instead of ccache (llama/11545) * Use sccache on ci for windows * Detect sccache in cmake	2025-02-03 22:00:57 +02:00
uvos	43c744ce8b	HIP: require at least HIP 5.5	2025-02-03 22:00:57 +02:00
uvos	fc2e44490d	HIP: Prepare reduction operators for wave 64	2025-02-03 22:00:57 +02:00
uvos	f41fdad200	CUDA/HIP: add warp_size to cuda_device_info	2025-02-03 22:00:57 +02:00
Rémy Oudompheng	80fa576254	vulkan: implement initial support for IQ2 and IQ3 quantizations (llama/11360) * vulkan: initial support for IQ3_S * vulkan: initial support for IQ3_XXS * vulkan: initial support for IQ2_XXS * vulkan: initial support for IQ2_XS * vulkan: optimize Q3_K by removing branches * vulkan: implement dequantize variants for coopmat2 * vulkan: initial support for IQ2_S * vulkan: vertically realign code * port failing dequant callbacks from mul_mm * Fix array length mismatches * vulkan: avoid using workgroup size before it is referenced * tests: increase timeout for Vulkan llvmpipe backend --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-02-03 22:00:57 +02:00
Jeff Bolz	75e7d0585e	vulkan: Catch pipeline creation failure and print an error message (llama/11436) * vulkan: Catch pipeline creation failure and print an error message Also, fix some warnings from my on-demand compile change. * vulkan: fix pipeline creation logging	2025-02-03 22:00:57 +02:00
uvos	682a6f5f87	HIP: Supress transformation warning in softmax.cu loops with bounds not known at compile time can not be unrolled. when ncols_template == 0, the bounds of the loop are not constexpr, thus llvm cant unroll the loops here.	2025-02-03 22:00:57 +02:00
Nikita Sarychev	115716d109	HIP: Only call rocblas_initialize on rocblas versions with the multiple instantation bug (llama/11080) This disables the workaround on rocblas fixed versions (>=4.0.0) to eliminate the runtime cost and unnecessary VRAM allocation of loading all tensile objects.	2025-02-03 22:00:57 +02:00
someone13574	b2cfef655b	cmake : don't fail on `GGML_CPU=OFF` (llama/11457)	2025-02-03 22:00:57 +02:00
Akarshan Biswas	22e3df0afa	SYCL : SOFTMAX F16 mask support and other fixes (llama/11261) Implemented ggml_sycl_op_soft_max() F16 src1(mask) support for which a pragma deprecation warning was added during #5021. To do this, had to decouple it from ggml_sycl_op_flatten which always considered src1 to be of fp32 type(many OP functions are dependent on it). * SYCL: SOFTMAX F16 mask support and other fixes * test-backend-ops: Add F16 mask test cases	2025-02-03 22:00:57 +02:00
Haus1	028511d349	AMD: parse the architecture as supplied by gcnArchName (llama/11244) The value provided by minor doesn't include stepping for AMD, parse the value returned by gcnArchName instead to retrieve an accurate ID.	2025-02-03 22:00:57 +02:00
Ihar Hrachyshka	70c4038842	metal: Handle null returned from MTLCreateSystemDefaultDevice() (llama/11441) This fixes segmentation fault error when running tests when no metal devices are available (for example, when not linked with Core Graphics framework or otherwise).	2025-02-03 22:00:57 +02:00
Georgi Gerganov	8639c003a9	metal : use residency sets (llama/11427) * metal : use residency sets ggml-ci * metal : restore commandBufferWithUnretainedReferences calls [no ci] * metal : release descriptors ggml-ci * metal : check env GGML_METAL_NO_RESIDENCY ggml-ci * metal : fix build + clean-up ggml-ci	2025-02-03 22:00:57 +02:00
bandoti	d5d831da65	cmake: add ggml find package (llama/11369) * Add initial ggml cmake package * Add build numbers to ggml find-package * Expand variables with GGML_ prefix * Guard against adding to cache variable twice * Add git to msys2 workflow * Handle ggml-cpu-* variants * Link ggml/ggml-base libraries to their targets * Replace main-cmake-pkg with simple-cmake-pkg * Interface features require c_std_90 * Fix typo * Removed unnecessary bracket from status message * Update examples/simple-cmake-pkg/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/simple-cmake-pkg/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-03 22:00:57 +02:00
Jeff Bolz	7230a6e1c8	vulkan: compile shaders on-demand (llama/11406) Reduce first-run startup time and memory consumption. Should fix #11339.	2025-02-03 22:00:57 +02:00
uvos	a160fa0f3a	Hip: disable VMM on hip as it seams that it dosent work in some configurations (llama/11420)	2025-02-03 22:00:57 +02:00
uvos	0282ad8fd1	hip : Add hipGraph and VMM support to ROCM (llama/11362) * Add hipGraph support * Enable VMM on rocm	2025-02-03 22:00:57 +02:00
Johannes Gäßler	9e467815d4	CUDA: fix FP16 cuBLAS GEMM (llama/11396)	2025-02-03 22:00:57 +02:00
uvos	727891d9bf	rocBLAS: Avoid fp32->fp16->fp32 conversion on cdna (llama/11356)	2025-02-03 22:00:57 +02:00
Johannes Gäßler	c262dc80e2	CPU/CUDA: fix (GQA) mul mat back, add CUDA support (llama/11380)	2025-02-03 22:00:57 +02:00
Bernhard M. Wiedemann	30767b4c4e	cmake : avoid -march=native when reproducible build is wanted (llama/11366) See https://reproducible-builds.org/ for why this is good and https://reproducible-builds.org/specs/source-date-epoch/ for the definition of this variable. Without this patch, compiling on different machines produced different binaries, which made verification of results difficult. Fixes: #11317 This patch was done while working on reproducible builds for openSUSE.	2025-02-03 22:00:57 +02:00
amd-dwang	16eeb31933	Vulkan-run-test: fix mmq_wg_denoms (llama/11343) There should be a copy-and-paste error here. mmq_wg_denoms should be used together with warptile_mmq, instead of wg_denoms.	2025-02-03 22:00:57 +02:00
Jeff Bolz	ba523d5e22	vulkan: sort shaders for more deterministic binary (llama/11315) Fixes #11306.	2025-02-03 22:00:57 +02:00
Jeff Bolz	3736706139	vulkan: fix diag_mask_inf (llama/11323) With robustbufferaccess disabled, this shader was showing OOB stores. There is a bounds check in the code, but the workgrouop dimensions were reversed vs CUDA and it was running the wrong number of threads. So fix the workgroup dimensions and disable robustness for this pipeline.	2025-02-03 22:00:57 +02:00
Radoslav Gerganov	58640aa456	rpc : better caching of the base buffer pointer (llama/11331) There is no need to use map, just store the base pointer in the buffer context.	2025-02-03 22:00:57 +02:00
Georgi Gerganov	5183a05e56	metal : fix out-of-bounds write (llama/11314) ggml-ci	2025-02-03 22:00:57 +02:00
Jeff Bolz	0dcada42d4	vulkan: fix coopmat2 validation failures (llama/11284) mul mat and flash attention shaders were loading f32 types directly into A/B matrices, which happens to work but is technically invalid usage. For FA, we can load it as an Accumulator matrix and convert and this is not in the inner loop and is cheap enough. For mul mat, it's more efficient to do this conversion in a separate pass and have the input(s) be f16. coopmat2 requires SPIR-V 1.6 (related using to LocalSizeId). LocalSizeId requires maintenance4 be enabled, and SPIR-V 1.6 requires Vulkan 1.3.	2025-02-03 22:00:57 +02:00
Nicolò Scipione	d507b4cebe	SYCL: Introducing memory host pool (llama/11251) * Implement host pool for matrix_info Creating a new memory pool on the host to store memory location for matrix_info needed to launch gemm_batch from oneMKL/oneMath. Removing complex support in gemm_batch since it is not used in llama.cpp * Remove unnecessary headers and cast * Reorder member variable to avoid warning on initialization * Formatting * Remove unused variable * Address PR review feedback - remove warning --------- Signed-off-by: nscipione <nicolo.scipione@codeplay.com>	2025-02-03 22:00:57 +02:00
Georgi Gerganov	90171055f3	cmake : add sanitizer flags for llama.cpp (llama/11279) * cmake : add sanitizer flags for llama.cpp ggml-ci * tests : fix compile warnings ggml-ci * cmake : move sanitizer flags to llama_add_compile_flags ggml-ci * cmake : move llama.cpp compile flags to top level lists ggml-ci * cmake : apply only sanitizer flags at top level ggml-ci * tests : fix gguf context use in same_tensor_data * gguf-test: tensor data comparison * dummy : trigger ggml-ci * unicode : silence gcc warnings ggml-ci * ci : use sanitizer builds only in Debug mode ggml-ci * cmake : add status messages [no ci] --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-02-03 22:00:57 +02:00
Jeff Bolz	668306ff2b	vulkan: fix coopmat2 flash attention for non-contiguous inputs (llama/11281) Add code similar to mul_mm_cm2 to force alignment of strides, to avoid a performance regression. Add noncontiguous FA tests in test-backend-ops. Fixes #11268.	2025-02-03 22:00:57 +02:00
Radoslav Gerganov	fdc21fc87b	rpc : early register backend devices (llama/11262) Early register RPC devices and do not propagate RPC specifics in the llama model structures. ref: #10609	2025-02-03 22:00:57 +02:00
Jeff Bolz	7183a1eb72	vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl (llama/11166) * vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl Shaders are based on cpy.cu. * vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32 * ggml: copy q->f32 assumes some contiguity in the destination	2025-02-03 22:00:57 +02:00
Jeff Bolz	09f3c66648	vulkan: optimize coopmat2 q4_k/q5_k dequant functions. (llama/11206) Do masking on whole dwords, fetch all scales at once.	2025-02-03 22:00:57 +02:00
Jeff Bolz	62e2414620	vulkan: optimize coopmat2 q2_k dequant function (llama/11130)	2025-02-03 22:00:57 +02:00
Johannes Gäßler	de49024e49	CUDA: backwards pass for misc. ops, add tests (llama/11257) * CUDA: backwards pass for misc. ops, add tests * remove restrict from pointers	2025-02-03 22:00:57 +02:00
fj-y-saito	db6383094c	ggml: aarch64: implement SVE kernels for q4_K_q8_K vector dot (llama/11227) * Add SVE support for q4_K_q8_K * Update ggml/src/ggml-cpu/ggml-cpu-quants.c change to use K_SCALE_SIZE Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-03 22:00:57 +02:00
Eve	164f13c6a9	vulkan: scale caching for k quants + misc fixes (llama/11081) * q6_k scale caching * 16 bit unpack * q4_k test (slow) * revert it * q3_k * q2_k * little stuff * try precalculating products of a and q2_k scales * Revert "try precalculating products of a and q2_k scales" This reverts commit 65110b81f23f66331a50c6e889a7c1ab9470a86b. * unpack should be u16, add vim swap to gitignore (about time) * better q4_k scales * q5_k * better q6_k with separate paths for all threads and partial threads in use, plus some more optimizations * q2_k better dequant * q3_k optimizations * q3_k use hmask simd from cpu avx version * make the caches happy * q3_k separate out calculation * q2_k separate out * little stuff * use calc_superblock everywhere * q2_k optimize scale calculation * more barriers	2025-02-03 22:00:57 +02:00
Junil Kim	02aa86230a	fix: ggml: fix vulkan-shaders-gen build (llama/10448) * fix: ggml: fix vulkan-shaders-gen build The vulkan-shaders-gen target was not being built correctly in case of cross-compilation. Other outputs need to be built for the cross compile target, but vulkan-shaders-gen needs to be built for the host. * refactor: ggml: Improve vulkan-shaders-gen toolchain setup - Add GGML_SHADERS_GEN_TOOLCHAIN CMake option. - Auto-detect host toolchain if not set. * refactor: ggml: Improve vulkan-shaders-gen toolchain setup Use configure_file to generate host_toolchain.cmake from template * fix: ggml: Fix compile error Fix compile error not finding vulkan-shaders-gen * fix: vulkan-shaders-gen build and path handling Fix build issues with vulkan-shaders-gen: - Add target dependency for correct build order - Use CMAKE_HOST_SYSTEM_NAME for executable suffix - Fix MSVC output directory in host toolchain - Normalize path handling for cross-compilation * fix: improve host compiler detection in vulkan shader build Improve host compiler detection for vulkan shader generation: - Add NO_CMAKE_FIND_ROOT_PATH to all compiler searches - Consolidate compiler detection logic - Fix Windows-specific MSVC detection - Ensure correct compiler search in cross-compilation * refactor: Simplify CMake function for detecting host compiler Simplified the CMake function to improve the process of detecting the host compiler. * fix: Remove unnecessary Vulkan library linkage in CMakeLists.txt Since `vulkan-shader-gen.cpp` only requires the `glslc` executable and not the Vulkan headers or libraries, CMakeLists.txt needs to be corrected. (See: ecc93d0558fc3ecb8a5af69d2ece02fae4710ade) * refactor: Rename host_toolchain.cmake.in - Rename host_toolchain.cmake.in to cmake/host-toolchain.cmake.in * refactor: GGML_VULKAN_SHADERS_GEN_TOOLCHAIN Rename the macro GGML_SHADERS_GEN_TOOLCHAIN to GGML_VULKAN_SHADERS_GEN_TOOLCHAIN	2025-02-03 22:00:57 +02:00
Johannes Gäßler	54a2ee648f	RoPE: fix back, CUDA support for back + noncont. (llama/11240) * RoPE: fix back, CUDA support for back + noncont. * fix comments reg. non-cont. RoPE support [no-ci]	2025-02-03 22:00:57 +02:00
Akarshan Biswas	9700cfb0a3	SYCL: Add gated linear attention kernel (llama/11175) * SYCL: Add Gated Linear attention kernel * glahpp: add a space at the end of file * gla: Put the barrier inside the main logic loop	2025-02-03 22:00:57 +02:00
William Tambellini	8e0143e205	ggml : add option to not print stack on abort (ggml/1081) * Add option to not print stack on abort Add option/envvar to disable stack printing on abort. Also link some unittests with Threads to fix link errors on ubuntu/g++11. * Update ggml/src/ggml.c --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-02-03 22:00:57 +02:00
issixx	f12559d590	ggml-cpu : fix ggml_graph_compute_thread did not terminate on abort. (ggml/1065) some threads kept looping and failed to terminate properly after an abort during CPU execution. Co-authored-by: issi <issi@gmail.com>	2025-02-03 22:00:57 +02:00
Johannes Gäßler	d5ef1737d8	GGUF: C++ refactor, backend support, misc fixes (skip) (llama/11030) ggml-ci	2025-01-14 10:38:01 +02:00
lhez	1deb41f0e7	ggml : add opencl backend (skip) (llama/10693) --------- Co-authored-by: Skyler Szot <quic_sszot@quicinc.com> Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com> Co-authored-by: Alexander Angus <quic_aangus@quicinc.com> Co-authored-by: Hongqiang Wang <quic_wangh@quicinc.com> Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>	2025-01-14 10:38:01 +02:00
Andreas Kieslinger	2425caf4fd	cuda : CUDA Graph Compute Function Refactor (precursor for performance improvements) (llama/11042) * Refactor: Moves cuda graph executable update step to separate function. * Refactor: Moves cuda graph update check to separate function. * Refactor: Moves cuda graph maintenance (update or adjusting copy parameters) to separate function for improved readability. * Fix: Adds missing reference to maintain_cuda_graph() definition. * Refactor: Improves structure and abstractions by moving CUDA graph evaluation and capture to its own function. * Refactor: Moves node graph checks and copy ops into individual function for improved readability. * Refactor: Removes code permanently excluded from compilation to increase readability. * Style: Adds missing newline * Style: Consolidates several neighboring '#ifdef USE_CUDA_GRAPH' into a single one * Refactor: Makes 'cuda_graph_update_required' a local variable * remove double lines between functions --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-01-14 10:38:01 +02:00
Radoslav Gerganov	a4b00bcaaf	ggml : do not define GGML_USE_CUDA when building with GGML_BACKEND_DL (llama/11211) Build fails when using HIP and GGML_BACKEND_DL: ``` /usr/bin/ld: ../ggml/src/libggml.so: undefined reference to `ggml_backend_cuda_reg' collect2: error: ld returned 1 exit status ``` This patch fixes this.	2025-01-14 10:38:01 +02:00
0cc4m	cdb8aa2f2e	Vulkan: Fix float16 use on devices without float16 support + fix subgroup_size_control validation error (llama/11161) * Vulkan: Remove float16 use in shaders * Fix validation error about subgroup_size_control extension	2025-01-14 10:38:01 +02:00
Molly Sophia	06209f6683	llama: add support for QRWKV6 model architecture (llama/11001) llama: add support for QRWKV6 model architecture (llama/11001) * WIP: Add support for RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV: Some graph simplification Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add support for RWKV6Qwen2 with cpu and cuda GLA Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix some typos Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * code format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix wkv test & add gla test Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix cuda warning Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update README.md Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update ggml/src/ggml-cuda/gla.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Fix fused lerp weights loading with RWKV6 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * better sanity check skipping for QRWKV6 in llama-quant thanks @compilade Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: compilade <git@compilade.net> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <git@compilade.net>	2025-01-14 10:38:01 +02:00
Akarshan Biswas	c3235bd81e	SYCL: Refactor ggml_sycl_compute_forward (llama/11121) * SYCL: refactor ggml_sycl_compute_forward * SYCL: add back GGML_USED(dst) to ggml_sycl_cpy * SYCL: add function name to noop debug * SYCL: Some device info print refactoring and add details of XMX availability	2025-01-14 10:38:01 +02:00
hydai	262d0abc87	fix: add missing msg in static_assert (llama/11143) Signed-off-by: hydai <z54981220@gmail.com>	2025-01-14 10:38:01 +02:00
amritahs-ibm	124eec1664	llamafile : ppc64le MMA INT8 implementation (llama/10912) This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for quantised int8 datatype. This change results in 10% - 70% improvement in total speed(ie all tokens/total time), across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>	2025-01-14 10:38:01 +02:00
Mathieu Baudier	b08c3a88c8	Disable GL_KHR_cooperative_matrix Vulkan extension if not available. (llama/11117) * Disable GL_KHR_cooperative_matrix Vulkan extension if not available. * Perform Vulkan extensions checks in a more sensible order * Remove unnecessary #ifdef directive	2025-01-14 10:38:01 +02:00
ag2s20150909	0afce25a69	fix: Vulkan shader gen binary path when Cross-compiling (llama/11096) * fix: Vulkan shader gen binary path when cross compiling	2025-01-14 10:38:01 +02:00
Johannes Gäßler	acdbe58631	GGUF: C++ refactor, backend support, misc fixes (llama/11030) * GGUF: C++ refactor, backend support, misc fixes remove ggml_tensor.backend update CODEOWNERS [no ci] remove gguf_get_data from API revise GGUF API data types	2025-01-14 10:38:01 +02:00
Diego Devesa	09fabffdf5	ggml-backend : only offload from host buffers (fix) (llama/11124)	2025-01-14 10:38:01 +02:00
Diego Devesa	3988d6396b	ggml-backend : only offload from host buffers (llama/11120)	2025-01-14 10:38:01 +02:00
Radoslav Gerganov	c8c63eeec0	rpc : code cleanup (llama/11107) Remove duplicated macros, use GGML_LOG_ERROR for errors	2025-01-14 10:38:01 +02:00
Akarshan Biswas	abf7f24410	SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6 (llama/11087) * SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6 * Revert "SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6" This reverts commit f62dc45f318e48d375e7734b34cbddee81deed52. * Reland: Use get_multi_ptr instead of deprecated get_pointer in wkv6	2025-01-14 10:38:01 +02:00
Johannes Gäßler	341f5c28e6	CUDA: add BF16 support (llama/11093) * CUDA: add BF16 support	2025-01-14 10:38:01 +02:00
0cc4m	5377099524	Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver (llama/11074) * Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver * Add (TM) to AMD name check	2025-01-14 10:38:01 +02:00
matt23654	dcbb375779	Support for models with non-512-aligned tensors over RPC. (llama/11047) * Added init tensor calling code * Added get_alloc_size forwarding * Cleaned up and improved type/error handling. * fix: remove trailing whitespaces. * Cleanup and use GGML error logging functions. * Handle potentially dangerous edge cases. * Apply suggestions from code review Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-01-14 10:38:01 +02:00
Gilad S.	4334c71aed	fix: Vulkan shader gen binary path (llama/11037)	2025-01-14 10:38:01 +02:00
Radoslav Gerganov	e875a82473	ggml : allow loading backend with env variable (ggml/1059) ref: #1058	2025-01-14 10:38:01 +02:00
Georgi Gerganov	2e93cb6a2f	ggml : do not install metal source when embed library (ggml/1054)	2025-01-04 10:45:01 +02:00
Georgi Gerganov	de5cd60d1c	metal : avoid uint (llama/11019)	2025-01-04 10:45:01 +02:00
Srihari-mcw	3fcba3e58b	ggml : fixes for AVXVNNI instruction set with MSVC and Clang (llama/11027) * Fixes for clang AVX VNNI * enable AVX VNNI and alder lake build for MSVC * Apply suggestions from code review --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-01-04 10:45:01 +02:00
Jeff Bolz	cea5f1c52f	vulkan: optimize mul_mat for small values of N (llama/10991) Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where the batch_strides are overloaded to hold the row strides. Put the loads from the B matrix in the innermost loop because it should cache better. Share some code for reducing the result values to memory in mul_mat_vec_base.	2025-01-04 10:45:01 +02:00
Jeff Bolz	2112462db4	vulkan: im2col and matmul optimizations for stable diffusion (llama/10942) * tests: Add im2col perf tests * vulkan: optimize im2col, more elements per thread * vulkan: increase small tile size for NV_coopmat2 * vulkan: change im2col to 512 elements per workgroup	2025-01-04 10:45:01 +02:00
Jeff Bolz	fc84ecd445	vulkan: Use push constant offset to handle misaligned descriptors (llama/10987)	2025-01-04 10:45:01 +02:00
Eve	8de1e99907	vulkan: multi-row k quants (llama/10846) * multi row k quant shaders! * better row selection * more row choices * readjust row selection * rm_kq=2 by default	2025-01-04 10:45:01 +02:00
Peter	499af9294a	examples, ggml : fix GCC compiler warnings (llama/10983) Warning types fixed (observed under MSYS2 GCC 14.2.0): * format '%ld' expects argument of type 'long int', but argument has type 'size_t' * llama.cpp/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp:81:46: warning: missing initializer for member '_STARTUPINFOA::lpDesktop' [-Wmissing-field-initializers] (emitted for all struct field except first)	2025-01-04 10:45:01 +02:00
Djip007	bcf937c216	ggml : more perfo with llamafile tinyblas on x86_64 (llama/10714) * more perfo with llamafile tinyblas on x86_64. - add bf16 suport - change dispache strategie (thanks: https://github.com/ikawrakow/ik_llama.cpp/pull/71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly * tinyblas dynamic dispaching * sgemm: add M blocs. * - git 2.47 use short id of len 9. - show-progress is not part of GNU Wget2 * remove not stable test	2025-01-04 10:45:01 +02:00
Diego Devesa	b8d90953d7	ggml : use wstring for backend search paths (llama/10960) ggml-ci	2025-01-04 10:45:01 +02:00
Diego Devesa	60a422147b	ggml : fix arm enabled features check (llama/10961)	2025-01-04 10:45:01 +02:00
Diego Devesa	3387415bad	ggml : fix const usage in SSE path (llama/10962)	2025-01-04 10:45:01 +02:00
yuri@FreeBSD	536ca3ec89	ggml : fix run-time on FreeBSD in get_executable_path() (llama/10948)	2025-01-04 10:45:01 +02:00
Jeff Bolz	a4bb983190	vulkan: build fixes for 32b (llama/10927) * vulkan: build fixes for 32b Should fix #10923 * vulkan: initialize some buffer/offset variables	2025-01-04 10:45:01 +02:00
Jeff Bolz	39c205f555	vulkan: optimize coopmat2 dequant functions (llama/10855) Change the code to do 16b loads when possible and extract the appropriate component late, so the code is effectively decoding a pair of elements and then selecting one. This can allow more commoning to happen in the compiler when neighboring elements are loaded.	2025-01-04 10:45:01 +02:00
Adrien Gallouët	6d502f33dc	ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0() (llama/10874) * ggml-cpu: replace NEON asm with intrinsics in ggml_gemv_q4_0_4x8_q8_0() Signed-off-by: Adrien Gallouët <angt@huggingface.co> * ggml-cpu: format code Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-01-04 10:45:01 +02:00
Akarshan Biswas	5ea27d089d	SYCL: Migrate away from deprecated ggml_tensor->backend (llama/10840) * Migrate to tensor->buffer for checking backend buffer type: 1 * SYCL: common.cpp try to migrate away from tensor->backend * SYCL: fix assertions and add proper comments * SYCL: remove extra space * SYCL: Add back static to ggml_backend_buffer_is_sycl_split function * SYCL: Add pragma directive to suppress warning spam * SYCL: Integrate debug logs with GGML_LOG and other fixes * Revert "SYCL: Integrate debug logs with GGML_LOG and other fixes" This reverts commit 2607b7de0f0d2f4f1f690226f86fa861aa39cb97. Let's keep the current SYCL specific logging mechanism for now * SYCL: Use GGML_SYCL_DEBUG after reverting * SYCL: reg_get_proc_address func, update to the current func signature * SYCL: Refactor SYCL buffer checks in ggml_sycl_cpy_tensor_2d	2025-01-04 10:45:01 +02:00
Diego Devesa	1462d92588	ggml : add test for SVE and disable when it fails (llama/10906)	2025-01-04 10:45:01 +02:00
Adrien Gallouët	7ba1a41f47	ggml: fix arm build with gcc (llama/10895) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-01-04 10:45:01 +02:00
Diego Devesa	5ea088636f	ggml : fix arm build (llama/10890) * ggml: GGML_NATIVE uses -mcpu=native on ARM Signed-off-by: Adrien Gallouët <angt@huggingface.co> * ggml: Show detected features with GGML_NATIVE Signed-off-by: Adrien Gallouët <angt@huggingface.co> * remove msvc support, add GGML_CPU_ARM_ARCH option * disable llamafile in android example * march -> mcpu, skip adding feature macros ggml-ci --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co> Co-authored-by: Adrien Gallouët <angt@huggingface.co>	2025-01-04 10:45:01 +02:00
Georgi Gerganov	f32ddb3b1c	tts : add OuteTTS support (llama/10784) * server : add "tokens" output ggml-ci * server : output embeddings for all tokens when pooling = none ggml-ci * server : be explicit about the pooling type in the tests ggml-ci * server : do not normalize embeddings when there is no pooling ggml-ci * llama : add OuteTTS support (wip) * wip * extract features * first conv * group norm * resnet conv * resnet * attn * pos net * layer norm * convnext * head * hann window * fix n_embd + remove llama.cpp hacks * compute hann window * fft * spectrum processing * clean-up * tts : receive input text and generate codes * clip : fix new conv name * tts : minor fix * tts : add header + minor fixes ggml-ci * tts : add matchematical constant ggml-ci * tts : fix sampling + cut initial noise * tts : fixes * tts : update default samplers ggml-ci * tts : text pre-processing * tts : outetts-voc -> wavtokenizer-dec * tts : remove hardcoded constants ggml-ci * tts : fix tensor shapes * llama : refactor wavtokenizer tensors ggml-ci * cont ggml-ci * cont [no ci] * llama : update WavTokenizer to non-causal attn * llama : handle no-vocab detokenization * tts : add Python example for OuteTTS (wip) * tts : extend python example to generate spectrogram ggml-ci * server : fix rebase artifacts * tts : enable "return_tokens" in Python example ggml-ci * tts : minor fixes * common : support HF download for vocoder	2025-01-04 10:45:01 +02:00
Johannes Gäßler	79b75ece03	tests: add tests for GGUF (llama/10830)	2025-01-04 10:45:01 +02:00
Daniel Bevenius	6348d73e55	ggml : improve inputs log sched_print_assignments (ggml/1053) This commit attempts to improve the log message for the inputs of the splits in the sched_print_assignments function. The motivation for this change is that currently even if there are no inputs a colon is displayed at the end of the line, which can make it a little confusing when reading the output as it could be interpreted as the line below are inputs when they are in fact nodes. With this change the colon will only be printed if there actually are inputs.	2025-01-04 10:45:01 +02:00
Georgi Gerganov	6576af00d7	files : remove old sources	2024-12-18 12:52:16 +02:00
Georgi Gerganov	479499dc0e	ggml : update ggml_backend_cpu_device_supports_op (llama/10867) * ggml : fix cpy op for IQ-quants to use reference impl ggml-ci * ggml : disable tests involving i-matrix quantization * ggml : update ggml_backend_cpu_device_supports_op ggml-ci	2024-12-18 12:52:16 +02:00
Eve	d420a759c5	vulkan: bugfixes for small subgroup size systems + llvmpipe test (llama/10809) * ensure mul mat shaders work on systems with subgroup size less than 32 more fixes add test * only s_warptile_mmq needs to be run with 32 threads or more	2024-12-18 12:52:16 +02:00
Zhiyuan Li	a1ab9b5e91	rwkv6: add wkv6 support for Vulkan backend (llama/10829) * rwkv_wkv6 vulkan shader * RWKV_WKV6 Vulkan op tests passed Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Apply code format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * add [[unroll]] and remove unnecessary conditions * add uma support * fix erros in EditorConfig Checker --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Molly Sophia <mollysophia379@gmail.com>	2024-12-18 12:52:16 +02:00
HimariO	e22d38e4f2	llama : add Qwen2VL support + multimodal RoPE (llama/10361) * Barebone Qwen2VL LLM convertor * Add Qwen2VL cli entrypoint * [WIP] add qwen2vl arch * Verify m-rope output * Add vl-rope/2d-rope support for qwen2vl ViT * update qwen2vl cli tool * update 5D tensor op workaround * [WIP] qwen2vl vision model * make batch and clip utils compatible with qwen2vl * [WIP] create inference workflow, gguf convert script but fix * correcting vision-rope behavior, add the missing last layer back to ViT * add arg parser to qwen2vl_surgery * replace variable size array with vector * cuda-gdb cmake preset * add fp32 mrope, vision rope kernel * add fp16 support for qwen2vl and m-rope * add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION` * fix rope op mode switching, out dated func args * update `llama_hparams` * update to keep up stream changes * resolve linter, test errors * add makefile entry, update speical image padding token * add mrope unit test, fix few compiler warnings * rename `mrope` related function, params * minor updates on debug util, bug fixs * add `m-rope` testcase to `test-backend-ops` * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix traililng whitespce * store `llama_hparams.rope_sections` with fixed size array * update position id tensor size check in GGML_OP_ROPE * minor updates * update `ggml_backend__supports_op` of unsupported backends remote old `rope_section` compare operator --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-12-18 12:52:16 +02:00
lhez	856fbaa92f	Introducing experimental OpenCL backend with support for Qualcomm Adreno GPUs (llama/10693) * [cl][adreno] Add Adreno GPU support Add new OpenCL backend to support Adreno GPUs --------- Co-authored-by: Skyler Szot <quic_sszot@quicinc.com> Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com> Co-authored-by: Alexander Angus <quic_aangus@quicinc.com> Co-authored-by: Hongqiang Wang <quic_wangh@quicinc.com> Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com> * [cl][ci] Add workflow for CL * [cl][adreno] Fix memory leak for non SMALL_ALLOC path * opencl: integrate backend dyn.load interface and fix compiler and format warnings * opencl: remove small-alloc support and fix build errors for non-opencl platforms * opencl: fixed merge conflict (MUSA added twice in cmake) * opencl-ci: use RUNNER_TEMP instead of github.workspace * opencl: fix embed tool invocation with python3 * opencl: CI workflow fixes * opencl: Clean up small-alloc in CMake files * opencl: cleanup ggml-opencl2 header file * opencl: use ulong for offsets and strides in ADD kernel * opencl: use cl_ulong for all offsets * opencl: use cl_ulong for sizes and strides * opencl: use `GGML_LOG_xxx` instead of `fprintf(stderr, ...)` * opencl: rename backend `opencl2` -> `opencl` * opencl: rename kernel files `ggml-opencl2` -> `ggml-opencl` * opencl: make OpenCL required, remove redundant lib and inc directories * `ggml-base`, `..` and `.` are added by `ggml_add_backend_library` * opencl: rename backend - funcs, structs, etc `opencl2` -> `opencl` * opencl: remove copyright marker since main license already covers * opencl: replace some more OPENCL2 leftovers * opencl: remove limits on `tensor_extra` * opencl: use pools for `tensor_extra` * opencl: fix compiler warnings with GCC and Clang Still getting the warning about clCreateCmdQueue being obsolete. Will fix that separately. * opencl: fail gracefully if opencl devices are not available Also for unsupported GPUs. * opencl: fix MSVC builds (string length error) * opencl: check for various requirements, allow deprecated API * opencl: update log message for unsupported GPUs --------- Co-authored-by: Skyler Szot <quic_sszot@quicinc.com> Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com> Co-authored-by: Alexander Angus <quic_aangus@quicinc.com> Co-authored-by: Hongqiang Wang <quic_wangh@quicinc.com> Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>	2024-12-18 12:52:16 +02:00
谢乃闻	2c05efa4b1	Fix crash caused by ggml_backend_load_all when launching on Android Activity (llama/10812) * Fix crash caused by ggml_backend_load_all when launching on AndroidActivity. Details: Calling ggml_backend_load_all during initialization in the AndroidActivity project leads to a crash with the error: terminating with uncaught exception of type std::__ndk1::__fs::filesystem::filesystem_error: filesystem error: in directory_iterator::directory_iterator(...): Permission denied [./]. This issue occurs because AndroidActivity restricts file access due to sandboxing. Reproduction: In the example folder, the LlamaAndroid project can reproduce the crash by calling ggml_backend_load_all first in Java_android_llama_cpp_LLamaAndroid_backend_1init. * Update ggml/src/ggml-backend-reg.cpp --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2024-12-18 12:52:16 +02:00
Eve	c21fb10b28	vulkan: small mul_mat_vec optimizations (llama/10665) * double the number of rows per workgroup * Update ggml-vulkan.cpp * Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats * only increase the number of rows for amd and subgroup size 64 * fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested * use subgroup min and max to check for gcn (requires https://github.com/ggerganov/llama.cpp/pull/10721) * manual merge ggml-vulkan.cpp * set min and max subgroup size in any case * Also double the number of rows for Intel GPUs	2024-12-18 12:52:16 +02:00
Akarshan Biswas	26c9fd0cdc	SYCL: Reduce most of the compiler warnings (llama/10748) * Try to reduce some unused and typecast warnings * Reduce compiler warnings step 2 * add a newline at the end of the file * Initialize nreduce as size_t * [SYCL] Remove pragma directives from mmq.cpp * SYCL: mmq add condition to prevent blocks_per_tile_x_row variable from becoming 0 * SYCL softmax: Initialize nreduce as size_t * ggml-sycl.cpp: fix some trailing whitespaces * SYCL: remove the unused variables instead of commenting it out * SYCL poo2d kernel: set NAN for invalid pooling op * SYCL gemm.hpp: remove pragma directives * SYCL gemm.hpp: use const cast to properly support dnnl::memory * SYCL: wkv6 remove a comment * SYCL: clean comments step 2 * SYCL: clean comments and variables step 3 * SYCL: Use GGML_UNUSED for unused variables * SYCL: remove extra empty lines and a comment * Remove TODO * cleanup spaces * add a stdout for unsupported op * use sycl printf over fprintf * remove prints for CI * SYCL ggml-sycl: pool2D use sycl::nan and remove if-else block --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>	2024-12-18 12:52:16 +02:00
Karol Kontny	e6eed605cf	ggml : Fix compilation issues on ARM platform when building without fp16 (llama/10811)	2024-12-18 12:52:16 +02:00
a3sh	abe3102cb7	CUDA: faster non-contiguous concat (llama/10760) * faster uncontiguous concat * Use a lambda to avoid code duplication Co-authored-by: Diego Devesa <slarengh@gmail.com> * Update ggml/src/ggml-cuda/concat.cu * add constexpr and static assert --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2024-12-18 12:52:16 +02:00
Diego Devesa	1193e494a9	remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS (llama/10797) other windows build fixes	2024-12-18 12:52:16 +02:00
0cc4m	e5e951672e	Vulkan: Use improved q4_k and q5_k dequant code in dequant shaders (llama/10798)	2024-12-18 12:52:16 +02:00
0cc4m	0e24559ad9	Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats (llama/10721) * Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats * Fix subgroup size control extension support check Add accf32 and accf16 checks for coopmats * Also disable coopmats on amdvlk	2024-12-18 12:52:16 +02:00
Gilad S	527ac800cf	ggml: load all backends from a user-provided search path (llama/10699) * feat: load all backends from a user-provided search path * fix: Windows search path * refactor: rename `ggml_backend_load_all_in_search_path` to `ggml_backend_load_all_from_path` * refactor: rename `search_path` to `dir_path` * fix: change `NULL` to `nullptr` Co-authored-by: Diego Devesa <slarengh@gmail.com> * fix: change `NULL` to `nullptr` --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2024-12-18 12:52:16 +02:00
Jeff Bolz	479bd77169	vulkan: request round-to-even for fp16 in im2col/rope_head (llama/10767) Vulkan doesn't mandate a specific rounding mode, but the shader_float_controls feature allows rounding mode to be requested if the implementation supports it.	2024-12-18 12:52:16 +02:00
Eve	d8bf63a41b	vulkan: dynamic subgroup size for the remaining k quants (llama/10745) * q5_k q4_k q3_k q2_k q6_k multi row example * revert as multi row isnt faster for k quants	2024-12-18 12:52:16 +02:00
Andreas Kieslinger	b82c8d76dc	CUDA: rename macros to avoid conflicts with WinAPI (llama/10736) * Renames NVIDIA GPU-architecture flags to avoid name clashes with WinAPI. (e.g. CC_PASCAL, GPU architecture or WinAPI pascal compiler flag?) * Reverts erroneous rename in SYCL-code. * Renames GGML_CUDA_MIN_CC_DP4A to GGML_CUDA_CC_DP4A. * Renames the rest of the compute capability macros for consistency.	2024-12-18 12:52:16 +02:00
Jeff Bolz	86346f811e	vulkan: disable spirv-opt for coopmat shaders (llama/10763) There are some bugs in the 1.3.296 SDK, so disable this. It isn't strictly necessary anyway. Add missing dependency on vulkan-shaders-gen, so shaders get recompiled when it changes. Fix coopmat support reporting when glslc doesn't support NV_coopmat2.	2024-12-18 12:52:16 +02:00
Daniel Bevenius	c635f40a34	ggml : remove return from ggml_gallocr_allocate_node (ggml/1048) This commit removes the return statement from ggml_gallocr_allocate_node function. The motivation behind this change is to make the code more readable and consistent.	2024-12-18 12:52:16 +02:00
Daniel Bevenius	e0be0de1ee	ggml : add check for grad_accs (ggml/1046) * ggml : add check for grad_accs This commit adds a check for grad_accs in ggml_graph_get_grad and ggml_graph_get_grad_acc functions. This is necessary to avoid segfaults when grad_accs is not initialized. The motivation for this change is that I find it nice to be able to print out a computation graph using ggml_graph_print but this function segfaults when grad_accs is not initialized: ```console (gdb) p g1 $2 = (ggml_cgraph ) 0x7ffff66004b0 (gdb) p g1 $3 = {size = 2048, n_nodes = 1, n_leafs = 2, nodes = 0x7ffff6600500, grads = 0x0, grad_accs = 0x0, leafs = 0x7ffff6604500, visited_hash_set = {size = 4099, used = 0x7ffff6610518, keys = 0x7ffff6608500}, order = GGML_CGRAPH_EVAL_ORDER_LEFT_TO_RIGHT} (gdb) p ggml_graph_print(g1) === GRAPH === n_nodes = 1 Program received signal SIGSEGV, Segmentation fault. 0x0000555555579775 in ggml_graph_get_grad (cgraph=0x7ffff66004b0,node=0x7ffff6600340) at /ggml/ggml/src/ggml.c:5990 5990 return igrad != GGML_HASHSET_FULL && ggml_bitset_get(cgraph->visited_hash_set.used, igrad) ? cgraph->grads[igrad] : NULL; ``` * squash! ggml : add check for grad_accs Fix the check in ggml_graph_get_grad. The check was incorrectly using cgraph->grad_accs instead of cgraph->grads.	2024-12-18 12:52:16 +02:00
Johannes Gäßler	eb27e0d834	CUDA: fix shared memory access condition for mmv (llama/10740)	2024-12-18 12:52:16 +02:00
Jeff Bolz	a682fdce0c	vulkan: fix compile warnings (llama/10731)	2024-12-18 12:52:16 +02:00
stduhpf	9ffbd3d969	Vulkan: fix NaN in tanh.comp with AMD proprietary driver on Windows (llama/10723) * Vulkan: fix NaN in tanh.comp * Faster NaN-free tanh	2024-12-18 12:52:16 +02:00
Jeff Bolz	6585a890b4	vulkan: compile a test shader in cmake to check for coopmat2 support (llama/10713)	2024-12-18 12:52:16 +02:00
Georgi Gerganov	d0a050b51f	ggml : disable iq4_nl interleave size 8 (llama/10709) ggml-ci	2024-12-18 12:52:16 +02:00
Djip007	e990d1b791	ggml : refactor online repacking (llama/10446) * rename ggml-cpu-aarch64.c to .cpp * reformat extra cpu backend. - clean Q4_0_N_M and IQ4_0_N_M - remove from "file" tensor type - allow only with dynamic repack - extract cpu extra bufts and convert to C++ - hbm - "aarch64" - more generic use of extra buffer - generalise extra_supports_op - new API for "cpu-accel": - amx - aarch64 * clang-format * Clean Q4_0_N_M ref Enable restrict on C++ * add op GGML_OP_MUL_MAT_ID for Q4_0_N_M with runtime repack * added/corrected control on tensor size for Q4 repacking. * Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add debug logs on repacks. --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-12-18 12:52:16 +02:00
0cc4m	4a6d52efe6	Vulkan: VK_KHR_cooperative_matrix support to speed up prompt processing (llama/10597) * Vulkan: Implement VK_KHR_cooperative_matrix support in the matrix matrix multiplication shader * Improve performance with better q4_k and q5_k dequant and store unrolling * Add Vulkan MUL_MAT and MUL_MAT_ID accumulator precision selection * Rework mulmat shader selection and compilation logic, avoid compiling shaders that won't get used by device * Vulkan: Implement accumulator switch for specific mul mat mat shaders * Vulkan: Unroll more loops for more mul mat mat performance * Vulkan: Add VK_AMD_shader_core_properties2 support to read Compute Unit count for split_k logic * Disable coopmat support on AMD proprietary driver * Remove redundant checks * Add environment variable GGML_VK_DISABLE_COOPMAT to disable VK_KHR_cooperative_matrix support * Fix rebase typo * Fix coopmat2 MUL_MAT_ID pipeline selection	2024-12-18 12:52:16 +02:00
Robert Ormandi	8b841d430a	metal : Extend how Llama.cpp locates metal resources (llama/10676) * metal : Extend how Llama.cpp locates metal resources (llama/10675) * It searches the resource file in the directory where the current binary is located as well. * Resolves symbolic links. Rationale: When we plug this dependency into a Bazel build and run it in the context of Bazel (e.g. testing): * the execution directory is often very different from where the files are located and no direct control over this (Bazel sandboxing), * the Bazel sandbox often use symbolic links to make files available. With this patch, we can have the resource file added to the target, can build and run tests in the context of Bazel. * Update ggml/src/ggml-metal/ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-metal/ggml-metal.m Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-12-18 12:52:16 +02:00
Jeff Bolz	b74b68212a	vulkan: Add VK_NV_cooperative_matrix2 support for mul_mat and flash attention (llama/10206)	2024-12-18 12:52:16 +02:00
Georgi Gerganov	94e7da1ff2	cmake : fix "amd64" processor string (#2638 )	2024-12-17 18:34:32 +02:00
gn64	c4aed6831e	vulkan : fix soft_max.comp division by zero (#2633 ) This change prevents a division by zero error when p.KY is 0.	2024-12-16 12:34:38 +02:00
Georgi Gerganov	7d134e3737	ggml : remove old files (skip) (#0 )	2024-12-08 23:04:26 +02:00
Georgi Gerganov	9df53b357e	ggml : sync remnants (skip) (#0 )	2024-12-08 22:48:25 +02:00
Diego Devesa	a815940e0e	ggml : add predefined list of CPU backend variants to build (llama/10626) * ggml : add predefined list of CPU backend variants to build * update CPU dockerfiles	2024-12-08 20:14:35 +02:00
Diego Devesa	904e307bce	ggml-cpu : fix HWCAP2_I8MM value (llama/10646)	2024-12-08 20:14:35 +02:00
Jeff Bolz	491ec076b4	vulkan: Implement "fast divide" (mul+shift) for unary ops like copy (llama/10642)	2024-12-08 20:14:35 +02:00
Nicolò Scipione	966433fdf2	SYCL : Move to compile time oneMKL interface backend selection for NVIDIA backend (llama/10584) * [SYCL] Move to Compile Time backend selection on oneMKL Interface for NVIDIA backend Move to compile time selection to backend to avoid latency at run time. Add it to all mkl gemm calls and only for NVIDIA backend. Signed-off-by: nscipione <nicolo.scipione@codeplay.com> * Formatting * Address PR comments to increase readibility --------- Signed-off-by: nscipione <nicolo.scipione@codeplay.com>	2024-12-08 20:14:35 +02:00
Frankie Robertson	6f1ba9d82d	Avoid using __fp16 on ARM with old nvcc (llama/10616)	2024-12-08 20:14:35 +02:00
Jeff Bolz	015ecd0001	vulkan: optimize and reenable split_k (llama/10637) Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders.	2024-12-08 20:14:35 +02:00
PAB	b7c64a4352	ggml: add `GGML_SET` Metal kernel + i32 CPU kernel (ggml/1037) * implemented cpu kernel * add i32 test cases in test-backend-ops * typedef `ggml_metal_kargs_set` * implemented `kernel_set` * memcpy	2024-12-08 20:14:35 +02:00
PAB	7895d39508	ggml : add `GGML_PAD_REFLECT_1D` operation (ggml/1034) * ggml_pad_reflect_1d defined in header * implemented on CPU * called the forward pass * impl Metal kernel * added Metal kernel * added OP_PAD_REFLECT_1D in test-backend-ops.cpp * add test-pad-reflect-1d test case * test case support multiple backend	2024-12-08 20:14:35 +02:00
Georgi Gerganov	22616f00f9	files : remove make artifacts	2024-12-08 20:14:35 +02:00
Diego Devesa	3daeacad24	ggml : move AMX to the CPU backend (llama/10570) ggml : automatic selection of best CPU backend (llama/10606)	2024-12-08 20:14:35 +02:00
Georgi Gerganov	4d73962da4	metal : small-batch mat-mul kernels (llama/10581) * metal : small-batch mat-mul kernels ggml-ci * metal : add rest of types ggml-ci * metal : final adjustments ggml-ci * metal : add comments ggml-ci	2024-12-08 20:14:35 +02:00
Akarshan Biswas	068812650e	SYCL: Fix and switch to GGML_LOG system instead of fprintf (llama/10579) * Switched to GGML_LOG * Fix missing semicolon	2024-12-08 20:14:35 +02:00
Adrien Gallouët	4b7e059e15	ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() (llama/10567) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2024-12-08 20:14:35 +02:00
Eve	30e35d7271	vulkan: Dynamic subgroup size support for Q6_K mat_vec (llama/10536) * subgroup 64 version with subgroup add. 15% faster scalable version tested for subgroup sizes 16-128 * check for subgroup multiple of 16 and greater than 16 * subgroup sizes are always a power of 2 (https://github.com/KhronosGroup/GLSL/issues/45) * force 16 sequential threads per block * make 16 subgroup size a constant	2024-12-08 20:14:35 +02:00
Georgi Gerganov	3623bd58f2	ggml : fix I8MM Q4_1 scaling factor conversion (llama/10562) ggml-ci	2024-12-08 20:14:35 +02:00
Shupei Fan	cb847c20a7	ggml-cpu: fix typo in gemv/gemm iq4_nl_4_4 (llama/10580)	2024-12-08 20:14:35 +02:00
Alberto Cabrera Pérez	964b154a2a	sycl : offload of get_rows set to 0 (llama/10432)	2024-12-08 20:14:35 +02:00
Alberto Cabrera Pérez	d7c2a04bce	sycl : Reroute permuted mul_mats through oneMKL (llama/10408) This PR fixes the failing MUL_MAT tests for the sycl backend.	2024-12-08 20:14:35 +02:00
Chenguang Li	2bb4ca9cba	CANN: RoPE operator optimization (llama/10563) * [cann] RoPE operator optimization * [CANN]Code Formatting --------- Co-authored-by: noemotiovon <noemotiovon@gmail.com>	2024-12-08 20:14:35 +02:00
Jeff Bolz	a753a82462	vulkan: get the first command buffer submitted sooner (llama/10499) This is an incremental improvement over #9118 to get work to the GPU a bit sooner. The first part is to start with a smaller number of nodes before the first submit, and ramp it up to the current 100 nodes/submit. The second part is to reduce the dryrun overhead for all the nodes that just need to request descriptor space. With these changes I get around 1-2% speedup on RTX 4070 combined with my old Haswell-era CPU.	2024-12-08 20:14:35 +02:00
Georgi Gerganov	276b08d8f0	ggml : remove redundant copyright notice + update authors	2024-12-08 20:14:35 +02:00
Georgi Gerganov	4ca1e72fe0	ggml : fix row condition for i8mm kernels (llama/10561) ggml-ci	2024-12-08 20:14:35 +02:00
Georgi Gerganov	16a66f103f	cmake : fix ARM feature detection (llama/10543) ggml-ci	2024-12-08 20:14:35 +02:00
Shupei Fan	330273901f	ggml-cpu: support IQ4_NL_4_4 by runtime repack (llama/10541) * ggml-cpu: support IQ4_NL_4_4 by runtime repack * ggml-cpu: add __ARM_FEATURE_DOTPROD guard	2024-12-08 20:14:35 +02:00
Sergio López	42099a9342	kompute : improve backend to pass test_backend_ops (llama/10542) * kompute: op_unary: reject unsupported parameters Signed-off-by: Sergio Lopez <slp@redhat.com> * kompute: softmax: implement ALiBi support Signed-off-by: Sergio Lopez <slp@redhat.com> * kompute: rope: implement neox and phi3 support Signed-off-by: Sergio Lopez <slp@redhat.com> * kompute: op_mul_mat_q4_k permutted support Signed-off-by: Sergio Lopez <slp@redhat.com> * kompute: op_mul_mat_[q4_0\|q4_1\|q8_0] permutted support Signed-off-by: Sergio Lopez <slp@redhat.com> * kompute: op_mul_mat_f16 permutted support Signed-off-by: Sergio Lopez <slp@redhat.com> * kompute: op_mul_mat_q6_k permutted support Signed-off-by: Sergio Lopez <slp@redhat.com> --------- Signed-off-by: Sergio Lopez <slp@redhat.com>	2024-12-08 20:14:35 +02:00
leo-pony	90dd5fca9c	CANN: Fix SOC_TYPE compile bug (llama/10519) * CANN: Fix the bug build fail on Ascend310P under two cases: 1) Manual specify SOC_TYPE 2) Under some unusual compile environment * Update the cann backend News content: Support F16 and F32 data type model for Ascend 310P NPU. * fix CANN compile fail bug: the assert in ascend kernel function doesn't supportted on some CANN version	2024-12-08 20:14:35 +02:00
Chenguang Li	2490f2a7f8	CANN: ROPE operator optimization (llama/10540) * [cann] ROPE operator optimization Co-authored-by: noemotiovon <noemotiovon@gmail.com>	2024-12-08 20:14:35 +02:00
uvos	230e985633	Add some minimal optimizations for CDNA (llama/10498) * Add some minimal optimizations for CDNA * ggml_cuda: set launch bounds also for GCN as it helps there too	2024-12-08 20:14:35 +02:00
Georgi Gerganov	ae24083f23	metal : fix group_norm support condition (llama/0)	2024-12-08 20:14:35 +02:00
Jeff Bolz	6463e36369	vulkan: define all quant data structures in types.comp (llama/10440)	2024-12-08 20:14:35 +02:00
Jeff Bolz	b3301f7d82	vulkan: Handle GPUs with less shared memory (llama/10468) There have been reports of failure to compile on systems with <= 32KB of shared memory (e.g. #10037). This change makes the large tile size fall back to a smaller size if necessary, and makes mul_mat_id fall back to CPU if there's only 16KB of shared memory.	2024-12-08 20:14:35 +02:00
Jeff Bolz	ab5d4d93ec	vulkan: further optimize q5_k mul_mat_vec (llama/10479)	2024-12-08 20:14:35 +02:00
Jeff Bolz	2d6e9dd723	vulkan: skip integer div/mod in get_offsets for batch_idx==0 (llama/10506)	2024-12-08 20:14:35 +02:00
Jeff Bolz	2f16e51553	vulkan: optimize Q2_K and Q3_K mul_mat_vec (llama/10459)	2024-12-08 20:14:35 +02:00
R0CKSTAR	0f0994902f	mtgpu: Add MUSA_DOCKER_ARCH in Dockerfiles && update cmake and make (llama/10516) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-12-08 20:14:35 +02:00
Jeff Bolz	5e1fcc1780	vulkan: fix group_norm (llama/10496) Fix bad calculation of the end of the range. Add a backend test that covers the bad case (taken from stable diffusion). Fixes https://github.com/leejet/stable-diffusion.cpp/issues/439.	2024-12-08 20:14:35 +02:00
Georgi Gerganov	48f421de23	cmake : enable warnings in llama (llama/10474) * cmake : enable warnings in llama ggml-ci * cmake : add llama_get_flags and respect LLAMA_FATAL_WARNINGS * cmake : get_flags -> ggml_get_flags * speculative-simple : fix warnings * cmake : reuse ggml_get_flags ggml-ci * speculative-simple : fix compile warning ggml-ci	2024-12-08 20:14:35 +02:00
Charles Xu	e7afb2b991	ggml-cpu: cmake add arm64 cpu feature check for macos (llama/10487) * ggml-cpu: cmake add arm64 cpu feature check for macos * use vmmlaq_s32 for compile option i8mm check	2024-12-08 20:14:35 +02:00
Shanshan Shen	9a5ef7b169	CANN: Improve the Inferencing Performance for Ascend NPU Device (llama/10454) * improve inferencing performance for ascend npu. Co-authored-by: Frank Mai <thxCode@thxcode0824@gmail.com> * some modification after review * some modifications after review * restore some modifications * restore some modifications --------- Co-authored-by: shanshan shen <shanshanshen333@gmail.com> Co-authored-by: Frank Mai <thxCode@thxcode0824@gmail.com>	2024-12-08 20:14:35 +02:00
Chenguang Li	453cc0fcf1	CANN: RoPE and CANCAT operator optimization (llama/10488) Co-authored-by: noemotiovon <noemotiovon@gmail.com>	2024-12-08 20:14:35 +02:00
Junil Kim	78dfec6bc5	vulkan: Fix a vulkan-shaders-gen arugment parsing error (llama/10484) The vulkan-shaders-gen was not parsing the --no-clean argument correctly. Because the previous code was parsing the arguments which have a value only and the --no-clean argument does not have a value, it was not being parsed correctly. This commit can now correctly parse arguments that don't have values.	2024-12-08 20:14:35 +02:00
Georgi Gerganov	f6d518fc4c	metal : enable mat-vec kernels for bs <= 4 (llama/10491)	2024-12-08 20:14:35 +02:00
Diego Devesa	ac33379a35	llama : accept a list of devices to use to offload a model (llama/10497) * llama : accept a list of devices to use to offload a model * accept `--dev none` to completely disable offloading * fix dev list with dl backends * rename env parameter to LLAMA_ARG_DEVICE for consistency	2024-12-08 20:14:35 +02:00
Diego Devesa	77e3e4a090	ggml : add support for dynamic loading of backends (llama/10469) * ggml : add support for dynamic loading of backends --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-12-08 20:14:35 +02:00
Georgi Gerganov	b840bb09be	metal : minor code formatting	2024-12-08 20:14:35 +02:00
Diego Devesa	8b1c1c30a7	ggml : do not use ARM features not included in the build (llama/10457)	2024-12-08 20:14:35 +02:00
leo-pony	4b81335f75	CANN: Support Ascend310P to accelerate F32 and F16 Model (llama/10216) * CANN Support Ascend310P to accelerate F32 and F16 Model * Add compile option soc type macro ASCEND_310P to ggml-cann lib * Remove unused code * Remove the ascend soc_type hard code compile option in CMakelist.txt	2024-12-08 20:14:35 +02:00
Diego Devesa	2a4b5c9d7e	cuda : optimize argmax (llama/10441) * cuda : optimize argmax * remove unused parameter ggml-ci * fixup : use full warps ggml-ci * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * fix ub * ggml : check ne00 <= INT32_MAX in argmax and argsort --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-12-08 20:14:35 +02:00
Jeff Bolz	04662748aa	vulkan: predicate max operation in soft_max shaders/soft_max (llama/10437) Fixes #10434	2024-12-08 20:14:35 +02:00
Jeff Bolz	a117279e13	vulkan: copy iq4_nl LUT into shared memory (llama/10409)	2024-12-08 20:14:35 +02:00
Jeff Bolz	bbb292ed38	vulkan: further optimize mul_mat_vec using larger loads (llama/10387) * vulkan: Use pipeline_robustness to disable robustness in mul_mat_vec. Add some early returns for nonexistent rows in mul_mat_vec shaders. These can only be hit when dispatching a 2D grid of workgroups. Fix the logic for the 2D grid of workgroups to round up. Enable the pipeline robustness extension if it's available, and use it to disable robustness for these pipelines. The instructions to do the bounds checking contend for the same ALU resources as the bit twiddling dequant instructions. * vulkan: Add GLSL structure aliases for quant types to allow larger loads In Vulkan it's not possible to cast pointer types, so instead you have to declare an aliased binding for the memory with a different type. This commit adds aliases for the quant formats using 16b ints, and in a few places where the struct size is a multiple of 4 also using 32b ints. Currently only q4_k's aliases are used, but others will be used in subsequent commits. * vulkan: use larger loads in q5_k and q6_k shaders. Similar to the optimization I did in q4_k recently, this vectorizes some loads and reduces the number of bit twiddling instructions. * vulkan: use larger K step per iteration in mul_mat_vec. Add vec4 dequantization functions, and use them to do K=8 per iteration in mul_mat_vec. This uses 16b loads for the quant values and 128b loads for B which helps reduce the load on the memory system. The K_PER_ITER==2 logic is still there, just for F16/F32, and really only because they support unaligned sizes. Tweak the num_iters/unrolling logic to be simpler and catch a couple missed unrolling opportunities.	2024-12-08 20:14:35 +02:00
haopeng	95e8901e71	add cmake rvv support (llama/10411)	2024-12-08 20:14:35 +02:00
mahorozte	4af9626702	CUDA: remove unnecessary warp reduce in FA (ggml/1032) * kqmax_new_j in every thread within warp is same after operate at line 199,this reduce can be omit * same problem in vec32 --------- Co-authored-by: ZhaoXiaoYu <zhao.xiaoyu@zte.com.cn>	2024-12-08 20:14:35 +02:00
PAB	c52d1035de	feat: add `GGML_UNARY_OP_ARGMAX` Metal kernel (ggml/1019) * implemented argmax kernel * tpig -> tgpig * change to strides * contiguous assertions * kernel working and tested * argmax simd parallel implementation * added 2 new tests for argmax in test-backend-ops * cosmit * added 3 tests cases for perf eval * add test_argmax in make_test_cases_perf * Update test-backend-ops.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2024-12-08 20:14:35 +02:00
PAB	5773a14980	metal : add `GGML_OP_CONV_TRANSPOSE_1D` kernels (ggml/1026) * wip * wip implementation f32 * kernel conv transpose 1d f32 working * initial commit	2024-12-08 20:14:35 +02:00
Frankie Robertson	6939147c47	Do not include arm_neon.h when compiling CUDA code (ggml/1028)	2024-12-08 20:14:35 +02:00
Johannes Gäßler	98f9916c9f	ggml-opt: fix data corruption (ggml/1022)	2024-12-08 20:14:35 +02:00
slaren	9db070a3c5	ggml/sched : do not skip views in pre-assignments	2024-11-20 21:00:08 +02:00
Georgi Gerganov	7fd8d9c220	whisper : adapt to new ggml (wip)	2024-11-20 21:00:08 +02:00
Georgi Gerganov	f4c1d7df39	ggml : sync resolve (skip) (#0 )	2024-11-20 21:00:08 +02:00
bandoti	339b8e559c	Add required ggml-base and backend libs to cmake pkg (llama/10407)	2024-11-20 21:00:08 +02:00
Diego Devesa	5f6d6919b4	cuda : fix CUDA_FLAGS not being applied (llama/10403)	2024-11-20 21:00:08 +02:00
Romain Biessy	8ee767732f	sycl : Add option to set the SYCL architecture for all targets (llama/10266) * Add option to set the SYCL architecture for all targets * Convert GGML_SYCL_HIP_TARGET to the more generic GGML_SYCL_ARCH option * Document that setting GGML_SYCL_ARCH can improve the performance	2024-11-20 21:00:08 +02:00
Jeff Bolz	45f1f9144f	vulkan: Optimize soft_max (llama/10301) * vulkan: Optimize soft_max Large soft_max could already saturate memory, but small/medium sizes were pretty slow. The bulk of the gains for them comes from using a smaller workgroup size, and making the workgroup size match the subgroup size also makes the barriers much cheaper. Cache some values in locals to avoid refetching/recomputing. And stamp out a few "template instantiations" so smaller cases will fully unroll. Add a missing early return for OOB rows. This happens when there are more than 512 rows and the dispatch is 512 x H. * vulkan: Further soft_max optimizations Restore the workgroup size of 512 case, use it for >1024. Use unrollable loops for more iteration counts.	2024-11-20 21:00:08 +02:00
Alberto Cabrera Pérez	53589c8f12	sycl: Revert MUL_MAT_OP support changes (llama/10385)	2024-11-20 21:00:08 +02:00
Diego Devesa	7ac2f17fac	cuda : only use native when supported by cmake (llama/10389)	2024-11-20 21:00:08 +02:00
Jeff Bolz	48862c7b27	vulkan: remove use of null initializer (llama/10372) Seems like this isn't working for vulkan-over-metal when the array is sized by a spec constant. Maybe a spirv-cross limitation?	2024-11-20 21:00:08 +02:00
Plamen Minev	44f7d9f4e3	metal : fox offset integer overflows in im2col (ggml/1015) -- While running StableDiffusion.cpp locally with Metal some offsets overflow and results in incorrect calculations	2024-11-20 21:00:08 +02:00
0cc4m	fd12302587	Vulkan: Fix device info output format specifiers (llama/10366) * Vulkan: Fix device info output format specifiers * Vulkan: Use zu printf specifier for size_t instead of ld	2024-11-20 21:00:08 +02:00
PAB	f80bef4630	metal : add `GGML_UNARY_OP_ELU` kernel (ggml/1018)	2024-11-20 21:00:08 +02:00
Johannes Gäßler	161b443514	CUDA: fix MMV kernel being used for FP16 src1 (llama/10357)	2024-11-20 21:00:08 +02:00
Johannes Gäßler	ef7fbe1c66	CMake: fix typo in comment [no ci] (llama/10360)	2024-11-20 21:00:08 +02:00
Diego Devesa	0879d3599e	llama : only use default buffer types for the KV cache (llama/10358)	2024-11-20 21:00:08 +02:00
Georgi Gerganov	2a444dc5bd	metal : refactor kernel args into structs (llama/10238) * metal : add kernel arg structs (wip) * metal : fattn args ggml-ci * metal : cont + avoid potential int overflow [no ci] * metal : mul mat struct (wip) * cont : mul mat vec * cont : pass by reference * cont : args is first argument * cont : use char ptr * cont : shmem style * cont : thread counters style * cont : mul mm id ggml-ci * cont : int safety + register optimizations ggml-ci * metal : GGML_OP_CONCAT ggml-ci * metal : GGML_OP_ADD, GGML_OP_SUB, GGML_OP_MUL, GGML_OP_DIV * metal : GGML_OP_REPEAT * metal : GGML_OP_CPY * metal : GGML_OP_RMS_NORM * metal : GGML_OP_NORM * metal : add TODOs for rest of ops * ggml : add ggml-metal-impl.h ggml-ci	2024-11-20 21:00:08 +02:00

... 17 18 19 20 21 ...

2159 Commits