whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Aaron Teo	89a7b4d22c	ggml-cpu: implement MXFP4 SIMD for s390x (llama/16193) * ggml-cpu: impl mxfp4 s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: missing s = sumf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix incorrect kval_mxfp4 type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: rework mxfp4 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: missing delta calc Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix typo Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix typo for vec_splats Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: expand to 2 blocks per loop Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add unroll to boost perf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: back to 1 block per loop to test perf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: back to 1 block per loop to test perf" This reverts commit 1fe55724e2dc295701101bf838bdd4a512237492. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: rm unroll from single block Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-29 15:18:11 +03:00
R0CKSTAR	98ac209ae1	musa: fix build warnings (llama/15611) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-09-29 15:18:10 +03:00
Aman Gupta	d9bf63cfb8	CUDA: add a fused top-K MoE kernel (llama/16130) * CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback	2025-09-29 15:18:10 +03:00
junchao-zhao	24ea5476de	ggml : fix loongarch lsx compilation error (llama/15864)	2025-09-29 15:18:10 +03:00
Daniel Bevenius	611ff19f20	ggml : remove -dev suffix from release version (ggml/1355) This commit removes the `-dev` suffix from the version string in CMakeLists.txt and the release script. The version will now be just be formatted as `MAJOR.MINOR.PATCH`.	2025-09-29 15:18:10 +03:00
Daniel Bevenius	06d7b3d124	ggml : bump version to 0.9.3 (ggml/1353)	2025-09-29 15:18:10 +03:00
Georgi Gerganov	ac678efb35	metal : fuse NORM + MUL + ADD, support non-multiples of 4 (llama/16220) * metal : fuse NORM + MUL + ADD * metal : support norms of non-multiple of 4 * cont : fix comment [no ci]	2025-09-29 15:18:10 +03:00
Georgi Gerganov	268f1c961b	metal : relax reorder conditions (llama/16216)	2025-09-29 15:18:10 +03:00
Georgi Gerganov	0a5b811f2e	metal : restore im2col perf (llama/16219)	2025-09-29 15:18:10 +03:00
Radoslav Gerganov	0946619662	rpc : use ggml logging facilities Use RPC_DEBUG environment variable to enable debug messages. Add helper macro LOG_DBG() which does an early check of the env var before calling GGML_LOG_DEBUG(). Make sure we log a debug message for every server function.	2025-09-29 15:18:10 +03:00
Johannes Gäßler	cd431223e0	llama: print memory breakdown on exit (llama/15860) * llama: print memory breakdown on exit	2025-09-29 15:18:10 +03:00
Acly	5069c08034	ggml : split graph allocations according to backend max buffer size (llama/15815) * ggml : make gallocr respect the backend's max buffer size * if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers * vulkan: report the actual max allocation size in buffer type interface * fix missing newline, apple-clang warning * track size of individual chunks in ggml_dyn_tallocr and raise max chunks. revert to use suballocation_block_size as max chunk size for vulkan. * track (chunk, offset) pairs instead of "global" offsets through gallocr. * simpler, don't need loops to map between local/global offsets * touches more code * fix dyn_tallocr_max_size and initialization * fix memory leak when buffers are reused due to same buffer type appearing multiple times * make vbuffer allocation follow the same logic as backend_buffer did before * continue to use leftover unallocated space of previous chunks after a new one has been created * treat free blocks of each chunk as separate list * they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges * exhaust freed blocks of all chunks before considering their last blocks with unallocated space * start with 0 chunks/blocks and create chunks as needed * allow the last chunk to grow beyond max size * refactor: move adding new free block and new chunk into separate functions * allocate chunks individually with a separate free-blocks list for each one * needs a bit more memory/allocations/indirections, but code is simpler * fix warnings (missing static) & debug checks	2025-09-29 15:18:09 +03:00
Xiangyan Sun	41245891c1	ggml-cpu: Respect cpumask settings (llama/16164)	2025-09-29 15:18:09 +03:00
Sigbjørn Skjæret	73e8f3acb8	ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (llama/15928) * fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl * change initialization to true	2025-09-29 15:18:09 +03:00
Aaron Teo	c706a50746	zdnn: refactor codebase + add docs (llama/16178) * zdnn: initial matmul refactor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm static from funcs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: update ggml-zdnn.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: change header files to hpp Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: switch to common.hpp Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: move mulmat forward around Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm inline from utils Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: code cleanup Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * docs: add zDNN docs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-29 15:18:09 +03:00
Daniel Bevenius	d8d31e3638	ggml-cpu : fix typo in gemm comments [no ci] (llama/16189)	2025-09-29 15:18:09 +03:00
Sigbjørn Skjæret	4e32ee733b	ggml : implement set_rows with i32 index (llama/16159) * implement set_rows with i32 index * template fix * test quantized path warnings-- * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * forgotten name change * deduplicate cuda/sycl and test-fix * indent++ * vulkan: support set_rows with i32 index type (llama/16162) * disable i32 index for webgpu for now --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-09-29 15:18:09 +03:00
Georgi Gerganov	df672c6372	ggml : extend ggml_can_fuse to work with non-sequential nodes (llama/16123) * ggml : extend ggml_can_fuse to work with non-sequential nodes in the graph * cont : fix wrong bounds check condition * cont : remove unnecessary overload	2025-09-29 15:18:09 +03:00
Georgi Gerganov	973054a8cd	ggml : add ggml_op_is_empty (llama/16122) * ggml : add ggml_op_is_empty * ggml : move to ggml-impl.h	2025-09-29 15:18:09 +03:00
Shin-myoung-serp	9f673df08d	Vulkan: add conv_transpose_2d operation (llama/16022) * Vulkan: add conv_transpose_2d operation * Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L) * Vulkan: fix incorrect indentation in conv_transpose_2d shader * Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation * Vulkan: revert the order of the index calculation and bound check in conv_2d shader * Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation. * Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.	2025-09-29 15:18:09 +03:00
Jeff Bolz	14723f25a1	vulkan: add RTE variants of exp shader (llama/16165) This fixes some failures on Turing where "round to zero" rounds to the max f16 value but the CPU reference value is infinite.	2025-09-29 15:18:08 +03:00
Ruben Ortlam	95b29fab78	vulkan: vec dot matrix multiplication fix (llama/16151) * vulkan: fix matrix multiplication index calculation for odd m/n and odd k in combination with batching * add odd m/n + odd k test with batching	2025-09-29 15:18:08 +03:00
lhez	4b7f09ac0b	opencl: fix concat crash on win arm64 with Adreno (llama/15944)	2025-09-29 15:18:08 +03:00
lhez	0a7096f4f3	opencl: initial `q8_0` mv support (llama/15732)	2025-09-29 15:18:08 +03:00
Giuseppe Scrivano	eae2be0ca2	vulkan: optimize UMA buffer operations and fix driver hangs (llama/16059) * vulkan: optimize UMA buffer operations and fix driver hangs The previous implementation was blocking the GPU for extended periods, causing the i915 driver to reset the context due to the hangcheck protection. [32628.443070] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:85dffffb, in llama-server [194114] [32628.443091] i915 0000:00:02.0: [drm] llama-server[194114] context reset due to GPU hang * vulkan: implement deferred_memset on UMA --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-09-29 15:18:08 +03:00
Jeff Bolz	9a6c2036a9	vulkan: fix validation error about VK_PIPELINE_CREATE_CAPTURE_STATISTICS_BIT_KHR (llama/16086)	2025-09-29 15:18:08 +03:00
Georgi Gerganov	8d10ded025	ggml : prepare for development of 0.9.2-dev	2025-09-29 15:18:08 +03:00
Georgi Gerganov	d89164a08d	ggml : bump version to 0.9.1	2025-09-29 15:18:05 +03:00
Georgi Gerganov	36778bd8b8	talk-llama : sync llama.cpp	2025-09-20 13:58:28 +03:00
Georgi Gerganov	66ad624d5b	sync : ggml	2025-09-20 13:46:41 +03:00
Ruben Ortlam	76d0934287	vulkan: use vec dot for matrix matrix multiplications (llama/16056) * vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions * use fma instead of dot to fix Nvidia and Apple performance issues	2025-09-20 13:46:39 +03:00
Xuan-Son Nguyen	2ad00d5586	ggml : refactor forward_dup for cpu backend (llama/16062) * ggml : refactor forward_dup for cpu backend * clean up a bit * add quant/dequant perf test	2025-09-20 13:46:39 +03:00
Adrien Gallouët	4d8cd07825	ggml-amx : fix ggml_amx_init() on generic Linux (llama/16049) Generalize Linux check to `__linux__` to support non-glibc systems (like musl). Also, return `false` on unknown/untested OS. Without this commit, the code compiles (with warnings) but fails: register_backend: registered backend CPU (1 devices) register_device: registered device CPU (Intel(R) Xeon(R) Platinum 8488C) build: 6487 (51c4cac6) with x86_64-linux-musl-gcc (GCC) 15.1.0 for x86_64-linux-musl (debug) system info: n_threads = 8, n_threads_batch = 8, total_threads = 16 .... print_info: n_ctx_orig_yarn = 262144 print_info: rope_finetuned = unknown print_info: model type = 4B Illegal instruction (core dumped) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-09-20 13:46:39 +03:00
Adrien Gallouët	4575f96873	cmake : fix static linking for OpenMP on Unix-like systems (llama/16031) When compiling with GGML_STATIC=ON, the build process would produce a binary that was still dynamically linked to OpenMP. This defeats the purpose of a static build: $ cmake -B build \ -DBUILD_SHARED_LIBS=OFF \ -DLLAMA_CURL=OFF \ -DGGML_CCACHE=OFF \ -DGGML_NATIVE=OFF \ -DGGML_STATIC=ON $ ldd llama-server linux-vdso.so.1 (0x0000e1a434e3b000) libgomp.so.1 => /lib/aarch64-linux-gnu/libgomp.so.1 (0x0000e1a4345a0000) libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000e1a434300000) libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000e1a434240000) libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000e1a434200000) libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000e1a434030000) /lib/ld-linux-aarch64.so.1 (0x0000e1a434df0000) This commit resolves the issue by modifying `CMAKE_FIND_LIBRARY_SUFFIXES` to prioritize `.a` files, forcing CMake to link the static version of the library. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-09-20 13:46:39 +03:00
Shawn Gu	f4a225cea6	opencl: optimize mxfp4 kernels (llama/16037) - flatten mxfp4 and packed fp4->fp16 bit-wise convert function (replace lut) - MoE kernel optimizations --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2025-09-20 13:46:39 +03:00
Jeff Bolz	7fcb7e83ec	rename optimize_graph to graph_optimize (llama/16082)	2025-09-20 13:46:39 +03:00
Bowen Han	fce6354e0f	CUDA: Optimize PAD_REFLECT_1D (llama/15957) * CUDA: Optimize PAD_REFLECT_1D feat: add more test cases for PAD_REFLECT_1D * use fast_div to improve performance * Apply suggestion from JohannesGaessler Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Apply suggestion from JohannesGaessler Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * optimize * use a concise expression to further speedup the cuda kernel --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-09-20 13:46:38 +03:00
Johannes Gäßler	05bdfd4380	CUDA: fix compilation on CC 6.0 (llama/16091)	2025-09-20 13:46:38 +03:00
Georgi Gerganov	960aaa9904	metal : use function constants for mul_mv_ext kernels (llama/16074) * metal : use function constants for mul_mv_ext kernels ggml-ci * metal : remove NW template argument ggml-ci * metal : adjust constants ggml-ci	2025-09-20 13:46:38 +03:00
Sigbjørn Skjæret	225d7c1d5a	cuda : add missing F32<->I32 entries in ggml_cuda_cpy_fn (llama/16060)	2025-09-20 13:46:38 +03:00
Georgi Gerganov	d37f590a77	metal : improve F32, F16 and BF16 mat-vec multiplication (llama/16057) * metal : improve F32, F16 and BF16 mat-vec multiplication ggml-ci * metal : make the NSG a function constant in mul_mv kernels ggml-ci	2025-09-20 13:46:38 +03:00
Jhen-Jie Hong	32b6d9c134	metal : avoid call free for non-owned buffer (llama/16067)	2025-09-20 13:46:38 +03:00
Georgi Gerganov	1f24b1df4d	metal : handle nil cv during pipeline creation (llama/16065) ggml-ci	2025-09-20 13:46:38 +03:00
Chenguang Li	c46adc0817	CANN: Remove print (llama/16044) Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-20 13:46:38 +03:00
Reese Levine	1361f679cc	GGML WebGPU: Support for ADD, MUL, RMS_NORM, GET_ROWS operators (llama/16018) * Add paramater buffer pool, batching of submissions, refactor command building/submission * Add header for linux builds * Free staged parameter buffers at once * Format with clang-format * Fix thread-safe implementation * Use device implicit synchronization * Update workflow to use custom release * Remove testing branch workflow * some f32 tests passing * Disable set_rows until it's implemented * f32 add all tests passing * Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments * Add templated addition, clean up code * Get addition and multiplication working * Implement rms_norm * Add get_rows implementation * Add new get_rows files * Refactor use of wg size entry * Fix compilation * Try manually unrolled q4_0 quant * Revert "Try manually unrolled q4_0 quant" This reverts commit 77f8b96515f7e640ae4b0e44f066321fbc4a6166. * Move to constant max wg size * Check for tensor size in supports_op * Vectorize f32 and change default workgroup size * Move f32 get_rows from < 4 to % 4 != 0 * fix linter errors * Add in-place tests --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>	2025-09-20 13:46:37 +03:00
Georgi Gerganov	eb2c01f92e	metal : refactor + optimize v2 (llama/15995)	2025-09-20 13:46:10 +03:00
Georgi Gerganov	6458bac4c1	sync : ggml	2025-09-20 13:45:32 +03:00
Johannes Gäßler	d452f0cf8c	CUDA: fix FA occupancy, optimize tile kernel (llama/15982)	2025-09-20 13:45:30 +03:00
Eve	e96b285011	vulkan: automatically remove unsupported devices (llama/15976) * remove unsupported vulkan devices * make this happen during selection instead * pass by reference	2025-09-20 13:45:30 +03:00
Chenguang Li	e32c3b0fd3	CANN: Optimize ggml_cann_set_device (llama/15935) * CANN: Fix ggml_cann_set_device to avoid redundant device switches - Added a check to skip aclrtSetDevice if the current device is already set. - Prevents unnecessary context switches while keeping thread/device consistency. * CANN: add device default id	2025-09-20 13:45:30 +03:00

... 11 12 13 14 15 ...

3874 Commits All Branches Search

3874 Commits

All Branches