whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Johannes Gäßler	0dda27bc0b	CUDA: fix crash on large batch size for quant. MoE (llama/13537)	2025-05-19 14:58:39 +03:00
Johannes Gäßler	ffa4720f25	CUDA: faster Deepseek FA, add Turing support (llama/13435)	2025-05-19 14:58:39 +03:00
bandoti	9b8eea28b5	cmake: simplify vulkan shader test logic (llama/13263)	2025-05-19 14:58:39 +03:00
Jeff Bolz	162bbe8220	vulkan: KHR_coopmat flash attention (llama/13506) This shader uses coopmat1 to do the QK^T multiply. The PV multiply is more difficult for various reasons so I haven't done it. Performance for this shader is around 2.5x better than for the scalar shader when doing prompt processing. Some of the benefit may be from other optimizations like staging through shared memory, or splitting by rows.	2025-05-19 14:58:39 +03:00
Jeff Bolz	a221288dc6	vulkan: workaround FA compile failures on macos (llama/13517)	2025-05-19 14:58:39 +03:00
Georgi Gerganov	08436716ae	metal : use FA-vec kernel up to batch size 20 (llama/13496) * batched-bench : fix pp batch contents * metal : optimize multi-sequence FA vec kernel ggml-ci * metal : use FA-vec kernel up to batch size 20 ggml-ci	2025-05-19 14:58:39 +03:00
Georgi Gerganov	e11fc21e6c	metal : optimize multi-sequence FA vec kernel (llama/13493) * batched-bench : fix pp batch contents * metal : optimize multi-sequence FA vec kernel ggml-ci	2025-05-19 14:58:39 +03:00
Dan Johansson	a77a924b20	ggml-cpu: Update KleidiAI to v1.6 and fix include directives (llama/13509) Signed-off-by: Dan Johansson <dan.johansson@arm.com>	2025-05-19 14:58:39 +03:00
Johannes Gäßler	405b9c77ad	mnist: fix segmentation fault (ggml/1227)	2025-05-19 14:58:39 +03:00
Diego Devesa	9c3bfc1499	ggml : fix apple OS check in ggml_print_backtrace (ggml/1229)	2025-05-19 14:58:39 +03:00
Daniel Tang	5b7797f674	ggml : Fix missing backtrace on Linux (ggml/1228) * Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1 * Fixed lldb attach * Simplify by having the child do ggml_print_backtrace_symbols	2025-05-19 14:58:39 +03:00
Xuan-Son Nguyen	75e9a840c5	ggml : add mrope kernel for metal (llama/13457)	2025-05-13 13:59:21 +03:00
Georgi Gerganov	41ed62bdbc	metal : optimize MoE for large batches (llama/13388)	2025-05-13 13:59:21 +03:00
lhez	029c8837f8	opencl: remove unnecessary assert for `add` (llama/13257)	2025-05-13 13:59:21 +03:00
Johannes Gäßler	5d8b068249	llama/ggml: add LLM training support (llama/10544) * llama/ggml: add LLM training support more compact progress bar llama_save_model_to_file llama_opt_param_filter ggml_graph_dup force_grads refactor ggml_opt, fix test-opt * remove logits_all * refactor CUDA implementation for ACC * reset graph at beginning of opt period	2025-05-13 13:59:21 +03:00
Dan Johansson	93ef22657e	ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel (llama/13053) * ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel Signed-off-by: Dan Johansson <dan.johansson@arm.com> * * code review fixes Signed-off-by: Dan Johansson <dan.johansson@arm.com> * * adds a comment that clarifies barrier usage Signed-off-by: Dan Johansson <dan.johansson@arm.com> --------- Signed-off-by: Dan Johansson <dan.johansson@arm.com> Co-authored-by: Charles Xu <charles.xu@arm.com>	2025-05-13 13:59:21 +03:00
Johannes Gäßler	866f685bbc	CUDA: fix misaligned synchronization in FA (llama/13469)	2025-05-13 13:59:21 +03:00
Atharva Dubey	250bcc041a	enable dpcpp nightly builds with libraries (llama/13406)	2025-05-13 13:59:21 +03:00
Johannes Gäßler	90b17a99bf	CUDA: fix crash with partial offloading of MoE (llama/13439)	2025-05-13 13:59:21 +03:00
David Huang	e1b2ace0f8	Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (llama/13386)	2025-05-13 13:59:21 +03:00
Johannes Gäßler	6db0e01db6	CUDA: fix race conditions FlashAttention kernels (llama/13438)	2025-05-13 13:59:21 +03:00
Johannes Gäßler	16f3546f38	CUDA: fix FlashAttention on Turing (llama/13415)	2025-05-13 13:59:21 +03:00
Jeff Bolz	a04b329ad1	vulkan: scalar flash attention implementation (llama/13324) * vulkan: scalar flash attention implementation * vulkan: always use fp32 for scalar flash attention * vulkan: use vector loads in scalar flash attention shader * vulkan: remove PV matrix, helps with register usage * vulkan: reduce register usage in scalar FA, but perf may be slightly worse * vulkan: load each Q value once. optimize O reduction. more tuning * vulkan: support q4_0/q8_0 KV in scalar FA * CI: increase timeout to accommodate newly-supported tests * vulkan: for scalar FA, select between 1 and 8 rows * vulkan: avoid using Float16 capability in scalar FA	2025-05-13 13:59:21 +03:00
Alberto Cabrera Pérez	45d8b2352e	sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (llama/12858) * sycl : Implemented reorder Q4_0 mmvq Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com> * sycl : Fixed mmvq being called when reorder is disabled * sycl : Improved comments in the quants header Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com> * Use static_assert * safe_div -> ceil_div * Clarify qi comment * change the reorder tensor from init to execute OP * dbg * Undo changes to test-backend-ops * Refactor changes on top of q4_0 reorder fix * Missing Reverts * Refactored opt_for_reorder logic to simplify code path * Explicit inlining and unroll * Renamed mul_mat_algo enum for consistency --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com> Co-authored-by: romain.biessy <romain.biessy@codeplay.com>	2025-05-13 13:59:21 +03:00
Johannes Gäßler	2d436bfbfb	CUDA: FA support for Deepseek (Ampere or newer) (llama/13306) * CUDA: FA support for Deepseek (Ampere or newer) * do loop unrolling via C++ template	2025-05-13 13:59:21 +03:00
Johannes Gäßler	4b7cbb62ef	CUDA: fix crash on large batch size for MoE models (llama/13384)	2025-05-13 13:59:21 +03:00
Radoslav Gerganov	e27c91f6d6	rpc : add rpc_msg_set_tensor_hash_req (llama/13353) * rpc : add rpc_msg_set_tensor_hash_req Use a dedicated struct for the request of RPC_CMD_SET_TENSOR_HASH which makes the code cleaner. * fix	2025-05-13 13:59:21 +03:00
Jeff Bolz	e46df4850f	vulkan: Allow up to 4096 elements for mul_mat_id row_ids (llama/13326) This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf: GGML_ASSERT(nei0 * nei1 <= 3072); The tensor is 8 x 512. Increase this array size to accommodate.	2025-05-13 13:59:21 +03:00
Alberto Cabrera Pérez	e8a7f1b7bb	sycl: addressing non-contiguous src1 mul_mats (nc and batched) (llama/13343) * sycl: fixed non-contiguous src1 mul_mats (nc and batched) * Fixed wrong static_cast inside kernel	2025-05-13 13:59:21 +03:00
R0CKSTAR	09e6b66025	cuda : remove nrows_x in mul_mat_q_process_tile (llama/13325) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-05-07 21:00:32 +03:00
Johannes Gäßler	d41cf26a0f	CUDA: mix virt/real CUDA archs for GGML_NATIVE=OFF (llama/13135)	2025-05-07 21:00:32 +03:00
Akarshan Biswas	3c67195be9	SYCL: Disable reorder optimize by default and stop setting tensor extras when optimize is disabled (llama/13254) * SYCL: Do not set tensor extras when reorder optimize is disabled * SYCL: Disable reorder optimize by default	2025-05-07 21:00:32 +03:00
Johannes Gäßler	f9f78a773f	CUDA: fix bad asserts for partial offload (llama/13337)	2025-05-07 21:00:32 +03:00
Johannes Gäßler	be55e25cac	CUDA: fix --split-mode row for MMQ (llama/13323)	2025-05-07 21:00:32 +03:00
Johannes Gäßler	2ffdda99e8	CUDA: fix logic for clearing padding with -ngl 0 (llama/13320)	2025-05-07 21:00:32 +03:00
Akarshan Biswas	9bbedc51cc	SYCL: Disable mul_mat kernels for noncontiguous tensor b (llama/13308) ggml-ci	2025-05-07 21:00:32 +03:00
Diego Devesa	1e1fa27add	rpc : use backend registry, support dl backends (llama/13304)	2025-05-07 21:00:32 +03:00
Aaron Teo	e1bdd148c5	ggml : activate s390x simd for Q3_K (llama/13301) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-05-07 21:00:32 +03:00
Johannes Gäßler	7fa8bb303f	CUDA: fix race condition in MMQ stream-k fixup (llama/13299)	2025-05-07 21:00:32 +03:00
Johannes Gäßler	7564f5e6f1	CUDA: fix race condition in MMQ ids_dst (llama/13294)	2025-05-07 21:00:32 +03:00
Jeff Bolz	22ba2e27ce	vulkan: Additional type support for unary, binary, and copy (llama/13266) Support f16->f32 copy. Support f16->f16 and f32->f32 unary ops. Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div.	2025-05-07 21:00:32 +03:00
Georgi Gerganov	5eac2a3fbb	vulkan : fix lint (llama/0)	2025-05-07 15:39:32 +03:00
shalinib-ibm	42938398f9	ggml : Enable MMA for BF16 in llamafile_sgemm (llama/13148) This patch upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for BF16 data type. This change results in 9x - 40x gains in total speed S t/s (ie all tokens/total time), across various batch sizes tested using llama-batched-bench benchmark. The patch is tested with Meta-Lllama-3-8B, and Mistral-7B models (BF16 models generated by using llama-quantize from corresponding FP32 models) on an IBM POWER10 machine. Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>	2025-05-07 15:39:32 +03:00
Justin Santa Barbara	a8fe90ae15	rpc : avoid uninitialized memory in serialize_tensor (llama/13210) Zero out the name and padding buffers.	2025-05-07 15:39:32 +03:00
Jesse Gross	c5a5a2da5b	ggml: Don't assert fail when tensor data changes (llama/13222) The following scenario will cause an assertion failure in the graph allocator: - Build and allocate a graph containing a tensor with a non-NULL data pointer - Build and allocate a new graph where that data is NULL Result: ggml-alloc.c:819: GGML_ASSERT(talloc->buffer_id >= 0) failed This happens during revalidation because we think that memory should have been previously allocated based on the current graph but in reality the previous graph was different. In this situation, we should do a full reallocation pass.	2025-05-07 15:39:32 +03:00
Diego Devesa	8316bfd82b	build : fix build info on windows (llama/13239) * build : fix build info on windows * fix cuda host compiler msg	2025-05-07 15:39:32 +03:00
Jeff Bolz	fd1cb9fc12	vulkan: Add bfloat16 support (llama/12554) * vulkan: Add bfloat16 support This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16. The extension is required for coopmat multiply support, but matrix-vector multiply trivially promotes bf16 to fp32 and doesn't require the extension. The copy/get_rows shaders also don't require the extension. It's probably possible to fall back to non-coopmat and promote to fp32 when the extension isn't supported, but this change doesn't do that. The coopmat support also requires a glslc that supports the extension, which currently requires a custom build. * vulkan: Support bf16 tensors without the bf16 extension or coopmat support Compile a variant of the scalar mul_mm shader that will promote the bf16 values to float, and use that when either the bf16 extension or the coopmat extensions aren't available. * vulkan: bfloat16 fixes (really works without bfloat16 support now) * vulkan: fix spirv-val failure and reenable -O	2025-05-07 15:39:32 +03:00
Jeff Bolz	17f6b8225e	vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader (llama/13191) * vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader	2025-05-07 15:39:32 +03:00
Acly	6374ea32ca	vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204) * vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW) * review: remove src_x/y < 0 checks; add performance tests	2025-05-07 15:39:32 +03:00
Daniel Bevenius	09846f4e12	whisper: remove MSVC warnings pragmas (#3090 ) * ggml : remove MSVC warnings pragmas This commit removes the MSVC-specific pragmas as these are now handled in CMakeLists.txt. * whisper : remove MSVC warning pragmas This commit removes the MSVC-specific pragmas. These are now handled in the CMakeLists.txt file.	2025-05-05 13:09:35 +02:00

1 2 3 4 5 ...

830 Commits