whisper.cpp

Commit Graph

Author	SHA1	Message	Date
uvos	081dc773a5	ci : add hip quality check (llama/20430) * CI: add hip quality check * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update .github/workflows/hip-quality-check.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update .github/workflows/hip-quality-check.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update .github/workflows/hip-quality-check.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Revert "Update .github/workflows/hip-quality-check.yml" This reverts commit efa0bfcdb01dfac0feee674987a0482d50f46145. * scripts: gcn-cdna-vgpr-check.py: enforce int type for total_vgprs * scripts: gcn-cdna-vgpr-check.py: add flash attention instances to ignore list * Bump ccache version * Add mssing seperators to list --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-29 15:04:36 +03:00
Reese Levine	551bb82960	ggml webgpu: ops support for qwen3.5 (SET, TRI_SOLVE, SSM_CONV, GATED_DELTA_NET) + GET_ROWS optimization (llama/20687) * Implement l2_norm, set, tri * Add DIAG/SOLVE_TRI * Add SSM_CONV * Better get_rows and gated_delta_net to support qwen3.5 * Clean up, update ops.md * Fix binding_index type for wasm * Fix read write annotations * cleanups	2026-03-29 15:04:36 +03:00
Eve	43c7c0f86c	vulkan: dequantize iq4_xs 4 at a time (llama/20657)	2026-03-29 15:04:36 +03:00
Charles Xu	fea629d00f	cmake : fix build warning when kleidiai is enabled (llama/20457) * cmake : fix build warning when kleidiai is enabled * remove LLAMA_ARG_THREADS from KleidiAI backend	2026-03-29 15:04:36 +03:00
Chenguang Li	2a6de29364	CANN: handle in-place ROPE on non-contiguous f32 tensors (llama/20274) RotaryPositionEmbedding on CANN fails when src and dst share the same non-contiguous buffer (inplace + view), because the operator overwrites source data before it is fully read. Add a branch that detects this case and uses contiguous temporary buffers: copy src to temp, run ROPE into another temp, then copy back to the non-contiguous dst. Fixes 20 failing ROPE tests (f32, v=1, inplace=1). Signed-off-by: noemotiovon <757486878@qq.com>	2026-03-29 15:04:36 +03:00
Masashi Yoshimura	3d004fbf0a	ggml-webgpu: Update the `RMS_NORM` preprocessor and add `L2_NORM` (llama/20665) * Update the preprocessor of RMS_NORM and add L2_NORM. * Fix the name of rms_norm to row_norm.	2026-03-29 15:04:36 +03:00
Masashi Yoshimura	12015a2174	ggml-webgpu: Add supports for `DIAG` and `TRI` (llama/20664) * Add supports for DIAG and TRI. * Remove extra ttype and add a comment for TRI op.	2026-03-29 15:04:36 +03:00
Chenguang Li	dfba84cb47	CANN: support flash attention for head dim not multiple of 16, fix ALiBi slope offset (llama/20031) - Allow FLASH_ATTN_EXT when head dimension D is not a multiple of 16 by padding Q/K/V to D_padded = GGML_PAD(D, 16), running FusedInferAttentionScoreV2, then slicing the output back to D (ggml-cann.cpp + aclnn_ops.cpp). - Fix aclnn_get_slope second-part offset: use ggml_type_size(dtype) instead of sizeof(float) so ALiBi slopes are correct when dtype is F16 (e.g. GQA with 48 heads); fixes buffer overflow and large numerical errors in those cases.	2026-03-29 15:04:36 +03:00
Reese Levine	d6a0f0d075	Move to no timeout for WaitAny in graph submission to avoid deadlocks in some cases on llvm-pipe backends (llama/20618)	2026-03-29 15:04:36 +03:00
Shaw Nguyen	14caedfa18	ggml-cpu/x86: fix unused changemask warning in repack (llama/20692)	2026-03-29 15:04:36 +03:00
uvos	61c7cd024d	HIP : ignore return of hipMemAdvise [no ci] (llama/20696)	2026-03-29 15:04:36 +03:00
Krishna Sridhar	e222814fc4	hexagon: add neg, exp, sigmoid, softplus ops, cont, repeat ops (llama/20701) Add element-wise unary ops needed by Qwen 3.5's DeltaNet linear attention layers. These ops follow the existing unary-ops pattern with VTCM DMA double-buffering. - neg: negate via scale by -1.0 - exp: uses existing hvx_exp_f32 HVX intrinsics - sigmoid: uses existing hvx_sigmoid_f32_aa HVX intrinsics - softplus: log(1 + exp(x)) scalar fallback - CONT reuses the existing CPY infrastructure since making a tensor contiguous is equivalent to a same-type copy. - REPEAT implements tiled memory copy with multi-threaded execution via the worker pool, supporting f32 and f16 types. The kernel parallelizes across output rows and uses memcpy for each tile. Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-03-29 15:04:36 +03:00
Ruben Ortlam	16ca5e6fb1	vulkan: disable mmvq on Intel Windows driver (llama/20672) * vulkan: disable mmvq on Intel Windows driver * improve comment	2026-03-29 15:04:36 +03:00
Kevin Hannon	906aef3da8	ggml-blas: set mkl threads from thread context (llama/20602) * ggml blas: set mkl threads from thread context * add code to run blas locally	2026-03-29 15:04:36 +03:00
Taimur Ahmad	c890a9d9b4	ggml-cpu: fix RVV checks in quants and repacking (llama/20682) * ggml-cpu: refactor quants.c; add rvv check * ggml-cpu: refactor; disable generic fallback	2026-03-29 15:04:36 +03:00
Ruben Ortlam	0ad6ceef59	vulkan: async and event fixes (llama/20518) * vulkan: fix event wait submission, event command buffer reset * fix event command buffer reset validation error * also reset command buffers before reuse * use timeline semaphores instead of fences for event_synchronize * don't use initializer list for semaphore wait info * use multiple events to avoid reset issues * fix event reuse issue with multiple vectors * add semaphore wait condition also if compute_ctx already exists * remove event pending stage	2026-03-29 15:04:36 +03:00
Justin Bradford	ab7d305b75	kleidiai : fix MUL_MAT support for batched (3D) inputs (llama/20620) * kleidiai : fix MUL_MAT support for batched (3D) inputs The supports_op() check incorrectly rejected MUL_MAT operations with 3D inputs (ne[2] > 1), but the actual compute_forward_qx() implementation handles batched inputs correctly via a loop over ne12. This caused models with Q4_0/Q8_0 weights to crash during graph scheduling when n_seq_max > 1, because weights were placed in KLEIDIAI buffers during loading (tested with 2D inputs) but the runtime used 3D inputs. Also relax the buffer check to allow supports_op() to be called during weight loading when src[0]->buffer is NULL. Fixes #20608 * Kleidiai support_ops should only return true for 3D inputs, not also 4D	2026-03-29 15:04:36 +03:00
Ruben Ortlam	49adc8b470	vulkan: allow graphics queue only through env var (llama/20599) * vulkan: avoid graphics queue on non-RADV AMD drivers * avoid graphics queues on small GPUs * change to only use graphics queue if overridden with env var GGML_VK_ALLOW_GRAPHICS_QUEUE * reenable transfer queue if graphics queue is not used	2026-03-29 15:04:36 +03:00
Neo Zhang	6494251197	ehance UPSCALE to support all UT cases (llama/20637) * [SYCL] ehance UPSCALE to support more cases * rm test case result of SYCL1	2026-03-29 15:04:36 +03:00
Martin Klacer	9232af59ba	kleidiai: add data type check to get_tensor_traits (llama/20639) * kleidiai: add data type check to get_tensor_traits * Added check for F16 data type into get_tensor_traits path with input data not in ggml_backend_cpu_kleidiai_buffer_type format (unsupported for Q4/8) Signed-off-by: Martin Klacer <martin.klacer@arm.com> Change-Id: I9aca4b9b8d669d35db6f1dbcc4e080b1919b1de7 * updated ggml/src/ggml-cpu/kleidiai/kleidiai.cpp updated kleidiai.cpp file as per suggestion Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Martin Klacer <martin.klacer@arm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-29 15:04:36 +03:00
Ruben Ortlam	724ea71cf9	vulkan: fix flash attention dot product precision (llama/20589)	2026-03-29 15:04:36 +03:00
Aman Gupta	dae7781052	CUDA: GDN hide memory latency (llama/20537)	2026-03-29 15:04:36 +03:00
Sigbjørn Skjæret	1335dfa785	sycl : fix for untransposed GDA recurrent state (llama/20583)	2026-03-29 15:04:36 +03:00
KITAITI Makoto	76684141a5	ruby : fix dangling pointers, memory leak, and SEGV on parallel transcription (#3715 ) * Prevent dangling pointers * Use proper free function * Free callback containers * Set default log callback when nil is passed to log_set * Raise error if callbacks set when parallel transcription * Bump version to 1.3.7 * Make tests follow spec change * Add note on parallel transcription and callbacks * Update signature of Whisper.log_set [skip ci]	2026-03-22 02:03:00 +09:00
Georgi Gerganov	9386f23940	release : v1.8.4	2026-03-19 10:40:13 +02:00
Georgi Gerganov	ef3463bb29	ci : update workflows	2026-03-18 22:43:38 +02:00
Georgi Gerganov	4bbce1e5b2	benches : update	2026-03-18 22:34:51 +02:00
Georgi Gerganov	f5b477ab09	sync : ggml	2026-03-18 15:18:24 +02:00
Georgi Gerganov	b2be16208d	ggml : bump version to 0.9.8 (ggml/1442)	2026-03-18 15:18:24 +02:00
Georgi Gerganov	945d3151d9	ggml : restore ggml_type_sizef() to aboid major version bump (ggml/1441)	2026-03-18 15:18:24 +02:00
lohopupa	dc96116622	fix: VAD time mapping timestamp drift caused by overlap samples (#3711 ) * whisper : fix VAD segment overlap boundary handling - Use original segment length (pre-overlap) for vad_end in the time mapping table, so segment boundaries are preserved accurately Claude Sonnet 4.6 (Low) * whisper : remove intermediate VAD time mapping points Now that segment boundaries are mapped accurately, the intermediate point interpolation is no longer necessary. --------- Co-authored-by: Lohopupa <lohopupa@gmail.com>	2026-03-17 07:19:08 +01:00
Alan	79218f51d0	go : handle EOF correctly in model download (#3671 )	2026-03-16 13:44:18 +02:00
Aiudadadadf	975b979834	py : replace deprecated openvino-dev with openvino>=2023.3.0 (#3678 ) * models: replace deprecated openvino-dev with openvino>=2023.3.0 for Python 3.12+ compat * models: remove unused openvino.tools.mo import from convert-whisper-to-openvino.py	2026-03-16 13:41:54 +02:00
Gaël James	21665eab4c	examples : Allow max_len to be used for any output format (#3679 )	2026-03-16 13:33:56 +02:00
Igor Loskutov	136dc2eb12	server: return proper HTTP status codes for error responses (#3707 ) Several error paths in the /inference and /load endpoints returned HTTP 200 with a JSON error body, making it impossible for clients to distinguish errors from successful responses by status code. Set 400 for client errors (missing file field, unreadable audio, missing/invalid model) and 500 for server errors (ffmpeg conversion failure). The two existing status-code sites (499 for client disconnect, 500 for processing failure) are unchanged.	2026-03-16 13:33:06 +02:00
Georgi Gerganov	27fa20774a	ggml : try fix arm build (#0 )	2026-03-16 13:10:15 +02:00
Georgi Gerganov	2bc630f197	talk-llama : sync llama.cpp	2026-03-16 13:10:15 +02:00
Georgi Gerganov	ab1252c19e	sync : ggml	2026-03-16 13:10:15 +02:00
David366AI	d4bc312169	ggml : extend im2col f16 (ggml/1434) * examples/yolo: fix load_model memory leak * fix/issue-1433 ggml_compute_forward_im2col_f16 assert error * fix/issue-1433	2026-03-16 13:10:15 +02:00
Georgi Gerganov	81ea958719	common : add nvfp4 (ggml/0)	2026-03-16 13:10:15 +02:00
Johannes Gäßler	d7926e62d4	CUDA: limit number of FA stream-k CUDA blocks (llama/20586)	2026-03-16 13:10:15 +02:00
Pascal	2fb6aea8ad	ggml: avoid creating CUDA context during device init (llama/20595)	2026-03-16 13:10:15 +02:00
MoonShadow	b327a321a2	ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain (llama/20536) * ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain On AMD APU/iGPU devices (unified memory architecture), hipMemAdviseSetCoarseGrain returns hipErrorInvalidValue because the hint is not applicable to UMA systems. The previous CUDA_CHECK() call treated this as a fatal error, causing crashes on APU systems such as AMD Strix Halo (gfx1151). Fix: treat hipMemAdviseSetCoarseGrain as an optional performance hint - call it without error checking and clear any resulting error with hipGetLastError(). Also add pre-allocation debug logging (GGML_LOG_DEBUG) to help diagnose memory issues on APU systems, and store totalGlobalMem in device info. Context: AMD APUs on Windows are affected by a ROCm runtime bug that limits hipMallocManaged to ~64GB regardless of available system RAM. A fix has been submitted upstream: https://github.com/ROCm/rocm-systems/pull/4077 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ggml/hip: remove unrelated changes, keep only hipMemAdviseSetCoarseGrain fix --------- Co-authored-by: moonshadow-25 <moonshadow-25@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-16 13:10:15 +02:00
Bartowski	6770239830	ggml : guard against sumq2 being 0 in IQ4_NL (llama/20460)	2026-03-16 13:10:15 +02:00
PikaPikachu	55c66106af	cuda : add RDNA4-specific MMVQ parameter table for bs=1 decode (llama/19478) * mmvq: add RDNA3/RDNA4-specific parameter table (nwarps=8, rows=1) * mmvq: add dedicated RDNA3 parameter table * mmvq: exclude RDNA3.5 (gfx1150/1151) from RDNA3 table	2026-03-16 13:10:15 +02:00
Ruben Ortlam	cd02195b8f	vulkan: use graphics queue on AMD (llama/20551) * vulkan: use graphics queue on AMD for slightly better performance * disable async transfer queue on AMD	2026-03-16 13:10:15 +02:00
Georgi Gerganov	b312018435	metal : add FA specialization for HSK = 320, HSV = 256 (llama/20549)	2026-03-16 13:10:15 +02:00
Max Krasnyansky	55f8cfdaed	hexagon: Q4_0 and MXFP4 repack fixes (llama/20527) * hexagon: fix tail corruption with rows sizes not multiple of 256 * hexagon: use different stride for repacking partial blocks * hex-mm: update repack and kernels to avoid shuffles for full 256-element blocks Previous commit changed the repacking to use even:odd (0:1,2:3,..) packing instead of the original (0:128,1:129,...) packing in order to fix tail corruption. Since the mm kernels already deal with partial tails we can use even:odd packing only for the last block. This avoid performance penalty of having to shuffle to zip the elements in the common case. * hex-mm: update rmpy x8 for better optimizations * hex-mm: tighten supported MUL_MAT checks to avoid spurios failures * hex-mm: use vzero to init accumulators * hex-mm: properly call partial rmpy_x8	2026-03-16 13:10:15 +02:00
Neo Zhang	c5f9a49b51	add op gated_delta_net (llama/20455)	2026-03-16 13:10:15 +02:00
Adrien Gallouët	93d09fdb23	ggml : add native AVX512-FP16 support for F16 operations (llama/20529) The overall benchmark speed remains almost the same because the CPU is now calculating faster than the RAM can deliver the data. (See perf stat results below showing 2.7 billion fewer instructions). Also note that this path will be only enabled for native build or with custom flags. now: ``` Performance counter stats for 'build/bin/llama-bench -m Qwen3-0.6B-f16.gguf -p 512 -n 128': 189,073.52 msec task-clock # 14.658 CPUs utilized 404 context-switches # 2.137 /sec 19 cpu-migrations # 0.100 /sec 372,390 page-faults # 1.970 K/sec 310,877,195,595 instructions # 0.54 insn per cycle 581,071,530,602 cycles # 3.073 GHz 19,352,107,994 branches # 102.352 M/sec 48,304,438 branch-misses # 0.25% of all branches 84,998,431,152 L1-dcache-loads # 449.552 M/sec 12,186,410,279 L1-dcache-load-misses # 14.34% of all L1-dcache accesses 12.899358742 seconds time elapsed 187.823044000 seconds user 1.253416000 seconds sys ``` before: ``` Performance counter stats for 'build/bin/llama-bench -m Qwen3-0.6B-f16.gguf -p 512 -n 128': 190,594.56 msec task-clock # 14.652 CPUs utilized 436 context-switches # 2.288 /sec 22 cpu-migrations # 0.115 /sec 372,782 page-faults # 1.956 K/sec 313,574,921,966 instructions # 0.54 insn per cycle 586,064,970,425 cycles # 3.075 GHz 19,585,778,563 branches # 102.761 M/sec 48,437,488 branch-misses # 0.25% of all branches 86,219,336,628 L1-dcache-loads # 452.370 M/sec 12,232,085,771 L1-dcache-load-misses # 14.19% of all L1-dcache accesses 13.007923164 seconds time elapsed 189.395316000 seconds user 1.202612000 seconds sys ``` Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-16 13:10:15 +02:00

1 2 3 4 5 ...

4174 Commits All Branches Search

4174 Commits

All Branches