whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Nechama Krashinski	f2f7320817	sycl: add F16 support for GGML_OP_CEIL (llama/19306) * Fix SYCL CEIL operator * sycl: implement GGML_OP_CEIL	2026-02-08 09:29:10 +02:00
Jeff Bolz	cea22b3075	vulkan: For coopmat2 FA, use fp16 accumulators for the final result (llama/19376) The cpu and cuda backends use fp16 for the VKQ accumulator type, this change does the same for vulkan. This helps particularly with large head sizes which are very register-limited. I tried this for the coopmat1 path and it slowed down a bit. I didn't try for scalar. I applied the softmax bias that the cuda backend uses to avoid overflow, although I was not able to reproduce the original bug without it.	2026-02-08 09:29:10 +02:00
Jeff Bolz	c1b63354bb	vulkan: make FA mask/softcap enables spec constants (llama/19309) * vulkan: make FA mask/softcap enables spec constants * don't specialize for sinks * bump timeout a little bit	2026-02-08 09:29:10 +02:00
Georgi Gerganov	776cf61857	metal : skip loading all-zero mask (llama/19337) * metal : skip loading all-zero mask * cont : minor	2026-02-08 09:29:10 +02:00
Georgi Gerganov	2a7d5490f1	cuda : cuda graphs now compare all node params (llama/19383)	2026-02-08 09:29:10 +02:00
Georgi Gerganov	34d332aca5	metal : adaptive CPU/GPU interleave based on number of nodes (llama/19369)	2026-02-08 09:29:10 +02:00
Jeff Bolz	a567c140a3	vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (llama/19281) Write out a 2-bit code per block and avoid loading the mask when it matches these two common cases. Apply this optimization when the mask is relatively large (i.e. prompt processing).	2026-02-08 09:29:10 +02:00
Georgi Gerganov	0781df2518	metal : add diag (llama/19330)	2026-02-08 09:29:10 +02:00
Oleksandr Kuvshynov	932def3198	vulkan: fix GPU deduplication logic. (llama/19222) * vulkan: fix GPU deduplication logic. As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the (same uuid, same driver) logic is problematic for windows+intel igpu. Let's just avoid filtering for MoltenVK which is apple-specific, and keep the logic the same as before 88d23ad5 - just dedup based on UUID. Verified that MacOS + 4xVega still reports 4 GPUs with this version. * vulkan: only skip dedup when both drivers are moltenVk	2026-02-08 09:29:10 +02:00
Jeff Bolz	5a786f7648	vulkan: Set k_load_shmem to false when K is too large (llama/19301)	2026-02-08 09:29:10 +02:00
Jeff Bolz	e0a3f393ad	vulkan: fix non-contig rope (llama/19299)	2026-02-08 09:29:10 +02:00
will-lms	eecc9bfa69	metal : add missing includes (llama/19348)	2026-02-08 09:29:10 +02:00
Kevin Pouget	2763054f99	ggml-virtgpu: make the code thread safe (llama/19204) * ggml-virtgpu: regenerate_remoting.py: add the ability to deprecate a function * ggml-virtgpu: deprecate buffer_type is_host remoting not necessary * ggml-virtgpu: stop using static vars as cache The static init isn't thread safe. * ggml-virtgpu: protect the use of the shared memory to transfer data * ggml-virtgpu: make the remote calls thread-safe * ggml-virtgpu: backend: don't continue if couldn't allocate the tensor memory * ggml-virtgpu: add a cleanup function for consistency * ggml-virtgpu: backend: don't crash if buft->iface.get_max_size is missing * fix style and ordering * Remove the static variable in apir_device_get_count * ggml-virtgpu: improve the logging * fix review minor formatting changes	2026-02-08 09:29:10 +02:00
Aman Gupta	4685ec9555	ggml-cpu: use LUT for converting e8->f32 scales on x86 (llama/19288) * ggml-cpu: use LUT for converting e8->f32 scales on x86 * add dispatch based on macro	2026-02-08 09:29:10 +02:00
Georgi Gerganov	5dda94dd2e	metal : add solve_tri (llama/19302)	2026-02-08 09:29:10 +02:00
Ruben Ortlam	aa34558b6f	vulkan: disable coopmat1 fa on Nvidia Turing (llama/19290)	2026-02-08 09:29:10 +02:00
Aman Gupta	8eede801e3	CUDA: use mmvq for mul-mat-id for small batch sizes (llama/18958) * CUDA: use mmvq for mul-mat-id for small batch sizes * add mmvq too * Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs * templatize multi_token_path	2026-02-08 09:29:10 +02:00
Georgi Gerganov	ce8a2da620	metal : minor cleanup (llama/19251)	2026-02-08 09:29:10 +02:00
Oliver Simons	698265d754	CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup (llama/19053) By providing stride_* variables as size_t (i.e., 64-bit) the compiler can correctly unroll the [two for-loops](`557515be1e/ggml/src/ggml-cuda/mmq.cuh (L3789-L3816)`) on BW. This gives some perf for prefill/pp phase on BW, while not affecting other SMs: \| GPU \| Model \| Test \| t/s master \| t/s osimons/fix_bw_mmq_fixup_kernel \| Speedup \| \|:--------------------------------------------------------\|:----------------------\|:-------\|-------------:\|--------------------------------------:\|----------:\| \| NVIDIA RTX 6000 Ada Generation \| gpt-oss 20B MXFP4 MoE \| pp8096 \| 8404.05 \| 8375.79 \| 1.00 \| \| NVIDIA RTX 6000 Ada Generation \| llama 3B Q4_K_M \| pp8096 \| 16148.93 \| 16019.60 \| 0.99 \| \| NVIDIA RTX 6000 Ada Generation \| llama 8B Q4_0 \| pp8096 \| 8008.29 \| 7978.80 \| 1.00 \| \| NVIDIA RTX 6000 Ada Generation \| nemotron_h 9B BF16 \| pp8096 \| 4263.16 \| 4248.53 \| 1.00 \| \| NVIDIA RTX 6000 Ada Generation \| nemotron_h 9B Q4_K_M \| pp8096 \| 5165.11 \| 5157.43 \| 1.00 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| gpt-oss 20B MXFP4 MoE \| pp8096 \| 12582.80 \| 12758.37 \| 1.01 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| llama 3B Q4_K_M \| pp8096 \| 16879.10 \| 17619.47 \| 1.04 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| llama 8B Q4_0 \| pp8096 \| 10649.90 \| 10982.65 \| 1.03 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| nemotron_h 9B BF16 \| pp8096 \| 7717.73 \| 7716.22 \| 1.00 \| \| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition \| nemotron_h 9B Q4_K_M \| pp8096 \| 7301.90 \| 7370.38 \| 1.01 \|	2026-02-08 09:29:10 +02:00
George	57107b2bf8	ggml: added cleanups in ggml_quantize_free (llama/19278) Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.	2026-02-08 09:29:10 +02:00
Gaurav Garg	6ec362d2e0	cuda : revert CUDA_SCALE_LAUNCH_QUEUES override until investigated (llama/19227) Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.	2026-02-08 09:29:10 +02:00
lhez	591072fcc8	opencl: refactor some ops, concat, repeat, tanh and scale (llama/19226) * opencl: refactor concat * opencl: refactor repeat * opencl: refactor tanh * opencl: enable fp16 for tanh * opencl: refactor scale * opencl: fix unused variables	2026-02-08 09:29:10 +02:00
Aman Gupta	871063016d	ggml-cpu: FA split across kv for faster TG (llama/19209) * ggml-cpu: split across kv for faster TG * simplify sinks application * add ref impl	2026-02-08 09:29:10 +02:00
Neo Zhang	c4003da2b8	Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. (llama/19246) User can't build up the software for Nvidia & AMD GPU. rm the oneMath since it is only used in NV and AMD code path.	2026-02-08 09:29:10 +02:00
Tamar	74353e90a1	sycl: implement GGML_OP_TOP_K (llama/19242)	2026-02-08 09:29:10 +02:00
Georgi Gerganov	73e04555eb	metal : support virtual devices (llama/18919) * metal : support virtual devices * cont : manage buffer type context memory * metal : add events * cont : implement cpy_tensor_async	2026-02-08 09:29:10 +02:00
Johannes Gäßler	625c8d863e	ggml-backend: fix async set/get fallback sync (llama/19179)	2026-02-08 09:29:10 +02:00
Christian Kastner	0e219ebf89	docs : Minor cleanups (llama/19252) * Update old URLs to github.com/ggml-org/ * Bump copyrights	2026-02-08 09:29:10 +02:00
Nikhil Jain	a0256b8159	Remove pipeline cache mutexes (llama/19195) * Remove mutex for pipeline caches, since they are now per-thread. * Add comment * Run clang-format * Cleanup * Run CI again * Run CI once more * Run clang-format	2026-02-08 09:29:10 +02:00
Max Krasnyansky	aca5953d8d	Bump cmake max version (needed for Windows on Snapdragon builds) (llama/19188) * Bump max cmake version (needed for Windows on Snapdragon builds) * cmake: move max version setting into ggml/CMakeLists	2026-02-08 09:29:10 +02:00
nullname	9b927dd849	ggml-hexagon: flash-attention and reduce-sum optimizations (llama/19141) * wip * ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation * ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations * wip * ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance * ggml-hexagon: refactor dot product functions to use a common loading function for improved readability * optimize vector dot product functions to use unified reduction for improved performance * wip * ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation * ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations * wip * ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance * ggml-hexagon: refactor dot product functions to use a common loading function for improved readability * optimize vector dot product functions to use unified reduction for improved performance * hexagon: optimize reduce-sum for v75+ * hexagon: always keep row_sums in sf/fp32 * ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAGON_TOOLS_ROOT * fix compiling error after rebase --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-02-08 09:29:10 +02:00
shaofeiqi	db9c88744d	opencl: add optimized q8_0 mm kernel for adreno (llama/18871) * Add Q8_0 OpenCL kernel Co-authored-by: yunjie <yunjie@qti.qualcomm.com> * opencl: fix build for non-adreno * opencl: refactor q8_0 * opencl: enforce subgroup size of 64 for adreno for q8_0 * For A750 and older generations, subgroup size can be 64 or 128. This kernel assumes subgroup size 64. * opencl: suppress warning when adreno kernels are disabled --------- Co-authored-by: yunjie <yunjie@qti.qualcomm.com> Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-02-08 09:29:10 +02:00
Simon Redman	efd6344939	Correctly fetch q8_1 quantize pipeline in test as needed by 8a3519b (llama/19194)	2026-02-08 09:29:10 +02:00
Georgi Gerganov	06e3750407	ggml : bump version to 0.9.6 (ggml/1423)	2026-02-08 09:29:10 +02:00
Georgi Gerganov	fc1a3e579e	cmake : remove unused file (ggml/1419)	2026-02-08 09:29:10 +02:00
KITAITI Makoto	941bdabbe4	ruby : add `Whisper::Context::Params`, fix token memory management (#3647 ) * Don't convert to temporary VALUE * Define Whisper::Context::Params * Add test for Whisper::Context::Params * Implement Whisper::Context::Params * Add tests for Context::Params * Fix Whisper::Token memory management * Add test for token_timestamps * Make Context accept Context::Params * Make Context::Params.new accept keyword args * Add test for Context::Params.new with keyword args * Add signature of Context::Params * Add example for Whisper::Token * Fix typos * Revert "Don't convert to temporary VALUE" This reverts commit `dee66e7384`. * Hold Token#text as Ruby objectd * Don't use pointer for ruby_whisper_context_params.params * Use RUBY_DEFAULT_FREE instead of custom function * Update bindings/ruby/README.md Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> * Add document for Whisper::Context::Params --------- Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2026-02-04 20:33:09 +09:00
KITAITI Makoto	aa1bc0d1a6	ruby : add `VAD::Context#segments_from_samples`, allow Pathname, etc. (#3633 ) * ruby : Bump version to 1.3.6 * Fix code in example * Add sample code to transcribe from MemoryView * Define GetVADContext macro * Use GetVADContext * Extract parse_full_args function * Use parse_full_args in ruby_whisper_full_parallel * Free samples after use * Check return value of parse_full_args() * Define GetVADParams macro * Add VAD::Context#segments_from_samples * Add tests for VAD::Context#segments_from_samples * Add signature for VAD::Context#segments_from_samples * Add sample code for VAD::Context#segments_from_samples * Add test for Whisper::Context#transcribe with Pathname * Make Whisper::Context#transcribe and Whisper::VAD::Context#detect accept Pathname * Update signature of Whisper::Context#transcribe * Fix variable name * Don't free memory view * Make parse_full_args return struct * Fallback when failed to get MemoryView * Add num of samples when too long * Check members of MemoryView * Fix a typo * Remove unnecessary include * Fix a typo * Fix a typo * Care the case of MemoryView doesn't fit spec * Add TODO comment * Add optimazation option to compiler flags * Use ALLOC_N instead of malloc * Add description to sample code * Rename and change args: parse_full_args -> parse_samples * Free samples when exception raised * Assign type check result to a variable * Define wrapper function of whisper_full * Change signature of parse_samples for rb_ensure * Ensure release MemoryView * Extract fill_samples function * Free samples memory when filling it failed * Free samples memory when transcription failed * Prepare transcription in wrapper funciton * Change function name * Simplify function boundary	2026-01-30 22:59:36 +09:00
Frieder Bluemle	bf422cb704	scripts : Fix dSYMs path case for macOS xcframework build (#3630 ) The script creates dSYMs/ but references dSYMS/ for macOS, causing build failures on case-sensitive filesystems.	2026-01-30 15:57:26 +02:00
Georgi Gerganov	acbace0571	cuda : fix compile warnings (#0 )	2026-01-30 15:56:40 +02:00
Georgi Gerganov	953e503fd9	talk-llama : sync llama.cpp	2026-01-30 15:56:40 +02:00
Georgi Gerganov	b529c0610f	sync : ggml	2026-01-30 15:56:40 +02:00
bssrdf	5dca0db99c	add tensor type checking as part of cuda graph properties (llama/19186)	2026-01-30 15:56:40 +02:00
s8322	2a16e7a67f	sycl: implement GGML_UNARY_OP_SOFTPLUS (llama/19114) * sycl: add softplus unary op implementation * sycl: add softplus unary op implementation * docs(ops): mark SYCL SOFTPLUS as supported * docs: update SYCL status for SOFTPLUS	2026-01-30 15:56:40 +02:00
RachelMantel	1b3c27efae	sycl: implement GGML_OP_TRI (llama/19089) * sycl: implement GGML_OP_TRI * docs: update ops.md for SYCL TRI * docs: regenerate ops.md * docs: update SYCL support for GGML_OP_TRI	2026-01-30 15:56:40 +02:00
Zheyuan Chen	829e70044b	ggml-webgpu: improve flastAttention performance by software pipelining (llama/19151) * webgpu : pipeline flash_attn Q/K loads in WGSL * ggml-webgpu: unroll QK accumlation inner loop ggml-webgpu: vectorization * ggml-webgpu: unrolling * ggml-webgpu: remove redundant unrolling * ggml-webgpu: restore the config * ggml-webgpu: remove redundant comments * ggml-webgpu: formatting * ggml-webgpu: formatting and remove vectorization * ggml-webgpu: remove unnecessary constants * ggml-webgpu: change QKV buffer to read_write to pass validation * ggml-webgpu: add explanation for the additional bracket around Q K accumulate * Indentation and for -> if for tail * Kick off CI on wgsl only commits --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-01-30 15:56:40 +02:00
Todor Boinovski	2a89a3f35c	hexagon: enable offloading to Hexagon on Windows on Snapdragon (llama/19150) * hexagon: updates to enable offloading to HTP on WoS * Update windows.md * Update windows.md * hexagon: enable -O3 optimizations * hexagon: move all _WINDOWS conditional compilation to _WIN32 * hexagon: updates to enable offloading to HTP on WoS * hexagon: use run-time vs load-time dynamic linking for cdsp driver interface * refactor htp-drv * hexagon: add run-bench.ps1 script * hexagon: htdrv refactor * hexagon: unify Android and Windows build readmes * hexagon: update README.md * hexagon: refactor htpdrv * hexagon: drv refactor * hexagon: more drv refactor * hexagon: fixes for android builds * hexagon: factor out dl into ggml-backend-dl * hexagon: add run-tool.ps1 script * hexagon: merge htp-utils in htp-drv and remove unused code * wos: no need for getopt_custom.h * wos: add missing CR in htpdrv * hexagon: ndev enforecement applies only to the Android devices * hexagon: add support for generating and signing .cat file * hexagon: add .inf file * hexagon: working auto-signing and improved windows builds * hexagon: futher improve skel build * hexagon: add rough WoS guide * hexagon: updated windows guide * hexagon: improve cmake handling of certs and logging * hexagon: improve windows setup/build doc * hexagon: more windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * Update windows.md * Update windows.md * snapdragon: rename docs/backend/hexagon to docs/backends/snapdragon Also added a power shell script to simplify build env setup. * hexagon: remove trailing whitespace and move cmake requirement to user-presets * hexagon: fix CMakeUserPresets path in workflow yaml * hexagon: introduce local version of libdl.h * hexagon: fix src1 reuse logic gpt-oss needs a bigger lookahead window. The check for src[1] itself being quantized was wrong. --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-01-30 15:56:40 +02:00
Georgi Gerganov	b997e690ef	cuda : fix nkvo, offload and cuda graph node properties matching (llama/19165) * cuda : fix nkvo * cont : more robust cuda graph node property matching * cont : restore pre-leafs implementation * cont : comments + static_assert	2026-01-30 15:56:40 +02:00
yulo	34a3e28a08	HIP: add mmf for CDNA (llama/18896) * refactor mmf rows_per_block * speed up compile * pass cdna compile * fix cuda error * clean up mmf * f32 mmf * clean float mma * fix mmf error * faster mmf * extend tile k * fix compile error * Revert "extend tile k" This reverts commit 4d2ef3d483932659801a59a5af0b6b48f6ffd5c7. * fix smem overflow * speed up compiling mmf * speed up compile for hip * 512 block for cdna * config pad size * fix as comment * update select logic * move some code to cuh * fix as comment * correct cdna3 config --------- Co-authored-by: zhang hui <you@example.com>	2026-01-30 15:56:40 +02:00
Vishal Singh	e0a2182970	ggml-zendnn : resolve ZenDNN backend cross-module symbol dependency (llama/19159)	2026-01-30 15:56:40 +02:00
Aman Gupta	62ba8b537f	CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) (llama/19126)	2026-01-30 15:56:40 +02:00

... 4 5 6 7 8 ...

4210 Commits All Branches Search

4210 Commits

All Branches