whisper.cpp

Commit Graph

Author	SHA1	Message	Date
shalinib-ibm	78a23d4830	ggml-cpu: optimize ggml_vec_dot_bf16 for Power9 (llama/18837)	2026-01-30 15:56:40 +02:00
Max Krasnyansky	50b7ab3d46	hexagon: support for OP_CPY, host buffers now optional (llama/18822)	2026-01-30 15:56:40 +02:00
Oliver Simons	bc09047405	CUDA: Factor out and re-use `block_reduce` function (llama/18785) * CUDA: Refactor and expose two_stage_warp_reduce_* function * Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it Moving smem out of `__device__` function to `__global__` function allows for explicit smem reuse, as either compiler or cuda rt seem to not free it afterwards (`cudaFuncSetAttribute` fails when not accounting for it once for each call to two_stage_warp_reduce) * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Use two_stage_warp_reduce in group_norm_f32 * Use two_stage_warp_reduce in rms_norm_f32 * Fix smem calculation which expects bytes * Make `two_stage_warp_reduce` accept all values warp_reduce accepts Also integrate it into norm_f32 function * Use two_stage_warp_reduce in l2_norm_f32 * Use type traits for block reduction for better legibility Also adresss other requests by @am17an such as variable renaming * Make norm tests cover all cuda paths * Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK Unit-tests passed locally, let's see if they pass in the CI as well * Use `enum class` for `block_reduce_method` This is more type-safe than plain enum * Rename variables as suggested in code review by @am17an * Rename two_stage_warp_reduce -> block_reduce * Fix trailing whitespace in common.cuh * Make condition of static_assert type-dependent This delays evaluation until the template is actually instantiated. Otherwise, some compilers may evaluate the assert when parsing the template, resulting in build errors as observed here: https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785 * Inline definitions --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2026-01-30 15:56:40 +02:00
Jeff Bolz	4b155e9bfb	vulkan: Check maxStorageBufferRange in supports_op (llama/18709) * vulkan: Check maxStorageBufferRange in supports_op * skip maxStorageBufferRange check when shader64BitIndexing is enabled	2026-01-30 15:56:40 +02:00
Daniel Bevenius	25aeb66a4a	CUDA : fix typo in clang pragma comment [no ci] (llama/18830)	2026-01-30 15:56:40 +02:00
Ruben Ortlam	49762e8fb3	vulkan: work around Intel fp16 bug in mmq (llama/18814)	2026-01-30 15:56:40 +02:00
Perry Naseck	17656e56dc	ggml-metal: do not copy headers for embedded, use current binary dir for embedded (llama/18705)	2026-01-30 15:56:40 +02:00
yulo	c6a495ae5d	HIP: add fattn-mma-f16 for RDNA4 (llama/18481) * finish VQ mma * flash_attn_ext_f16_iter * KQ_rowsum * correct exp * fix scale error * fix softmax scale * fix softmax scale * enable fattn on cpu side * fix random error * disable fattn-mma-f16 on rdna3 * fix wrong col for rdna * use identity mat to transpose * resolve conflicts * basic tuning for DeepSeek-R1-Distill-Qwen-1.5B * fix volta compile error * align rdna4 policy for fattn * adjust fattn policy * adjust kernel selection logic * update as the review comments * keep fattn-wmma logic * adjust kernel selection logic --------- Co-authored-by: zhang hui <you@example.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-01-30 15:56:40 +02:00
Bráulio Oliveira	7aa8818647	examples : use -dev/--device and WHISPER_ARG_DEVICE (#3557 ) Align device selection naming with llama.cpp.	2026-01-21 08:40:30 +01:00
Yshtola	f53dc74843	whisper : Fix UTF-8 character boundary issue in segment wrapping (max_len) (#3592 ) The current implementation in `whisper_wrap_segment()` uses `strlen()` to count bytes, not UTF-8 characters. When splitting segments at `max_len`, this can break multi-byte UTF-8 characters, resulting in invalid sequences displayed as `�` (U+FFFD replacement character).	2026-01-16 14:16:05 +02:00
Georgi Gerganov	2eeeba56e9	release : v1.8.3	2026-01-15 11:54:31 +02:00
Georgi Gerganov	21c1765fcb	benches : update	2026-01-15 11:53:09 +02:00
Georgi Gerganov	47af2fb70f	sync : ggml	2026-01-14 09:11:59 +02:00
Georgi Gerganov	6ee0eaf531	CUDA : fix unused argument when USE_CUDA_GRAPH=OFF (llama/18800)	2026-01-14 09:11:59 +02:00
Jeff Bolz	ab1828dc1c	vulkan: change memory_logger to be controlled by an env var (llama/18769)	2026-01-14 09:11:59 +02:00
Jeff Bolz	aedf332ec5	vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (llama/18678) This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128. This should work when the number of blocks in the A matrix is less than 2^32 (for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like 2^32*LOAD_VEC_A elements. - Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b. - Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle variants. So far this change just adds a single use case for this, compiling with the e64BitIndexingEXT flag. - Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange. 64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort to avoid enabling it unconditionally.	2026-01-14 09:11:59 +02:00
Ruben Ortlam	716d68aca9	vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (llama/18763) * vulkan: Disable large coopmat matmul configuration on proprietary AMD driver * Also disable the large tile size	2026-01-14 09:11:59 +02:00
Ruben Ortlam	c0433783c3	Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (llama/18749) * vulkan: Enable and optimize large matmul parameter combination for AMD * limit tuning to AMD GPUs with coopmat support * use tx_m values instead of _l	2026-01-14 09:11:59 +02:00
Georgi Gerganov	ecfcc65fbf	talk-llama : sync llama.cpp	2026-01-14 09:11:59 +02:00
Georgi Gerganov	13dc9a912b	sync : ggml	2026-01-14 09:11:59 +02:00
shaofeiqi	d4ce2e554f	opencl: add SOFTPLUS op support (llama/18726)	2026-01-14 09:11:59 +02:00
Johannes Gäßler	3a1ea96373	HIP: adjust RDNA3.5 MMQ kernel selction logic (llama/18666)	2026-01-14 09:11:59 +02:00
Perry Naseck	484b17053a	cmake : update blas logic (llama/18205)	2026-01-14 09:11:59 +02:00
Michael Wand	45be2cd27a	Corrected: changed s13 = src1->nb[3] instead of nb[2] (llama/18724)	2026-01-14 09:11:59 +02:00
shaofeiqi	4af27bf2da	opencl: add EXPM1 op (llama/18704)	2026-01-14 09:11:59 +02:00
Reese Levine	4ac8c3b478	Updates to webgpu get_memory (llama/18707)	2026-01-14 09:11:59 +02:00
Aaron Teo	fff3ebd93d	llama: use host memory if device reports 0 memory (llama/18587)	2026-01-14 09:11:59 +02:00
Masashi Yoshimura	a71127dfd8	ggml-webgpu: Fix GGML_MEM_ALIGN to 8 for emscripten. (llama/18628) * Fix GGML_MEM_ALIGN to 8 for emscripten. * Add a comment explaining the need for GGML_MEM_ALIGN == 8 in 64-bit wasm with emscripten	2026-01-14 09:11:59 +02:00
Reese Levine	1bb903f599	ggml webgpu: initial flashattention implementation (llama/18610) * FlashAttention (llama/13) * Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though * neg passes backend test * unary operators pass ggml tests * rms_norm double declaration bug atoned * abides by editor-config * removed vestigial files * fixed autoconfig * All operators (inlcluding xielu) working * removed unnecesarry checking if node->src[1] exists for unary operators * responded and dealt with PR comments * implemented REPL_Template support and removed bug in unary operators kernel * formatted embed wgsl and ggml-webgpu.cpp * Faster tensors (llama/8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (llama/9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (llama/4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Refactored pipelines and workgroup calculations (llama/10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on flash attention * Shader structure set up (many bugs still) * debugging * Working first test * Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32 * Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling * Start work on integrating pre-wgsl * Separate structs/initial shader compilation library into separate files * Work on compilation choices for flashattention * Work on subgroup matrix/tile size portability * subgroup size agnostic online softmax * Cleanups, quantization types * more cleanup * fix wasm build * Refactor flashattention to increase parallelism, use direct loads for KV in somce cases * Checkpoint * formatting * Update to account for default kv cache padding * formatting shader * Add workflow for ggml-ci webgpu * Try passing absolute path to dawn in ggml-ci * Avoid error on device destruction, add todos for proper cleanup * Fix unused warning * Forgot one parameter unused * Move some flashattn computation to f32 for correctness	2026-01-14 09:11:59 +02:00
Jeff Bolz	0bc0e5616e	vulkan: fix push constant size for quantize_q8_1 (llama/18687) I added an assert to catch further mismatches, and it found several. Fix those, too.	2026-01-14 09:11:59 +02:00
Jeff Bolz	678c660e62	vulkan: optimize ssm_scan (llama/18630) * vulkan: optimize ssm_scan * fix warp vs subgroup naming	2026-01-14 09:11:59 +02:00
도로로도로또	f2d8588229	metal : add MoE kernel specialization for ne20=5 (llama/18667) Add template specialization for kernel_mul_mm_id_map0 with ne20=5 to support models using 5 active experts (e.g., VAETKI).	2026-01-14 09:11:59 +02:00
Doctor Shotgun	b9965c89a1	ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH (llama/18535) * ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH * makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32 * ggml: read GGML_OP_OFFLOAD_MIN_BATCH once and store to dev ctx * cann: forward declaration of device context struct * cann: move offload op check after device context declaration * cuda: fix whitespace Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2026-01-14 09:11:59 +02:00
shaofeiqi	85a329cb08	opencl: add FILL op support (llama/18682)	2026-01-14 09:11:59 +02:00
Oliver Walsh	4f2ca7c163	cuda : fix build on cuda 12.8 (llama/18672) compute121 requires 12.9 Signed-off-by: Oliver Walsh <owalsh@redhat.com>	2026-01-14 09:11:59 +02:00
Jeff Bolz	a91ab72bd9	vulkan: reject ops when a tensor is too large to allocate (llama/18646)	2026-01-14 09:11:59 +02:00
virajwad	096e7e911a	vulkan: Warptile tuning for Intel Xe2/Xe3 (llama/18178) * modify warptile tuning for xe3 * intel vendor check w/ coopmat support * fix back formatting * fix formatting change 2 * move intel check to chip specific tuning part * Change to support both windows and linux * modify m_warptile to l_warptile for intel * modify warptile tuning for bf16 matmuls to fix regression (m_warptile to l_warptile) * Code style changes * Code style changes (2) * Code style changes (3)	2026-01-14 09:11:59 +02:00
Eve	a576ed944a	vulkan: more mul mat optimizations (llama/18533) * q4_k * q5_k * q2_k * q4_1 * q5_1 * better buf index	2026-01-14 09:11:59 +02:00
hipudding	5c583f3c02	CANN: Fix rename for get_env (llama/18652) In #18624, get_env in ggml-cann was renamed to get_env_as_lowercase to accurately reflect the function’s behavior and reduce the chance of misuse. However, the update missed renaming call sites in other files. This commit fixes that oversight.	2026-01-14 09:11:59 +02:00
Raul Torres	47671c81db	CANN: Rename `get_env` to `get_env_as_lowercase` (llama/18624)	2026-01-14 09:11:59 +02:00
Max Krasnyansky	a5f51ac75b	Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (llama/18611) * hexagon: improve fp16 matmul and add fp32/fp16 flash-attention * hexagon: add support for set-rows fp32 -> fp16 with i32/i64 row-idx * hexagon: add support for SCALE fp32 * hexagon: replace scalar fp32 -> fp16 copy with HVX * hexagon: optimize flash_atten_ext with aligned VTCM buffers and DMA - Implements double-buffered DMA prefetching for K, V, and Mask tensors. - Ensures K and V rows in VTCM are padded to 128 bytes to support aligned HVX operations. - Correctly synchronizes DMA transfers to prevent race conditions. - Uses `FLASH_ATTN_BLOCK_SIZE` of 128 for efficient chunking. * hexagon: use aligned mad_f16 * hexagon: flash_atten more aligned ops * hexagon: optimize scale_f32 hvx helpers * hexagon: unroll fa loops * hexagon: remove unused set-rows log * hexagon: flash_attn_ext add support for DMAing Q - Update `op_flash_attn_ext` to include Q row size in scratchpad allocation. - Pad Q row size to 128 bytes for alignment. - Implement DMA transfer for Q tensor in `flash_attn_ext_f16_thread`. - Update dot product computations to use VTCM-buffered Q data. * hexagon: fix handling of NANs hvx dotproducts * hexagon: cleanup spad allocation in flash-atten * hexagon: improve fp16/fp32 matmul - Introduced `vec_dot_f16_f16` and `vec_dot_f16_f16_rx2` kernels using efficient HVX dot product intrinsics. - Added `quantize_fp32_f16` to copy/convert weights from DDR to VTCM - Updated `op_matmul` to use the optimized path when VTCM capacity allows and broadcasting requirements are compatible. - Implemented fallback logic to the original implementation for complex broadcasting scenarios. * hexagon: fix HVX_ARCH check * hexagon: matmul cleanup and fp16 fixes Use aligned vec_dot_f16 for 2d matmuls and unaligned version for 4d. * hexagon: fix fp16 x fp16 matmuls and some minor refactoring * hexagon: add support for GET_ROWS f32 -> f32 Also optimize SET_ROWS threading a bit when we have just a few rows to process. * hexagon: optimize set-rows threading * hexagon: update adb/run-bench.sh to properly support experimental and verbose options * hexagon: flash_atten use aligned vectors for dot products	2026-01-14 09:11:59 +02:00
Aadeshveer Singh	436f30d05f	ggml : optimize cuda ssm_scan using warp-level reduction (llama/18505) * ggml : optimize cuda ssm_scan using warp-level reduction * ggml : apply code review suggestions (style, const, constexpr) * ggml : add TODO regarding stride consistency	2026-01-14 09:11:59 +02:00
Jeff Bolz	dbec71f6cf	vulkan: support buffer_from_host_ptr (llama/18467) * vulkan: support buffer_from_host_ptr * hacky use of buffer_from_host_ptr for directio * disable buffer_from_host_ptr cap * use external memory for ggml_vk_host_malloc, revert model loader changes * disable external_memory_host for MoltenVK * take buffer memory types into account * don't use external_memory_host for ggml_vk_host_malloc	2026-01-14 09:11:59 +02:00
Aman Gupta	575d894603	ggml-cuda: refactor cuda graph usage (llama/18637) * ggml-cuda: refactor cuda graph usage * use is_enabled() instead of enabled	2026-01-14 09:11:59 +02:00
Beinsezii	ed674cfc10	mmq.cu: tune mmq/rocblas switching for RDNA (llama/18537) * Patch perf regression for mmq kernels in ROCm recover performance regression for https://github.com/ggml-org/llama.cpp/issues/17917 * add n_experts branch like the cdna path * mmq.cu: tune mmq/wmma switching for RDNA * mmq.cu: move amd wmma mmq/wmma switching behind IS_RDNA3 * Update ggml/src/ggml-cuda/mmq.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Jiacheng (Jason) Chen <76919340+jiachengjason@users.noreply.github.com> Co-authored-by: jiachengjason <jasonchen.jiacheng@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-01-14 09:11:59 +02:00
Adrien Gallouët	5520f27363	ggml : fix avx512bf16 build (llama/18623) - include `immintrin.h` when required - remove unused m512bh Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-01-14 09:11:59 +02:00
Raul Torres	9a1a6685ba	CANN: Make `valid_values` variable `static const` (llama/18627)	2026-01-14 09:11:59 +02:00
nwyin	e563e239a7	ggml webgpu: add CEIL operation support (llama/18605) * ggml-webgpu: add CEIL operation support Add support for the CEIL unary operation in the WebGPU backend: - Add CEIL_FUNC shader template in unary_op.wgsl - Add 4 shader variants (f32, f16, inplace versions) - Initialize CEIL pipelines in ggml-webgpu.cpp - Register CEIL in supports_op function * docs: update WebGPU ops support for CEIL	2026-01-14 09:11:59 +02:00
Johannes Gäßler	9956333361	CUDA: fix FA FP16 accumulator overflow for Granite (llama/18614)	2026-01-14 09:11:59 +02:00
Aman Gupta	804f545454	ggml-cuda: check for srcs outside the cgraph (llama/18583) * ggml-cuda: check for srcs outside the cgraph * review: use leafs instead	2026-01-14 09:11:59 +02:00

... 6 7 8 9 10 ...

4210 Commits All Branches Search

4210 Commits

All Branches