whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Jiacheng (Jason) Chen	e3f3c6ead1	HIP: enable WMMA-MMQ INT kernels for RDNA 3 (llama/17576) * enabled wmma instructions for most quantizations other than q2k * fixed the last q2_k test case failure * address comments: fix out of bound write for RDNA4, add comments after #endif * clean up rebase: fix ne error in half2 * fix the EditorConfig CI	2025-12-12 17:53:17 +02:00
Piotr Wilkin (ilintar)	8d44d6181a	Add support for CUMSUM and TRI for CUDA. (llama/17584) * Add support for CUMSUM and TRI for CUDA. * Minor optimizations. * Correct warp_prefix_inclusive_sum in float2 variant to return float2 * Optimize TRI * Whitespace * Fix strides. * Implement double loop * Whitespace * Fix HIP compilation bugs * Optimizations + big case performance tests * Implement using CUB with fallback to custom kernel * Remove error message. * Fixes from code review * Comment out CPU-unsupported F16/BF16 cases to fix CI * Fine, you win :P * Fix last cast, use NO_DEVICE_CODE and GGML_UNUSED_VARS * Vary warp-size based on physical warp size * Add GGML_UNUSED_VARS in tri as well * Use constexpr and call prefix_inclusive with warp_size template param * Update ggml/src/ggml-cuda/cumsum.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Change to tid % warp_size * Fix strides; hardcode mask; add ggml_lane_mask_t * Missing renames, remove unused get_warp_mask(), explicit calls to ggml_cuda_info() * Too hasty... --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-12 17:53:17 +02:00
Gabe Goodhart	8902c9d976	metal: TRI, FILL, EXPM1, SOFTPLUS (llama/16623) * feat(wip): Port initial TRI impl from pervious work The kernel does not work and is not optimized, but the code compiles and runs, so this will be the starting point now that the core op has been merged. Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove argument for constant val override This was added in the original draft, but later removed. With this, the kernel now passes tests. Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Move the ttype conditional to templating to avoid conditional in kernel Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Type fixes Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * feat: Add softplus for metal Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add EXPM1 for metal Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add FILL for metal Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Branchless version of tri using _ggml_vec_tri_cmp as a mask Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unused arguments Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use select instead of branch for softplus non-vec Branch: ggml-cumsum-tri Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:17 +02:00
Alberto Cabrera Pérez	f96ebc92d2	ggml-cpu : remove asserts always evaluating to false (llama/17728)	2025-12-12 17:53:17 +02:00
Georgi Gerganov	194d016456	metal : use params per pipeline instance (llama/17739)	2025-12-12 17:53:16 +02:00
Adrien Gallouët	92e50155c9	build : move _WIN32_WINNT definition to headers (llama/17736) Previously, cmake was forcing `_WIN32_WINNT=0x0A00` for MinGW builds, This caused "macro redefined" warnings with toolchains that define the version. This also removes the `GGML_WIN_VER` variable as it is no longer needed. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-12 17:53:16 +02:00
Herman Semenoff	3794a0d3b6	ggml-cpu: remove duplicate conditional check 'iid' (llama/17650)	2025-12-12 17:53:16 +02:00
Johannes Gäßler	7adbcafb6c	CUDA: generalized (mma) FA, add Volta support (llama/17505) * CUDA: generalized (mma) FA, add Volta support * use struct for MMA FA kernel config --------- Co-authored-by: Aman Gupta <aman>	2025-12-12 17:53:16 +02:00
Georgi Gerganov	4a00f2e3a4	metal : fix data race in pipeline library (llama/17731)	2025-12-12 17:53:16 +02:00
Reese Levine	d263bdbfb6	ggml webgpu: add support for emscripten builds (llama/17184) * Faster tensors (llama/8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (llama/9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (llama/4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Fix .gitignore * Add memory64 option and remove unneeded macros for setting threads to 1 --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-12-12 17:53:16 +02:00
Jeff Bolz	86cb5ab93f	vulkan: Reduce temporary memory usage for TOP_K (llama/17623) - Compute row size for the temp buffer based on the output of the first pass. - Update shader addressing math to use the output row size - Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k" For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer from about 3.2MB to 500KB.	2025-12-12 17:53:15 +02:00
xiaobing318	fffdf679d4	cmake : add utf8 compilation options for msvc (llama/17682)	2025-12-12 17:53:15 +02:00
Adrien Gallouët	16688c6d2c	ggml : use svcntb() for SVE vector length detection (llama/17474) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-12 17:53:15 +02:00
TianHao324	a64d46a529	CANN: Disable Ger operator of OUT_PROD on 310p device (llama/17563)	2025-12-12 17:53:15 +02:00
Daniel Bevenius	201b910743	ggml : remove redundant n_copies check when setting input/output (llama/17612) This commit removes a redundant check for sched->n_copies > 1 when setting input and output flags on tensor copies in ggml_backend_sched_split_graph. The motivation for this change is to clarify the code as the outer if statement already performs this check.	2025-12-12 17:53:15 +02:00
Adrien Gallouët	e2537b4af3	ggml : add fallback definition for HWCAP2_SVE2 (llama/17683) This align with other HWCAP2 feature flags See #17528 Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-12 17:53:15 +02:00
Aman Gupta	4c89232b5c	ggml-cuda: reorder only relevant nodes (llama/17639)	2025-12-12 17:53:14 +02:00
Neo Zhang Jianyu	26732d28c4	enhance argsort for UT (llama/17573) Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>	2025-12-12 17:53:14 +02:00
Georgi Gerganov	32090930f7	metal : add FA head size 48 (llama/17619)	2025-12-12 17:53:14 +02:00
Georgi Gerganov	7cd3de89bf	ggml : extend the GGML_SCHED_NO_REALLOC debug logic of the scheduler (llama/17617)	2025-12-12 17:53:14 +02:00
Aman Gupta	6cc2d0534f	llama-graph: avoid expand_forward for fusion (llama/17633)	2025-12-12 17:53:14 +02:00
Tarek Dakhran	0defeee679	model: LFM2-VL fixes (llama/17577) * Adjust to pytorch * Add antialiasing upscale * Increase number of patches to 1024 * Handle default marker insertion for LFM2 * Switch to flag * Reformat * Cuda implementation of antialias kernel * Change placement in ops.cpp * consistent float literals * Pad only for LFM2 * Address PR feedback * Rollback default marker placement changes * Fallback to CPU implementation for antialias implementation of upscale	2025-12-12 17:53:14 +02:00
Gilad S.	706647202e	ggml: fix: macOS build with `-DGGML_BACKEND_DL=ON` (llama/17581)	2025-12-12 17:53:13 +02:00
Aman Gupta	e68ee6e281	CUDA: add stream-based concurrency (llama/16991) * CUDA: add stream-based concurrency * HIP: fix hipStreamWaitEvent define and nodiscard warnings * ggml-cuda: fix fusion inside stream * ggml-cuda: fix bug w.r.t first stream launch * ggml-cuda: format * ggml-cuda: improve assert message * ggml-cuda: use lambda instead of duplicating code * ggml-cuda: add some more comments * ggml-cuda: add more detailed comments about concurrency * ggml-cuda: rename + remove unused var * ggml-cuda: fix condition for stream launch * ggml-cuda: address review comments, add destructor * common.cuh: add is_valid for concurrent events * common.cuh: make comment better * update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * common.cuh: fix lower_bound condition + remove join_node data from write_ranges * ggml-cuda: fix overlap condition + shadowing parameter --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-12 17:53:13 +02:00
Mahekk Shaikh	2e4a7a21fa	cuda : add error checking for cudaMemcpyAsync in argsort (llama/17599) * cuda : add error checking for cudaMemcpyAsync in argsort (llama/12836) * fix indentation	2025-12-12 17:53:13 +02:00
Acly	2258930c2e	vulkan : fix FA mask load with bounds check (coopmat2) (llama/17606)	2025-12-12 17:53:13 +02:00
Neo Zhang	a3459484bf	sycl : support to malloc memory on device more than 4GB, update the doc and script (llama/17566) Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2025-12-12 17:53:13 +02:00
ixgbe	28dff06555	ggml: replace hwcap with riscv_hwprobe for RVV detection (llama/17567) Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>	2025-12-12 17:53:12 +02:00
Ruben Ortlam	2fcc0a3a9f	Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support (llama/16900) * vulkan: split mul_mmq_funcs for mul_mat_vecq use * add mxfp4 mmvq * add q2_k mmvq * add q3_k mmvq * add q4_k and q5_k mmvq * add q6_k mmvq * handle 4x4 quants per mmvq thread * enable MUL_MAT_ID mmvq support * enable subgroup optimizations for mul_mat_vec_id shaders * device tuning * request prealloc_y sync after quantization * fix indentation * fix llvmpipe test failures * fix mul_mat_id mmvq condition * fix unused variable warning	2025-12-12 17:53:12 +02:00
Jeff Bolz	dbf8766ffa	vulkan: improve topk perf for large k, fix overflow in unit tests (llama/17582)	2025-12-12 17:53:12 +02:00
Diego Devesa	463003e76c	ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched (llama/17276) * ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched Enabled in ggml-ci for testing. * llama : update worst-case graph for unified cache * ci : disable op offload in some tests * fix spelling --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:12 +02:00
R0CKSTAR	c372bdbb3c	enable fp16/fast_fp16/bf16_mma on PH1 (llama/17551) * [MUSA] enable fp16/fast_fp16/bf16_mma on PH1 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Update ggml/src/ggml-cuda/fattn-vec.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/fattn-vec.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/fattn-tile.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-12 17:53:12 +02:00
Aman Gupta	90ca4e0a07	ggml-cuda: add stricter checking for fusion (llama/17568) * ggml-cuda: make conditions for fusion more explicit * ggml-cuda: remove size check as std::equal already does it	2025-12-12 17:53:12 +02:00
Piotr Wilkin (ilintar)	43441ff58a	model : Qwen3 Next (llama/16095) * Qwen3 Next - cleaned up version * Whitespaces and stuff * Correct minor errors * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Misc. fixes. * Clean up code, add missing hybrid qualifier * Did someone transpose the SOLVE_TRI result matrix? Perhaps... * Whitespace * Proper tensors for cb calls * Use llama-graph.h vertical alignment * BROKEN: chunking * Set new tensors as inputs. * Proper chunk logic * It's the circle of life... * More shenanigans for n_seq > 1 * Nail in the coffin? * Fix Windows build * Eh, one fails on Windows, the other fails on Mac... just use general capture. * quant : cleanup * model : cleanup * qwen3 : cleanup * cont : cleanup * cont : cleanup * ggml : revert change * qwen3 : cleanup * cont : cleanup * Readd cmath * qwen3 : fix typo * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Usual suspects * fix my bad suggestion --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:11 +02:00
Johannes Gäßler	37e4c2ed3a	CUDA: no FP16 arithmetic for vector FA kernel (llama/17558)	2025-12-12 17:53:11 +02:00
Jeff Bolz	7a20963140	vulkan: Implement GGML_OP_TRI (llama/17503) * vulkan: Implement GGML_OP_TRI * check types match	2025-12-12 17:53:11 +02:00
Radoslav Gerganov	d26d1c8b85	rpc : cache and reuse compute graphs (llama/15405) Store the last computed graph and reuse it when possible. Also do not return response from GRAPH_COMPUTE and assume it always completes successfully. If this this is not the case, the server closes the connection. This saves us a network round trip to the server.	2025-12-12 17:53:11 +02:00
yulo	f92d542d4d	HIP: enable mul_mat_f for RDNA4 (llama/17437) * enable mmf for rdna4 * move some mmvf to mmf * revert lds128 for wmma loading * Revert "revert lds128 for wmma loading" This reverts commit db9ae8b6b4738a5def5b393caa1611d52133e9b5. * Revert "enable mmf for rdna4" This reverts commit 698c9f24187b990e35c3b73a8067e5387e6ddbd4. * Revert "move some mmvf to mmf" This reverts commit 99b92bd6653cc8593607f641e44606391691792f. * enable mul_mat for rdna4 --------- Co-authored-by: zhang hui <you@example.com>	2025-12-12 17:53:11 +02:00
Piotr Wilkin (ilintar)	51e842d106	SOLVE_TRI CUDA kernel for small matrices (llama/17457)	2025-12-12 17:53:11 +02:00
Neo Zhang Jianyu	93bc8dc5a8	refactor pad_reflect_1d to make the UT case pass (llama/17204) Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>	2025-12-12 17:53:10 +02:00
Jeff Bolz	3727a36c48	vulkan: Implement SOLVE_TRI (llama/17486) * vulkan: Implement SOLVE_TRI * load B matrix through shared memory * use FLOAT_TYPE	2025-12-12 17:53:10 +02:00
matt23654	e682af7886	cuda : fix UMA detection on discrete GPUs. (llama/17537)	2025-12-12 17:53:10 +02:00
Alberto Cabrera Pérez	93f6cdb9c0	ggml-cpu: aarm64: q4_K repack gemm and gemv implementations (dotprod only) (llama/17494) * Enabled q4_K_4x8 path * Fixed generic Q4_K 8x4 implementation * wip: dotprod gemm * Working arm q4_K dotprod gemm Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Undo acc rename Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Q4_K arm dotprod gemm Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fix: q4_qs reinterpret from uint to int Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Removed comments * Fixed macro guards * Fixed unused vars in generic implementation * Fixed unused vars in 8x4 repack * Fixed unused vars in generic implementation, unneeded comment * Missing arch fallback for x86 * minor : style --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:10 +02:00
Acly	ac92424b59	vulkan : move contiguous checks to device_supports_op (llama/17490) * vulkan : remove op_supports_incontiguous and add missing constraints in device_supports_op * im2col: remove contraints on src0 (kernel input)	2025-12-12 17:53:10 +02:00
Jeff Bolz	310db24fca	vulkan: use a fixed 1KB buffer for the add_rms_fusion opt (llama/17514)	2025-12-12 17:53:10 +02:00
lhez	74ef5dd1a9	opencl: add sqr, sqrt, mean and ssm_conv (llama/17476) * opencl: add sqr * opencl: add sqrt * opencl: add mean * opencl: add ssm_conv * opencl: add missing cl_khr_fp16 * opencl: do sqrt in f32 then convert to f16 for better precision	2025-12-12 17:53:09 +02:00
Alberto Cabrera Pérez	3de4372465	Fix chunks being too small with small matrix sizes (llama/17526)	2025-12-12 17:53:09 +02:00
Jeff Bolz	c8050e5fdc	vulkan: allow graph_optimize for prompt processing workloads (llama/17475)	2025-12-12 17:53:09 +02:00
Jeff Bolz	d8b61e05f8	vulkan: Implement top-k (llama/17418) * vulkan: Implement top-k Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10) and discards all but the top K. Repeat until only K are left. And there's a fast path when K==1 to just find the max value rather than sorting. * fix pipeline selection * vulkan: Add N-ary search algorithm for topk * microoptimizations	2025-12-12 17:53:09 +02:00
xctan	fb31a19797	ggml-cpu : add RISC-V Zvfh impl for ggml_vec_mad_f16 (llama/17448) * ggml-cpu : add RISC-V Zvfh impl for ggml_vec_mad_f16 * ggml-cpu : dedup scalar impl * Update ggml/src/ggml-cpu/vec.h --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-12 17:53:09 +02:00

... 10 11 12 13 14 ...

4210 Commits All Branches Search

4210 Commits

All Branches