whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Yibo Cai	ea643c6ae3	arm64: optimize q4_k_q8_k kernel with i8mm (llama/13886) This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction. Tested on neoverse-n2 with llama3 8b q4_k_m quantization model. - 34% ~ 50% S_PP uplift for all batch sizes - 12% ~ 37% S_TG uplift for batch size 4 and above Perplexity doesn't change with this PR. ``` // tested on neoverse-n2 $ llama-batched-bench \ -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \ --no-mmap -fa \ -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \ -npl 1,2,4,8,16,32 \ -t 64 --------------------------------------------------------------------- \| PP \| TG \| B \| S_PP t/s \| S_TG t/s \| \| \| \| \| original \| this pr \| original \| this pr \| \|-------\|--------\|------\|----------\|----------\|----------\|----------\| \| 128 \| 128 \| 1 \| 110.12 \| 147.83 \| 24.36 \| 24.28 \| \| 128 \| 128 \| 2 \| 121.16 \| 172.42 \| 46.36 \| 47.93 \| \| 128 \| 128 \| 4 \| 120.15 \| 169.75 \| 74.68 \| 84.00 \| \| 128 \| 128 \| 8 \| 130.97 \| 196.81 \| 91.04 \| 114.74 \| \| 128 \| 128 \| 16 \| 131.01 \| 196.88 \| 101.43 \| 135.79 \| \| 128 \| 128 \| 32 \| 130.85 \| 196.51 \| 106.97 \| 147.29 \| --------------------------------------------------------------------- ```	2025-06-01 15:14:44 +03:00
Christian Kastner	1d7b3c79f4	cmake: Factor out CPU architecture detection (llama/13883) * cmake: Define function for querying architecture The tests and results match exactly those of src/CMakeLists.txt * Switch arch detection over to new function	2025-06-01 15:14:44 +03:00
Vineel Abhinav	ccfaac2bb0	ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm (llama/13882) * F32-Mamba-Seq_Scan-SVE * Fix formatting * ggml : missing space --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-06-01 15:14:44 +03:00
Vineel Abhinav	1230d37bca	ggml: aarch64: Implement SVE F32 kernels for vector functions (llama/13843) * F32-Mamba-SVE * F32-Mamba-SVE * Resolve test errors-1 * Resolve test errors-2 * F32-vec-SVE * F32-vec-SVE * F32-vec-SVE	2025-06-01 15:14:44 +03:00
Johannes Gäßler	9a500394ad	CUDA: fix FA tg at long context for CC >= 8.9 (llama/13852)	2025-06-01 15:14:44 +03:00
leo-pony	0035b8527c	CANN: Add SOC TYPE printing in cmake configuration (llama/13837)	2025-06-01 15:14:44 +03:00
lhez	3623186312	opencl: add new ops - `argsort`, `div`, `sub`, `addrows`, `sigmoid`, `group_norm` (llama/13787) * opencl: add `argsort` * opencl: add `div` * opencl: add `add_rows` * opencl: add `sub` * opencl: add `sigmoid`, both `f16` and `f32` * opencl: add `group_norm`	2025-06-01 15:14:44 +03:00
lhez	67beac47f3	opencl: mark `mul_mat` `f32f32` as supporting non-contiguous tensors (llama/13790)	2025-06-01 15:14:44 +03:00
Jeff Bolz	47a19bae25	vulkan: use timestamp queries for GGML_VULKAN_PERF (llama/13817) Also change it to be controlled by an env var rather than cmake flag	2025-06-01 15:14:44 +03:00
Akarshan Biswas	3d5c7ca4bc	SYCL: add gelu_erf kernel (llama/13749) * SYCL: add gelu_erf kernel * refactor code Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com> * Use scope_op_debug_print --------- Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com>	2025-06-01 15:14:44 +03:00
Xuan-Son Nguyen	4dfb2c2215	ggml : add ggml_repeat_4d (llama/13824)	2025-06-01 15:14:44 +03:00
Kai Pastor	ad433403ce	vulkan : Remove unexpected ; (ggml/1253)	2025-06-01 15:14:44 +03:00
Kai Pastor	4064dd6484	cmake : Fix broken CMake error messages (ggml/1252)	2025-06-01 15:14:44 +03:00
Radoslav Gerganov	fd75c4995b	ggml : remove ggml_graph_import and ggml_graph_export declarations (ggml/1247) The implementation is already deleted with commit 9d0762e. closes: #1235	2025-06-01 15:14:44 +03:00
Daniel Tang	4d18e52f55	ggml : Fix backtrace breaking Windows build (#3203 )	2025-05-29 13:26:58 +03:00
Radoslav Gerganov	48dddbbac1	ggml : install dynamic backends (ggml/1240)	2025-05-29 09:56:26 +03:00
Daniel Tang	5ea2c37a4c	ggml : Print backtrace on uncaught C++ exceptions (ggml/1232) The goal is to have what users call "full logs" contain the backtrace. This is registered upon ggml_init. Also fixes a minor fd leak on Linux.	2025-05-29 09:56:26 +03:00
Simon Booth	5720426d97	whisper : install shared libs when using GGML_BACKEND_DL (#3195 )	2025-05-28 10:15:04 +02:00
xctan	15ae9dc2a4	ggml : riscv: add xtheadvector support (llama/13720) * ggml : riscv: add xtheadvector support * ggml : clean up some macro usage	2025-05-27 18:03:00 +03:00
Christian Kastner	2e7a1e3e43	ggml-cpu: x86 feature detection is specific to x86 (llama/13811)	2025-05-27 18:03:00 +03:00
Diego Devesa	b75babebb2	ggml : allow CUDA graphs when using pipeline parallelism (llama/13814)	2025-05-27 18:03:00 +03:00
Georgi Gerganov	cc7a0105ef	cuda : avoid cuGetErrorString (llama/13791) ggml-ci	2025-05-27 18:03:00 +03:00
Akarshan Biswas	195fde8804	SYCL: Add non contiguous support in RMS_NORM and NORM kernels (llama/13611) * SYCL: Add non contiguous input support to norm kernel * refactor and add RMS_NORM non contiguous input support ggml-ci * restore subgroup reduction for multi-subgroup thread blocks in norm kernels * Swap grid dims of nsamples and nrows ggml-ci * Revert "Swap grid dims of nsamples and nrows" This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf. * restore not required changes ggml-ci * address review comments: change it to more like SYCL * Use a common function to calculate offset * remove wrap around logic for handling broadcasts * remove static from calculate_offset fn and use ceil_div	2025-05-27 18:03:00 +03:00
Romain Biessy	25e27904ca	sycl: Add more debug prints (llama/13640)	2025-05-27 18:03:00 +03:00
Jeff Bolz	474f7be8b6	vulkan: mark IM2COL as supporting non-contig (llama/13783)	2025-05-27 18:03:00 +03:00
Bizhao Shi	e35fecc2a1	CANN: Add the basic supports of Flash Attention kernel (llama/13627) * cann: add the basic FA support * cann: update the readme * cann: update the FlashAttention with PSEShift * cann: update the input parameters in FA * cann: update the alibi with max_bias * cann: add the constrints of softcap * cann: update the docs CANN.md * cann: update the docs CANN.md * cann: fix typo of CANN.md * cann: add some comments and update the CANN.md * cann: update the CANN.md * cann: update the inner precise for fusedInferAttention * cann: update the constraints of flash_attn_ext on ggml-cann.cpp * cann: clean the whitespace * cann: clean the whitespace * cann: add a new endline	2025-05-27 18:03:00 +03:00
Akarshan Biswas	1cd7028428	SYCL: revert "sycl: simplify bin_bcast_kernel (ggml/13383)" (llama/13752) Temporarily reverted due to failing fp16 DIV operation This reverts commit 02cdd2d8b092b5a4bb18e013c6887ce49ba20ac5. ggml-ci	2025-05-27 18:03:00 +03:00
Diego Devesa	99596d6031	ggml-cpu : set openmp wait time if not set (llama/13758)	2025-05-27 18:03:00 +03:00
Xuan-Son Nguyen	2d6c6862f7	ggml : add ggml_gelu_erf() CUDA kernel (llama/13719) * ggml : add ggml_gelu_erf() CUDA kernel * missing semicolon	2025-05-27 18:03:00 +03:00
Johannes Gäßler	f1576b2659	CUDA: fix race condition in FA vector kernels (llama/13742)	2025-05-27 18:03:00 +03:00
Chenguang Li	994b4f86ab	CANN: Support MUL_MAT_ID for q8_0 and q4_0 (llama/13705) * [CANN]Support MUL_MAT_ID Q8 && Q4 Signed-off-by: noemotiovon <757486878@qq.com> * codestyle adjustment Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-05-27 18:03:00 +03:00
Xuan-Son Nguyen	3e7eaccf55	ggml : fix the order of ggml_unary_op (llama/13718)	2025-05-27 18:03:00 +03:00
Jeff Bolz	191f040414	vulkan: support CPY from any type to itself (llama/13695) Reuse the f16/f32 copy shaders, and just scale the number of elements according to the type size.	2025-05-27 18:03:00 +03:00
Jeff Bolz	2d49d4a9b5	vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (llama/13696)	2025-05-27 18:03:00 +03:00
Judd	000d65befb	use LOG_WARN to replace `std::cerr` (llama/13657)	2025-05-27 18:03:00 +03:00
Nicolò Scipione	f0803e6646	sycl : Remove waits from function calls (llama/13702) * removes the waits in async memcpy functions	2025-05-27 18:03:00 +03:00
Ewan Crawford	730a00be8a	SYCL: Avoid using with SYCL-Graph for unsupported nodes (llama/13587) Currently on a CUDA backend to SYCL when running `GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there are two operations that throw an exception from the blocking waits during queue recording. * `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187 * `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074 We've noticed that `ggml-cuda.cu` has the [check_node_graph_compatibility_and_refresh_copy_ops](`39e73ae0d6/ggml/src/ggml-cuda/ggml-cuda.cu (L2458-L2458)`) method for checking if a graph can be used, even if enabled. I've taken a similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking if a graph can be used for the operations even if a user has asked for it to be enabled.	2025-05-27 18:03:00 +03:00
Henry Linjamäki	316600e8ee	opencl: Add support for multiple devices (llama/12622) * opencl: Add support for multiple devices ... but limited to one platform. A platform with a GPU will be preferred. Additionally: * Filter out devices that lack capabilities needed by the backend implementation (half support, OpenCL 2.0+, etc). * Make ggml_backend_opencl_reg() thread-safe. * fixup: fix an error in sync_with_other_backends ... when there is only one OpenCL device available.	2025-05-27 18:03:00 +03:00
Henry Linjamäki	42f2b3bb65	opencl: fix couple crashes (llama/12795) * opencl: fix couple crashes * fix kernel launches failed on devices which do not support non-uniform work-groups. When non-uniform work-groups are not supported, set `local_work_size` to NULL (= let driver choose the work-group sizes). This patch does not cover everything - just the cases tested by test-backend-ops. * fix sub-buffer creation failed due to `cl_buffer_region::origin` not being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`. * OpenCL: query non-uniform WG sizes only on OpenCL 3.0+	2025-05-27 18:03:00 +03:00
Xuan-Son Nguyen	dd6ef64060	ggml : add ggml_gelu_erf() (llama/13667) * ggml : add ggml_gelu_na (not approximated) * fix naming order * rename na --> erf * apply review suggesions * revert naming order	2025-05-27 18:03:00 +03:00
R0CKSTAR	131ee546ca	musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (llama/13647) * musa: fix build warning (unused parameter) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: upgrade MUSA SDK version to rc4.0.1 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Update ggml/src/ggml-cuda/cpy.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-05-27 18:03:00 +03:00
Eve	4712f7b663	vulkan: fix warnings (llama/13626) * small fixes * remove ifdef	2025-05-27 18:03:00 +03:00
Johannes Gäßler	926fe234e9	CUDA: skip fully masked-out KV in FA vec kernel (llama/13584) * CUDA: skip fully masked-out KV in FA vec kernel	2025-05-27 18:03:00 +03:00
Svetlozar Georgiev	f44b53480f	sycl: disable reorder for sycl mulmat (llama/13536)	2025-05-27 18:03:00 +03:00
Georgi Gerganov	e04e8f1c79	metal : fix typo in FA kernel comments (llama/13651)	2025-05-27 18:03:00 +03:00
Nicolò Scipione	ee3f177cba	sycl : Overcoming workaround for mmap() allocation on Windows (llama/13482) * Remove mmap workaround on windows After some testing I found that mmap is supported on windows and for many GPUs on Linux. Therefore I remove the workaround for windows since it is not necessary. * Update llama-bench README SYCL backend introduced a workaround that allows execution of llama-bench also without specifying `--mmp 0` flag	2025-05-27 18:03:00 +03:00
0cc4m	0b69f74e15	Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (llama/13607)	2025-05-27 18:03:00 +03:00
Chenguang Li	9da3fc27be	CANN: Support MOE Model MUL_MAT_ID (llama/13042) Signed-off-by: noemotiovon <757486878@qq.com>	2025-05-19 14:58:39 +03:00
Gilad S.	2c13651e08	cmake: use the current build config for vulkan-shaders-gen (llama/13595) * fix: use the current build config for `vulkan-shaders-gen` * fix: only pass a valid build type to `--config`	2025-05-19 14:58:39 +03:00
Jeff Bolz	13dca86c56	vulkan: move common FA code to flash_attn_base.comp (llama/13556) * vulkan: move common FA code to flash_attn_base.comp * vulkan: move common FA index/stride setup code to flash_attn_base.comp * build fix	2025-05-19 14:58:39 +03:00

1 2 3 4 5 ...

888 Commits