whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Akarshan Biswas	3d5c7ca4bc	SYCL: add gelu_erf kernel (llama/13749) * SYCL: add gelu_erf kernel * refactor code Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com> * Use scope_op_debug_print --------- Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com>	2025-06-01 15:14:44 +03:00
Xuan-Son Nguyen	4dfb2c2215	ggml : add ggml_repeat_4d (llama/13824)	2025-06-01 15:14:44 +03:00
Kai Pastor	ad433403ce	vulkan : Remove unexpected ; (ggml/1253)	2025-06-01 15:14:44 +03:00
Kai Pastor	4064dd6484	cmake : Fix broken CMake error messages (ggml/1252)	2025-06-01 15:14:44 +03:00
Radoslav Gerganov	fd75c4995b	ggml : remove ggml_graph_import and ggml_graph_export declarations (ggml/1247) The implementation is already deleted with commit 9d0762e. closes: #1235	2025-06-01 15:14:44 +03:00
Daniel Tang	4d18e52f55	ggml : Fix backtrace breaking Windows build (#3203 )	2025-05-29 13:26:58 +03:00
Radoslav Gerganov	48dddbbac1	ggml : install dynamic backends (ggml/1240)	2025-05-29 09:56:26 +03:00
Daniel Tang	5ea2c37a4c	ggml : Print backtrace on uncaught C++ exceptions (ggml/1232) The goal is to have what users call "full logs" contain the backtrace. This is registered upon ggml_init. Also fixes a minor fd leak on Linux.	2025-05-29 09:56:26 +03:00
Simon Booth	5720426d97	whisper : install shared libs when using GGML_BACKEND_DL (#3195 )	2025-05-28 10:15:04 +02:00
xctan	15ae9dc2a4	ggml : riscv: add xtheadvector support (llama/13720) * ggml : riscv: add xtheadvector support * ggml : clean up some macro usage	2025-05-27 18:03:00 +03:00
Christian Kastner	2e7a1e3e43	ggml-cpu: x86 feature detection is specific to x86 (llama/13811)	2025-05-27 18:03:00 +03:00
Diego Devesa	b75babebb2	ggml : allow CUDA graphs when using pipeline parallelism (llama/13814)	2025-05-27 18:03:00 +03:00
Georgi Gerganov	cc7a0105ef	cuda : avoid cuGetErrorString (llama/13791) ggml-ci	2025-05-27 18:03:00 +03:00
Akarshan Biswas	195fde8804	SYCL: Add non contiguous support in RMS_NORM and NORM kernels (llama/13611) * SYCL: Add non contiguous input support to norm kernel * refactor and add RMS_NORM non contiguous input support ggml-ci * restore subgroup reduction for multi-subgroup thread blocks in norm kernels * Swap grid dims of nsamples and nrows ggml-ci * Revert "Swap grid dims of nsamples and nrows" This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf. * restore not required changes ggml-ci * address review comments: change it to more like SYCL * Use a common function to calculate offset * remove wrap around logic for handling broadcasts * remove static from calculate_offset fn and use ceil_div	2025-05-27 18:03:00 +03:00
Romain Biessy	25e27904ca	sycl: Add more debug prints (llama/13640)	2025-05-27 18:03:00 +03:00
Jeff Bolz	474f7be8b6	vulkan: mark IM2COL as supporting non-contig (llama/13783)	2025-05-27 18:03:00 +03:00
Bizhao Shi	e35fecc2a1	CANN: Add the basic supports of Flash Attention kernel (llama/13627) * cann: add the basic FA support * cann: update the readme * cann: update the FlashAttention with PSEShift * cann: update the input parameters in FA * cann: update the alibi with max_bias * cann: add the constrints of softcap * cann: update the docs CANN.md * cann: update the docs CANN.md * cann: fix typo of CANN.md * cann: add some comments and update the CANN.md * cann: update the CANN.md * cann: update the inner precise for fusedInferAttention * cann: update the constraints of flash_attn_ext on ggml-cann.cpp * cann: clean the whitespace * cann: clean the whitespace * cann: add a new endline	2025-05-27 18:03:00 +03:00
Akarshan Biswas	1cd7028428	SYCL: revert "sycl: simplify bin_bcast_kernel (ggml/13383)" (llama/13752) Temporarily reverted due to failing fp16 DIV operation This reverts commit 02cdd2d8b092b5a4bb18e013c6887ce49ba20ac5. ggml-ci	2025-05-27 18:03:00 +03:00
Diego Devesa	99596d6031	ggml-cpu : set openmp wait time if not set (llama/13758)	2025-05-27 18:03:00 +03:00
Xuan-Son Nguyen	2d6c6862f7	ggml : add ggml_gelu_erf() CUDA kernel (llama/13719) * ggml : add ggml_gelu_erf() CUDA kernel * missing semicolon	2025-05-27 18:03:00 +03:00
Johannes Gäßler	f1576b2659	CUDA: fix race condition in FA vector kernels (llama/13742)	2025-05-27 18:03:00 +03:00
Chenguang Li	994b4f86ab	CANN: Support MUL_MAT_ID for q8_0 and q4_0 (llama/13705) * [CANN]Support MUL_MAT_ID Q8 && Q4 Signed-off-by: noemotiovon <757486878@qq.com> * codestyle adjustment Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-05-27 18:03:00 +03:00
Xuan-Son Nguyen	3e7eaccf55	ggml : fix the order of ggml_unary_op (llama/13718)	2025-05-27 18:03:00 +03:00
Jeff Bolz	191f040414	vulkan: support CPY from any type to itself (llama/13695) Reuse the f16/f32 copy shaders, and just scale the number of elements according to the type size.	2025-05-27 18:03:00 +03:00
Jeff Bolz	2d49d4a9b5	vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (llama/13696)	2025-05-27 18:03:00 +03:00
Judd	000d65befb	use LOG_WARN to replace `std::cerr` (llama/13657)	2025-05-27 18:03:00 +03:00
Nicolò Scipione	f0803e6646	sycl : Remove waits from function calls (llama/13702) * removes the waits in async memcpy functions	2025-05-27 18:03:00 +03:00
Ewan Crawford	730a00be8a	SYCL: Avoid using with SYCL-Graph for unsupported nodes (llama/13587) Currently on a CUDA backend to SYCL when running `GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there are two operations that throw an exception from the blocking waits during queue recording. * `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187 * `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074 We've noticed that `ggml-cuda.cu` has the [check_node_graph_compatibility_and_refresh_copy_ops](`39e73ae0d6/ggml/src/ggml-cuda/ggml-cuda.cu (L2458-L2458)`) method for checking if a graph can be used, even if enabled. I've taken a similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking if a graph can be used for the operations even if a user has asked for it to be enabled.	2025-05-27 18:03:00 +03:00
Henry Linjamäki	316600e8ee	opencl: Add support for multiple devices (llama/12622) * opencl: Add support for multiple devices ... but limited to one platform. A platform with a GPU will be preferred. Additionally: * Filter out devices that lack capabilities needed by the backend implementation (half support, OpenCL 2.0+, etc). * Make ggml_backend_opencl_reg() thread-safe. * fixup: fix an error in sync_with_other_backends ... when there is only one OpenCL device available.	2025-05-27 18:03:00 +03:00
Henry Linjamäki	42f2b3bb65	opencl: fix couple crashes (llama/12795) * opencl: fix couple crashes * fix kernel launches failed on devices which do not support non-uniform work-groups. When non-uniform work-groups are not supported, set `local_work_size` to NULL (= let driver choose the work-group sizes). This patch does not cover everything - just the cases tested by test-backend-ops. * fix sub-buffer creation failed due to `cl_buffer_region::origin` not being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`. * OpenCL: query non-uniform WG sizes only on OpenCL 3.0+	2025-05-27 18:03:00 +03:00
Xuan-Son Nguyen	dd6ef64060	ggml : add ggml_gelu_erf() (llama/13667) * ggml : add ggml_gelu_na (not approximated) * fix naming order * rename na --> erf * apply review suggesions * revert naming order	2025-05-27 18:03:00 +03:00
R0CKSTAR	131ee546ca	musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (llama/13647) * musa: fix build warning (unused parameter) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: upgrade MUSA SDK version to rc4.0.1 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Update ggml/src/ggml-cuda/cpy.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-05-27 18:03:00 +03:00
Eve	4712f7b663	vulkan: fix warnings (llama/13626) * small fixes * remove ifdef	2025-05-27 18:03:00 +03:00
Johannes Gäßler	926fe234e9	CUDA: skip fully masked-out KV in FA vec kernel (llama/13584) * CUDA: skip fully masked-out KV in FA vec kernel	2025-05-27 18:03:00 +03:00
Svetlozar Georgiev	f44b53480f	sycl: disable reorder for sycl mulmat (llama/13536)	2025-05-27 18:03:00 +03:00
Georgi Gerganov	e04e8f1c79	metal : fix typo in FA kernel comments (llama/13651)	2025-05-27 18:03:00 +03:00
Nicolò Scipione	ee3f177cba	sycl : Overcoming workaround for mmap() allocation on Windows (llama/13482) * Remove mmap workaround on windows After some testing I found that mmap is supported on windows and for many GPUs on Linux. Therefore I remove the workaround for windows since it is not necessary. * Update llama-bench README SYCL backend introduced a workaround that allows execution of llama-bench also without specifying `--mmp 0` flag	2025-05-27 18:03:00 +03:00
0cc4m	0b69f74e15	Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (llama/13607)	2025-05-27 18:03:00 +03:00
Chenguang Li	9da3fc27be	CANN: Support MOE Model MUL_MAT_ID (llama/13042) Signed-off-by: noemotiovon <757486878@qq.com>	2025-05-19 14:58:39 +03:00
Gilad S.	2c13651e08	cmake: use the current build config for vulkan-shaders-gen (llama/13595) * fix: use the current build config for `vulkan-shaders-gen` * fix: only pass a valid build type to `--config`	2025-05-19 14:58:39 +03:00
Jeff Bolz	13dca86c56	vulkan: move common FA code to flash_attn_base.comp (llama/13556) * vulkan: move common FA code to flash_attn_base.comp * vulkan: move common FA index/stride setup code to flash_attn_base.comp * build fix	2025-05-19 14:58:39 +03:00
Jeff Bolz	6d61a09bc4	vulkan: use scalar FA rather than coopmat2 when N==1 (llama/13554)	2025-05-19 14:58:39 +03:00
Georgi Gerganov	4fedad988b	metal : add FA-vec kernel for head size 64 (llama/13583) ggml-ci	2025-05-19 14:58:39 +03:00
Łukasz Ślusarczyk	a8e17a244d	sycl : fixed compilation warnings (llama/13582)	2025-05-19 14:58:39 +03:00
Diego Devesa	0c76acd08a	gguf : use ggml log system (llama/13571) * gguf : use ggml log system * llama : remove unnecessary new lines in exception messages	2025-05-19 14:58:39 +03:00
Atharva Dubey	27964db1be	sycl: simplify bin_bcast_kernel (llama/13383)	2025-05-19 14:58:39 +03:00
Svetlozar Georgiev	8081e7a23d	sycl: reordered Q4_K MMVQ (llama/13109)	2025-05-19 14:58:39 +03:00
Łukasz Ślusarczyk	d807c497a4	sycl: use oneDNN for matrices multiplication (llama/12972)	2025-05-19 14:58:39 +03:00
Yibo Cai	8e9bf548f4	arm64: optimize q6_k_q8_k kernel with i8mm (llama/13519) This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction. Tested on neoverse-n2 with llama3 8b q6_k quantization model. - 40% ~ 54% S_PP uplift for all batch sizes - 16% ~ 47% S_TG uplift for batch size 4 and above Perplexity doesn't change with this PR. ``` // tested on neoverse-n2 $ llama-batched-bench \ -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \ --no-mmap -fa \ -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \ -npl 1,2,4,8,16,32 \ -t 64 --------------------------------------------------------------------- \| PP \| TG \| B \| S_PP t/s \| S_TG t/s \| \| \| \| \| original \| this pr \| original \| this pr \| \|-------\|--------\|------\|----------\|----------\|----------\|----------\| \| 128 \| 128 \| 1 \| 78.52 \| 109.18 \| 18.63 \| 18.88 \| \| 128 \| 128 \| 2 \| 84.62 \| 123.94 \| 34.54 \| 36.92 \| \| 128 \| 128 \| 4 \| 84.36 \| 122.49 \| 52.65 \| 61.32 \| \| 128 \| 128 \| 8 \| 90.52 \| 138.87 \| 63.46 \| 84.41 \| \| 128 \| 128 \| 16 \| 90.11 \| 138.56 \| 71.04 \| 101.33 \| \| 128 \| 128 \| 32 \| 89.81 \| 137.79 \| 75.14 \| 110.47 \| --------------------------------------------------------------------- ```	2025-05-19 14:58:39 +03:00
Johannes Gäßler	0dda27bc0b	CUDA: fix crash on large batch size for quant. MoE (llama/13537)	2025-05-19 14:58:39 +03:00

1 2 3 4 5 ...

879 Commits