whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Akarshan Biswas	c5bb7c0078	sycl: Improve mul_mat_id memory efficiency and add BF16 fast path (llama/22119) * sycl: size mul_mat_id staging buffers by routed rows Previously src1_contiguous/dst_contiguous in ggml_sycl_mul_mat_id were sized to ggml_nelements(src1/dst), which over-allocates when ne12 > 1 and can fail with UR_RESULT_ERROR_OUT_OF_HOST_MEMORY on Level Zero for MoE models (notably with --cpu-moe). Size them by the actual number of routed rows (ids->ne[1] * n_ids) instead. * sycl: add bf16 mul_mat fast path via DNNL When src0 is BF16 (commonly the case for lm_head / output.weight), the existing f16 path is skipped because bf16 isn't covered, and the f32 fallback dequantizes the entire src0 slab to f32 in a single pool alloc (row_diff*ne00 floats). For large-vocab models this can reach several GB and fail with UR_RESULT_ERROR_OUT_OF_HOST_MEMORY on Level Zero. Add a bf16xbf16 -> f32 DNNL matmul fast path that uses the bf16 storage in place and only materializes a small src1 bf16 conversion buffer. bf16 matmul accumulates in f32, so it's correct even when the op requests GGML_PREC_F32 (as lm_head does). - gemm.hpp: map bfloat16 to dnnl::memory::data_type::bf16. - convert.{hpp,cpp}: expose ggml_get_to_bf16_sycl for f32/f16/bf16 -> bf16. - ggml-sycl.cpp: take the bf16 path early in ggml_sycl_op_mul_mat_sycl when DNNL and GGML_SYCL_HAS_BF16 are both available.	2026-04-30 11:29:16 +03:00
Sigbjørn Skjæret	4e32ee733b	ggml : implement set_rows with i32 index (llama/16159) * implement set_rows with i32 index * template fix * test quantized path warnings-- * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * forgotten name change * deduplicate cuda/sycl and test-fix * indent++ * vulkan: support set_rows with i32 index type (llama/16162) * disable i32 index for webgpu for now --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-09-29 15:18:09 +03:00
Neo Zhang Jianyu	cd764eaf2b	Revert "sycl: add usage of enqueue_functions extension (llama/14244)" (llama/15910) * Revert "sycl: add usage of enqueue_functions extension (#14244)" This reverts commit 8308f98c7fb778e54bf75538f5234d8bd20915e9. * fix missed revert code, format the code	2025-09-20 13:45:28 +03:00
Akarshan Biswas	4908e9dd05	SYCL: Add set_rows support for quantized types (llama/14883) * SYCL: Add set_rows support for quantized types This commit adds support for GGML_OP_SET_ROWS operation for various quantized tensor types (Q8_0, Q5_1, Q5_0, Q4_1, Q4_0, IQ4_NL) and BF16 type in the SYCL backend. The quantization/dequantization copy kernels were moved from cpy.cpp to cpy.hpp to make them available for set_rows.cpp. This addresses part of the TODOs mentioned in the code. * Use get_global_linear_id() instead ggml-ci * Fix formatting ggml-ci * Use const for ne11 and size_t variables in set_rows_sycl_q ggml-ci * Increase block size for q kernel to 256 ggml-ci * Cleanup imports * Add float.h to cpy.hpp	2025-08-18 20:30:45 +03:00
Akarshan Biswas	ebb0e9d0ed	SYCL: use 1D kernel for set_rows (llama/14618) * SYCL: Use 1D kernel for set_rows * Remove dangling comment * Refactor and use ceil_div	2025-07-20 00:23:50 +03:00
Akarshan Biswas	3c21cde540	SYCL: Initial set_rows kernel implementation (llama/14562) * SYCL: Initial set_rows kernel implementation * Revert max_threads to 256 * Refactor set_rows and address review comments * Deduplicate conversion function * Remove guard before kernel launch and refactor * Fix and add back SFINAE	2025-07-12 19:23:56 +03:00

6 Commits