whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Radoslav Gerganov	3477fdb2e3	rpc : keep last_graph_uid in the device context (llama/23273) With the introduction of MTP we can have multiple compute contexts for the same RPC device. In this case last_graph_uid is not updated properly when contexts are being switched. This patch fixes this by moving last_graph_uid to the device context, making sure it is always updated. closes: #23242	2026-05-25 12:26:07 +03:00
Pranav Dhinakar	eb558f23cb	hexagon: add support for TRI op (llama/22822) * Hexagon: TRI HVX Kernel addition to ggml hexagon HTP ops and context * addressed PR review comments for TRI op * hexagon: clang format * hex-unary: remove merge conflict markers * hex-ggml: remove duplicate op cases (merge conflict) * hex-ggml: fix editor config errors --------- Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-05-25 12:26:07 +03:00
Pranav Dhinakar	0a11c9fe83	ggml-hexagon: add PAD op HVX kernel (llama/23078) * ggml-hexagon: add PAD op HVX kernel Implements GGML_OP_PAD on the Hexagon HTP backend using HVX vectorized kernels. Supports zero-padding and circular padding across all 4 tensor dimensions. * hex-ggml: remove duplicate op cases (merge conflict) * hex-pad: fix editorconfig checks and macro alignment --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-05-25 12:26:07 +03:00
Intel AI Get-to Market Customer Success and Solutions	c65b082c94	sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (llama/22156) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>	2026-05-25 12:26:07 +03:00
Intel AI Get-to Market Customer Success and Solutions	619262ad24	sycl: route small f32 matmuls to oneMKL, bypass oneDNN (llama/22150) Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com>	2026-05-25 12:26:07 +03:00
Gabe Goodhart	4fb3ccabd3	feat: Support d_conv=15 for ssm-conv.cu (llama/23017) Branch: ModalityConditionalAdapters AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2026-05-25 12:26:07 +03:00
Oliver Simons	53736a3f0e	CUDA: Continue directly including cuda/iterator (llama/23102) Cont of #22936, forgot to update one site	2026-05-25 12:26:07 +03:00
Jan Ekström	9e96e0eaf1	ggml-vulkan/CMakeLists: add a check for SPIRV-Headers (llama/22009) * ci/run: set explicit SPIR-V Headers search path for macOS vulkan CI For whatever reason, the files are under additional sub-path `vulkan/` under the cmake directory, which does not match either current LunarG macOS Vulkan SDK structure (`lib/cmake/SPIRV-Headers`), nor what gets installed when you run the cmake build+install for SPIRV-Headers itself on at least Linux (`share/cmake/SPIRV-Headers`). This allows for SPIRV-Headers to be found, as currently the CI runner's setup does not seem to include the relevant path in list of search locations. * ggml-vulkan/CMakeLists: add a check for SPIRV-Headers This is installed by the project if it is built and installed. Receiving an error during the configuration step is generally preferred to receiving an error in the middle of a build.	2026-05-25 12:26:07 +03:00
Pascal	50482cbd22	vulkan: add cpy bf16 -> f32 pipelines (llama/22677)	2026-05-25 12:26:07 +03:00
Jeff Bolz	e417ce7aeb	vulkan: Support unaligned tensors for ROPE (llama/22637)	2026-05-25 12:26:07 +03:00
Jeff Bolz	c7dd64c606	vulkan: fuse SSM_CONV + BIAS + SILU (llama/22653)	2026-05-25 12:26:07 +03:00
Winston Ma	ef5ddecff9	vulkan: removed duplicate #include <memory> in headers (llama/23144)	2026-05-25 12:26:07 +03:00
Ori Pekelman	e78e693017	ggml.h: correct ggml_silu_back arg docstring (a=dy, b=x) (ggml/1500)	2026-05-25 12:26:07 +03:00
Dev-X25874	3583e35e0d	ggml-alloc: fix out-of-bounds read in ggml_dyn_tallocr_remove_block (ggml/1492)	2026-05-25 12:26:07 +03:00
Georgi Gerganov	587dca0eda	ggml : bump version to 0.12.0 (ggml/1494)	2026-05-25 12:26:07 +03:00
Aman Gupta	23f956de33	llama + spec: MTP Support (llama/22673) * spec: support MTP * fix batch size * rename files * cont : simplify (llama/7) * MTP: clean-up (llama/9) * MTP: clean-up * review: use llama_context_type instead of llama_graph_type * review: remove llama_model_has_mtp * review: fix convert issues * convert: fix pycheck * review: formatting * use `mtp-` for identifying mtp models * convert: fix mtp conversion * mtp -> draft-mtp * remove unused llama_arch * add need_embd in speculative * llama: allow partial seq_rm for GDN models for speculative decoding Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates. * fix pending state * vulkan: add GDN partial rollback * meta: extend check to axis 1 * metal: add GDN partial rollback Extend the gated delta net kernel to store intermediate states for partial rollback support on the Metal backend. - Add K (snapshot slot count) as a function constant - Read input state from slot 0 of the 3D state tensor - Write intermediate states to different slots during token loop - For K=1, maintain backward-compatible single-slot behavior Ref: `8c05923630` Assisted-by: llama.cpp:local pi * delta_net_base: use ggml_pad instead of new_tensor * review: add need_rs_seq * review: rename part_bounded to n_rs * review: deslop comments * review: rename, add asserts * server : adjust checkpoint logic (llama/11) * server : adjust checkpoint logic * cont : rm asserts * server-context: fix early exit * spec : fix compatibility with n-gram and add TODOs (llama/13) * metal : cleanup * llama : fix faulty bitwise check in recurrent memory * server : disable RS-based MTP in combination with other spec types * spec : add TODOs * cont : fix comment * cont : update comment * common : fix logic for ngram + mtp compat * llama-memory: enable checkpointing with partial rollback * cont: add test-case for loading into a dirty ctx * llama-memory-recurrent: clear rs_idx in clear * download: fix mtp path * llama-arch: fix enorm op * docs: update docs * conversion: fix type annotations --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-05-25 12:26:07 +03:00
Pranav Dhinakar	18a61f44b6	ggml-hexagon: cpy: add contiguous fast-path in reshape copy (llama/23076)	2026-05-25 12:26:07 +03:00
Johannes Gäßler	e62d5893f4	HIP: RDNA3 mma FA, faster AMD transpose, tune AMD (llama/22880) Adds RDNA3 support to the CUDA mma FA kernel. To make the RDNA3 tensor cores work with the FP16 accumulation for VKQ the tiles they need to be 32 logical units long in direction of the attention head; for head sizes 80 and 112 that are not exactly divided by 32 the regular length of 16 with FP32 accumulation is used instead. The longer tiles also enable more efficient transposition for a warp size of 32 which is why it's also used for RDNA4. However, this scrambles the data layout of the accumulators along the attention head dimension. To prevent accidental misuse I added another entry to ggml_cuda_mma::data_layout. I also tuned the kernel parameters for RDNA3, RDNA4, and CDNA1 in general, during which I discovered that the kernel can be made to work for head sizes up to 256 for CDNA. For RDNA3/4 I was not able to get better performance that the tile kernel for head sizes > 128.	2026-05-25 12:26:07 +03:00
Zheyuan Chen	13133ab299	ggml-webgpu: makes the flash attn vec path subgroup-aware (llama/23040) * ggml-webgpu: makes the flash attn vec path compile and size its split/reduce work from the device’s reported subgroup range instead of assuming 32 subgroup size. * ggml-webgpu: remove the extra max_wg_size >= max_subgroup_size guard. Remove hardcoded 32 when determine the value of reduce_wg_size and vec_nwg_cap	2026-05-25 12:26:07 +03:00
Georgi Gerganov	592a8cd15d	logs : reduce (llama/23021) * logs : reduce * args : fix envs * server : fix build * common : print verbosity level at start * server : clean-up logs * server : print prompt processing timings + sampling params * minor : whitespaces	2026-05-25 12:26:07 +03:00
alex-spacemit	f0223903aa	ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (llama/22863)	2026-05-25 12:26:07 +03:00
Ruben Ortlam	97ba44338f	vulkan: fix matmul integer pipeline selection (llama/23005) * vulkan: fix matmul integer pipeline selection * gate pipeline creation with the right bools	2026-05-25 12:26:07 +03:00
Katostrofik	06cfc3653b	SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations (llama/21597) * SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM. zeMemAllocDevice uses the SVM/P2P path with no host staging. On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes. With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with no performance regression. All Level Zero calls include automatic fallback to the original SYCL allocation path if Level Zero interop is unavailable. * SYCL: address review feedback - remove try/catch, check device types, deduplicate - Remove try/catch from malloc/free/memcpy helpers, check backend and device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu) - Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp and declare in common.hpp to eliminate code duplication - Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls - Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the host-staged path for iGPU-to-dGPU transfers - Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH) in CMakeLists.txt (co-authored with @arthw) * SYCL: add build/runtime flags for Level Zero, address review feedback Implements the architecture suggested by @arthw: compile-time and runtime flags to cleanly separate Level Zero and SYCL memory API paths. - Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level Zero code is wrapped in #ifdef so the build works on systems without the Level Zero SDK installed (e.g. CPU-only CI servers). Both the loader library and headers are checked before enabling. - Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls whether Level Zero or SYCL memory APIs are used. Only one API style is used per session, no mixing. If Level Zero is enabled but the devices don't support the Level Zero backend, it auto-disables with a warning. - Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory is not called anywhere in the backend) and used try/catch for flow control. - Update SYCL.md with documentation for both new parameters. Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development (Claude). Code reviewed and tested on my hardware. * SYCL: unify Level Zero malloc/free call sites, address review feedback Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device. Both functions are now unconditionally available — Level Zero code is #ifdef'd inside the functions, not at call sites. All call sites use uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks. Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack traces on failure, eliminate duplicated #ifdef/else patterns at 6 call sites (-29 lines net). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs so the Level Zero code path is compiled and tested in CI. Fix two bugs found during extended dual-GPU testing (no ONEAPI_DEVICE_SELECTOR set): - The Level Zero backend check was iterating all SYCL devices including CPU. The OpenCL CPU device caused Level Zero to be disabled for the GPUs, defeating the fix on multi-GPU systems. Added is_gpu() filter so only GPU devices are checked. - sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers) were still calling sycl::malloc/sycl::free directly, bypassing the Level Zero path. Routed through ggml_sycl_malloc_device/free_device for consistency with the other device memory call sites. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: address arthw review feedback on Level Zero memory API structure - Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp; only ggml_sycl_free_device (used by common.cpp) stays in common.cpp - Switch both helpers to use g_ggml_sycl_enable_level_zero global instead of per-call queue backend checks - Remove #ifdef wrapper from global definition; always declare at 0, add #else branch in init block so it stays 0 when L0 not compiled in - Update init loop comment to explain GPU-only device check - CMakeLists: message(STATUS) before the if block; align option wording AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed <5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * SYCL: remove unused cstdio/cstdlib includes from common.cpp Leftover from the deleted ggml_sycl_queue_supports_level_zero helper. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Apply suggestions from code review Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * SYCL: preserve Level Zero allocation path during early malloc * ci: fix Level Zero package conflict in Intel Docker build * ci: find Level Zero loader in oneAPI package step * ci: allow Windows SYCL package without Level Zero DLL --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>	2026-05-25 12:26:07 +03:00
Daniel Bevenius	e414ecf674	cmake : add CMakePresets.json [no ci] (#3808 ) This commit adds a CMakePresets.json file similar to the one in llama.cpp. The motivation for this is that this provides sharable named configuration which can be used with cmake --preset <name>. It also allows for extendins these preset with a CMakeUserPresets.json for specific hardware (like CPUs), architectures, and toolchains etc.	2026-05-25 12:25:15 +03:00
OrbisAI Security	b3877e10c0	fix: in bindings/ruby/test/jfk_reader/jfk_reader in jfk_reader.c (#3756 ) * fix: V-002 security vulnerability Automated security fix generated by Orbis Security AI * fix(ruby): use Ruby allocator macros in jfk_reader and fix memory leak - Replace calloc/free with ALLOC_N/xfree to match Ruby binding conventions (ALLOC_N handles overflow checking and raises NoMemoryError on failure) - Free temporary samples buffer after conversion loop (was leaked) - Add NULL check for fopen return value with rb_raise - Add comment clarifying n_samples is a compile-time constant Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(ruby): return false instead of rb_raise in memory_view callback rb_memory_view_get_func_t callbacks should communicate errors via return value (false), not exceptions. rb_memory_view_get has no exception-handling wrapper around get_func calls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * replacing ALLOC_N with rb_protect as ALLOC_N raises Ruby exceptions --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-25 08:19:23 +02:00
Pascal	0ccd896f5b	common : fix server /inference fails to decode in-memory audio (regression) (#3818 ) * common: add memory buffer overload of read_audio_data whisper-server /inference without --convert passed the uploaded file bytes to read_audio_data as a filename, so ma_decoder_init_file tried to open a path starting with "RIFF" and failed. every request returned HTTP 400 "Invalid request" on builds without WHISPER_FFMPEG, which is the default. factor the PCM extraction into a shared helper and add an overload that decodes straight from a memory buffer via ma_decoder_init_memory, which the function already used for the stdin path. server now calls it with the upload content. the filename overload behavior is unchanged.	2026-05-22 08:27:35 +02:00
Daniel Bevenius	8443cf05e3	ci : use github ubuntu-22.04-arm runner instead of qemu (#3815 ) * ci : use github ubuntu-22.04-arm runner instead of qemu This commit updates the ubuntu-22-gcc-arm64 job to use a arm github runner instead of QEMU. The motivation for this is that we get intermittent failure specifically related to QEMU. For example: ```console Segmentation fault (core dumped) qemu: uncaught target signal 11 (Segmentation fault) - core dumped Segmentation fault (core dumped) dpkg: error processing package libc-bin (--configure): installed libc-bin package post-installation script subprocess returned error exit status 139 Processing triggers for ca-certificates (20240203~22.04.1) ... Updating certificates in /etc/ssl/certs... 0 added, 0 removed; done. Running hooks in /etc/ca-certificates/update.d... done. Errors were encountered while processing: libc-bin E: Sub-process /usr/bin/dpkg returned an error code (1) ``` This is an attempt to try to avoid QEMU and hence avoid this issue. * ci : remove QEMU where possible	2026-05-21 10:59:58 +02:00
Daniel Bevenius	afa2ea544f	whisper : set bench data for each iteration (#3812 ) * whisper : set bench data for each iteration This commit updates whisper_bench_ggml_mul_mat_str to intialize the tensors data for each iteration. The motivation for this is that is currently possible for a previous run's results, F32 values, to leak into the next run. When it is time for the F16 iteration then F32 results can cause NaN values to appear in the tensor values causing the F16 iteration to fail. Refs:https://github.com/ggml-org/whisper.cpp/actions/runs/25901678402/job/76152894644?pr=3735 * ci : set GGML_NATIVE=OFF if x86_64 This commit sets GGML_NATIVE=OFF for x86_64 architectures. The motivation for this is to try to get CI to pass and the theory is that the libggml-cpu.so library in the ccache might have been built by a runner that supports a different instruction set. When another runner that does not support that instruction set tries to use it, it will fail with a segmentation fault. I'm not sure about this yet but going to try this out and if it does not work I'll ssh into the runner to debug further.	2026-05-19 08:58:43 +02:00
petterreinholdtsen	47b9eb37a3	examples : fix memory leak in read_audio_data (#3810 ) This commit addresses a memory leak in the `read_audio_data` function where it is currently possible that a call to `ma_decoder_init_file` succeeds and the function returns early without calling `ma_decoder_uninit`. A similar situation can occur with `ma_decoder_init_memory`. Refs: https://bugs.debian.org/1124796 Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2026-05-18 12:16:39 +02:00
Andreas Lubbe	6227a0ef73	server : Return speaker information in JSON (#3782 )	2026-05-18 09:18:04 +02:00
Andreas Lubbe	968eebe772	server: add support for carry_initial_prompt (#3781 ) * Add support for carry_initial_prompt on the server * Update README	2026-05-15 14:03:17 +02:00
Georgi Gerganov	46ca43d639	talk-llama : sync llama.cpp	2026-05-14 21:26:48 +03:00
Georgi Gerganov	69500f5502	sync : ggml	2026-05-14 21:26:48 +03:00
Zheyuan Chen	e4ce42e55f	ggml-webgpu: only use subgroup-matrix path when head dims are divisible by sg_mat_k / sg_mat_n (llama/23020)	2026-05-14 21:26:48 +03:00
scutler-nv	97371e9285	Fix for issue #22974 . Cast intermediate results to float before adding and casting the result to the destination type. Avoids half+half operator ambiguity. (llama/22994)	2026-05-14 21:26:48 +03:00
shaofeiqi	d4a4d87f0e	opencl: add q5_0 and q5_1 MoE for Adreno (llama/22985) * opencl: add q5_0 moe support * opencl: add q5_1 moe support * opencl: avoid potential leak * opencl: suppress unused var warning when building for non-Adreno --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-05-14 21:26:48 +03:00
lhez	b19beb6027	opencl: fix crash when warming up MoE on Adreno (llama/22876)	2026-05-14 21:26:48 +03:00
Masashi Yoshimura	1cbbd0b6d0	flush the gpu profile timestamp before the queryset is overflowed (llama/22995)	2026-05-14 21:26:48 +03:00
Max Krasnyansky	cb7d38bf18	hexagon: add unary tanh op (llama/22999)	2026-05-14 21:26:48 +03:00
Sachin Sharma	8b288f5d96	ggml-zendnn : adaptive fallback to CPU backend for small batch sizes (llama/22681) * ggml-zendnn : add runtime env var GGML_ZENDNN_ADAPTIVE_FALLBACK to control adaptive fallback (default: enabled) * ggml-zendnn : restore original fallback logic when adaptive fallback is disabled	2026-05-14 21:26:48 +03:00
Trivikram Reddy	bcaf449826	hexagon: eliminate scalar VTCM loads via HVX splat helpers (llama/22993) * hexagon: add hvx_vec_repl helpers and use those for splat-from-vtcm usecase * hmx-mm: optimize per-group scale handling * hmx-fa: optimize slope load from vtcm * hmx-fa: use aligned access where possible in hmx-utils * hexagon: add hvx_vec_repl_2x_f16 helper and consolidate repl helpers --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-05-14 21:26:48 +03:00
yzyyzyhhh	1caed1d2ba	opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill (llama/22755) * ggml-opencl: add Adreno xmem F16xF32 GEMM for prefill * ggml-opencl: address Adreno xmem review comments * ggml-opencl: align xmem gemm kernel naming --------- Co-authored-by: Your Name <your@email.com>	2026-05-14 21:26:48 +03:00
Masashi Yoshimura	e8a7cd314f	ggml-webgpu: Enables running gpt-oss-20b (llama/22906) * Enable to run gpt-oss-20b and refactor mulmat-q * disable test-backend-ops in ubuntu-24-webgpu	2026-05-14 21:26:48 +03:00
Chen Yuan	a9bcbf5595	ggml-webgpu: address precision issues for multimodal (llama/22808) * fix(mixed-types): use f32 for precision and update the shared memory calculation logic for f32 * fix(unary): correct the gelu, gelu quick and gelu erf functions * fix(flash-attn-tile): fix the hardcode v type * fix(flash_attn): fix tile path * fix: pass editorconfig and address the type conflicts * fix: remove reduant pipeline keys * fix: remove inline min/max group size functions and revert the flash attn path order * fix: use clamp to avoid NaN for GELU * fix: use the right range for exp, 80 is safer for f32 exp	2026-05-14 21:26:48 +03:00
Masato Nakasaka	be5a35ccee	vulkan: Fix Windows performance regression on Intel GPU BF16 workloads for Xe2 and newer (llama/22461) * refactor * Use l_warptile only when coopamt is available for BF16	2026-05-14 21:26:48 +03:00
Jeff Bolz	20895abdbd	vulkan: Check shared memory size for mmq shaders (llama/22693)	2026-05-14 21:26:48 +03:00
guyfischman	8ec91c91e1	metal : promote mul_mv/mul_mm batch divisors to function constants (llama/22711) * metal : promote mul_mv/mul_mm batch divisors to function constants * metal : take op directly in get_pipeline_mul_mv_ext	2026-05-14 21:26:48 +03:00
Shawn Gu	ea4652c427	opencl: add q4_1 MoE for Adreno (llama/22856) * Q4_1 MoE CLC pass sanity check * remove unnecessary code * opencl: remove unnecessary asserts and reformat * opencl: fix supports_op for q4_1 moe * q4_1 moe is supported by Adreno with certain shapes --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-05-14 21:26:48 +03:00
CrispStrobe	287f637fb1	CUDA: handle OW > 65535 in im2col (2D and 3D) (llama/22944) `im2col_cuda` and `im2col_3d_cuda` both dispatch with `block_nums.y = OW`. CUDA caps grid Y at 65535. Conv1d encoders on raw 16 kHz audio with T > 65535 (~ 4 s) trip the limit -- e.g. SEANet at 11 s lands at OW = 176000 -- and the launch returns `invalid configuration argument`. Clamp `block_nums.y` to `MIN(OW, MAX_GRIDDIM_Y)` and loop inside the kernel with stride `MAX_GRIDDIM_Y`. Same in-kernel stride pattern already used for the z axis (`MAX_GRIDDIM_Z`). Both 2D `im2col_kernel` and 3D `im2col_3d_kernel` need the same fix. Bit-identical for OW <= 65535 (single iteration of the new outer loop). Tested on T4 / Jetson Orin with a SEANet encoder running on 11 s / 16 kHz audio (im2col reaching OW ~ 176000); pre-fix launch returns `invalid configuration argument`, post-fix runs to completion. Existing test-backend-ops im2col cases unchanged.	2026-05-14 21:26:48 +03:00
Pascal	449b33fc8f	Ggml/cuda snake fusion hardening (llama/22912) * cuda: tighten snake fusion type checks for all operands (defensive, sync vulkan) * cuda: reject snake fusion when ne[2] or ne[3] > 1 (mirror vulkan PR review) * cuda: merge type_ok and types_ok into a single types_ok (address am17an review) * cuda: filter ADD/SUB/MUL/DIV in supports_op to F32/F16 bin_bcast only dispatches F32/F16 type triplets, mirror the vulkan filter so unsupported types fall back through cpy instead of aborting. * test-backend-ops: extend snake_fuse to rank-4 with ne[2]/ne[3] > 1 cases	2026-05-14 21:26:48 +03:00

1 2 3 4 5 ...

4503 Commits All Branches Search

4503 Commits

All Branches