whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Winston Ma	816c3029bc	vulkan: fix wrong index variable in inner loop (llama/23665)	2026-05-29 09:47:30 +03:00
Winston Ma	5db94bac04	vulkan: Fix memory logger unsafe iterator access (llama/23667)	2026-05-29 09:47:30 +03:00
fairydreaming	60e420ff6a	cuda : fix KQ mask offset integer overflow in fattn MMA kernel (llama/23610) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2026-05-29 09:47:30 +03:00
Martin Klacer	8e40325876	ggml: fixed Arm SVE usage bug in vec.h, vec.cpp (llama/22841) * Updated vec.h/vec.cpp code to accumulate to F32 rather than F16 Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8 Signed-off-by: Martin Klacer <martin.klacer@arm.com> Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>	2026-05-29 09:47:30 +03:00
ymcki	d284e1c3aa	Hexagon: OP_GATED_DELTA_NET K>1 support (llama/23531) * K>1 state snapshot support * removed picky indent multiple of 4 fixes	2026-05-29 09:47:30 +03:00
ymcki	7e843a80e1	opencl: OP_GATED_DELTA_NET (llama/23312) * OP_GATED_DELTA_NET impl * add back lanes_per_column declaration * removed has_subgroup_arithmetic and has_subgroup_clustered_reduce * removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot * support for K>1 state snapshot * removed picky indent multiple of 4 fixes * removed return that won\'t be executed	2026-05-29 09:47:30 +03:00
Reese Levine	8c8f213dac	ggml-webgpu: remove legacy constants (llama/23672)	2026-05-29 09:47:30 +03:00
Max Krasnyansky	3bbe93378c	hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (llama/23647) * hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now * hmx-mm: add support for Q4_1 * hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot * hexagon: fix repack scratch buffer overflow * hex-mm: fix Q4_1 repack buffer sizing * hexagon: flip the build order for mm and fa (seems to help LTO) * hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1 * hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output * hexagon: resurrect early-wake and add support for polling for op-batch completions With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax. This is a good thing! But it does add extra latency for the pure benchmark runs. Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking. --------- Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>	2026-05-29 09:47:30 +03:00
Masashi Yoshimura	a52bd385d6	ggml-webgpu: Fix how to dispatch WG to some ops (llama/23750)	2026-05-29 09:47:30 +03:00
Matt Corallo	8bce478ee8	vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 (llama/22887) * vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 Against mesa git, this shows a 4.8% performance improvement for tg128 on Qwen3.5-9B:BF16 on Intel BMG. Note that this breaks some tests until the last commit which fixes OOB A reads. * vulkan: Use aligned loads in mul_mat_vec when available Against mesa git, this shows a 3.3% performance improvement for tg128 on Qwen3.5-9B:BF16 on Intel BMG. * Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec Mesa's UUB logic can't see through conditionals, limiting its ability to understand the bounds on the `num_rows` field in the cleanup run. Making it explicit that `num_rows` is, indeed, always <= `NUM_ROWS` helps mesa make slightly better codegen. Against mesa git, this currently shows a 1% performance improvement in tg128 on Qwen3.5-9B:BF16 on Intel BMG. * vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes There was a TODO to fix the OOB reads from the A matrix which we do here. It is within performance noise (+<0.1%) in tg128 for Qwen3.5-9B:BF16 on Intel BMG.	2026-05-29 09:47:30 +03:00
Jeff Bolz	1b590bbb9a	vulkan: use GL_NV_cooperative_matrix_decode_vector for faster matmul (llama/23541)	2026-05-29 09:47:30 +03:00
l8bloom	c5cde8c717	vulkan: add REPEAT op support for f16 to f16. (llama/23298) * feat: extend repeat op for vulkan * feat: add repeat_f16 vulkan pipeline * fix: ensure same dst and src types * fix: use type_size instead of data types * fix: use int16 and int32 for repeat shader op * chore: rename repeat_f* to repeat_i* * chore: rename repeat vulkan pipelines	2026-05-29 09:47:30 +03:00
Oliver Simons	98c6722fec	CUDA: restrict PDL to CTK >= 12.3 due to MSVC issues (llama/23742)	2026-05-29 09:47:30 +03:00
Winston Ma	80e87ec453	vulkan: avoid preferring transfer queue on AMD UMA devices (llama/22455)	2026-05-29 09:47:30 +03:00
Vladislav	6a249cd640	ggml-zendnn : fixed naming of matmul function (llama/20964) * ggml-zendnn: fixed naming of matmul function * ggml-zendnn: fixed naming of mul_mat_id function * ggml-zendnn: fixed print in mul_mat_id --------- Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>	2026-05-29 09:47:30 +03:00
Jeff Bolz	a0efd13f0f	vulkan: optimize conv2d and implement coopmat1 support (llama/22620) * vulkan: add CONV_SHAPE_64x128 for medium-K conv2d * vulkan: skip conv2d bounds checks when shapes align with tile sizes * vulkan: use WG_SIZE=128 for CONV_SHAPE_64x32 conv2d * vulkan: stage cm2 conv2d accumulator through shmem before global store * vulkan: add coopmat1 conv2d path * fallback when using too much shared memory. clean up comments * Require 16x16x16 and subgroup size 32 or 64 * check whether shared memory is sufficient before overwriting conv2d params with coopmat1 values	2026-05-29 09:47:30 +03:00
Max Krasnyansky	f8df28d331	hexagon: add support for CONCAT op (llama/23648) * hexagon: add support for CONCAT with optimized concat_2d_transposed qwen3.5 models are quite heavy on the CONCAT with large and transposed src1. * hex-concat: use fastdiv in generic version * hex-concat: make checks for transposed a bit more readable * hex-concat: reoder dma ops for better pipelining * hex-cont/cpy: optimize CPY and CONT ops The primary change is to avoid scalar divs in the inner loops. We were calling hvx_copy_uu(... type_size) where type_size is non a constexpr. This causes runtime divs by that value which is normally just 4 or 2 (f32/f16). * hex-get-rows: optimize GET_ROWS for large rows We now use DMA for larger rows and also split them into chunks to improve perf for Qwen3.5 and other models that do lots of GET_ROWS with huge (2MB+ rows). Also bump the DMA queue depth now that we can take advantage of it. * hex-concat: unroll the inner loops of concat_2d * hex-concat: more updates to concat_2d to improve perf a bit further * hex-cpy: fixed n_rows per thread checks in the copy ops * hmx-fa: fix alignment issues while computing dma sizes * hex-set-rows: add early returns for idle threads * hvx-rope: minor optimization to replace loops with fastdiv logic * hex-rope: replace scalar tail processing with HVX * hex-rope: optimize rope cache init with HVX Add hvx-utils sin/cos helpers that use an aprox method (similar to rsqrt, inverse, etc) Use the helpers to optimize ROPE.	2026-05-29 09:47:30 +03:00
Alexey Kopytko	049f0af339	SYCL: implement ggml_sycl_pool_vmm (llama/22862) * SYCL: implement ggml_sycl_pool_vmm * Add an option to bypass VMM with GGML_SYCL_DISABLE_VMM * Clean up debugging logging * document GGML_SYCL_DISABLE_VMM * Multi-stream MoE optimization * Revert "Multi-stream MoE optimization" This reverts commit 938929c3f13a562ec67c59e87cc5d38595444cce. * Update common.hpp Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * Flip GGML_SYCL_DISABLE_VMM to GGML_SYCL_ENABLE_VMM * add logging for GGML_SYCL_ENABLE_VMM when extension is not available (SYCL_EXT_ONEAPI_VIRTUAL_MEM macro) * Apply suggestions from code review Co-authored-by: Alexey Kopytko <alexey@kopytko.com> * Apply suggestion from @sanmai * Apply suggestion from @sanmai --------- Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>	2026-05-29 09:47:30 +03:00
Masashi Yoshimura	00a5110b19	ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K and clean up legacy MUL_MAT pipeline (llama/23594) * ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K * Fix to editorconfig checking pass * Remove mul-mat-legacy pipeline * Fix to use vendor name as is and add dot_product/vendor to shader_lib_ctx	2026-05-29 09:47:30 +03:00
Nikhil Jain	bc77933c2d	Check batch_compute_passes before sending passes when not doing GPU profiling (llama/23457) * Only run webgpu CI on my fork * Add webgpu only workflow * refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled * restore build.yml	2026-05-29 09:47:30 +03:00
Johannes Gäßler	2307712d32	CUDA: missing PDL sync for FWHT, better fallback (llama/23690)	2026-05-29 09:47:30 +03:00
forforever73	1c477d4056	metal : add apple device id (llama/23566) Co-authored-by: lvyichen <lvyichen@stepfun.com>	2026-05-29 09:47:30 +03:00
Aman Gupta	205ee5a189	CUDA: add fast walsh-hadamard transform (llama/23615) * CUDA: add fast walsh-hadamard transform * review: add unrolls + change size_t -> int * warp size 64 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-05-29 09:47:30 +03:00
Georgi Gerganov	1cf8e3a903	ggml : bump version to 0.13.0 (ggml/1510)	2026-05-25 12:44:04 +03:00
Johannes Gäßler	bcff515150	TP: fix ggml context size calculation (llama/22616) * TP: fix ggml context size calculation, memory leak * move split state cache back into the context * revert to constant ggml context size for cgraphs * increase headroom for statically allocated tensors * remove obsolete include	2026-05-25 12:44:04 +03:00
Gilad S	2979e5f95f	ggml: `gguf_init_from_callback` and `gguf_init_from_buffer` (llama/22341) * ggml: implement `gguf_init_from_buffer` * test: `gguf_init_from_buffer` * fix: memory breakdown for a model loaded with `no_alloc` from a file is consistent with being loaded from a buffer * fix: use `GGML_UNUSED` Co-authored-by: Copilot <copilot@github.com> * fix: remove `total_size` from `gguf_reader` * fix: file offset calculation, rename `offset` to `data_offset` Co-authored-by: Copilot <copilot@github.com> * refactor: extract model loader bug fixes to another PR * feat: add `gguf_init_from_callback` * fix: always require a max expected size * fix: change `gguf_reader_callback_t`'s `output` type to `void `, change `max_expected_size` and offsets to `uint64_t` fix: harden against offset overflow in buffer read * fix: remove seek behavior from the callback * feat: `max_chunk_read == 0` means `SIZE_MAX` * fix: seeking in a gguf file with no tensors --------- Co-authored-by: Copilot <copilot@github.com>	2026-05-25 12:44:04 +03:00
Georgi Gerganov	946d6813b9	ggml : bump version to 0.12.1 (ggml/1508)	2026-05-25 12:26:07 +03:00
Jeff Bolz	a369b3949c	ggml : Parallelize quant LUT init (llama/23595) - Use OpenMP to parallelize iq2xs_init_impl and iq3xs_init_impl. - Move the OpenMP detection from ggml-cpu to ggml-base. - Update OpenMP dependencies in ggml-config.cmake.in.	2026-05-25 12:26:07 +03:00
Johannes Gäßler	3306af62b1	TP: fix entirely zero-sized slices per device (llama/23525)	2026-05-25 12:26:07 +03:00
shaofeiqi	1435988ab3	opencl: batch profiling to improve speed and prevent memory leaks (llama/23495)	2026-05-25 12:26:07 +03:00
Yiwei Shao	b84d03487c	hexagon: apply repl optimization in flash attn softmax as #22993 (llama/23455)	2026-05-25 12:26:07 +03:00
dskwe	511f8602b1	ggml : Check the right iface method before using the fallback 2d get (llama/23514)	2026-05-25 12:26:07 +03:00
Jeff Bolz	6b85d73b33	vulkan: fix windows find_package of SPIRV-Headers (llama/23215) * vulkan: fix windows find_package of SPIRV-Headers * not windows-only	2026-05-25 12:26:07 +03:00
Shawn Gu	aefffa1fa5	opencl: generalize Adreno MoE kernels on M (llama/23449)	2026-05-25 12:26:07 +03:00
Alexey Kopytko	b0c9f90059	SYCL: improve MoE prefill throughput (llama/23142) - change `k_copy_src1_to_contiguous` so that uses a precomputed contiguous mapping where all rows "owned" by an expert are in one slice with a know starts and ends - switch the `O(n_as * n_routed_rows)` contraption to a counting sort-based procedure with `O(n_as + n_routed_rows)` complexity	2026-05-25 12:26:07 +03:00
Alexey Kopytko	21c65a78a3	sycl : Level Zero detection in ggml_sycl_init (llama/23097) * [SYCL] Centralize Level Zero detection in ggml_sycl_init * use the same wording * get back the warning	2026-05-25 12:26:07 +03:00
karavayev	0416feecfc	SYCL : gated_delta_net K>1 (llama/23174) * sycl_gated_delta_net K>1 * editor_config	2026-05-25 12:26:07 +03:00
Katostrofik	6fb7f1af2c	SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc) (llama/21580) * SYCL: add BF16 to DMMV kernel path for ~4x token generation speedup BF16 models had no dedicated token generation kernel — they fell through to the generic full-GEMM path, resulting in ~14% memory bandwidth utilization on Intel Arc GPUs. This adds BF16 support to the DMMV (dequantize mul-mat-vec) path, matching the existing F16 implementation. Fixes #20478 * SYCL: fix BF16 DMMV out-of-bounds when ncols % 64 != 0 The qk=1 kernel (used for F16 and BF16) iterates with stride 2GGML_SYCL_DMMV_X (= 64 on Intel targets where WARP_SIZE=16). When ncols is a multiple of DMMV_X (32) but not of 2DMMV_X (64), the last warp iteration accesses elements at col >= ncols, producing NaN for the final row and wrong values for interior rows. Fix: tighten can_use_dequantize_mul_mat_vec to require ne[0] % (2*DMMV_X) == 0 for F16/BF16 types, and update the ASSERT in the BF16 launcher to match. Quantized types use block-structured kernels with different access patterns and keep the existing DMMV_X check. Verified: test-backend-ops MUL_MAT passes 913/913 on Intel Arc Pro B70. Previously failing: m=128/129 n=1 k=1056 cases (NaN and ERR > 0.0005). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 12:26:07 +03:00
Sachin Sharma	2d629533a5	ggml-zendnn : add Q8_0 quantization support (llama/23414) * ggml-zendnn : add Q8_0 quantization support * ggml-zendnn : sync with latest ZenDNN * ggml-zendnn : address review comments for Q8_0	2026-05-25 12:26:07 +03:00
Johannes Gäßler	ec183556c6	CUDA: fix PDL CC check for JIT compilation (llama/23471)	2026-05-25 12:26:07 +03:00
Pascal	8402c36039	vulkan: fuse snake activation (mul, sin, sqr, mul, add) (llama/22855) * vulkan: fuse snake activation (mul, sin, sqr, mul, add) Add snake.comp shader with F32 / F16 / BF16 pipelines and ggml_vk_snake_dispatch_fused. The matcher recognizes the naive 5 op decomposition emitted by audio decoders (BigVGAN, Vocos) for snake activation y = x + sin(ax)^2 inv_b and rewrites it to a single elementwise kernel. test_snake_fuse from the CUDA PR now also compares CPU naive vs Vulkan fused across F32 / F16 / BF16. * vulkan: address jeffbolznv review for fused snake activation Rename T / C to ne0 / ne1 in the shader and push constants to match the standard naming convention used across the Vulkan backend. Tighten ggml_vk_can_fuse_snake: require x and dst to be contiguous (the shader uses idx = i0 + i1 * ne0) and require a / inv_b to be tightly packed on the broadcast dim (the shader reads data_a[i1]). * vulkan: tighten snake fusion type checks for all operands (address jeffbolznv review) * vulkan: reject snake fusion when ne[2] or ne[3] > 1 (address jeffbolznv review) * vulkan: address 0cc4m review for fused snake activation snake.comp is renamed to follow the ggml DATA_A_* / A_TYPE convention. A_TYPE now applies to the activation tensor data_a instead of the broadcast multiplier, and the bindings become data_a (A_TYPE), data_b (float), data_c (float) and data_d (D_TYPE). A header at the top of the shader maps each buffer to its role in y = x + sin(b * x)^2 * c. On the C++ side, ggml_vk_can_fuse_snake reuses the existing snake_pattern constant instead of duplicating the op list, sin_node is extracted as a named local alongside the other chain nodes, and the broadcast operands a and inv_b are now required to be GGML_TYPE_F32 to match the hardcoded float bindings on data_b and data_c (the previous a->type == x->type would silently reject any future BF16 or F16 chain once the supports_op gate for SIN / SQR is lifted). ggml_vk_snake_dispatch_fused gets an explicit GGML_TYPE_F32 case and GGML_ABORT on default in place of the silent f32 fallback, and a stale comment about data_a[i1] / data_inv_b[i1] is refreshed to match the new binding names.	2026-05-25 12:26:07 +03:00
Chen Yuan	c436f1419f	fix(flash-attn): replace f32 with kv_type and q_type (llama/23372)	2026-05-25 12:26:07 +03:00
Georgi Gerganov	158d93c836	metal : optimize concat kernel and fix set kernel threads (llama/23411) * metal : fix GGML_OP_SET kernel threads * tests : extend test_cpy to support different src/dst shapes Extend test_cpy to support different source and destination tensor shapes for CPY operations (reshaping), where the total number of elements must match. - Renamed ne -> ne_src, added ne_dst parameter (default: use src shape) - Added 50 new reshaping test cases covering 1D<->2D<->3D<->4D conversions - Tests exercise 1024 boundary, small shapes, and large dimensionality changes - Fixed dangling reference bug (storing & to temporary std::array) - Updated all existing test calls with permute/transpose args for compatibility Assisted-by: llama.cpp:local pi * metal : optimize concat kernel with row batching for small widths When ne0 < 256, batch multiple rows into a single threadgroup to improve occupancy. This avoids underutilizing the GPU when processing narrow tensors. - Dispatch nth = min(256, ne0) threads per group - Calculate nrptg (rows per threadgroup) to fill up to 256 threads - Update kernel index calculation to handle the row batching - Add boundary check for i1 >= ne1 Assisted-by: llama.cpp:local pi * tests : clean-up * tests : refactor CPY shape tests to use dimension permutations Replace 75 hardcoded test cases with a loop over permutations of {3, 5, 7, 32} (total elements: 3360). Each src permutation is tested against canonical sorted and reverse dst, skipping identical shapes. Covers F32, F16, and Q4_0 (when both src and dst ne0 == 32). Assisted-by: llama.cpp:local pi	2026-05-25 12:26:07 +03:00
Matt Corallo	03da9f17f4	ggml : Check the right iface method before using the fallback 2d get (llama/23306) Probably no backends implement only one of 2d get/set, but this might be annoying for some future backend developer trying to add 2d get/set.	2026-05-25 12:26:07 +03:00
Todor Boinovski	6d1d66de40	hexagon: ssm-conv fix for large prompts (llama/23307) * hexagon: remove gathers and better handling of vtcm in ssm-conv * hexagon: relax ssm-conv gating requirements * hexagon: add new prefill ssm-conv backend test * hexagon: remove trailing white space * hex-rope: uninline rope_cache_init, otherwise it breaks after rebaseing with SSM_CONV changes --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-05-25 12:26:07 +03:00
lhez	896718eacf	opencl: refactor backend initilization (llama/23318) * opencl: refactor initialization * opencl: refactor GPU identification * opencl: rename for consistency * opencl: cache global mem size in dev_ctx * opencl: adjust log level * opencl: load argsort and flash_attn kernels in supports_op * argsort kernel must be built for supports_op for querying the max workgroups * flash_attn kernel has many variants, only load them when needed	2026-05-25 12:26:07 +03:00
Daniele	ad717a6de0	vulkan: optimize operations in the IM2COL shader (llama/22685) * vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting	2026-05-25 12:26:07 +03:00
Max Krasnyansky	b93a5ba605	hexagon: HMX quantized matmul rework (llama/23368) * hmx-mm: update debug logging in hmx-mm * hmx-mm: update dequant logic to use HVX_vector_x2/4 * hmx-mm: remove non-pipelined version of the quantize matmul It seems that we don't reall need non-pipelined version * hmx-mm: use activation depth mode and update naming Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com> * hex-mm: minor hmx matmul naming updates * hmx-mm: remove unused vars * snapdragon: scripts bump default ubatch-size to 1K * hexagon: combine HMX and power and clock settings into a single set_power call * hmx-mm: remove leftover of the scale repl helper * hexagon: fix editconf error --------- Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>	2026-05-25 12:26:07 +03:00
Andreas Kieslinger	3fa19558f2	Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522) * Adds initial PDL setup. * Adds PDL barriers based on simple heuristic: place "sync" before first input pointer access, and "launch" after last write, e.g. to tensors like dst. * Further optimization pass of the first half of kernels * Optimized PDL barriers for the second batch of kernels * Further refinements after rebase. * Moves pdl logic to separate function, removes some whitespace * Strips post-hoc PDL logic * Adds stream capture PDL setup. Enrolls quantize_q8_1 to leverage pdl to overlap execution with previous kernels * Enrolls mul_mat_vec_q, rms_norm_f32 and k_bin_bcast (partly) into PDL * Enrolls mmvf, rope, set-rows and topk kernels for gpt-oss into PDL * Introduce ggml_cuda_kernel_launch, to abstract away cudaLaunchKernelEx, to enable hip/musa compatibility * Enrolls cpy_scalar_contiguous, k_get_rows_float and rms_norm_f32 * Enrolls flash_attn_combine_results * Fix: Drops needless and broken check of CUDA arch for PDL. PDL either works or is without effect. * Enrolls flash-attention kernels to pdl * Fix: inlines ggml_cuda_kernel_launch, and uses perfect forwarding for kernels args. This fixes PDL. * Perf: Enrolls k_bin_bcast variadic template invocation into PDL, via and template alias and template expansion * Enrolls all remaining kernels for qwen3-coder-next into PDL * Remove all PDL LC calls to create a baseline * Added LC according to internal guidance and tested kernel performance. * Enrols missing qwen3-5 kernels passively into PDL. * Kernel optimizations (LC signals) for qwen3.5 * Enrolls ssm-scan kernels into PDL * Adds GGML_CUDA_PDL command line option to toggle PDL. * Fix: Ada and lower compilation by guarding PDL calls correctly * Cleanup: Removes commented out GGML_CUDA_PDL_LC * Cleanup: Removes experimental comments * Adds 90-virtual to build script so that Hopper GPUs can leverage PDL. * Adds stricter checks to enable PDL, adds env-check to disable it, and removes now superfluous compile option to enable PDL. * Fix: Correct PDL en/disablement based on device-side arch check. Host side check is UB. Required moving from macros to inlined functions * Fix: default-disable PDL. Enable by setting GGML_CUDA_ENABLE_PDL=1 * Enable PDL by default for Hopper+ devices * Enrolls softcap_f32 and two flash_attn kernels into PDL. * Improves flash attn PDL barrier placement * Fix: Perf regression on ada; excludes ada and below from PDL launches * Improves some sync barrier placements * Drops superfluous constructor * Adds #endif guard comments * Reverts experimental change to top-k-moe.cu, which moved expensive allocations in front of the PDL barrier. It did not have a meaningful impact. * Exchanges GGML_CUDA_DISABLE_PDL with GGML_CUDA_PDL. IFF GGML_CUDA_PDL=0 PDL is disabled * Revert "Drops superfluous constructor". Adds const to remaining arguments This reverts commit 12b1d250da0089ae02a9bb71bbb3fd6d70f6f2f1. * Cleanup: Removes and fixes some comments and whitespace * Clarifies comment of sync-barrier position * Relocates and refactors PDL launch functions and accessories * Adds error checking to the regular kernel launch path * Drops "auto" in favor of "ggml_cuda_kernel_params" * Adds "const" to ggml_cuda_kernel_launch_params * [Whitespace] Adds final newline to common.cuh to make editorconfig CI job happy	2026-05-25 12:26:07 +03:00
Georgi Gerganov	c58fc465df	metal : optimize pad + cpy (llama/23354) * metal : optimize pad * metal : optinmize cpy * cont : better row packing in threadgroup	2026-05-25 12:26:07 +03:00

1 2 3 4 5 ...

2526 Commits