Commit Graph

4554 Commits

Author SHA1 Message Date
Nikhil Jain bc77933c2d Check batch_compute_passes before sending passes when not doing GPU profiling (llama/23457)
* Only run webgpu CI on my fork

* Add webgpu only workflow

* refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled

* restore build.yml
2026-05-29 09:47:30 +03:00
Johannes Gäßler 2307712d32 CUDA: missing PDL sync for FWHT, better fallback (llama/23690) 2026-05-29 09:47:30 +03:00
forforever73 1c477d4056 metal : add apple device id (llama/23566)
Co-authored-by: lvyichen <lvyichen@stepfun.com>
2026-05-29 09:47:30 +03:00
Aman Gupta 205ee5a189 CUDA: add fast walsh-hadamard transform (llama/23615)
* CUDA: add fast walsh-hadamard transform

* review: add unrolls + change size_t -> int

* warp size 64

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-29 09:47:30 +03:00
Daniel Bevenius c932729a30
ci : add ignore for bindings/{ruby, go} in build.yml [no ci] (#3837)
This commit adds an ignore for bindings-ruby and bindings-go in
build.yml as these are handled by separate .yml file (separate jobs)
and don't need to trigger a full CI build.
2026-05-28 18:06:04 +02:00
Daniel Bevenius e47a3eeb04
ci : fix include paths for bindings-go job [no ci] (#3835) 2026-05-28 14:53:34 +02:00
Daniel Bevenius f41562bdd6
ci : add on push/pull_request paths ruby job (#3833)
* ci : add on push/pull_request paths ruby job

This commit adds paths to bindings-ruby to only build if changes where
made to bindings/ruby or to include/whisper.h.

* ci : add additional paths [no ci]
2026-05-28 14:41:48 +02:00
Daniel Bevenius 9186e2453b
ci : renable arm64 docker builds (#3832)
This commit re-enables the arm64 docker images builds which were removed
in Commit 9366544991
("ci : fix arm builds"). It also uses ubuntu-24.04-arm as the runner
which enables us to avoid QEMU.

Resolves: https://github.com/ggml-org/whisper.cpp/issues/2859
2026-05-28 12:09:13 +02:00
Daniel Bevenius f6e617bab7
ci : set GGML_NATIVE=OFF for bindings-java (#3830)
* ci : set GGML_NATIVE=OFF for bindings-java

This commit attempts to address an issue with the bindings-java job
which is currently failing.

I've not been able to reproduce this locally my windows machine and I
suspect that what might be happning is that windows job compiles on a
runner where it has different CPU features, for example AVX512 and when
this dll is used on a different runner that does not have that feature
it will crash.

Refs: https://github.com/ggml-org/whisper.cpp/actions/runs/26496174929/job/78059073255?pr=3829

* ci : also disable BMI2
2026-05-28 07:21:25 +02:00
Daniel Bevenius 6dcdd65364
ci : only run docker jobs when pushed to master [no ci] (#3828) 2026-05-27 08:46:23 +02:00
Daniel Bevenius ee540bf0be
docs : add AGENTS.md and CONTRIBUTING.md [no ci] (#3826)
* docs : add AGENTS.md and CONTRIBUTING.md [no ci]

This commit add AGENTS.md and CONTRIBUTING.md which are based on the
same files in llama.cpp. They have been modified slightly to fit with
whisper.cpp.

The motivation for this is to clarify the contribution policy in
whisper.cpp so that contributers can have a better understanding of the
expectations and requirements for contributing to the project.
2026-05-27 06:22:38 +02:00
texasich 27101c01dc
cli : merge tokens split across UTF-8 boundaries in JSON output (#3751)
* cli : merge tokens split across UTF-8 boundaries in JSON output

When a multi-byte UTF-8 codepoint (most commonly a CJK character, 3 bytes)
is split across multiple whisper tokens, the -ojf/--output-json-full
writer emitted each token's partial bytes as its own JSON string, producing
invalid UTF-8 that chokes downstream parsers.

Merge adjacent tokens in output_json whenever the accumulated text still
ends on an incomplete UTF-8 sequence. The merged entry keeps the first
token's id/p/t_dtw and extends t1 to the last absorbed token, which
matches how segment text is assembled elsewhere.

Refs #1798

* fix: address review — add braces for consistency, use full issue URL

- Add braces to if/else chain for codebase consistency
- Use full URL for issue #1798 reference

Review: @danbev

---------

Co-authored-by: texasich <texasich@users.noreply.github.com>
Co-authored-by: texasich <texasich@gmail.com>
2026-05-26 06:23:41 +02:00
Georgi Gerganov e0fd1f6787
release : v1.8.5 2026-05-25 13:06:33 +03:00
Georgi Gerganov c245b3ec23
benches : update 2026-05-25 13:05:30 +03:00
Georgi Gerganov f14ae77f40
sync : ggml 2026-05-25 12:44:07 +03:00
Georgi Gerganov 1cf8e3a903
ggml : bump version to 0.13.0 (ggml/1510) 2026-05-25 12:44:04 +03:00
Johannes Gäßler bcff515150
TP: fix ggml context size calculation (llama/22616)
* TP: fix ggml context size calculation, memory leak

* move split state cache back into the context

* revert to constant ggml context size for cgraphs

* increase headroom for statically allocated tensors

* remove obsolete include
2026-05-25 12:44:04 +03:00
Gilad S 2979e5f95f
ggml: `gguf_init_from_callback` and `gguf_init_from_buffer` (llama/22341)
* ggml: implement `gguf_init_from_buffer`

* test: `gguf_init_from_buffer`

* fix: memory breakdown for a model loaded with `no_alloc` from a file is consistent with being loaded from a buffer

* fix: use `GGML_UNUSED`

Co-authored-by: Copilot <copilot@github.com>

* fix: remove `total_size` from `gguf_reader`

* fix: file offset calculation, rename `offset` to `data_offset`

Co-authored-by: Copilot <copilot@github.com>

* refactor: extract model loader bug fixes to another PR

* feat: add `gguf_init_from_callback`

* fix: always require a max expected size

* fix: change `gguf_reader_callback_t`'s `output` type to `void *`, change `max_expected_size` and offsets to `uint64_t`

* fix: harden against offset overflow in buffer read

* fix: remove seek behavior from the callback

* feat: `max_chunk_read == 0` means `SIZE_MAX`

* fix: seeking in a gguf file with no tensors

---------

Co-authored-by: Copilot <copilot@github.com>
2026-05-25 12:44:04 +03:00
Kaihui-AMD 44a50ca41a
readme : add AMD ROCm/HIP GPU build instructions (#3823)
Signed-off-by: Kaihui-AMD <Kaihui.Tang@amd.com>
2026-05-25 12:27:42 +03:00
Georgi Gerganov 865ec171aa talk-llama : sync llama.cpp 2026-05-25 12:26:07 +03:00
Georgi Gerganov 0a62a579cc sync : ggml 2026-05-25 12:26:07 +03:00
Georgi Gerganov 946d6813b9 ggml : bump version to 0.12.1 (ggml/1508) 2026-05-25 12:26:07 +03:00
Jeff Bolz a369b3949c ggml : Parallelize quant LUT init (llama/23595)
- Use OpenMP to parallelize iq2xs_init_impl and iq3xs_init_impl.
- Move the OpenMP detection from ggml-cpu to ggml-base.
- Update OpenMP dependencies in ggml-config.cmake.in.
2026-05-25 12:26:07 +03:00
Johannes Gäßler 3306af62b1 TP: fix entirely zero-sized slices per device (llama/23525) 2026-05-25 12:26:07 +03:00
shaofeiqi 1435988ab3 opencl: batch profiling to improve speed and prevent memory leaks (llama/23495) 2026-05-25 12:26:07 +03:00
Yiwei Shao b84d03487c hexagon: apply repl optimization in flash attn softmax as #22993 (llama/23455) 2026-05-25 12:26:07 +03:00
dskwe 511f8602b1 ggml : Check the right iface method before using the fallback 2d get (llama/23514) 2026-05-25 12:26:07 +03:00
Jeff Bolz 6b85d73b33 vulkan: fix windows find_package of SPIRV-Headers (llama/23215)
* vulkan: fix windows find_package of SPIRV-Headers

* not windows-only
2026-05-25 12:26:07 +03:00
Shawn Gu aefffa1fa5 opencl: generalize Adreno MoE kernels on M (llama/23449) 2026-05-25 12:26:07 +03:00
Alexey Kopytko b0c9f90059 SYCL: improve MoE prefill throughput (llama/23142)
- change `k_copy_src1_to_contiguous` so that uses a precomputed contiguous mapping where all rows "owned" by an expert are in one slice with a know starts and ends
- switch the `O(n_as * n_routed_rows)` contraption to a counting sort-based procedure with `O(n_as + n_routed_rows)` complexity
2026-05-25 12:26:07 +03:00
Alexey Kopytko 21c65a78a3 sycl : Level Zero detection in ggml_sycl_init (llama/23097)
* [SYCL] Centralize Level Zero detection in ggml_sycl_init

* use the same wording

* get back the warning
2026-05-25 12:26:07 +03:00
karavayev 0416feecfc SYCL : gated_delta_net K>1 (llama/23174)
* sycl_gated_delta_net K>1

* editor_config
2026-05-25 12:26:07 +03:00
Katostrofik 6fb7f1af2c SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc) (llama/21580)
* SYCL: add BF16 to DMMV kernel path for ~4x token generation speedup

BF16 models had no dedicated token generation kernel — they fell through
to the generic full-GEMM path, resulting in ~14% memory bandwidth
utilization on Intel Arc GPUs. This adds BF16 support to the DMMV
(dequantize mul-mat-vec) path, matching the existing F16 implementation.

Fixes #20478

* SYCL: fix BF16 DMMV out-of-bounds when ncols % 64 != 0

The qk=1 kernel (used for F16 and BF16) iterates with stride
2*GGML_SYCL_DMMV_X (= 64 on Intel targets where WARP_SIZE=16). When
ncols is a multiple of DMMV_X (32) but not of 2*DMMV_X (64), the last
warp iteration accesses elements at col >= ncols, producing NaN for the
final row and wrong values for interior rows.

Fix: tighten can_use_dequantize_mul_mat_vec to require ne[0] %
(2*DMMV_X) == 0 for F16/BF16 types, and update the ASSERT in the BF16
launcher to match. Quantized types use block-structured kernels with
different access patterns and keep the existing DMMV_X check.

Verified: test-backend-ops MUL_MAT passes 913/913 on Intel Arc Pro B70.
Previously failing: m=128/129 n=1 k=1056 cases (NaN and ERR > 0.0005).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 12:26:07 +03:00
Sachin Sharma 2d629533a5 ggml-zendnn : add Q8_0 quantization support (llama/23414)
* ggml-zendnn : add Q8_0 quantization support

* ggml-zendnn : sync with latest ZenDNN

* ggml-zendnn : address review comments for Q8_0
2026-05-25 12:26:07 +03:00
Johannes Gäßler ec183556c6 CUDA: fix PDL CC check for JIT compilation (llama/23471) 2026-05-25 12:26:07 +03:00
Pascal 8402c36039 vulkan: fuse snake activation (mul, sin, sqr, mul, add) (llama/22855)
* vulkan: fuse snake activation (mul, sin, sqr, mul, add)

Add snake.comp shader with F32 / F16 / BF16 pipelines and
ggml_vk_snake_dispatch_fused. The matcher recognizes the naive 5 op
decomposition emitted by audio decoders (BigVGAN, Vocos) for snake
activation y = x + sin(a*x)^2 * inv_b and rewrites it to a single
elementwise kernel.

test_snake_fuse from the CUDA PR now also compares CPU naive vs
Vulkan fused across F32 / F16 / BF16.

* vulkan: address jeffbolznv review for fused snake activation

Rename T / C to ne0 / ne1 in the shader and push constants to match
the standard naming convention used across the Vulkan backend.

Tighten ggml_vk_can_fuse_snake: require x and dst to be contiguous
(the shader uses idx = i0 + i1 * ne0) and require a / inv_b to be
tightly packed on the broadcast dim (the shader reads data_a[i1]).

* vulkan: tighten snake fusion type checks for all operands (address jeffbolznv review)

* vulkan: reject snake fusion when ne[2] or ne[3] > 1 (address jeffbolznv review)

* vulkan: address 0cc4m review for fused snake activation

snake.comp is renamed to follow the ggml DATA_A_* / A_TYPE convention.
A_TYPE now applies to the activation tensor data_a instead of the
broadcast multiplier, and the bindings become data_a (A_TYPE), data_b
(float), data_c (float) and data_d (D_TYPE). A header at the top of
the shader maps each buffer to its role in y = x + sin(b * x)^2 * c.

On the C++ side, ggml_vk_can_fuse_snake reuses the existing snake_pattern
constant instead of duplicating the op list, sin_node is extracted as a
named local alongside the other chain nodes, and the broadcast operands
a and inv_b are now required to be GGML_TYPE_F32 to match the hardcoded
float bindings on data_b and data_c (the previous a->type == x->type
would silently reject any future BF16 or F16 chain once the supports_op
gate for SIN / SQR is lifted). ggml_vk_snake_dispatch_fused gets an
explicit GGML_TYPE_F32 case and GGML_ABORT on default in place of the
silent f32 fallback, and a stale comment about data_a[i1] / data_inv_b[i1]
is refreshed to match the new binding names.
2026-05-25 12:26:07 +03:00
Chen Yuan c436f1419f fix(flash-attn): replace f32 with kv_type and q_type (llama/23372) 2026-05-25 12:26:07 +03:00
Georgi Gerganov 158d93c836 metal : optimize concat kernel and fix set kernel threads (llama/23411)
* metal : fix GGML_OP_SET kernel threads

* tests : extend test_cpy to support different src/dst shapes

Extend test_cpy to support different source and destination tensor shapes
for CPY operations (reshaping), where the total number of elements must match.

- Renamed ne -> ne_src, added ne_dst parameter (default: use src shape)
- Added 50 new reshaping test cases covering 1D<->2D<->3D<->4D conversions
- Tests exercise 1024 boundary, small shapes, and large dimensionality changes
- Fixed dangling reference bug (storing & to temporary std::array)
- Updated all existing test calls with permute/transpose args for compatibility

Assisted-by: llama.cpp:local pi

* metal : optimize concat kernel with row batching for small widths

When ne0 < 256, batch multiple rows into a single threadgroup to improve
occupancy. This avoids underutilizing the GPU when processing narrow tensors.

- Dispatch nth = min(256, ne0) threads per group
- Calculate nrptg (rows per threadgroup) to fill up to 256 threads
- Update kernel index calculation to handle the row batching
- Add boundary check for i1 >= ne1

Assisted-by: llama.cpp:local pi

* tests : clean-up

* tests : refactor CPY shape tests to use dimension permutations

Replace 75 hardcoded test cases with a loop over permutations of
{3, 5, 7, 32} (total elements: 3360). Each src permutation is tested
against canonical sorted and reverse dst, skipping identical shapes.
Covers F32, F16, and Q4_0 (when both src and dst ne0 == 32).

Assisted-by: llama.cpp:local pi
2026-05-25 12:26:07 +03:00
Matt Corallo 03da9f17f4 ggml : Check the right iface method before using the fallback 2d get (llama/23306)
Probably no backends implement only one of 2d get/set, but this
might be annoying for some future backend developer trying to add
2d get/set.
2026-05-25 12:26:07 +03:00
Todor Boinovski 6d1d66de40 hexagon: ssm-conv fix for large prompts (llama/23307)
* hexagon: remove gathers and better handling of vtcm in ssm-conv

* hexagon: relax ssm-conv gating requirements

* hexagon: add new prefill ssm-conv backend test

* hexagon: remove trailing white space

* hex-rope: uninline rope_cache_init, otherwise it breaks after rebaseing with SSM_CONV changes

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-05-25 12:26:07 +03:00
lhez 896718eacf opencl: refactor backend initilization (llama/23318)
* opencl: refactor initialization

* opencl: refactor GPU identification

* opencl: rename for consistency

* opencl: cache global mem size in dev_ctx

* opencl: adjust log level

* opencl: load argsort and flash_attn kernels in supports_op

* argsort kernel must be built for supports_op for querying the max
  workgroups
* flash_attn kernel has many variants, only load them when needed
2026-05-25 12:26:07 +03:00
Daniele ad717a6de0 vulkan: optimize operations in the IM2COL shader (llama/22685)
* vulkan: optimize operations in the IM2COL shader

* Add comments and improve the code formatting
2026-05-25 12:26:07 +03:00
Max Krasnyansky b93a5ba605 hexagon: HMX quantized matmul rework (llama/23368)
* hmx-mm: update debug logging in hmx-mm

* hmx-mm: update dequant logic to use HVX_vector_x2/4

* hmx-mm: remove non-pipelined version of the quantize matmul

It seems that we don't reall need non-pipelined version

* hmx-mm: use activation depth mode and update naming

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>

* hex-mm: minor hmx matmul naming updates

* hmx-mm: remove unused vars

* snapdragon: scripts bump default ubatch-size to 1K

* hexagon: combine HMX and power and clock settings into a single set_power call

* hmx-mm: remove leftover of the scale repl helper

* hexagon: fix editconf error

---------

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
2026-05-25 12:26:07 +03:00
Andreas Kieslinger 3fa19558f2 Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)
* Adds initial PDL setup.

* Adds PDL barriers based on simple heuristic: place "sync" before first input pointer access, and "launch" after last write, e.g. to tensors like dst.

* Further optimization pass of the first half of kernels

* Optimized PDL barriers for the second batch of kernels

* Further refinements after rebase.

* Moves pdl logic to separate function, removes some whitespace

* Strips post-hoc PDL logic

* Adds stream capture PDL setup. Enrolls quantize_q8_1 to leverage pdl to
overlap execution with previous kernels

* Enrolls mul_mat_vec_q, rms_norm_f32 and k_bin_bcast (partly) into PDL

* Enrolls mmvf, rope, set-rows and topk kernels for gpt-oss into PDL

* Introduce ggml_cuda_kernel_launch, to abstract away cudaLaunchKernelEx,
to enable hip/musa compatibility

* Enrolls cpy_scalar_contiguous, k_get_rows_float and rms_norm_f32

* Enrolls flash_attn_combine_results

* Fix: Drops needless and broken check of CUDA arch for PDL. PDL either
works or is without effect.

* Enrolls flash-attention kernels to pdl

* Fix: inlines ggml_cuda_kernel_launch, and uses perfect forwarding for
kernels args. This fixes PDL.

* Perf: Enrolls k_bin_bcast variadic template invocation into PDL, via
and template alias and template expansion

* Enrolls all remaining kernels for qwen3-coder-next into PDL

* Remove all PDL LC calls to create a baseline

* Added LC according to internal guidance and tested kernel performance.

* Enrols missing qwen3-5 kernels passively into PDL.

* Kernel optimizations (LC signals) for qwen3.5

* Enrolls ssm-scan kernels into PDL

* Adds GGML_CUDA_PDL command line option to toggle PDL.

* Fix: Ada and lower compilation by guarding PDL calls correctly

* Cleanup: Removes commented out GGML_CUDA_PDL_LC

* Cleanup: Removes experimental comments

* Adds 90-virtual to build script so that Hopper GPUs can leverage PDL.

* Adds stricter checks to enable PDL, adds env-check to disable it, and removes now superfluous compile option to enable PDL.

* Fix: Correct PDL en/disablement based on device-side arch check. Host
side check is UB. Required moving from macros to inlined functions

* Fix: default-disable PDL. Enable by setting GGML_CUDA_ENABLE_PDL=1

* Enable PDL by default for Hopper+ devices

* Enrolls softcap_f32 and two flash_attn kernels into PDL.

* Improves flash attn PDL barrier placement

* Fix: Perf regression on ada; excludes ada and below from PDL launches

* Improves some sync barrier placements

* Drops superfluous constructor

* Adds #endif guard comments

* Reverts experimental change to top-k-moe.cu, which moved expensive allocations
in front of the PDL barrier. It did not have a meaningful impact.

* Exchanges GGML_CUDA_DISABLE_PDL with GGML_CUDA_PDL. IFF GGML_CUDA_PDL=0
PDL is disabled

* Revert "Drops superfluous constructor". Adds const to remaining
arguments

This reverts commit 12b1d250da0089ae02a9bb71bbb3fd6d70f6f2f1.

* Cleanup: Removes and fixes some comments and whitespace

* Clarifies comment of sync-barrier position

* Relocates and refactors PDL launch functions and accessories

* Adds error checking to the regular kernel launch path

* Drops "auto" in favor of "ggml_cuda_kernel_params"

* Adds "const" to ggml_cuda_kernel_launch_params

* [Whitespace] Adds final newline to common.cuh to make editorconfig CI job happy
2026-05-25 12:26:07 +03:00
Georgi Gerganov c58fc465df metal : optimize pad + cpy (llama/23354)
* metal : optimize pad

* metal : optinmize cpy

* cont : better row packing in threadgroup
2026-05-25 12:26:07 +03:00
ravel7524 0a0a34287e ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (llama/23349) 2026-05-25 12:26:07 +03:00
shaofeiqi 37f17208c2 opencl: add MoE support for q4_k, q5_k, q6_k on Adreno (llama/23303)
* opencl: add q4_k moe support

* opencl: add q5_k moe support

* opencl: add q6_k moe support

* opencl: adjust format

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-05-25 12:26:07 +03:00
Aparna M P aca63e7638 hexagon: add MROPE and IMROPE support in HTP rope op (llama/23317) 2026-05-25 12:26:07 +03:00
Aparna M P 459ff0707b hexagon: enable support for NORM op (llama/23319) 2026-05-25 12:26:07 +03:00
Reese Levine 6090f39f36 ggml-webgpu : extend GDN for K>1 (llama/23299) 2026-05-25 12:26:07 +03:00