Commit Graph

4596 Commits

Author SHA1 Message Date
danscMax 610e664ba7
whisper : catch C++ exceptions in whisper_init_with_params_no_state (#3831)
whisper_model_load() can throw instead of returning false: std::runtime_error
from this file (failed ggml context / no compatible buffer type), or
vk::SystemError / vk::OutOfDeviceMemoryError from the ggml-vulkan backend during
device/buffer allocation.

whisper_init_* are extern "C", so a C++ exception unwinding across that boundary
aborts non-C++ callers (Rust via whisper-rs, Go via cgo) -- on Windows
STATUS_STACK_BUFFER_OVERRUN (0xC0000409) -- even though the function already
returns NULL on failure. Wrap whisper_model_load() in try/catch and route any
throw into the existing NULL-return path.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 13:25:29 +02:00
Noah Lyons e5d4412578
server : merge split utf-8 token text in verbose json (#3850) 2026-06-02 13:10:27 +02:00
Patrice Levesque ef24de1e58
cmake : do not assume /usr/lib library installation. (#3693)
Current `pkgconfig` configuration file installation path and its
contents assume libraries are installed under `/usr/lib` and this is not
always the case, for instance `/usr/lib64` is quite possible under
Gentoo Linux.

Thus use the `CMAKE_INSTALL_LIBDIR` variable instead of a hardcoded
`lib`.
2026-06-02 09:22:16 +02:00
Georgi Gerganov 23ee03506a
release : v1.8.6 2026-06-01 14:56:20 +03:00
Daniel Bevenius 0dff27498f
ci : fix path to whisper.h in examples.yml [no ci] (#3842)
This commit updates the include path to whisper.h and also ensures that
this is only built on pushes to master.
2026-06-01 07:20:19 +02:00
Georgi Gerganov fe69461618
ci : fix self-hosted paths to mnt 2026-05-31 16:06:32 +03:00
Georgi Gerganov 099af1c67d
pi : add config
[no ci]
2026-05-31 16:04:12 +03:00
Georgi Gerganov 2e045a967b
ci : remove obsolete self-hosted label 2026-05-31 15:49:14 +03:00
Georgi Gerganov 6c343e7a4e
common : pass sample rate to `ffmpeg_decode_audio()` 2026-05-31 15:49:13 +03:00
Georgi Gerganov f39cc71282
common : re-implement `ffmpeg-transcode.cpp` + clarify ffmpeg usage (#3846)
* examples : remove ffmpeg-transcode.cpp

* examples : implement ffmpeg-transcode.cpp

Assisted-by: llama.cpp:local pi

* common : switch from WHISPER_FFMPEG -> WHISPER_COMMON_FFMPEG
2026-05-31 15:44:07 +03:00
Georgi Gerganov f24588a272 sync : ggml 2026-05-29 09:47:30 +03:00
Georgi Gerganov 92fc3f2a58 ggml : bump version to 0.13.1 (ggml/1523) 2026-05-29 09:47:30 +03:00
Georgi Gerganov 5828fba79f talk-llama : sync llama.cpp 2026-05-29 09:47:30 +03:00
Georgi Gerganov cc65eb1816 sync : ggml 2026-05-29 09:47:30 +03:00
Andreas Kieslinger e90501e179 cuda : disables launch_fattn PDL enrollment due to compiler bug (llama/23825) 2026-05-29 09:47:30 +03:00
Matt Corallo f1b687da28 meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (llama/23480)
Without this at least the vulkan backend will skip the `* 0` for
!COMPUTE tensors, causing corrupt output.
2026-05-29 09:47:30 +03:00
Max Krasnyansky 442be1789d hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (llama/23835)
Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.
2026-05-29 09:47:30 +03:00
lhez 94922ce12c opencl: move backend info printing into its own function (llama/23702)
* opencl: move backend info print into its own function

* opencl: move new log line

* opencl: fix for non adreno path
2026-05-29 09:47:30 +03:00
fl0rianr e1faa7cb4d ggml: auto apply iGPU flag CUDA/HIP if integrated device (llama/23007) 2026-05-29 09:47:30 +03:00
redfox 4e8af441e5 mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … (#23729)
* mmvq Optim:  add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING

* avoid a mismatch for JIT compilation of Turing device code for Ampere or newer

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-29 09:47:30 +03:00
Jaden_Mach 04795e6272 CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (llama/23227)
* CUDA: per-quant MMVQ/MMQ batch threshold on AMD MFMA hardware

The dispatcher uses a single global threshold (MMVQ_MAX_BATCH_SIZE = 8)
to choose between mul_mat_vec_q (per-row GEMV) and mul_mat_q (MFMA-tiled
GEMM) for quantized matmul. On AMD CDNA, the optimal crossover differs
substantially by quant family because the per-row GEMV cost is dominated
by dequantisation, not the dot-product itself: K-quants pay a heavier
super-block decode and so MMQ wins sooner; legacy and IQ quants have
lean decode and stay ahead until the batch fully populates an MFMA tile.

This patch introduces ggml_cuda_should_use_mmvq(type, cc, ne11) -> bool,
mirroring the existing ggml_cuda_should_use_mmq, and gates per-quant
thresholds on amd_mfma_available(cc):

  Q3_K, Q4_K, Q5_K  : MMVQ <= 3   (MMQ wins from batch=4: +5% .. +76%)
  Q2_K, Q6_K        : MMVQ <= 5   (MMQ wins from batch=6: +8% .. +35%)
  others            : MMVQ <= 8   (legacy & IQ regress under MMQ; unchanged)

Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical
to master. GGML_CUDA_FORCE_MMVQ=1 restores the original global threshold
for A/B testing.

Measured on MI250X (gfx90a, ROCm 7.2.1) with Llama-3.2-3B-Instruct,
llama-bench pp512 across all 20 supported quants, ubatch 1..8, 10 reps.
Full table in PR description.

  Selected pp512 throughput (tok/s, ub=8):
    Q4_K_S:  559 -> 940  (+68%)
    Q5_K_S:  503 -> 884  (+76%)
    Q3_K_S:  629 -> 879  (+40%)
    Q2_K  :  615 -> 809  (+32%)
    Q6_K  :  582 -> 776  (+33%)

  Selected pp512 throughput (tok/s, ub=4):
    Q4_K_S:  444 -> 480  (+ 8%)
    Q4_0  :  682 -> 685  (+ 0%)   (no regression - retains MMVQ)
    IQ4_XS:  706 -> 698  (- 1%)   (no regression - retains MMVQ)

* CUDA: address review — inline MMVQ batch table, drop env hatch & doc block

* tune kernel selection logic for CDNA1

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-29 09:47:30 +03:00
Max Krasnyansky 1b241b879c hexagon: minor refresh for HMX FA and MM (llama/23796)
* hex-fa: clean up qf32/fp32 handling and stride handling

* hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79

* hex-fa: vectorize leftover handling

* hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity

* hmx-mm: remove dead code

* hmx-mm: use fastdiv in x4x2 dequant

* hmx-mm: sandwich dequant and scatter to improve perf

* hmx-mm: fixed rebase conflicts

* hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv

* hmx-mm: an even earlier dispatch for per-type dequant

* hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs

This is a bit faster than LUT.

* hex-cmake: one more tweak for lto

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
2026-05-29 09:47:30 +03:00
Jeff Bolz b896e91f18 vulkan: fast path for walsh-hadamard transform (llama/23687)
* vulkan: fast path for walsh-hadamard transform

* disable for intel due to segfault
2026-05-29 09:47:30 +03:00
Winston Ma 816c3029bc vulkan: fix wrong index variable in inner loop (llama/23665) 2026-05-29 09:47:30 +03:00
Winston Ma 5db94bac04 vulkan: Fix memory logger unsafe iterator access (llama/23667) 2026-05-29 09:47:30 +03:00
fairydreaming 60e420ff6a cuda : fix KQ mask offset integer overflow in fattn MMA kernel (llama/23610)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2026-05-29 09:47:30 +03:00
Martin Klacer 8e40325876 ggml: fixed Arm SVE usage bug in vec.h, vec.cpp (llama/22841)
* Updated vec.h/vec.cpp code to accumulate to F32 rather than F16

Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>
2026-05-29 09:47:30 +03:00
ymcki d284e1c3aa Hexagon: OP_GATED_DELTA_NET K>1 support (llama/23531)
* K>1 state snapshot support

* removed picky indent multiple of 4 fixes
2026-05-29 09:47:30 +03:00
ymcki 7e843a80e1 opencl: OP_GATED_DELTA_NET (llama/23312)
* OP_GATED_DELTA_NET impl

* add back lanes_per_column declaration

* removed has_subgroup_arithmetic and has_subgroup_clustered_reduce

* removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot

* support for K>1 state snapshot

* removed picky indent multiple of 4 fixes

* removed return that won\'t be executed
2026-05-29 09:47:30 +03:00
Reese Levine 8c8f213dac ggml-webgpu: remove legacy constants (llama/23672) 2026-05-29 09:47:30 +03:00
Max Krasnyansky 3bbe93378c hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (llama/23647)
* hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now

* hmx-mm: add support for Q4_1

* hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot

* hexagon: fix repack scratch buffer overflow

* hex-mm: fix Q4_1 repack buffer sizing

* hexagon: flip the build order for mm and fa (seems to help LTO)

* hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1

* hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output

* hexagon: resurrect early-wake and add support for polling for op-batch completions

With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax.
This is a good thing! But it does add extra latency for the pure benchmark runs.
Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking.

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
2026-05-29 09:47:30 +03:00
Masashi Yoshimura a52bd385d6 ggml-webgpu: Fix how to dispatch WG to some ops (llama/23750) 2026-05-29 09:47:30 +03:00
Matt Corallo 8bce478ee8 vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 (llama/22887)
* vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32

Against mesa git, this shows a 4.8% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.

Note that this breaks some tests until the last commit which fixes
OOB A reads.

* vulkan: Use aligned loads in mul_mat_vec when available

Against mesa git, this shows a 3.3% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.

* Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec

Mesa's UUB logic can't see through conditionals, limiting its
ability to understand the bounds on the `num_rows` field in the
cleanup run. Making it explicit that `num_rows` is, indeed, always
<= `NUM_ROWS` helps mesa make slightly better codegen.

Against mesa git, this currently shows a 1% performance improvement
in tg128 on Qwen3.5-9B:BF16 on Intel BMG.

* vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes

There was a TODO to fix the OOB reads from the A matrix which we do
here.

It is within performance noise (+<0.1%) in tg128 for
Qwen3.5-9B:BF16 on Intel BMG.
2026-05-29 09:47:30 +03:00
Jeff Bolz 1b590bbb9a vulkan: use GL_NV_cooperative_matrix_decode_vector for faster matmul (llama/23541) 2026-05-29 09:47:30 +03:00
l8bloom c5cde8c717 vulkan: add REPEAT op support for f16 to f16. (llama/23298)
* feat: extend repeat op for vulkan

* feat: add repeat_f16 vulkan pipeline

* fix: ensure same dst and src types

* fix: use type_size instead of data types

* fix: use int16 and int32 for repeat shader op

* chore: rename repeat_f* to repeat_i*

* chore: rename repeat vulkan pipelines
2026-05-29 09:47:30 +03:00
Oliver Simons 98c6722fec CUDA: restrict PDL to CTK >= 12.3 due to MSVC issues (llama/23742) 2026-05-29 09:47:30 +03:00
Winston Ma 80e87ec453 vulkan: avoid preferring transfer queue on AMD UMA devices (llama/22455) 2026-05-29 09:47:30 +03:00
Vladislav 6a249cd640 ggml-zendnn : fixed naming of matmul function (llama/20964)
* ggml-zendnn: fixed naming of matmul function

* ggml-zendnn: fixed naming of mul_mat_id function

* ggml-zendnn: fixed print in  mul_mat_id

---------

Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>
2026-05-29 09:47:30 +03:00
Jeff Bolz a0efd13f0f vulkan: optimize conv2d and implement coopmat1 support (llama/22620)
* vulkan: add CONV_SHAPE_64x128 for medium-K conv2d

* vulkan: skip conv2d bounds checks when shapes align with tile sizes

* vulkan: use WG_SIZE=128 for CONV_SHAPE_64x32 conv2d

* vulkan: stage cm2 conv2d accumulator through shmem before global store

* vulkan: add coopmat1 conv2d path

* fallback when using too much shared memory. clean up comments

* Require 16x16x16 and subgroup size 32 or 64

* check whether shared memory is sufficient before overwriting conv2d params with coopmat1 values
2026-05-29 09:47:30 +03:00
Max Krasnyansky f8df28d331 hexagon: add support for CONCAT op (llama/23648)
* hexagon: add support for CONCAT with optimized concat_2d_transposed

qwen3.5 models are quite heavy on the CONCAT with large and transposed src1.

* hex-concat: use fastdiv in generic version

* hex-concat: make checks for transposed a bit more readable

* hex-concat: reoder dma ops for better pipelining

* hex-cont/cpy: optimize CPY and CONT ops

The primary change is to avoid scalar divs in the inner loops.
We were calling hvx_copy_uu(... type_size) where type_size is non a constexpr.
This causes runtime divs by that value which is normally just 4 or 2 (f32/f16).

* hex-get-rows: optimize GET_ROWS for large rows

We now use DMA for larger rows and also split them into chunks to improve perf for Qwen3.5 and other models
that do lots of GET_ROWS with huge (2MB+ rows).

Also bump the DMA queue depth now that we can take advantage of it.

* hex-concat: unroll the inner loops of concat_2d

* hex-concat: more updates to concat_2d to improve perf a bit further

* hex-cpy: fixed n_rows per thread checks in the copy ops

* hmx-fa: fix alignment issues while computing dma sizes

* hex-set-rows: add early returns for idle threads

* hvx-rope: minor optimization to replace loops with fastdiv logic

* hex-rope: replace scalar tail processing with HVX

* hex-rope: optimize rope cache init with HVX

Add hvx-utils sin/cos helpers that use an aprox method (similar to rsqrt, inverse, etc)
Use the helpers to optimize ROPE.
2026-05-29 09:47:30 +03:00
Alexey Kopytko 049f0af339 SYCL: implement ggml_sycl_pool_vmm (llama/22862)
* SYCL: implement ggml_sycl_pool_vmm

* Add an option to bypass VMM with GGML_SYCL_DISABLE_VMM

* Clean up debugging logging

* document GGML_SYCL_DISABLE_VMM

* Multi-stream MoE optimization

* Revert "Multi-stream MoE optimization"

This reverts commit 938929c3f13a562ec67c59e87cc5d38595444cce.

* Update common.hpp

Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>

* Flip GGML_SYCL_DISABLE_VMM to GGML_SYCL_ENABLE_VMM

* add logging for GGML_SYCL_ENABLE_VMM when extension is not available (SYCL_EXT_ONEAPI_VIRTUAL_MEM macro)

* Apply suggestions from code review

Co-authored-by: Alexey Kopytko <alexey@kopytko.com>

* Apply suggestion from @sanmai

* Apply suggestion from @sanmai

---------

Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
2026-05-29 09:47:30 +03:00
Masashi Yoshimura 00a5110b19 ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K and clean up legacy MUL_MAT pipeline (llama/23594)
* ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K

* Fix to editorconfig checking pass

* Remove mul-mat-legacy pipeline

* Fix to use vendor name as is and add dot_product/vendor to shader_lib_ctx
2026-05-29 09:47:30 +03:00
Nikhil Jain bc77933c2d Check batch_compute_passes before sending passes when not doing GPU profiling (llama/23457)
* Only run webgpu CI on my fork

* Add webgpu only workflow

* refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled

* restore build.yml
2026-05-29 09:47:30 +03:00
Johannes Gäßler 2307712d32 CUDA: missing PDL sync for FWHT, better fallback (llama/23690) 2026-05-29 09:47:30 +03:00
forforever73 1c477d4056 metal : add apple device id (llama/23566)
Co-authored-by: lvyichen <lvyichen@stepfun.com>
2026-05-29 09:47:30 +03:00
Aman Gupta 205ee5a189 CUDA: add fast walsh-hadamard transform (llama/23615)
* CUDA: add fast walsh-hadamard transform

* review: add unrolls + change size_t -> int

* warp size 64

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-29 09:47:30 +03:00
Daniel Bevenius c932729a30
ci : add ignore for bindings/{ruby, go} in build.yml [no ci] (#3837)
This commit adds an ignore for bindings-ruby and bindings-go in
build.yml as these are handled by separate .yml file (separate jobs)
and don't need to trigger a full CI build.
2026-05-28 18:06:04 +02:00
Daniel Bevenius e47a3eeb04
ci : fix include paths for bindings-go job [no ci] (#3835) 2026-05-28 14:53:34 +02:00
Daniel Bevenius f41562bdd6
ci : add on push/pull_request paths ruby job (#3833)
* ci : add on push/pull_request paths ruby job

This commit adds paths to bindings-ruby to only build if changes where
made to bindings/ruby or to include/whisper.h.

* ci : add additional paths [no ci]
2026-05-28 14:41:48 +02:00
Daniel Bevenius 9186e2453b
ci : renable arm64 docker builds (#3832)
This commit re-enables the arm64 docker images builds which were removed
in Commit 9366544991
("ci : fix arm builds"). It also uses ubuntu-24.04-arm as the runner
which enables us to avoid QEMU.

Resolves: https://github.com/ggml-org/whisper.cpp/issues/2859
2026-05-28 12:09:13 +02:00