Commit Graph

4515 Commits

Author SHA1 Message Date
Todor Boinovski 6d1d66de40 hexagon: ssm-conv fix for large prompts (llama/23307)
* hexagon: remove gathers and better handling of vtcm in ssm-conv

* hexagon: relax ssm-conv gating requirements

* hexagon: add new prefill ssm-conv backend test

* hexagon: remove trailing white space

* hex-rope: uninline rope_cache_init, otherwise it breaks after rebaseing with SSM_CONV changes

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-05-25 12:26:07 +03:00
lhez 896718eacf opencl: refactor backend initilization (llama/23318)
* opencl: refactor initialization

* opencl: refactor GPU identification

* opencl: rename for consistency

* opencl: cache global mem size in dev_ctx

* opencl: adjust log level

* opencl: load argsort and flash_attn kernels in supports_op

* argsort kernel must be built for supports_op for querying the max
  workgroups
* flash_attn kernel has many variants, only load them when needed
2026-05-25 12:26:07 +03:00
Daniele ad717a6de0 vulkan: optimize operations in the IM2COL shader (llama/22685)
* vulkan: optimize operations in the IM2COL shader

* Add comments and improve the code formatting
2026-05-25 12:26:07 +03:00
Max Krasnyansky b93a5ba605 hexagon: HMX quantized matmul rework (llama/23368)
* hmx-mm: update debug logging in hmx-mm

* hmx-mm: update dequant logic to use HVX_vector_x2/4

* hmx-mm: remove non-pipelined version of the quantize matmul

It seems that we don't reall need non-pipelined version

* hmx-mm: use activation depth mode and update naming

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>

* hex-mm: minor hmx matmul naming updates

* hmx-mm: remove unused vars

* snapdragon: scripts bump default ubatch-size to 1K

* hexagon: combine HMX and power and clock settings into a single set_power call

* hmx-mm: remove leftover of the scale repl helper

* hexagon: fix editconf error

---------

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
2026-05-25 12:26:07 +03:00
Andreas Kieslinger 3fa19558f2 Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)
* Adds initial PDL setup.

* Adds PDL barriers based on simple heuristic: place "sync" before first input pointer access, and "launch" after last write, e.g. to tensors like dst.

* Further optimization pass of the first half of kernels

* Optimized PDL barriers for the second batch of kernels

* Further refinements after rebase.

* Moves pdl logic to separate function, removes some whitespace

* Strips post-hoc PDL logic

* Adds stream capture PDL setup. Enrolls quantize_q8_1 to leverage pdl to
overlap execution with previous kernels

* Enrolls mul_mat_vec_q, rms_norm_f32 and k_bin_bcast (partly) into PDL

* Enrolls mmvf, rope, set-rows and topk kernels for gpt-oss into PDL

* Introduce ggml_cuda_kernel_launch, to abstract away cudaLaunchKernelEx,
to enable hip/musa compatibility

* Enrolls cpy_scalar_contiguous, k_get_rows_float and rms_norm_f32

* Enrolls flash_attn_combine_results

* Fix: Drops needless and broken check of CUDA arch for PDL. PDL either
works or is without effect.

* Enrolls flash-attention kernels to pdl

* Fix: inlines ggml_cuda_kernel_launch, and uses perfect forwarding for
kernels args. This fixes PDL.

* Perf: Enrolls k_bin_bcast variadic template invocation into PDL, via
and template alias and template expansion

* Enrolls all remaining kernels for qwen3-coder-next into PDL

* Remove all PDL LC calls to create a baseline

* Added LC according to internal guidance and tested kernel performance.

* Enrols missing qwen3-5 kernels passively into PDL.

* Kernel optimizations (LC signals) for qwen3.5

* Enrolls ssm-scan kernels into PDL

* Adds GGML_CUDA_PDL command line option to toggle PDL.

* Fix: Ada and lower compilation by guarding PDL calls correctly

* Cleanup: Removes commented out GGML_CUDA_PDL_LC

* Cleanup: Removes experimental comments

* Adds 90-virtual to build script so that Hopper GPUs can leverage PDL.

* Adds stricter checks to enable PDL, adds env-check to disable it, and removes now superfluous compile option to enable PDL.

* Fix: Correct PDL en/disablement based on device-side arch check. Host
side check is UB. Required moving from macros to inlined functions

* Fix: default-disable PDL. Enable by setting GGML_CUDA_ENABLE_PDL=1

* Enable PDL by default for Hopper+ devices

* Enrolls softcap_f32 and two flash_attn kernels into PDL.

* Improves flash attn PDL barrier placement

* Fix: Perf regression on ada; excludes ada and below from PDL launches

* Improves some sync barrier placements

* Drops superfluous constructor

* Adds #endif guard comments

* Reverts experimental change to top-k-moe.cu, which moved expensive allocations
in front of the PDL barrier. It did not have a meaningful impact.

* Exchanges GGML_CUDA_DISABLE_PDL with GGML_CUDA_PDL. IFF GGML_CUDA_PDL=0
PDL is disabled

* Revert "Drops superfluous constructor". Adds const to remaining
arguments

This reverts commit 12b1d250da0089ae02a9bb71bbb3fd6d70f6f2f1.

* Cleanup: Removes and fixes some comments and whitespace

* Clarifies comment of sync-barrier position

* Relocates and refactors PDL launch functions and accessories

* Adds error checking to the regular kernel launch path

* Drops "auto" in favor of "ggml_cuda_kernel_params"

* Adds "const" to ggml_cuda_kernel_launch_params

* [Whitespace] Adds final newline to common.cuh to make editorconfig CI job happy
2026-05-25 12:26:07 +03:00
Georgi Gerganov c58fc465df metal : optimize pad + cpy (llama/23354)
* metal : optimize pad

* metal : optinmize cpy

* cont : better row packing in threadgroup
2026-05-25 12:26:07 +03:00
ravel7524 0a0a34287e ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (llama/23349) 2026-05-25 12:26:07 +03:00
shaofeiqi 37f17208c2 opencl: add MoE support for q4_k, q5_k, q6_k on Adreno (llama/23303)
* opencl: add q4_k moe support

* opencl: add q5_k moe support

* opencl: add q6_k moe support

* opencl: adjust format

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-05-25 12:26:07 +03:00
Aparna M P aca63e7638 hexagon: add MROPE and IMROPE support in HTP rope op (llama/23317) 2026-05-25 12:26:07 +03:00
Aparna M P 459ff0707b hexagon: enable support for NORM op (llama/23319) 2026-05-25 12:26:07 +03:00
Reese Levine 6090f39f36 ggml-webgpu : extend GDN for K>1 (llama/23299) 2026-05-25 12:26:07 +03:00
Intel AI Get-to Market Customer Success and Solutions 28edd0cb36 sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle (llama/22153)
* sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle

Signed-off-by: Chun Tao <chun.tao@intel.com>

* Use async mem ops for correctness when SYCL graphs are explicitly on.

Signed-off-by: Tao, Chun <chun.tao@intel.com>

---------

Signed-off-by: Chun Tao <chun.tao@intel.com>
Signed-off-by: Tao, Chun <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
2026-05-25 12:26:07 +03:00
Radoslav Gerganov 3477fdb2e3 rpc : keep last_graph_uid in the device context (llama/23273)
With the introduction of MTP we can have multiple compute contexts for
the same RPC device. In this case last_graph_uid is not updated properly
when contexts are being switched. This patch fixes this by moving
last_graph_uid to the device context, making sure it is always updated.

closes: #23242
2026-05-25 12:26:07 +03:00
Pranav Dhinakar eb558f23cb hexagon: add support for TRI op (llama/22822)
* Hexagon: TRI HVX Kernel addition to ggml hexagon HTP ops and context

* addressed PR review comments for TRI op

* hexagon: clang format

* hex-unary: remove merge conflict markers

* hex-ggml: remove duplicate op cases (merge conflict)

* hex-ggml: fix editor config errors

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-05-25 12:26:07 +03:00
Pranav Dhinakar 0a11c9fe83 ggml-hexagon: add PAD op HVX kernel (llama/23078)
* ggml-hexagon: add PAD op HVX kernel

Implements GGML_OP_PAD on the Hexagon HTP backend using HVX vectorized
kernels. Supports zero-padding and circular padding across all 4 tensor
dimensions.

* hex-ggml: remove duplicate op cases (merge conflict)

* hex-pad: fix editorconfig checks and macro alignment

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-05-25 12:26:07 +03:00
Intel AI Get-to Market Customer Success and Solutions c65b082c94 sycl: scalar SWAR byte-subtract in Q6_K MMVQ dot product (llama/22156)
Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
2026-05-25 12:26:07 +03:00
Intel AI Get-to Market Customer Success and Solutions 619262ad24 sycl: route small f32 matmuls to oneMKL, bypass oneDNN (llama/22150)
Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
2026-05-25 12:26:07 +03:00
Gabe Goodhart 4fb3ccabd3 feat: Support d_conv=15 for ssm-conv.cu (llama/23017)
Branch: ModalityConditionalAdapters
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2026-05-25 12:26:07 +03:00
Oliver Simons 53736a3f0e CUDA: Continue directly including cuda/iterator (llama/23102)
Cont of #22936, forgot to update one site
2026-05-25 12:26:07 +03:00
Jan Ekström 9e96e0eaf1 ggml-vulkan/CMakeLists: add a check for SPIRV-Headers (llama/22009)
* ci/run: set explicit SPIR-V Headers search path for macOS vulkan CI

For whatever reason, the files are under additional sub-path
`vulkan/` under the cmake directory, which does not match either
current LunarG macOS Vulkan SDK structure (`lib/cmake/SPIRV-Headers`),
nor what gets installed when you run the cmake build+install for
SPIRV-Headers itself on at least Linux (`share/cmake/SPIRV-Headers`).

This allows for SPIRV-Headers to be found, as currently the CI
runner's setup does not seem to include the relevant path in
list of search locations.

* ggml-vulkan/CMakeLists: add a check for SPIRV-Headers

This is installed by the project if it is built and installed.
Receiving an error during the configuration step is generally
preferred to receiving an error in the middle of a build.
2026-05-25 12:26:07 +03:00
Pascal 50482cbd22 vulkan: add cpy bf16 -> f32 pipelines (llama/22677) 2026-05-25 12:26:07 +03:00
Jeff Bolz e417ce7aeb vulkan: Support unaligned tensors for ROPE (llama/22637) 2026-05-25 12:26:07 +03:00
Jeff Bolz c7dd64c606 vulkan: fuse SSM_CONV + BIAS + SILU (llama/22653) 2026-05-25 12:26:07 +03:00
Winston Ma ef5ddecff9 vulkan: removed duplicate #include <memory> in headers (llama/23144) 2026-05-25 12:26:07 +03:00
Ori Pekelman e78e693017 ggml.h: correct ggml_silu_back arg docstring (a=dy, b=x) (ggml/1500) 2026-05-25 12:26:07 +03:00
Dev-X25874 3583e35e0d ggml-alloc: fix out-of-bounds read in ggml_dyn_tallocr_remove_block (ggml/1492) 2026-05-25 12:26:07 +03:00
Georgi Gerganov 587dca0eda ggml : bump version to 0.12.0 (ggml/1494) 2026-05-25 12:26:07 +03:00
Aman Gupta 23f956de33 llama + spec: MTP Support (llama/22673)
* spec: support MTP

* fix batch size

* rename files

* cont : simplify (llama/7)

* MTP: clean-up (llama/9)

* MTP: clean-up

* review: use llama_context_type instead of llama_graph_type

* review: remove llama_model_has_mtp

* review: fix convert issues

* convert: fix pycheck

* review: formatting

* use `mtp-` for identifying mtp models

* convert: fix mtp conversion

* mtp -> draft-mtp

* remove unused llama_arch

* add need_embd in speculative

* llama: allow partial seq_rm for GDN models for speculative decoding

Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.

* fix pending state

* vulkan: add GDN partial rollback

* meta: extend check to axis 1

* metal: add GDN partial rollback

Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.

- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior

Ref: 8c05923630

Assisted-by: llama.cpp:local pi

* delta_net_base: use ggml_pad instead of new_tensor

* review: add need_rs_seq

* review: rename part_bounded to n_rs

* review: deslop comments

* review: rename, add asserts

* server : adjust checkpoint logic (llama/11)

* server : adjust checkpoint logic

* cont : rm asserts

* server-context: fix early exit

* spec : fix compatibility with n-gram and add TODOs (llama/13)

* metal : cleanup

* llama : fix faulty bitwise check in recurrent memory

* server : disable RS-based MTP in combination with other spec types

* spec : add TODOs

* cont : fix comment

* cont : update comment

* common : fix logic for ngram + mtp compat

* llama-memory: enable checkpointing with partial rollback

* cont: add test-case for loading into a dirty ctx

* llama-memory-recurrent: clear rs_idx in clear

* download: fix mtp path

* llama-arch: fix enorm op

* docs: update docs

* conversion: fix type annotations

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-05-25 12:26:07 +03:00
Pranav Dhinakar 18a61f44b6 ggml-hexagon: cpy: add contiguous fast-path in reshape copy (llama/23076) 2026-05-25 12:26:07 +03:00
Johannes Gäßler e62d5893f4 HIP: RDNA3 mma FA, faster AMD transpose, tune AMD (llama/22880)
Adds RDNA3 support to the CUDA mma FA kernel. To make the RDNA3 tensor cores work with the FP16 accumulation for VKQ the tiles they need to be 32 logical units long in direction of the attention head; for head sizes 80 and 112 that are not exactly divided by 32 the regular length of 16 with FP32 accumulation is used instead. The longer tiles also enable more efficient transposition for a warp size of 32 which is why it's also used for RDNA4. However, this scrambles the data layout of the accumulators along the attention head dimension. To prevent accidental misuse I added another entry to ggml_cuda_mma::data_layout.

I also tuned the kernel parameters for RDNA3, RDNA4, and CDNA1 in general, during which I discovered that the kernel can be made to work for head sizes up to 256 for CDNA. For RDNA3/4 I was not able to get better performance that the tile kernel for head sizes > 128.
2026-05-25 12:26:07 +03:00
Zheyuan Chen 13133ab299 ggml-webgpu: makes the flash attn vec path subgroup-aware (llama/23040)
* ggml-webgpu: makes the flash attn vec path compile and size its split/reduce work from the device’s reported subgroup range instead of assuming 32 subgroup size.

* ggml-webgpu: remove the extra max_wg_size >= max_subgroup_size guard. Remove hardcoded 32 when determine the value of reduce_wg_size and vec_nwg_cap
2026-05-25 12:26:07 +03:00
Georgi Gerganov 592a8cd15d logs : reduce (llama/23021)
* logs : reduce

* args : fix envs

* server : fix build

* common : print verbosity level at start

* server : clean-up logs

* server : print prompt processing timings + sampling params

* minor : whitespaces
2026-05-25 12:26:07 +03:00
alex-spacemit f0223903aa ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (llama/22863) 2026-05-25 12:26:07 +03:00
Ruben Ortlam 97ba44338f vulkan: fix matmul integer pipeline selection (llama/23005)
* vulkan: fix matmul integer pipeline selection

* gate pipeline creation with the right bools
2026-05-25 12:26:07 +03:00
Katostrofik 06cfc3653b SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations (llama/21597)
* SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations

Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation
in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's
DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM.
zeMemAllocDevice uses the SVM/P2P path with no host staging.

On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model
consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes.
With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with
no performance regression.

All Level Zero calls include automatic fallback to the original SYCL
allocation path if Level Zero interop is unavailable.

* SYCL: address review feedback - remove try/catch, check device types, deduplicate

- Remove try/catch from malloc/free/memcpy helpers, check backend and
  device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu)
- Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp
  and declare in common.hpp to eliminate code duplication
- Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls
- Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the
  host-staged path for iGPU-to-dGPU transfers
- Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH)
  in CMakeLists.txt (co-authored with @arthw)

* SYCL: add build/runtime flags for Level Zero, address review feedback

Implements the architecture suggested by @arthw: compile-time and runtime
flags to cleanly separate Level Zero and SYCL memory API paths.

- Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level
  Zero code is wrapped in #ifdef so the build works on systems without
  the Level Zero SDK installed (e.g. CPU-only CI servers). Both the
  loader library and headers are checked before enabling.

- Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls
  whether Level Zero or SYCL memory APIs are used. Only one API style is
  used per session, no mixing. If Level Zero is enabled but the devices
  don't support the Level Zero backend, it auto-disables with a warning.

- Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory
  is not called anywhere in the backend) and used try/catch for flow control.

- Update SYCL.md with documentation for both new parameters.

Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both
GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development
(Claude). Code reviewed and tested on my hardware.

* SYCL: unify Level Zero malloc/free call sites, address review feedback

Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device.
Both functions are now unconditionally available — Level Zero code is
#ifdef'd inside the functions, not at call sites. All call sites use
uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks.

Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack
traces on failure, eliminate duplicated #ifdef/else patterns at 6 call
sites (-29 lines net).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths

Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs
so the Level Zero code path is compiled and tested in CI.

Fix two bugs found during extended dual-GPU testing (no
ONEAPI_DEVICE_SELECTOR set):

- The Level Zero backend check was iterating all SYCL devices
  including CPU. The OpenCL CPU device caused Level Zero to be
  disabled for the GPUs, defeating the fix on multi-GPU systems.
  Added is_gpu() filter so only GPU devices are checked.

- sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers)
  were still calling sycl::malloc/sycl::free directly, bypassing the
  Level Zero path. Routed through ggml_sycl_malloc_device/free_device
  for consistency with the other device memory call sites.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: address arthw review feedback on Level Zero memory API structure

- Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp;
  only ggml_sycl_free_device (used by common.cpp) stays in common.cpp
- Switch both helpers to use g_ggml_sycl_enable_level_zero global
  instead of per-call queue backend checks
- Remove #ifdef wrapper from global definition; always declare at 0,
  add #else branch in init block so it stays 0 when L0 not compiled in
- Update init loop comment to explain GPU-only device check
- CMakeLists: message(STATUS) before the if block; align option wording

AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro
B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU
Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed
<5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* SYCL: remove unused cstdio/cstdlib includes from common.cpp

Leftover from the deleted ggml_sycl_queue_supports_level_zero helper.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Apply suggestions from code review

Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>

* SYCL: preserve Level Zero allocation path during early malloc

* ci: fix Level Zero package conflict in Intel Docker build

* ci: find Level Zero loader in oneAPI package step

* ci: allow Windows SYCL package without Level Zero DLL

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
2026-05-25 12:26:07 +03:00
Daniel Bevenius e414ecf674
cmake : add CMakePresets.json [no ci] (#3808)
This commit adds a CMakePresets.json file similar to the one in
llama.cpp.

The motivation for this is that this provides sharable named
configuration which can be used with cmake --preset <name>.

It also allows for extendins these preset with a
CMakeUserPresets.json for specific hardware (like CPUs),
architectures, and toolchains etc.
2026-05-25 12:25:15 +03:00
OrbisAI Security b3877e10c0
fix: in bindings/ruby/test/jfk_reader/jfk_reader in jfk_reader.c (#3756)
* fix: V-002 security vulnerability

Automated security fix generated by Orbis Security AI

* fix(ruby): use Ruby allocator macros in jfk_reader and fix memory leak

- Replace calloc/free with ALLOC_N/xfree to match Ruby binding conventions
  (ALLOC_N handles overflow checking and raises NoMemoryError on failure)
- Free temporary samples buffer after conversion loop (was leaked)
- Add NULL check for fopen return value with rb_raise
- Add comment clarifying n_samples is a compile-time constant

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(ruby): return false instead of rb_raise in memory_view callback

rb_memory_view_get_func_t callbacks should communicate errors via
return value (false), not exceptions. rb_memory_view_get has no
exception-handling wrapper around get_func calls.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* replacing ALLOC_N with rb_protect as ALLOC_N raises Ruby exceptions

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-25 08:19:23 +02:00
Pascal 0ccd896f5b
common : fix server /inference fails to decode in-memory audio (regression) (#3818)
* common: add memory buffer overload of read_audio_data

whisper-server /inference without --convert passed the uploaded file
bytes to read_audio_data as a filename, so ma_decoder_init_file tried
to open a path starting with "RIFF" and failed. every request returned
HTTP 400 "Invalid request" on builds without WHISPER_FFMPEG, which is
the default.

factor the PCM extraction into a shared helper and add an overload that
decodes straight from a memory buffer via ma_decoder_init_memory, which
the function already used for the stdin path. server now calls it with
the upload content. the filename overload behavior is unchanged.
2026-05-22 08:27:35 +02:00
Daniel Bevenius 8443cf05e3
ci : use github ubuntu-22.04-arm runner instead of qemu (#3815)
* ci : use github ubuntu-22.04-arm runner instead of qemu

This commit updates the ubuntu-22-gcc-arm64 job to use a arm github
runner instead of QEMU.

The motivation for this is that we get intermittent failure specifically
related to QEMU. For example:
```console
Segmentation fault (core dumped)
qemu: uncaught target signal 11 (Segmentation fault) - core dumped
Segmentation fault (core dumped)
dpkg: error processing package libc-bin (--configure):
installed libc-bin package post-installation script subprocess returned error exit status 139
Processing triggers for ca-certificates (20240203~22.04.1) ...
Updating certificates in /etc/ssl/certs...
0 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d...
done.
Errors were encountered while processing:
libc-bin
E: Sub-process /usr/bin/dpkg returned an error code (1)
```
This is an attempt to try to avoid QEMU and hence avoid this issue.

* ci : remove QEMU where possible
2026-05-21 10:59:58 +02:00
Daniel Bevenius afa2ea544f
whisper : set bench data for each iteration (#3812)
* whisper : set bench data for each iteration

This commit updates whisper_bench_ggml_mul_mat_str to intialize the
tensors data for each iteration.

The motivation for this is that is currently possible for a previous
run's results, F32 values, to leak into the next run. When it is time
for the F16 iteration then F32 results can cause NaN values to appear
in the tensor values causing the F16 iteration to fail.

Refs:https://github.com/ggml-org/whisper.cpp/actions/runs/25901678402/job/76152894644?pr=3735

* ci : set GGML_NATIVE=OFF if x86_64

This commit sets GGML_NATIVE=OFF for x86_64 architectures.

The motivation for this is to try to get CI to pass and the theory is
that the libggml-cpu.so library in the ccache might have been built by a
runner that supports a different instruction set. When another runner
that does not support that instruction set tries to use it, it will fail
with a segmentation fault.

I'm not sure about this yet but going to try this out and if it does not
work I'll ssh into the runner to debug further.
2026-05-19 08:58:43 +02:00
petterreinholdtsen 47b9eb37a3
examples : fix memory leak in read_audio_data (#3810)
This commit addresses a memory leak in the `read_audio_data` function
where it is currently possible that a call to `ma_decoder_init_file`
succeeds and the function returns early without calling
`ma_decoder_uninit`. A similar situation can occur with
`ma_decoder_init_memory`.

Refs: https://bugs.debian.org/1124796

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2026-05-18 12:16:39 +02:00
Andreas Lubbe 6227a0ef73
server : Return speaker information in JSON (#3782) 2026-05-18 09:18:04 +02:00
Andreas Lubbe 968eebe772
server: add support for carry_initial_prompt (#3781)
* Add support for carry_initial_prompt on the server

* Update README
2026-05-15 14:03:17 +02:00
Georgi Gerganov 46ca43d639 talk-llama : sync llama.cpp 2026-05-14 21:26:48 +03:00
Georgi Gerganov 69500f5502 sync : ggml 2026-05-14 21:26:48 +03:00
Zheyuan Chen e4ce42e55f ggml-webgpu: only use subgroup-matrix path when head dims are divisible by sg_mat_k / sg_mat_n (llama/23020) 2026-05-14 21:26:48 +03:00
scutler-nv 97371e9285 Fix for issue #22974. Cast intermediate results to float before adding and casting the result to the destination type. Avoids half+half operator ambiguity. (llama/22994) 2026-05-14 21:26:48 +03:00
shaofeiqi d4a4d87f0e opencl: add q5_0 and q5_1 MoE for Adreno (llama/22985)
* opencl: add q5_0 moe support

* opencl: add q5_1 moe support

* opencl: avoid potential leak

* opencl: suppress unused var warning when building for non-Adreno

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-05-14 21:26:48 +03:00
lhez b19beb6027 opencl: fix crash when warming up MoE on Adreno (llama/22876) 2026-05-14 21:26:48 +03:00
Masashi Yoshimura 1cbbd0b6d0 flush the gpu profile timestamp before the queryset is overflowed (llama/22995) 2026-05-14 21:26:48 +03:00