Commit Graph

4174 Commits

Author SHA1 Message Date
uvos 081dc773a5 ci : add hip quality check (llama/20430)
* CI: add hip quality check

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update .github/workflows/hip-quality-check.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update .github/workflows/hip-quality-check.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update .github/workflows/hip-quality-check.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Revert "Update .github/workflows/hip-quality-check.yml"

This reverts commit efa0bfcdb01dfac0feee674987a0482d50f46145.

* scripts: gcn-cdna-vgpr-check.py: enforce int type for total_vgprs

* scripts: gcn-cdna-vgpr-check.py: add flash attention instances to ignore list

* Bump ccache version

* Add mssing seperators to list

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-29 15:04:36 +03:00
Reese Levine 551bb82960 ggml webgpu: ops support for qwen3.5 (SET, TRI_SOLVE, SSM_CONV, GATED_DELTA_NET) + GET_ROWS optimization (llama/20687)
* Implement l2_norm, set, tri

* Add DIAG/SOLVE_TRI

* Add SSM_CONV

* Better get_rows and gated_delta_net to support qwen3.5

* Clean up, update ops.md

* Fix binding_index type for wasm

* Fix read write annotations

* cleanups
2026-03-29 15:04:36 +03:00
Eve 43c7c0f86c vulkan: dequantize iq4_xs 4 at a time (llama/20657) 2026-03-29 15:04:36 +03:00
Charles Xu fea629d00f cmake : fix build warning when kleidiai is enabled (llama/20457)
* cmake : fix build warning when kleidiai is enabled

* remove LLAMA_ARG_THREADS from KleidiAI backend
2026-03-29 15:04:36 +03:00
Chenguang Li 2a6de29364 CANN: handle in-place ROPE on non-contiguous f32 tensors (llama/20274)
RotaryPositionEmbedding on CANN fails when src and dst share the same
non-contiguous buffer (inplace + view), because the operator overwrites
source data before it is fully read.

Add a branch that detects this case and uses contiguous temporary
buffers: copy src to temp, run ROPE into another temp, then copy back
to the non-contiguous dst. Fixes 20 failing ROPE tests (f32, v=1,
inplace=1).

Signed-off-by: noemotiovon <757486878@qq.com>
2026-03-29 15:04:36 +03:00
Masashi Yoshimura 3d004fbf0a ggml-webgpu: Update the `RMS_NORM` preprocessor and add `L2_NORM` (llama/20665)
* Update the preprocessor of RMS_NORM and add L2_NORM.

* Fix the name of rms_norm to row_norm.
2026-03-29 15:04:36 +03:00
Masashi Yoshimura 12015a2174 ggml-webgpu: Add supports for `DIAG` and `TRI` (llama/20664)
* Add supports for DIAG and TRI.

* Remove extra ttype and add a comment for TRI op.
2026-03-29 15:04:36 +03:00
Chenguang Li dfba84cb47 CANN: support flash attention for head dim not multiple of 16, fix ALiBi slope offset (llama/20031)
- Allow FLASH_ATTN_EXT when head dimension D is not a multiple of 16 by
  padding Q/K/V to D_padded = GGML_PAD(D, 16), running FusedInferAttentionScoreV2,
  then slicing the output back to D (ggml-cann.cpp + aclnn_ops.cpp).
- Fix aclnn_get_slope second-part offset: use ggml_type_size(dtype) instead of
  sizeof(float) so ALiBi slopes are correct when dtype is F16 (e.g. GQA with
  48 heads); fixes buffer overflow and large numerical errors in those cases.
2026-03-29 15:04:36 +03:00
Reese Levine d6a0f0d075 Move to no timeout for WaitAny in graph submission to avoid deadlocks in some cases on llvm-pipe backends (llama/20618) 2026-03-29 15:04:36 +03:00
Shaw Nguyen 14caedfa18 ggml-cpu/x86: fix unused changemask warning in repack (llama/20692) 2026-03-29 15:04:36 +03:00
uvos 61c7cd024d HIP : ignore return of hipMemAdvise [no ci] (llama/20696) 2026-03-29 15:04:36 +03:00
Krishna Sridhar e222814fc4 hexagon: add neg, exp, sigmoid, softplus ops, cont, repeat ops (llama/20701)
Add element-wise unary ops needed by Qwen 3.5's DeltaNet linear
attention layers. These ops follow the existing unary-ops pattern
with VTCM DMA double-buffering.

- neg: negate via scale by -1.0
- exp: uses existing hvx_exp_f32 HVX intrinsics
- sigmoid: uses existing hvx_sigmoid_f32_aa HVX intrinsics
- softplus: log(1 + exp(x)) scalar fallback
- CONT reuses the existing CPY infrastructure since making a tensor
  contiguous is equivalent to a same-type copy.
- REPEAT implements tiled memory copy with multi-threaded execution via
  the worker pool, supporting f32 and f16 types. The kernel parallelizes
  across output rows and uses memcpy for each tile.

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-03-29 15:04:36 +03:00
Ruben Ortlam 16ca5e6fb1 vulkan: disable mmvq on Intel Windows driver (llama/20672)
* vulkan: disable mmvq on Intel Windows driver

* improve comment
2026-03-29 15:04:36 +03:00
Kevin Hannon 906aef3da8 ggml-blas: set mkl threads from thread context (llama/20602)
* ggml blas: set mkl threads from thread context

* add code to run blas locally
2026-03-29 15:04:36 +03:00
Taimur Ahmad c890a9d9b4 ggml-cpu: fix RVV checks in quants and repacking (llama/20682)
* ggml-cpu: refactor quants.c; add rvv check

* ggml-cpu: refactor; disable generic fallback
2026-03-29 15:04:36 +03:00
Ruben Ortlam 0ad6ceef59 vulkan: async and event fixes (llama/20518)
* vulkan: fix event wait submission, event command buffer reset

* fix event command buffer reset validation error

* also reset command buffers before reuse

* use timeline semaphores instead of fences for event_synchronize

* don't use initializer list for semaphore wait info

* use multiple events to avoid reset issues

* fix event reuse issue with multiple vectors

* add semaphore wait condition also if compute_ctx already exists

* remove event pending stage
2026-03-29 15:04:36 +03:00
Justin Bradford ab7d305b75 kleidiai : fix MUL_MAT support for batched (3D) inputs (llama/20620)
* kleidiai : fix MUL_MAT support for batched (3D) inputs

The supports_op() check incorrectly rejected MUL_MAT operations with 3D
inputs (ne[2] > 1), but the actual compute_forward_qx() implementation
handles batched inputs correctly via a loop over ne12.

This caused models with Q4_0/Q8_0 weights to crash during graph scheduling
when n_seq_max > 1, because weights were placed in KLEIDIAI buffers during
loading (tested with 2D inputs) but the runtime used 3D inputs.

Also relax the buffer check to allow supports_op() to be called during
weight loading when src[0]->buffer is NULL.

Fixes #20608

* Kleidiai support_ops should only return true for 3D inputs, not also 4D
2026-03-29 15:04:36 +03:00
Ruben Ortlam 49adc8b470 vulkan: allow graphics queue only through env var (llama/20599)
* vulkan: avoid graphics queue on non-RADV AMD drivers

* avoid graphics queues on small GPUs

* change to only use graphics queue if overridden with env var GGML_VK_ALLOW_GRAPHICS_QUEUE

* reenable transfer queue if graphics queue is not used
2026-03-29 15:04:36 +03:00
Neo Zhang 6494251197 ehance UPSCALE to support all UT cases (llama/20637)
* [SYCL] ehance UPSCALE to support more cases

* rm test case result of SYCL1
2026-03-29 15:04:36 +03:00
Martin Klacer 9232af59ba kleidiai: add data type check to get_tensor_traits (llama/20639)
* kleidiai: add data type check to get_tensor_traits

 * Added check for F16 data type into get_tensor_traits path with input data
   not in ggml_backend_cpu_kleidiai_buffer_type format (unsupported for Q4/8)

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Change-Id: I9aca4b9b8d669d35db6f1dbcc4e080b1919b1de7

* updated ggml/src/ggml-cpu/kleidiai/kleidiai.cpp

updated kleidiai.cpp file as per suggestion

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-29 15:04:36 +03:00
Ruben Ortlam 724ea71cf9 vulkan: fix flash attention dot product precision (llama/20589) 2026-03-29 15:04:36 +03:00
Aman Gupta dae7781052 CUDA: GDN hide memory latency (llama/20537) 2026-03-29 15:04:36 +03:00
Sigbjørn Skjæret 1335dfa785 sycl : fix for untransposed GDA recurrent state (llama/20583) 2026-03-29 15:04:36 +03:00
KITAITI Makoto 76684141a5
ruby : fix dangling pointers, memory leak, and SEGV on parallel transcription (#3715)
* Prevent dangling pointers

* Use proper free function

* Free callback containers

* Set default log callback when nil is passed to log_set

* Raise error if callbacks set when parallel transcription

* Bump version to 1.3.7

* Make tests follow spec change

* Add note on parallel transcription and callbacks

* Update signature of Whisper.log_set [skip ci]
2026-03-22 02:03:00 +09:00
Georgi Gerganov 9386f23940
release : v1.8.4 2026-03-19 10:40:13 +02:00
Georgi Gerganov ef3463bb29
ci : update workflows 2026-03-18 22:43:38 +02:00
Georgi Gerganov 4bbce1e5b2
benches : update 2026-03-18 22:34:51 +02:00
Georgi Gerganov f5b477ab09 sync : ggml 2026-03-18 15:18:24 +02:00
Georgi Gerganov b2be16208d ggml : bump version to 0.9.8 (ggml/1442) 2026-03-18 15:18:24 +02:00
Georgi Gerganov 945d3151d9 ggml : restore ggml_type_sizef() to aboid major version bump (ggml/1441) 2026-03-18 15:18:24 +02:00
lohopupa dc96116622
fix: VAD time mapping timestamp drift caused by overlap samples (#3711)
* whisper : fix VAD segment overlap boundary handling

 - Use original segment length (pre-overlap) for vad_end in the time
   mapping table, so segment boundaries are preserved accurately

Claude Sonnet 4.6 (Low)

* whisper : remove intermediate VAD time mapping points

Now that segment boundaries are mapped accurately, the intermediate
point interpolation is no longer necessary.

---------

Co-authored-by: Lohopupa <lohopupa@gmail.com>
2026-03-17 07:19:08 +01:00
Alan 79218f51d0
go : handle EOF correctly in model download (#3671) 2026-03-16 13:44:18 +02:00
Aiudadadadf 975b979834
py : replace deprecated openvino-dev with openvino>=2023.3.0 (#3678)
* models: replace deprecated openvino-dev with openvino>=2023.3.0 for Python 3.12+ compat

* models: remove unused openvino.tools.mo import from convert-whisper-to-openvino.py
2026-03-16 13:41:54 +02:00
Gaël James 21665eab4c
examples : Allow max_len to be used for any output format (#3679) 2026-03-16 13:33:56 +02:00
Igor Loskutov 136dc2eb12
server: return proper HTTP status codes for error responses (#3707)
Several error paths in the /inference and /load endpoints returned
HTTP 200 with a JSON error body, making it impossible for clients
to distinguish errors from successful responses by status code.

Set 400 for client errors (missing file field, unreadable audio,
missing/invalid model) and 500 for server errors (ffmpeg conversion
failure). The two existing status-code sites (499 for client
disconnect, 500 for processing failure) are unchanged.
2026-03-16 13:33:06 +02:00
Georgi Gerganov 27fa20774a ggml : try fix arm build (#0) 2026-03-16 13:10:15 +02:00
Georgi Gerganov 2bc630f197 talk-llama : sync llama.cpp 2026-03-16 13:10:15 +02:00
Georgi Gerganov ab1252c19e sync : ggml 2026-03-16 13:10:15 +02:00
David366AI d4bc312169 ggml : extend im2col f16 (ggml/1434)
* examples/yolo: fix load_model memory leak

* fix/issue-1433 ggml_compute_forward_im2col_f16 assert error

* fix/issue-1433
2026-03-16 13:10:15 +02:00
Georgi Gerganov 81ea958719 common : add nvfp4 (ggml/0) 2026-03-16 13:10:15 +02:00
Johannes Gäßler d7926e62d4 CUDA: limit number of FA stream-k CUDA blocks (llama/20586) 2026-03-16 13:10:15 +02:00
Pascal 2fb6aea8ad ggml: avoid creating CUDA context during device init (llama/20595) 2026-03-16 13:10:15 +02:00
MoonShadow b327a321a2 ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain (llama/20536)
* ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain

On AMD APU/iGPU devices (unified memory architecture), hipMemAdviseSetCoarseGrain
returns hipErrorInvalidValue because the hint is not applicable to UMA systems.
The previous CUDA_CHECK() call treated this as a fatal error, causing crashes on
APU systems such as AMD Strix Halo (gfx1151).

Fix: treat hipMemAdviseSetCoarseGrain as an optional performance hint - call it
without error checking and clear any resulting error with hipGetLastError().

Also add pre-allocation debug logging (GGML_LOG_DEBUG) to help diagnose memory
issues on APU systems, and store totalGlobalMem in device info.

Context: AMD APUs on Windows are affected by a ROCm runtime bug that limits
hipMallocManaged to ~64GB regardless of available system RAM. A fix has been
submitted upstream: https://github.com/ROCm/rocm-systems/pull/4077

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ggml/hip: remove unrelated changes, keep only hipMemAdviseSetCoarseGrain fix

---------

Co-authored-by: moonshadow-25 <moonshadow-25@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-16 13:10:15 +02:00
Bartowski 6770239830 ggml : guard against sumq2 being 0 in IQ4_NL (llama/20460) 2026-03-16 13:10:15 +02:00
PikaPikachu 55c66106af cuda : add RDNA4-specific MMVQ parameter table for bs=1 decode (llama/19478)
* mmvq: add RDNA3/RDNA4-specific parameter table (nwarps=8, rows=1)

* mmvq: add dedicated RDNA3 parameter table

* mmvq: exclude RDNA3.5 (gfx1150/1151) from RDNA3 table
2026-03-16 13:10:15 +02:00
Ruben Ortlam cd02195b8f vulkan: use graphics queue on AMD (llama/20551)
* vulkan: use graphics queue on AMD for slightly better performance

* disable async transfer queue on AMD
2026-03-16 13:10:15 +02:00
Georgi Gerganov b312018435 metal : add FA specialization for HSK = 320, HSV = 256 (llama/20549) 2026-03-16 13:10:15 +02:00
Max Krasnyansky 55f8cfdaed hexagon: Q4_0 and MXFP4 repack fixes (llama/20527)
* hexagon: fix tail corruption with rows sizes not multiple of 256

* hexagon: use different stride for repacking partial blocks

* hex-mm: update repack and kernels to avoid shuffles for full 256-element blocks

Previous commit changed the repacking to use even:odd (0:1,2:3,..) packing
instead of the original (0:128,1:129,...) packing in order to fix tail corruption.
Since the mm kernels already deal with partial tails we can use even:odd
packing only for the last block.
This avoid performance penalty of having to shuffle to zip the elements
in the common case.

* hex-mm: update rmpy x8 for better optimizations

* hex-mm: tighten supported MUL_MAT checks to avoid spurios failures

* hex-mm: use vzero to init accumulators

* hex-mm: properly call partial rmpy_x8
2026-03-16 13:10:15 +02:00
Neo Zhang c5f9a49b51 add op gated_delta_net (llama/20455) 2026-03-16 13:10:15 +02:00
Adrien Gallouët 93d09fdb23 ggml : add native AVX512-FP16 support for F16 operations (llama/20529)
The overall benchmark speed remains almost the same because the CPU is
now calculating faster than the RAM can deliver the data. (See perf stat
results below showing 2.7 billion fewer instructions).

Also note that this path will be only enabled for native build or with
custom flags.

now:
```
 Performance counter stats for 'build/bin/llama-bench -m Qwen3-0.6B-f16.gguf -p 512 -n 128':

        189,073.52 msec task-clock                       #   14.658 CPUs utilized
               404      context-switches                 #    2.137 /sec
                19      cpu-migrations                   #    0.100 /sec
           372,390      page-faults                      #    1.970 K/sec
   310,877,195,595      instructions                     #    0.54  insn per cycle
   581,071,530,602      cycles                           #    3.073 GHz
    19,352,107,994      branches                         #  102.352 M/sec
        48,304,438      branch-misses                    #    0.25% of all branches
    84,998,431,152      L1-dcache-loads                  #  449.552 M/sec
    12,186,410,279      L1-dcache-load-misses            #   14.34% of all L1-dcache accesses

      12.899358742 seconds time elapsed

     187.823044000 seconds user
       1.253416000 seconds sys
```

before:
```
 Performance counter stats for 'build/bin/llama-bench -m Qwen3-0.6B-f16.gguf -p 512 -n 128':

        190,594.56 msec task-clock                       #   14.652 CPUs utilized
               436      context-switches                 #    2.288 /sec
                22      cpu-migrations                   #    0.115 /sec
           372,782      page-faults                      #    1.956 K/sec
   313,574,921,966      instructions                     #    0.54  insn per cycle
   586,064,970,425      cycles                           #    3.075 GHz
    19,585,778,563      branches                         #  102.761 M/sec
        48,437,488      branch-misses                    #    0.25% of all branches
    86,219,336,628      L1-dcache-loads                  #  452.370 M/sec
    12,232,085,771      L1-dcache-load-misses            #   14.19% of all L1-dcache accesses

      13.007923164 seconds time elapsed

     189.395316000 seconds user
       1.202612000 seconds sys
```

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-16 13:10:15 +02:00