whisper.cpp/ggml
Max Krasnyansky f8df28d331 hexagon: add support for CONCAT op (llama/23648)
* hexagon: add support for CONCAT with optimized concat_2d_transposed

qwen3.5 models are quite heavy on the CONCAT with large and transposed src1.

* hex-concat: use fastdiv in generic version

* hex-concat: make checks for transposed a bit more readable

* hex-concat: reoder dma ops for better pipelining

* hex-cont/cpy: optimize CPY and CONT ops

The primary change is to avoid scalar divs in the inner loops.
We were calling hvx_copy_uu(... type_size) where type_size is non a constexpr.
This causes runtime divs by that value which is normally just 4 or 2 (f32/f16).

* hex-get-rows: optimize GET_ROWS for large rows

We now use DMA for larger rows and also split them into chunks to improve perf for Qwen3.5 and other models
that do lots of GET_ROWS with huge (2MB+ rows).

Also bump the DMA queue depth now that we can take advantage of it.

* hex-concat: unroll the inner loops of concat_2d

* hex-concat: more updates to concat_2d to improve perf a bit further

* hex-cpy: fixed n_rows per thread checks in the copy ops

* hmx-fa: fix alignment issues while computing dma sizes

* hex-set-rows: add early returns for idle threads

* hvx-rope: minor optimization to replace loops with fastdiv logic

* hex-rope: replace scalar tail processing with HVX

* hex-rope: optimize rope cache init with HVX

Add hvx-utils sin/cos helpers that use an aprox method (similar to rsqrt, inverse, etc)
Use the helpers to optimize ROPE.
2026-05-29 09:47:30 +03:00
..
cmake ggml : Parallelize quant LUT init (llama/23595) 2026-05-25 12:26:07 +03:00
include ggml: `gguf_init_from_callback` and `gguf_init_from_buffer` (llama/22341) 2026-05-25 12:44:04 +03:00
src hexagon: add support for CONCAT op (llama/23648) 2026-05-29 09:47:30 +03:00
.gitignore whisper : reorganize source code + improve CMake (#2256) 2024-06-26 19:34:09 +03:00
CMakeLists.txt ggml : bump version to 0.13.0 (ggml/1510) 2026-05-25 12:44:04 +03:00