* Updated vec.h/vec.cpp code to accumulate to F32 rather than F16
Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>
* OP_GATED_DELTA_NET impl
* add back lanes_per_column declaration
* removed has_subgroup_arithmetic and has_subgroup_clustered_reduce
* removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot
* support for K>1 state snapshot
* removed picky indent multiple of 4 fixes
* removed return that won\'t be executed
* hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now
* hmx-mm: add support for Q4_1
* hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot
* hexagon: fix repack scratch buffer overflow
* hex-mm: fix Q4_1 repack buffer sizing
* hexagon: flip the build order for mm and fa (seems to help LTO)
* hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1
* hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output
* hexagon: resurrect early-wake and add support for polling for op-batch completions
With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax.
This is a good thing! But it does add extra latency for the pure benchmark runs.
Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking.
---------
Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
* vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32
Against mesa git, this shows a 4.8% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.
Note that this breaks some tests until the last commit which fixes
OOB A reads.
* vulkan: Use aligned loads in mul_mat_vec when available
Against mesa git, this shows a 3.3% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.
* Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec
Mesa's UUB logic can't see through conditionals, limiting its
ability to understand the bounds on the `num_rows` field in the
cleanup run. Making it explicit that `num_rows` is, indeed, always
<= `NUM_ROWS` helps mesa make slightly better codegen.
Against mesa git, this currently shows a 1% performance improvement
in tg128 on Qwen3.5-9B:BF16 on Intel BMG.
* vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes
There was a TODO to fix the OOB reads from the A matrix which we do
here.
It is within performance noise (+<0.1%) in tg128 for
Qwen3.5-9B:BF16 on Intel BMG.
* feat: extend repeat op for vulkan
* feat: add repeat_f16 vulkan pipeline
* fix: ensure same dst and src types
* fix: use type_size instead of data types
* fix: use int16 and int32 for repeat shader op
* chore: rename repeat_f* to repeat_i*
* chore: rename repeat vulkan pipelines
* ggml-zendnn: fixed naming of matmul function
* ggml-zendnn: fixed naming of mul_mat_id function
* ggml-zendnn: fixed print in mul_mat_id
---------
Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>
* vulkan: add CONV_SHAPE_64x128 for medium-K conv2d
* vulkan: skip conv2d bounds checks when shapes align with tile sizes
* vulkan: use WG_SIZE=128 for CONV_SHAPE_64x32 conv2d
* vulkan: stage cm2 conv2d accumulator through shmem before global store
* vulkan: add coopmat1 conv2d path
* fallback when using too much shared memory. clean up comments
* Require 16x16x16 and subgroup size 32 or 64
* check whether shared memory is sufficient before overwriting conv2d params with coopmat1 values
* hexagon: add support for CONCAT with optimized concat_2d_transposed
qwen3.5 models are quite heavy on the CONCAT with large and transposed src1.
* hex-concat: use fastdiv in generic version
* hex-concat: make checks for transposed a bit more readable
* hex-concat: reoder dma ops for better pipelining
* hex-cont/cpy: optimize CPY and CONT ops
The primary change is to avoid scalar divs in the inner loops.
We were calling hvx_copy_uu(... type_size) where type_size is non a constexpr.
This causes runtime divs by that value which is normally just 4 or 2 (f32/f16).
* hex-get-rows: optimize GET_ROWS for large rows
We now use DMA for larger rows and also split them into chunks to improve perf for Qwen3.5 and other models
that do lots of GET_ROWS with huge (2MB+ rows).
Also bump the DMA queue depth now that we can take advantage of it.
* hex-concat: unroll the inner loops of concat_2d
* hex-concat: more updates to concat_2d to improve perf a bit further
* hex-cpy: fixed n_rows per thread checks in the copy ops
* hmx-fa: fix alignment issues while computing dma sizes
* hex-set-rows: add early returns for idle threads
* hvx-rope: minor optimization to replace loops with fastdiv logic
* hex-rope: replace scalar tail processing with HVX
* hex-rope: optimize rope cache init with HVX
Add hvx-utils sin/cos helpers that use an aprox method (similar to rsqrt, inverse, etc)
Use the helpers to optimize ROPE.
* ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K
* Fix to editorconfig checking pass
* Remove mul-mat-legacy pipeline
* Fix to use vendor name as is and add dot_product/vendor to shader_lib_ctx
* Only run webgpu CI on my fork
* Add webgpu only workflow
* refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled
* restore build.yml
This commit adds an ignore for bindings-ruby and bindings-go in
build.yml as these are handled by separate .yml file (separate jobs)
and don't need to trigger a full CI build.
* ci : add on push/pull_request paths ruby job
This commit adds paths to bindings-ruby to only build if changes where
made to bindings/ruby or to include/whisper.h.
* ci : add additional paths [no ci]
This commit re-enables the arm64 docker images builds which were removed
in Commit 9366544991
("ci : fix arm builds"). It also uses ubuntu-24.04-arm as the runner
which enables us to avoid QEMU.
Resolves: https://github.com/ggml-org/whisper.cpp/issues/2859
* ci : set GGML_NATIVE=OFF for bindings-java
This commit attempts to address an issue with the bindings-java job
which is currently failing.
I've not been able to reproduce this locally my windows machine and I
suspect that what might be happning is that windows job compiles on a
runner where it has different CPU features, for example AVX512 and when
this dll is used on a different runner that does not have that feature
it will crash.
Refs: https://github.com/ggml-org/whisper.cpp/actions/runs/26496174929/job/78059073255?pr=3829
* ci : also disable BMI2
* docs : add AGENTS.md and CONTRIBUTING.md [no ci]
This commit add AGENTS.md and CONTRIBUTING.md which are based on the
same files in llama.cpp. They have been modified slightly to fit with
whisper.cpp.
The motivation for this is to clarify the contribution policy in
whisper.cpp so that contributers can have a better understanding of the
expectations and requirements for contributing to the project.
* cli : merge tokens split across UTF-8 boundaries in JSON output
When a multi-byte UTF-8 codepoint (most commonly a CJK character, 3 bytes)
is split across multiple whisper tokens, the -ojf/--output-json-full
writer emitted each token's partial bytes as its own JSON string, producing
invalid UTF-8 that chokes downstream parsers.
Merge adjacent tokens in output_json whenever the accumulated text still
ends on an incomplete UTF-8 sequence. The merged entry keeps the first
token's id/p/t_dtw and extends t1 to the last absorbed token, which
matches how segment text is assembled elsewhere.
Refs #1798
* fix: address review — add braces for consistency, use full issue URL
- Add braces to if/else chain for codebase consistency
- Use full URL for issue #1798 reference
Review: @danbev
---------
Co-authored-by: texasich <texasich@users.noreply.github.com>
Co-authored-by: texasich <texasich@gmail.com>
* TP: fix ggml context size calculation, memory leak
* move split state cache back into the context
* revert to constant ggml context size for cgraphs
* increase headroom for statically allocated tensors
* remove obsolete include
* ggml: implement `gguf_init_from_buffer`
* test: `gguf_init_from_buffer`
* fix: memory breakdown for a model loaded with `no_alloc` from a file is consistent with being loaded from a buffer
* fix: use `GGML_UNUSED`
Co-authored-by: Copilot <copilot@github.com>
* fix: remove `total_size` from `gguf_reader`
* fix: file offset calculation, rename `offset` to `data_offset`
Co-authored-by: Copilot <copilot@github.com>
* refactor: extract model loader bug fixes to another PR
* feat: add `gguf_init_from_callback`
* fix: always require a max expected size
* fix: change `gguf_reader_callback_t`'s `output` type to `void *`, change `max_expected_size` and offsets to `uint64_t`
* fix: harden against offset overflow in buffer read
* fix: remove seek behavior from the callback
* feat: `max_chunk_read == 0` means `SIZE_MAX`
* fix: seeking in a gguf file with no tensors
---------
Co-authored-by: Copilot <copilot@github.com>
- Use OpenMP to parallelize iq2xs_init_impl and iq3xs_init_impl.
- Move the OpenMP detection from ggml-cpu to ggml-base.
- Update OpenMP dependencies in ggml-config.cmake.in.
- change `k_copy_src1_to_contiguous` so that uses a precomputed contiguous mapping where all rows "owned" by an expert are in one slice with a know starts and ends
- switch the `O(n_as * n_routed_rows)` contraption to a counting sort-based procedure with `O(n_as + n_routed_rows)` complexity
* SYCL: add BF16 to DMMV kernel path for ~4x token generation speedup
BF16 models had no dedicated token generation kernel — they fell through
to the generic full-GEMM path, resulting in ~14% memory bandwidth
utilization on Intel Arc GPUs. This adds BF16 support to the DMMV
(dequantize mul-mat-vec) path, matching the existing F16 implementation.
Fixes#20478
* SYCL: fix BF16 DMMV out-of-bounds when ncols % 64 != 0
The qk=1 kernel (used for F16 and BF16) iterates with stride
2*GGML_SYCL_DMMV_X (= 64 on Intel targets where WARP_SIZE=16). When
ncols is a multiple of DMMV_X (32) but not of 2*DMMV_X (64), the last
warp iteration accesses elements at col >= ncols, producing NaN for the
final row and wrong values for interior rows.
Fix: tighten can_use_dequantize_mul_mat_vec to require ne[0] %
(2*DMMV_X) == 0 for F16/BF16 types, and update the ASSERT in the BF16
launcher to match. Quantized types use block-structured kernels with
different access patterns and keep the existing DMMV_X check.
Verified: test-backend-ops MUL_MAT passes 913/913 on Intel Arc Pro B70.
Previously failing: m=128/129 n=1 k=1056 cases (NaN and ERR > 0.0005).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>