In #18624, get_env in ggml-cann was renamed to get_env_as_lowercase
to accurately reflect the function’s behavior and reduce the chance
of misuse. However, the update missed renaming call sites in other
files. This commit fixes that oversight.
* hexagon: improve fp16 matmul and add fp32/fp16 flash-attention
* hexagon: add support for set-rows fp32 -> fp16 with i32/i64 row-idx
* hexagon: add support for SCALE fp32
* hexagon: replace scalar fp32 -> fp16 copy with HVX
* hexagon: optimize flash_atten_ext with aligned VTCM buffers and DMA
- Implements double-buffered DMA prefetching for K, V, and Mask tensors.
- Ensures K and V rows in VTCM are padded to 128 bytes to support aligned HVX operations.
- Correctly synchronizes DMA transfers to prevent race conditions.
- Uses `FLASH_ATTN_BLOCK_SIZE` of 128 for efficient chunking.
* hexagon: use aligned mad_f16
* hexagon: flash_atten more aligned ops
* hexagon: optimize scale_f32 hvx helpers
* hexagon: unroll fa loops
* hexagon: remove unused set-rows log
* hexagon: flash_attn_ext add support for DMAing Q
- Update `op_flash_attn_ext` to include Q row size in scratchpad allocation.
- Pad Q row size to 128 bytes for alignment.
- Implement DMA transfer for Q tensor in `flash_attn_ext_f16_thread`.
- Update dot product computations to use VTCM-buffered Q data.
* hexagon: fix handling of NANs hvx dotproducts
* hexagon: cleanup spad allocation in flash-atten
* hexagon: improve fp16/fp32 matmul
- Introduced `vec_dot_f16_f16` and `vec_dot_f16_f16_rx2` kernels using efficient HVX dot product intrinsics.
- Added `quantize_fp32_f16` to copy/convert weights from DDR to VTCM
- Updated `op_matmul` to use the optimized path when VTCM capacity allows and broadcasting requirements are compatible.
- Implemented fallback logic to the original implementation for complex broadcasting scenarios.
* hexagon: fix HVX_ARCH check
* hexagon: matmul cleanup and fp16 fixes
Use aligned vec_dot_f16 for 2d matmuls and unaligned version for 4d.
* hexagon: fix fp16 x fp16 matmuls and some minor refactoring
* hexagon: add support for GET_ROWS f32 -> f32
Also optimize SET_ROWS threading a bit when we have just a few rows to process.
* hexagon: optimize set-rows threading
* hexagon: update adb/run-bench.sh to properly support experimental and verbose options
* hexagon: flash_atten use aligned vectors for dot products
* vulkan: support buffer_from_host_ptr
* hacky use of buffer_from_host_ptr for directio
* disable buffer_from_host_ptr cap
* use external memory for ggml_vk_host_malloc, revert model loader changes
* disable external_memory_host for MoltenVK
* take buffer memory types into account
* don't use external_memory_host for ggml_vk_host_malloc
* ggml-webgpu: add CEIL operation support
Add support for the CEIL unary operation in the WebGPU backend:
- Add CEIL_FUNC shader template in unary_op.wgsl
- Add 4 shader variants (f32, f16, inplace versions)
- Initialize CEIL pipelines in ggml-webgpu.cpp
- Register CEIL in supports_op function
* docs: update WebGPU ops support for CEIL
This commit implements operator fusion for ADD + RMS_NORM operations
in the CANN backend to reduce memory access overhead and improve
performance. The fusion is controlled by the GGML_CANN_OPERATOR_FUSION
environment variable (default: false).
Changes:
- Implement ggml_cann_op_add_rms_norm_fused() using ACLNN AddRmsNorm
- Add ggml_cann_can_fuse() to check fusion eligibility
- Integrate fusion logic into computation graph evaluation
- Add test cases for ADD + RMS_NORM fusion
- Update documentation with new environment variable
The fusion combines ADD and RMS_NORM into a single kernel call,
which is more efficient than executing them separately.
* sampling : add support for backend sampling
This commit adds support for performing sampling operations on the
backend (e.g. GPU) as part of the model computation graph.
The motivation for this feature is to enable sampling to be performed
directly on the backend as part of the computation graph being executed,
allowing for some or all of the sampling to be done on the backend.
For example, the backend sampler chain might select/sample a token
directly in which case only the sampled token needs to be transferred
from device memory to host memory.
It is also possible for the backend samplers to perform filtering of
the logits, or compute and filter the probability distribution, in
which case only the filtered logits or probabilites need to be
transferred back to system memory for further processing by CPU
samplers.
Currently the backend sampling works in a similar manner to how
pooling works, it is a function that is called by build_graph and the
sampler operations become part of the models computation graph.
* llama-cli : add backend sampler configuration
* server : add backend sampling options/configuration
* webui : add backend sampling options
* ggml : add initial cumsum implementation for CUDA
* sampling : enable all backend sampler tests
This commit enables all exisiting backend sampler tests in the
test-backend-sampler. Previously, some tests were disabled because
there were missing ggml operation implementations.
* graph : do not include llama-model.h
* sampling : always expose sampled_ids
This commit precomputes and caches the full-vocab token id list in
llama_context's constructor, so llama_get_backend_sampled_token_ids_ith
always returns a valid pointer.
The motivation for this is that this enables both common/sampling.cpp
and src/llama-sampling.cpp can simplify their logic.
Not all backends samplers that process logits need to set the
sampled_tokens_id as they may not change the order of the logits, for
example the temperature sampler only scales the logits but does not
change their order. Simliar the logit bias sampler only adds bias to
specific token ids but does not change the order of the logits. In
these cases there will not be a device to host copy of the sampled
token ids, and this is the use case where having this precomputed
list is useful.
* sampling : ensure at most one output token per seq
This commit adds a check in the batch allocator to ensure that when
backend sampling is enabled, at most one output token is specified per
sequence.
* CUDA: Optimize argsort for gpu-based token sampling
Argsort is used for top-k currently. WE optimize argsort by 2 things:
1. Use `DeviceRadixSort` for single-row/sequence to parallelize it
across our SMs
2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the
correct entrypoint (the function chooses different execution paths,
it contains `DeviceSegmentedRadixSort` as one of the paths and will
choose the best one according to heuristics.
https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview
Some perf numbers for a RTX PRO 6000:
On the kernel level, tested with
`GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf`
Before:
```
ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 359.24 us/run
ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 8192 runs - 861.34 us/run
ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 1020.01 us/run
```
After:
```
ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 312.41 us/run
ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 16384 runs - 63.48 us/run
ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 874.36 us/run
```
* CUDA: Fixed obj byte size instead of obj count being passed to pool alloc (fattn-common, dst_tmp_meta)
* CUDA: Explicitly casted some of the int alloc counts before multiplication in argsort
---------
Co-authored-by: pl752 <maximpl752@gmail.com>
* vulkan: Optimize GGML_OP_CUMSUM
There are two paths: The preexisting one that does a whole row per workgroup
in a single shader, and one that splits each row into multiple blocks and does
two passes. The first pass computes partials within a block, the second adds
the block partials to compute the final result. The multipass shader is used
when there are a small number of large rows.
In the whole-row shader, handle multiple elements per invocation.
* use 2 ELEM_PER_THREAD for AMD/Intel
* address feedback
* ggml-cuda: fixed assertion in ggml_cuda_cpy (llama/18140)
* ggml-cuda: changes in data types to int64_t
* ggml-cuda: added asserts for CUDA block numbers
* ggml-cuda: changed the condition for y and z dimension
* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron
Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).
Fewer pipeline variants and spec constants, just use push constants.
In test_topk_moe, change exp_probs_b to be 1D, matching real networks.
Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.
* change test_topk_moe to allow results in arbitrary order
* disable sigmoid fusion for moltenvk
* cmake:
- added `whisper-` prefix to unprefixed targets: `quantize`, `lsp`,
`vad-speech-segments`
- added `install(TARGETS ${TARGET} RUNTIME)` where it was missing
Signed-off-by: Peter A. <ink.splatters@pm.me>
* .github/workflows/build.yml: quantize -> whisper-quantize
Signed-off-by: Peter A. <ink.splatters@pm.me>
---------
Signed-off-by: Peter A. <ink.splatters@pm.me>
* add count equal for metal
* remove trailing whitespace
* updated doc ops table
* changed shmem to i32
* added multi tg and templating
* removed BLAS support from Metal docs
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add memset to set dst to 0
* metal : cleanup
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* cmake: work around broken IntelSYCLConfig.cmake in oneAPI 2025.x
* [AI] sycl: auto-detect and skip incompatible IntelSYCL package
Automatically detect compiler versions with incompatible IntelSYCL
CMake configuration files and fall back to manual SYCL flags instead
of requiring users to set options manually.
Fixes build failures with oneAPI 2025.x where IntelSYCLConfig.cmake
has SYCL_FEATURE_TEST_EXTRACT invocation errors.
* refactor: improve SYCL provider handling and error messages in CMake configuration
* refactor: enhance SYCL provider validation and error handling in CMake configuration
* ggml-sycl: wrap find_package(IntelSYCL) to prevent build crashes