* opencl: add `copy_to_contiguous` and utilize mm kernels
* opencl: only copy to cont for f32 and f16 tensors
* opencl: use cont mm for fallback when dst is large
* opencl: use nb local to copy-to-cont
* opencl: use local offset as well
* vulkan: Remove transfer_ctx, do everything in compute_ctx.
We had a bug where a set_tensor_async (using transfer_ctx) didn't get
submitted before the graph_compute (using compute_ctx) that came after
it. To avoid this sort of issue, just do everything in compute_ctx.
Remove transfer_cmd_pool, which was already unused.
* fix crash with perf logger
Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and
update the indexing calculations in get_offsets.
Mat-vec is faster than mat-mat for small values of n. We don't get the same
reuse of the weights as in the non-ID path, but with this the cost is linear
in n rather than n>1 being far slower than n==1.
* CUDA: Replace `init_offsets` with iterators in argsort
This is a QOL improvement, saving us the cost of materializing the
iterator
* Remove unnecessary include from top-k.cu
* CANN: fix an issue where get_env was not fully renamed
* ci: add cann with acl group
* ci: define use_acl_graph using GitHub Action
* ci: update cann dockerfile with acl graph
* CANN: support gated linear attn
This change adds support for the GGML_OP_GATED_LINEAR_ATTN operator.
The feature was implemented by YushengZhao. Because the previous
submission was based on an outdated codebase, this PR was rebased to
merge.
Co-authored-by: YushengZhao <yusheng.chao@outlook.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
* CANN: optimize OP gla
Optimize gla for high preformance
* Remove unused comments
---------
Co-authored-by: 赵禹昇 <2501112001@cninfer02.localdomain>
Co-authored-by: YushengZhao <yusheng.chao@outlook.com>
* CUDA: Refactor and expose two_stage_warp_reduce_* function
* Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it
Moving smem out of `__device__` function to `__global__` function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (`cudaFuncSetAttribute` fails when not accounting for
it once for each call to two_stage_warp_reduce)
* Update ggml/src/ggml-cuda/common.cuh
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* Use two_stage_warp_reduce in group_norm_f32
* Use two_stage_warp_reduce in rms_norm_f32
* Fix smem calculation which expects bytes
* Make `two_stage_warp_reduce` accept all values warp_reduce accepts
Also integrate it into norm_f32 function
* Use two_stage_warp_reduce in l2_norm_f32
* Use type traits for block reduction for better legibility
Also adresss other requests by @am17an such as variable renaming
* Make norm tests cover all cuda paths
* Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK
Unit-tests passed locally, let's see if they pass in the CI as well
* Use `enum class` for `block_reduce_method`
This is more type-safe than plain enum
* Rename variables as suggested in code review by @am17an
* Rename two_stage_warp_reduce -> block_reduce
* Fix trailing whitespace in common.cuh
* Make condition of static_assert type-dependent
This delays evaluation until the template is actually instantiated.
Otherwise, some compilers may evaluate the assert when parsing the
template, resulting in build errors as observed here:
https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785
* Inline definitions
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
The current implementation in `whisper_wrap_segment()` uses `strlen()` to count bytes, not UTF-8 characters. When splitting segments at `max_len`, this can break multi-byte UTF-8 characters, resulting in invalid sequences displayed as `�` (U+FFFD replacement character).
This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which
has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128.
This should work when the number of blocks in the A matrix is less than 2^32
(for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like
2^32*LOAD_VEC_A elements.
- Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b.
- Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle
variants. So far this change just adds a single use case for this, compiling with the
e64BitIndexingEXT flag.
- Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange.
64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort
to avoid enabling it unconditionally.
* vulkan: Enable and optimize large matmul parameter combination for AMD
* limit tuning to AMD GPUs with coopmat support
* use tx_m values instead of _l