if GGML_VULKAN_MIN_1_1 is defined:
- rte shaders will be built with 1.1 API and "SPV_KHR_float_controls"
spirv extension.
- all no _cm2 shaders will be built with 1.1 API
No changes if GGML_VULKAN_MIN_1_1 is not defined (default).
"The members of VkPhysicalDeviceVulkan12Properties must have the same
values as the corresponding members of ...
VkPhysicalDeviceFloatControlsProperties ..."
Instead of VkPhysicalDeviceVulkan11Properties, that was added in Vulkan 1.2.
"The members of VkPhysicalDeviceVulkan11Properties have the same values
as the corresponding members of ... VkPhysicalDeviceSubgroupProperties ..."
The current implementation in `whisper_wrap_segment()` uses `strlen()` to count bytes, not UTF-8 characters. When splitting segments at `max_len`, this can break multi-byte UTF-8 characters, resulting in invalid sequences displayed as `�` (U+FFFD replacement character).
This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which
has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128.
This should work when the number of blocks in the A matrix is less than 2^32
(for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like
2^32*LOAD_VEC_A elements.
- Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b.
- Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle
variants. So far this change just adds a single use case for this, compiling with the
e64BitIndexingEXT flag.
- Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange.
64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort
to avoid enabling it unconditionally.
* vulkan: Enable and optimize large matmul parameter combination for AMD
* limit tuning to AMD GPUs with coopmat support
* use tx_m values instead of _l
* FlashAttention (llama/13)
* Add inplace softmax
* Move rms_norm to split row approach
* Update debug for supports_op
* clean up debug statements
* neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though
* neg passes backend test
* unary operators pass ggml tests
* rms_norm double declaration bug atoned
* abides by editor-config
* removed vestigial files
* fixed autoconfig
* All operators (inlcluding xielu) working
* removed unnecesarry checking if node->src[1] exists for unary operators
* responded and dealt with PR comments
* implemented REPL_Template support and removed bug in unary operators kernel
* formatted embed wgsl and ggml-webgpu.cpp
* Faster tensors (llama/8)
Add fast matrix and matrix/vector multiplication.
* Use map for shader replacements instead of pair of strings
* Wasm (llama/9)
* webgpu : fix build on emscripten
* more debugging stuff
* test-backend-ops: force single thread on wasm
* fix single-thread case for init_tensor_uniform
* use jspi
* add pthread
* test: remember to set n_thread for cpu backend
* Add buffer label and enable dawn-specific toggles to turn off some checks
* Intermediate state
* Fast working f16/f32 vec4
* Working float fast mul mat
* Clean up naming of mul_mat to match logical model, start work on q mul_mat
* Setup for subgroup matrix mat mul
* Basic working subgroup matrix
* Working subgroup matrix tiling
* Handle weirder sg matrix sizes (but still % sg matrix size)
* Working start to gemv
* working f16 accumulation with shared memory staging
* Print out available subgroup matrix configurations
* Vectorize dst stores for sg matrix shader
* Gemv working scalar
* Minor set_rows optimization (llama/4)
* updated optimization, fixed errors
* non vectorized version now dispatches one thread per element
* Simplify
* Change logic for set_rows pipelines
---------
Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Comment on dawn toggles
* Working subgroup matrix code for (semi)generic sizes
* Remove some comments
* Cleanup code
* Update dawn version and move to portable subgroup size
* Try to fix new dawn release
* Update subgroup size comment
* Only check for subgroup matrix configs if they are supported
* Add toggles for subgroup matrix/f16 support on nvidia+vulkan
* Make row/col naming consistent
* Refactor shared memory loading
* Move sg matrix stores to correct file
* Working q4_0
* Formatting
* Work with emscripten builds
* Fix test-backend-ops emscripten for f16/quantized types
* Use emscripten memory64 to support get_memory
* Add build flags and try ci
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* Remove extra whitespace
* Move wasm single-thread logic out of test-backend-ops for cpu backend
* Disable multiple threads for emscripten single-thread builds in ggml_graph_plan
* Refactored pipelines and workgroup calculations (llama/10)
* refactored pipelines
* refactored workgroup calculation
* removed commented out block of prior maps
* Clean up ceiling division pattern
---------
Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Start work on flash attention
* Shader structure set up (many bugs still)
* debugging
* Working first test
* Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32
* Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling
* Start work on integrating pre-wgsl
* Separate structs/initial shader compilation library into separate files
* Work on compilation choices for flashattention
* Work on subgroup matrix/tile size portability
* subgroup size agnostic online softmax
* Cleanups, quantization types
* more cleanup
* fix wasm build
* Refactor flashattention to increase parallelism, use direct loads for KV in somce cases
* Checkpoint
* formatting
* Update to account for default kv cache padding
* formatting shader
* Add workflow for ggml-ci webgpu
* Try passing absolute path to dawn in ggml-ci
* Avoid error on device destruction, add todos for proper cleanup
* Fix unused warning
* Forgot one parameter unused
* Move some flashattn computation to f32 for correctness
* ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH
* makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32
* ggml: read GGML_OP_OFFLOAD_MIN_BATCH once and store to dev ctx
* cann: forward declaration of device context struct
* cann: move offload op check after device context declaration
* cuda: fix whitespace
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* modify warptile tuning for xe3
* intel vendor check w/ coopmat support
* fix back formatting
* fix formatting change 2
* move intel check to chip specific tuning part
* Change to support both windows and linux
* modify m_warptile to l_warptile for intel
* modify warptile tuning for bf16 matmuls to fix regression (m_warptile to l_warptile)
* Code style changes
* Code style changes (2)
* Code style changes (3)
In #18624, get_env in ggml-cann was renamed to get_env_as_lowercase
to accurately reflect the function’s behavior and reduce the chance
of misuse. However, the update missed renaming call sites in other
files. This commit fixes that oversight.
* hexagon: improve fp16 matmul and add fp32/fp16 flash-attention
* hexagon: add support for set-rows fp32 -> fp16 with i32/i64 row-idx
* hexagon: add support for SCALE fp32
* hexagon: replace scalar fp32 -> fp16 copy with HVX
* hexagon: optimize flash_atten_ext with aligned VTCM buffers and DMA
- Implements double-buffered DMA prefetching for K, V, and Mask tensors.
- Ensures K and V rows in VTCM are padded to 128 bytes to support aligned HVX operations.
- Correctly synchronizes DMA transfers to prevent race conditions.
- Uses `FLASH_ATTN_BLOCK_SIZE` of 128 for efficient chunking.
* hexagon: use aligned mad_f16
* hexagon: flash_atten more aligned ops
* hexagon: optimize scale_f32 hvx helpers
* hexagon: unroll fa loops
* hexagon: remove unused set-rows log
* hexagon: flash_attn_ext add support for DMAing Q
- Update `op_flash_attn_ext` to include Q row size in scratchpad allocation.
- Pad Q row size to 128 bytes for alignment.
- Implement DMA transfer for Q tensor in `flash_attn_ext_f16_thread`.
- Update dot product computations to use VTCM-buffered Q data.
* hexagon: fix handling of NANs hvx dotproducts
* hexagon: cleanup spad allocation in flash-atten
* hexagon: improve fp16/fp32 matmul
- Introduced `vec_dot_f16_f16` and `vec_dot_f16_f16_rx2` kernels using efficient HVX dot product intrinsics.
- Added `quantize_fp32_f16` to copy/convert weights from DDR to VTCM
- Updated `op_matmul` to use the optimized path when VTCM capacity allows and broadcasting requirements are compatible.
- Implemented fallback logic to the original implementation for complex broadcasting scenarios.
* hexagon: fix HVX_ARCH check
* hexagon: matmul cleanup and fp16 fixes
Use aligned vec_dot_f16 for 2d matmuls and unaligned version for 4d.
* hexagon: fix fp16 x fp16 matmuls and some minor refactoring
* hexagon: add support for GET_ROWS f32 -> f32
Also optimize SET_ROWS threading a bit when we have just a few rows to process.
* hexagon: optimize set-rows threading
* hexagon: update adb/run-bench.sh to properly support experimental and verbose options
* hexagon: flash_atten use aligned vectors for dot products
* vulkan: support buffer_from_host_ptr
* hacky use of buffer_from_host_ptr for directio
* disable buffer_from_host_ptr cap
* use external memory for ggml_vk_host_malloc, revert model loader changes
* disable external_memory_host for MoltenVK
* take buffer memory types into account
* don't use external_memory_host for ggml_vk_host_malloc
* ggml-webgpu: add CEIL operation support
Add support for the CEIL unary operation in the WebGPU backend:
- Add CEIL_FUNC shader template in unary_op.wgsl
- Add 4 shader variants (f32, f16, inplace versions)
- Initialize CEIL pipelines in ggml-webgpu.cpp
- Register CEIL in supports_op function
* docs: update WebGPU ops support for CEIL