Commit Graph

99 Commits

Author SHA1 Message Date
Gaurav Garg 1a1900f90c Remove padding and multiple D2D copies for MTP (llama/24086)
* Make ggml_gated_delta_net take only the initial recurrent state (D, 1, n_seqs) and passes the snapshot count K as an op parameter instead of inferring it from state->ne[1].

Remove the padding hack and copy all emitted snapshots into the recurrent cache with a single strided ggml_cpy

* Make GDN changes in all backends. Address review comments.

* Fix CI build errors
2026-06-15 10:33:53 +03:00
Reese Levine e69e5138fe ggml-webgpu: Add clang-format job (llama/24308)
* Add clang-format job

* try local formatting
2026-06-15 10:33:53 +03:00
Masashi Yoshimura aa42b48312 ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants (llama/24225)
* ggml-webgpu: Improve prefill speeds + refactor matmul for quants

* Fixes for editroconfig checker
2026-06-15 10:33:53 +03:00
Nikhil Jain 15e5d401d1 Handle buffer overlap / buffer aliasing for concat operator (llama/24000)
* Only run webgpu CI on my fork

* Add webgpu only workflow

* handle buffer overlap case for concat operator

* restore build-webgpu.yml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Run clang-format

* Update ggml/src/ggml-webgpu/wgsl-shaders/concat.wgsl

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-06-15 10:33:53 +03:00
Nikhil Jain 490e50056c Implement 2D workgroups for scale, binary, and unary ops (llama/24044)
* Only run webgpu CI on my fork

* Add webgpu only workflow

* Implement 2d workgroups for more operations

* fix

* Fix type

* Move back to global_invocation_id
2026-06-15 10:33:53 +03:00
Reese Levine e9dbd0c18a ggml-webgpu: FlashAttention refactor + standardize quantization support (llama/23834)
* Start work on flash_attn refactor

* Refactor

* Split k/v quantization

* Refactor and abstract quantization logic for flash_attn and mul_mat

* Add quantization support to tile path

* formatting

* Move to functions, add a check
2026-06-08 14:36:36 +03:00
Masashi Yoshimura db2a39507c revert to using global_invocation_id for cpy shader (llama/23955) 2026-06-08 14:36:36 +03:00
Reese Levine 9147a9676b ggml-webgpu: Check earlier for WebGPU required features (llama/23879) 2026-06-08 14:36:36 +03:00
Reese Levine acd91d2c38 ggml-webgpu: add q4_0/q8_0 SET_ROWS (llama/23760)
* Add q8_0 and q4_0 set_rows

* Add fast(er) quantization set_rows path

* formatting/naming

* a little more naming

* Remove unused constant

* Don't override other override

* Avoid bitcast

* Narrow relaxation
2026-06-08 14:36:36 +03:00
Reese Levine 8c8f213dac ggml-webgpu: remove legacy constants (llama/23672) 2026-05-29 09:47:30 +03:00
Masashi Yoshimura a52bd385d6 ggml-webgpu: Fix how to dispatch WG to some ops (llama/23750) 2026-05-29 09:47:30 +03:00
Masashi Yoshimura 00a5110b19 ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K and clean up legacy MUL_MAT pipeline (llama/23594)
* ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K

* Fix to editorconfig checking pass

* Remove mul-mat-legacy pipeline

* Fix to use vendor name as is and add dot_product/vendor to shader_lib_ctx
2026-05-29 09:47:30 +03:00
Nikhil Jain bc77933c2d Check batch_compute_passes before sending passes when not doing GPU profiling (llama/23457)
* Only run webgpu CI on my fork

* Add webgpu only workflow

* refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled

* restore build.yml
2026-05-29 09:47:30 +03:00
Chen Yuan c436f1419f fix(flash-attn): replace f32 with kv_type and q_type (llama/23372) 2026-05-25 12:26:07 +03:00
Reese Levine 6090f39f36 ggml-webgpu : extend GDN for K>1 (llama/23299) 2026-05-25 12:26:07 +03:00
Zheyuan Chen 13133ab299 ggml-webgpu: makes the flash attn vec path subgroup-aware (llama/23040)
* ggml-webgpu: makes the flash attn vec path compile and size its split/reduce work from the device’s reported subgroup range instead of assuming 32 subgroup size.

* ggml-webgpu: remove the extra max_wg_size >= max_subgroup_size guard. Remove hardcoded 32 when determine the value of reduce_wg_size and vec_nwg_cap
2026-05-25 12:26:07 +03:00
Zheyuan Chen e4ce42e55f ggml-webgpu: only use subgroup-matrix path when head dims are divisible by sg_mat_k / sg_mat_n (llama/23020) 2026-05-14 21:26:48 +03:00
Masashi Yoshimura 1cbbd0b6d0 flush the gpu profile timestamp before the queryset is overflowed (llama/22995) 2026-05-14 21:26:48 +03:00
Masashi Yoshimura e8a7cd314f ggml-webgpu: Enables running gpt-oss-20b (llama/22906)
* Enable to run gpt-oss-20b and refactor mulmat-q

* disable test-backend-ops in ubuntu-24-webgpu
2026-05-14 21:26:48 +03:00
Chen Yuan a9bcbf5595 ggml-webgpu: address precision issues for multimodal (llama/22808)
* fix(mixed-types): use f32 for precision and update the shared memory calculation logic for f32

* fix(unary): correct the gelu, gelu quick and gelu erf functions

* fix(flash-attn-tile): fix the hardcode v type

* fix(flash_attn): fix tile path

* fix: pass editorconfig and address the type conflicts

* fix: remove reduant pipeline keys

* fix: remove inline min/max group size functions and revert the flash attn path order

* fix: use clamp to avoid NaN for GELU

* fix: use the right range for exp, 80 is safer for f32 exp
2026-05-14 21:26:48 +03:00
Chen Yuan d1d0dc2348 ggml-webgpu: add layer norm ops (llama/22406)
* shader(norm): add layer norm ops

* shader(norm): stablize floating point computation with Kahan summation and handle mixed types

* shader(norm): remove the non-contiguous strides

* shader(norm): use the original implementation rather than the kahan summation
2026-05-14 21:26:48 +03:00
Georgi Gerganov bbdaa21aa7 ggml : remove obsolete rms_norm.wgsl (ggml/0) 2026-05-02 15:02:42 +03:00
Georgi Gerganov a5a8496d31 ggml : remove obsoloete wgsl templates (ggml/0) 2026-05-02 15:02:42 +03:00
Masashi Yoshimura 9623c1203b ggml-webgpu: Fix vectorized handling in mul-mat and mul-mat-id (llama/22578)
* Fix vectorized condition of mul-mat-fast pipeline and add vectorized variant to mul-mat-id

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-02 15:02:42 +03:00
Chen Yuan ccd04522f9
ggml-webgpu: add the upscale shader (llama/22419)
* shader(upscale): add the upscale shader with nearest, bilinear and bicubic implementations

* shader(upscale): use macro
2026-05-01 13:07:36 +03:00
Masashi Yoshimura b34a9f3d83
ggml-webgpu: Improve performance of mat-vec and mat-mat for MUL_MAT_ID (llama/22464)
* Add mat-vec fast path of MUL_MAT_ID.

* Add shared accumulation vec logic and the other types supports.

* Add i-quant mat-mat for MUL_MAT_ID and fix some parts

* Remove n_experts from shader_lib_context.
2026-05-01 13:07:35 +03:00
Ruben Ortlam 0c7c3ba570
vulkan: add get/set tensor 2d functions (llama/22514)
* vulkan: add get/set_tensor_2d functions

* fix backend interface comments

* Update ggml/src/ggml-metal/ggml-metal.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-01 13:07:35 +03:00
Rithik Sharma d74c56862b
add fast matmul iquants (llama/22504) 2026-05-01 13:07:35 +03:00
Reese Levine fa20229eeb
ggml-webgpu: Fix bug in FlashAttention support check (llama/22492)
* Fix flashattention support check for devices that don't support subgroups

* set path to none if kv_tile doesn't fit
2026-04-30 11:29:23 +03:00
Reese Levine 4ea5b6febc
ggml-webgpu: fix buffer aliasing for ssm_scan and refactor aliasing logic (llama/22456)
* Refactor buffer aliasing to be part of shader lib decisions

* cleanup

* formatting
2026-04-30 11:29:22 +03:00
Rithik Sharma 9c233f11f0
ggml-webgpu: add Q1_0 support (llama/22374)
* add fast matmul matvec q1_0 kernel

* ggml-webgpu: drop redundant zero-fills in Q1_0 shmem init
2026-04-30 11:29:21 +03:00
Rithik Sharma f675a8c926
add fast mat-vec kernels for i-quants (llama/22344) 2026-04-30 11:29:21 +03:00
Rithik Sharma 1478450e61
add performance-portable tuning for register-tile and subgroup matmul (llama/22241) 2026-04-30 11:29:20 +03:00
Reese Levine c235b05d8a
ggml-webgpu: support for SSM_SCAN and disable set_rows error checking (llama/22327)
* Implement ssm_scan

* Remove blocking in graph_compute and check for set rows

* Fix bindings

* Update op support
2026-04-30 11:29:19 +03:00
Zheyuan Chen 35d679a4f8
ggml-webgpu: enable FLASH_ATTN_EXT on browser without subgroup matrix (llama/22199)
* ggml-webgpu: add tile flash attention fallback

* ggml-webgpu: add new fields and discard usage of mnk for tile version

* ggml-webgpu: modify the vec path to discard the mnk parameter

* ggml-webgpu: enable flash attention vec and tile version for broswer

* ggml-webgpu: stagging KV for flash attention tile version

* formatting

* turn on subgroup uniformity check

* remove Q_TILE as it is always 1 for vec path

* make row_max and exp_sum to local register

* make different bindings with same underlying buffer to have the same usage flags

* move path selection into the shader library and have the host consume a single flash-attn decision object.

* turn off skip_validation and address buffer overlapping when nwg==1

* formatting

* merge binding when kv overlap
2026-04-30 11:29:18 +03:00
Chen Yuan 641998f558
fix(shader): handle the buffer aliasing for rms fuse (llama/22266) 2026-04-30 11:29:17 +03:00
Chen Yuan df528c4f71
ggml-webgpu: add support for im2col (llama/22259)
* shader(im2col): implement the im2col shader

* shader(im2col): clean the formatting issues

* shader(im2col): clean the editorconfig checker warning

* fix(shader): address the workgroup issues of im2col and conv2d
2026-04-30 11:29:17 +03:00
Nikhil Jain d2a26dc8e2
Implement async tensor api and event api (llama/22099)
* Only run webgpu CI on my fork

* Implement set_tensor_async

* Implement synchronize api

* Implement event creation and deletion API

* Cleanup

* Cleanup

* Comment out jobs for local CI run

* Add webgpu only workflow

* Delete .github/workflows/build-webgpu.yml

* Cleanup

* Cleanup

* Update API with function handlers

* Run clang-format

* Replace one-shot buffer with a direct queue.WriteBuffer using the buffer context
2026-04-30 11:29:16 +03:00
Masashi Yoshimura 0fbe4c4ca7
ggml-webgpu: Add fused RMS_NORM + MUL (llama/21983)
* fused rms_norm_mul + mul

* Add GGML_WEBGPU_DISABLE_FUSION for being able to disable kernel fusion.

* Decouple num_fused_ops from webgpu_context; misc cleanup

* Fix eps handling and remove disable_fusion.

* Fix not to use c++20 initializers.
2026-04-30 11:29:16 +03:00
Chen Yuan 447be522e9
ggml-webgpu(shader): support conv2d kernels. (llama/21964)
* ggml(webgpu): fix the busy-polls in Emscripten  in the waitAny after #20618, and remove the busy webgpu log

* Merge with upstream

* Fix GET_ROWS packed integer NaN when using f16 as memory buffer in shader quants

* Update Unary wgsl EXP and EXPM1 for f16 stability

* Fix GET_ROWS IQ4_XS strcut for NaN f16 canonicalization

* Fix numerical percision for unary sqrt when working with f16

* Fix NaN canonicalization for packed integers using f16

* Update err threshold for binary div ops when using f16

* backend: Keep one Dawn/WebGPU instance alive for the lifetime of the static backend

* clean: uncomment existing code logs

* clean: clean the unncessary debug info

* Refactor and generalize dequant helpers

* Remove deprecated quant structs

* Refactor shader defines to reduce repetition

* Remove error override for F16 type

* fix: fix the accidential removal of the proper initialization of ctx

* clean: clean legacy and format code

* fix: did not modify tests ops

* shader(conv2d): add conv2d shader kernels and pass f32 and f16 tests

* shader(conv2d): fix the out of bounds memory access in the weight indexing

* shader(conv2d): clean unused variables and optimize the computation

* merge: use the new entries function

* clean: address the formatting issues

* clean: address the warning issues

* clear: clean the shader editorconfig-checker issues

* clear: clean the shader editorconfig-checker with utf-8

---------

Co-authored-by: Jeremy J. Hartmann <jeremy@mtion.tv>
2026-04-30 11:29:16 +03:00
Masashi Yoshimura 2e5eb6e951
ggml-webgpu: reset CPU/GPU profiling time when freeing context (llama/22050)
* Reset the CPU/GPU profiling time when freeing context.

* move GPU profiling time from global context to webgpu_context.
2026-04-30 11:29:15 +03:00
neha-ha 5f21fdcbb9
ggml-webgpu: updated matrix-vector multiplication (llama/21738)
* merged properly, but slow q3_k and q5_k with u32 indexing

* Start on new mat-vec

* New format float paths working

* Working q4_0

* Work on remaining legacy q-types

* port k-quants to new matvec

* remove old shader

* Remove old constants, format

* remove accidental file

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-04-30 11:29:13 +03:00
Reese Levine cbbe935765
ggml-webgpu: fix compiler warnings and refactor FlashAttention encoding (llama/21052)
* Update workflows to remove dependence on llvmpipe

* Try setting Dawn_DIR

* remove c++20 initializers

* Move to proper guid

* Try avoiding segfaults on vulkan backend process exit

* Remove compiler warnings on parameter casting

* Fix soft_max and update reg_tile accumulation to f32 for better precision

* Refactor flash_attn a bit

* remove c++20 initializers and format

* Increase div precision for NVIDIA

* revert div precision and comment out ggml-ci node for now

* Formatting

* Try debugging on a failing CI node

* Revert "Try debugging on a failing CI node"

This reverts commit 1971e33cba919915e12bcfd5828abfbd54ca942e.
2026-04-30 11:29:12 +03:00
Reese Levine 092330b474
ggml-webgpu: compute pass batching and removing profiling overhead (llama/21873)
* Update register tiling matmul to use f32 accumulation

* fix profiling code

* Fix register tiling matmul for chrome, i'm blaming dawn

* Update batch tuning value for iOS

* compile fix

* Fix use of new load function

* Move to a single query set for GPU profiling

* Move to batching compute passes when not profiling

* Refactor build_multi

* remove iOS throttling now that we're batching compute passes
2026-04-30 11:29:10 +03:00
Reese Levine 2a785c5969
ggml-webgpu: Fix dequantization helpers to not pass in pointers (llama/21872)
* Fix dequantization helpers to not pass in pointers

* Increase XIELU precision
2026-04-30 11:29:10 +03:00
Georgi Gerganov 7024f7e5c1
ci : re-enable mac workflows (llama/21894)
* ci : re-enable mac workflows

* vulkan : fix compile warning
2026-04-30 11:29:08 +03:00
Reese Levine b732f4d9b5
ggml-webgpu: Update register tiling matmul to use f32 accumulation (llama/21644)
* Update register tiling matmul to use f32 accumulation

* fix profiling code

* Fix register tiling matmul for chrome, i'm blaming dawn

* Update batch tuning value for iOS

* compile fix

* Fix use of new load function
2026-04-30 11:29:07 +03:00
Masashi Yoshimura 36b7bb3d95
Remove extra conditional check on debug mode. (llama/21798) 2026-04-30 11:29:07 +03:00
Rithik Sharma 2580cfc703
ggml-webgpu: support non-square subgroup matrix configs for Intel GPUs (llama/21669) 2026-04-30 11:29:05 +03:00
Chen Yuan 3fc738a8c2
ggml-webgpu: address quantization precision and backend lifecycle managment (llama/21521)
* ggml(webgpu): fix the busy-polls in Emscripten  in the waitAny after #20618, and remove the busy webgpu log

* Merge with upstream

* Fix GET_ROWS packed integer NaN when using f16 as memory buffer in shader quants

* Update Unary wgsl EXP and EXPM1 for f16 stability

* Fix GET_ROWS IQ4_XS strcut for NaN f16 canonicalization

* Fix numerical percision for unary sqrt when working with f16

* Fix NaN canonicalization for packed integers using f16

* Update err threshold for binary div ops when using f16

* backend: Keep one Dawn/WebGPU instance alive for the lifetime of the static backend

* clean: uncomment existing code logs

* clean: clean the unncessary debug info

* Refactor and generalize dequant helpers

* Remove deprecated quant structs

* Refactor shader defines to reduce repetition

* Remove error override for F16 type

* fix: fix the accidential removal of the proper initialization of ctx

* clean: clean legacy and format code

* fix: did not modify tests ops

---------

Co-authored-by: Jeremy J. Hartmann <jeremy@mtion.tv>
2026-04-30 11:29:05 +03:00