whisper.cpp

Commit Graph

Author	SHA1	Message	Date
Gaurav Garg	1a1900f90c	Remove padding and multiple D2D copies for MTP (llama/24086) * Make ggml_gated_delta_net take only the initial recurrent state (D, 1, n_seqs) and passes the snapshot count K as an op parameter instead of inferring it from state->ne[1]. Remove the padding hack and copy all emitted snapshots into the recurrent cache with a single strided ggml_cpy * Make GDN changes in all backends. Address review comments. * Fix CI build errors	2026-06-15 10:33:53 +03:00
Reese Levine	e69e5138fe	ggml-webgpu: Add clang-format job (llama/24308) * Add clang-format job * try local formatting	2026-06-15 10:33:53 +03:00
Masashi Yoshimura	aa42b48312	ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants (llama/24225) * ggml-webgpu: Improve prefill speeds + refactor matmul for quants * Fixes for editroconfig checker	2026-06-15 10:33:53 +03:00
Nikhil Jain	15e5d401d1	Handle buffer overlap / buffer aliasing for concat operator (llama/24000) * Only run webgpu CI on my fork * Add webgpu only workflow * handle buffer overlap case for concat operator * restore build-webgpu.yml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Run clang-format * Update ggml/src/ggml-webgpu/wgsl-shaders/concat.wgsl --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-06-15 10:33:53 +03:00
Nikhil Jain	490e50056c	Implement 2D workgroups for scale, binary, and unary ops (llama/24044) * Only run webgpu CI on my fork * Add webgpu only workflow * Implement 2d workgroups for more operations * fix * Fix type * Move back to global_invocation_id	2026-06-15 10:33:53 +03:00
Reese Levine	e9dbd0c18a	ggml-webgpu: FlashAttention refactor + standardize quantization support (llama/23834) * Start work on flash_attn refactor * Refactor * Split k/v quantization * Refactor and abstract quantization logic for flash_attn and mul_mat * Add quantization support to tile path * formatting * Move to functions, add a check	2026-06-08 14:36:36 +03:00
Masashi Yoshimura	db2a39507c	revert to using global_invocation_id for cpy shader (llama/23955)	2026-06-08 14:36:36 +03:00
Reese Levine	9147a9676b	ggml-webgpu: Check earlier for WebGPU required features (llama/23879)	2026-06-08 14:36:36 +03:00
Reese Levine	acd91d2c38	ggml-webgpu: add q4_0/q8_0 SET_ROWS (llama/23760) * Add q8_0 and q4_0 set_rows * Add fast(er) quantization set_rows path * formatting/naming * a little more naming * Remove unused constant * Don't override other override * Avoid bitcast * Narrow relaxation	2026-06-08 14:36:36 +03:00
Reese Levine	8c8f213dac	ggml-webgpu: remove legacy constants (llama/23672)	2026-05-29 09:47:30 +03:00
Masashi Yoshimura	a52bd385d6	ggml-webgpu: Fix how to dispatch WG to some ops (llama/23750)	2026-05-29 09:47:30 +03:00
Masashi Yoshimura	00a5110b19	ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K and clean up legacy MUL_MAT pipeline (llama/23594) * ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K * Fix to editorconfig checking pass * Remove mul-mat-legacy pipeline * Fix to use vendor name as is and add dot_product/vendor to shader_lib_ctx	2026-05-29 09:47:30 +03:00
Nikhil Jain	bc77933c2d	Check batch_compute_passes before sending passes when not doing GPU profiling (llama/23457) * Only run webgpu CI on my fork * Add webgpu only workflow * refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled * restore build.yml	2026-05-29 09:47:30 +03:00
Chen Yuan	c436f1419f	fix(flash-attn): replace f32 with kv_type and q_type (llama/23372)	2026-05-25 12:26:07 +03:00
Reese Levine	6090f39f36	ggml-webgpu : extend GDN for K>1 (llama/23299)	2026-05-25 12:26:07 +03:00
Zheyuan Chen	13133ab299	ggml-webgpu: makes the flash attn vec path subgroup-aware (llama/23040) * ggml-webgpu: makes the flash attn vec path compile and size its split/reduce work from the device’s reported subgroup range instead of assuming 32 subgroup size. * ggml-webgpu: remove the extra max_wg_size >= max_subgroup_size guard. Remove hardcoded 32 when determine the value of reduce_wg_size and vec_nwg_cap	2026-05-25 12:26:07 +03:00
Zheyuan Chen	e4ce42e55f	ggml-webgpu: only use subgroup-matrix path when head dims are divisible by sg_mat_k / sg_mat_n (llama/23020)	2026-05-14 21:26:48 +03:00
Masashi Yoshimura	1cbbd0b6d0	flush the gpu profile timestamp before the queryset is overflowed (llama/22995)	2026-05-14 21:26:48 +03:00
Masashi Yoshimura	e8a7cd314f	ggml-webgpu: Enables running gpt-oss-20b (llama/22906) * Enable to run gpt-oss-20b and refactor mulmat-q * disable test-backend-ops in ubuntu-24-webgpu	2026-05-14 21:26:48 +03:00
Chen Yuan	a9bcbf5595	ggml-webgpu: address precision issues for multimodal (llama/22808) * fix(mixed-types): use f32 for precision and update the shared memory calculation logic for f32 * fix(unary): correct the gelu, gelu quick and gelu erf functions * fix(flash-attn-tile): fix the hardcode v type * fix(flash_attn): fix tile path * fix: pass editorconfig and address the type conflicts * fix: remove reduant pipeline keys * fix: remove inline min/max group size functions and revert the flash attn path order * fix: use clamp to avoid NaN for GELU * fix: use the right range for exp, 80 is safer for f32 exp	2026-05-14 21:26:48 +03:00
Chen Yuan	d1d0dc2348	ggml-webgpu: add layer norm ops (llama/22406) * shader(norm): add layer norm ops * shader(norm): stablize floating point computation with Kahan summation and handle mixed types * shader(norm): remove the non-contiguous strides * shader(norm): use the original implementation rather than the kahan summation	2026-05-14 21:26:48 +03:00
Georgi Gerganov	bbdaa21aa7	ggml : remove obsolete rms_norm.wgsl (ggml/0)	2026-05-02 15:02:42 +03:00
Georgi Gerganov	a5a8496d31	ggml : remove obsoloete wgsl templates (ggml/0)	2026-05-02 15:02:42 +03:00
Masashi Yoshimura	9623c1203b	ggml-webgpu: Fix vectorized handling in mul-mat and mul-mat-id (llama/22578) * Fix vectorized condition of mul-mat-fast pipeline and add vectorized variant to mul-mat-id * Apply suggestion from @CISC Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-02 15:02:42 +03:00
Chen Yuan	ccd04522f9	ggml-webgpu: add the upscale shader (llama/22419) * shader(upscale): add the upscale shader with nearest, bilinear and bicubic implementations * shader(upscale): use macro	2026-05-01 13:07:36 +03:00
Masashi Yoshimura	b34a9f3d83	ggml-webgpu: Improve performance of mat-vec and mat-mat for MUL_MAT_ID (llama/22464) * Add mat-vec fast path of MUL_MAT_ID. * Add shared accumulation vec logic and the other types supports. * Add i-quant mat-mat for MUL_MAT_ID and fix some parts * Remove n_experts from shader_lib_context.	2026-05-01 13:07:35 +03:00
Ruben Ortlam	0c7c3ba570	vulkan: add get/set tensor 2d functions (llama/22514) * vulkan: add get/set_tensor_2d functions * fix backend interface comments * Update ggml/src/ggml-metal/ggml-metal.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-05-01 13:07:35 +03:00
Rithik Sharma	d74c56862b	add fast matmul iquants (llama/22504)	2026-05-01 13:07:35 +03:00
Reese Levine	fa20229eeb	ggml-webgpu: Fix bug in FlashAttention support check (llama/22492) * Fix flashattention support check for devices that don't support subgroups * set path to none if kv_tile doesn't fit	2026-04-30 11:29:23 +03:00
Reese Levine	4ea5b6febc	ggml-webgpu: fix buffer aliasing for ssm_scan and refactor aliasing logic (llama/22456) * Refactor buffer aliasing to be part of shader lib decisions * cleanup * formatting	2026-04-30 11:29:22 +03:00
Rithik Sharma	9c233f11f0	ggml-webgpu: add Q1_0 support (llama/22374) * add fast matmul matvec q1_0 kernel * ggml-webgpu: drop redundant zero-fills in Q1_0 shmem init	2026-04-30 11:29:21 +03:00
Rithik Sharma	f675a8c926	add fast mat-vec kernels for i-quants (llama/22344)	2026-04-30 11:29:21 +03:00
Rithik Sharma	1478450e61	add performance-portable tuning for register-tile and subgroup matmul (llama/22241)	2026-04-30 11:29:20 +03:00
Reese Levine	c235b05d8a	ggml-webgpu: support for SSM_SCAN and disable set_rows error checking (llama/22327) * Implement ssm_scan * Remove blocking in graph_compute and check for set rows * Fix bindings * Update op support	2026-04-30 11:29:19 +03:00
Zheyuan Chen	35d679a4f8	ggml-webgpu: enable FLASH_ATTN_EXT on browser without subgroup matrix (llama/22199) * ggml-webgpu: add tile flash attention fallback * ggml-webgpu: add new fields and discard usage of mnk for tile version * ggml-webgpu: modify the vec path to discard the mnk parameter * ggml-webgpu: enable flash attention vec and tile version for broswer * ggml-webgpu: stagging KV for flash attention tile version * formatting * turn on subgroup uniformity check * remove Q_TILE as it is always 1 for vec path * make row_max and exp_sum to local register * make different bindings with same underlying buffer to have the same usage flags * move path selection into the shader library and have the host consume a single flash-attn decision object. * turn off skip_validation and address buffer overlapping when nwg==1 * formatting * merge binding when kv overlap	2026-04-30 11:29:18 +03:00
Chen Yuan	641998f558	fix(shader): handle the buffer aliasing for rms fuse (llama/22266)	2026-04-30 11:29:17 +03:00
Chen Yuan	df528c4f71	ggml-webgpu: add support for im2col (llama/22259) * shader(im2col): implement the im2col shader * shader(im2col): clean the formatting issues * shader(im2col): clean the editorconfig checker warning * fix(shader): address the workgroup issues of im2col and conv2d	2026-04-30 11:29:17 +03:00
Nikhil Jain	d2a26dc8e2	Implement async tensor api and event api (llama/22099) * Only run webgpu CI on my fork * Implement set_tensor_async * Implement synchronize api * Implement event creation and deletion API * Cleanup * Cleanup * Comment out jobs for local CI run * Add webgpu only workflow * Delete .github/workflows/build-webgpu.yml * Cleanup * Cleanup * Update API with function handlers * Run clang-format * Replace one-shot buffer with a direct queue.WriteBuffer using the buffer context	2026-04-30 11:29:16 +03:00
Masashi Yoshimura	0fbe4c4ca7	ggml-webgpu: Add fused RMS_NORM + MUL (llama/21983) * fused rms_norm_mul + mul * Add GGML_WEBGPU_DISABLE_FUSION for being able to disable kernel fusion. * Decouple num_fused_ops from webgpu_context; misc cleanup * Fix eps handling and remove disable_fusion. * Fix not to use c++20 initializers.	2026-04-30 11:29:16 +03:00
Chen Yuan	447be522e9	ggml-webgpu(shader): support conv2d kernels. (llama/21964) * ggml(webgpu): fix the busy-polls in Emscripten in the waitAny after #20618, and remove the busy webgpu log * Merge with upstream * Fix GET_ROWS packed integer NaN when using f16 as memory buffer in shader quants * Update Unary wgsl EXP and EXPM1 for f16 stability * Fix GET_ROWS IQ4_XS strcut for NaN f16 canonicalization * Fix numerical percision for unary sqrt when working with f16 * Fix NaN canonicalization for packed integers using f16 * Update err threshold for binary div ops when using f16 * backend: Keep one Dawn/WebGPU instance alive for the lifetime of the static backend * clean: uncomment existing code logs * clean: clean the unncessary debug info * Refactor and generalize dequant helpers * Remove deprecated quant structs * Refactor shader defines to reduce repetition * Remove error override for F16 type * fix: fix the accidential removal of the proper initialization of ctx * clean: clean legacy and format code * fix: did not modify tests ops * shader(conv2d): add conv2d shader kernels and pass f32 and f16 tests * shader(conv2d): fix the out of bounds memory access in the weight indexing * shader(conv2d): clean unused variables and optimize the computation * merge: use the new entries function * clean: address the formatting issues * clean: address the warning issues * clear: clean the shader editorconfig-checker issues * clear: clean the shader editorconfig-checker with utf-8 --------- Co-authored-by: Jeremy J. Hartmann <jeremy@mtion.tv>	2026-04-30 11:29:16 +03:00
Masashi Yoshimura	2e5eb6e951	ggml-webgpu: reset CPU/GPU profiling time when freeing context (llama/22050) * Reset the CPU/GPU profiling time when freeing context. * move GPU profiling time from global context to webgpu_context.	2026-04-30 11:29:15 +03:00
neha-ha	5f21fdcbb9	ggml-webgpu: updated matrix-vector multiplication (llama/21738) * merged properly, but slow q3_k and q5_k with u32 indexing * Start on new mat-vec * New format float paths working * Working q4_0 * Work on remaining legacy q-types * port k-quants to new matvec * remove old shader * Remove old constants, format * remove accidental file --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-04-30 11:29:13 +03:00
Reese Levine	cbbe935765	ggml-webgpu: fix compiler warnings and refactor FlashAttention encoding (llama/21052) * Update workflows to remove dependence on llvmpipe * Try setting Dawn_DIR * remove c++20 initializers * Move to proper guid * Try avoiding segfaults on vulkan backend process exit * Remove compiler warnings on parameter casting * Fix soft_max and update reg_tile accumulation to f32 for better precision * Refactor flash_attn a bit * remove c++20 initializers and format * Increase div precision for NVIDIA * revert div precision and comment out ggml-ci node for now * Formatting * Try debugging on a failing CI node * Revert "Try debugging on a failing CI node" This reverts commit 1971e33cba919915e12bcfd5828abfbd54ca942e.	2026-04-30 11:29:12 +03:00
Reese Levine	092330b474	ggml-webgpu: compute pass batching and removing profiling overhead (llama/21873) * Update register tiling matmul to use f32 accumulation * fix profiling code * Fix register tiling matmul for chrome, i'm blaming dawn * Update batch tuning value for iOS * compile fix * Fix use of new load function * Move to a single query set for GPU profiling * Move to batching compute passes when not profiling * Refactor build_multi * remove iOS throttling now that we're batching compute passes	2026-04-30 11:29:10 +03:00
Reese Levine	2a785c5969	ggml-webgpu: Fix dequantization helpers to not pass in pointers (llama/21872) * Fix dequantization helpers to not pass in pointers * Increase XIELU precision	2026-04-30 11:29:10 +03:00
Georgi Gerganov	7024f7e5c1	ci : re-enable mac workflows (llama/21894) * ci : re-enable mac workflows * vulkan : fix compile warning	2026-04-30 11:29:08 +03:00
Reese Levine	b732f4d9b5	ggml-webgpu: Update register tiling matmul to use f32 accumulation (llama/21644) * Update register tiling matmul to use f32 accumulation * fix profiling code * Fix register tiling matmul for chrome, i'm blaming dawn * Update batch tuning value for iOS * compile fix * Fix use of new load function	2026-04-30 11:29:07 +03:00
Masashi Yoshimura	36b7bb3d95	Remove extra conditional check on debug mode. (llama/21798)	2026-04-30 11:29:07 +03:00
Rithik Sharma	2580cfc703	ggml-webgpu: support non-square subgroup matrix configs for Intel GPUs (llama/21669)	2026-04-30 11:29:05 +03:00
Chen Yuan	3fc738a8c2	ggml-webgpu: address quantization precision and backend lifecycle managment (llama/21521) * ggml(webgpu): fix the busy-polls in Emscripten in the waitAny after #20618, and remove the busy webgpu log * Merge with upstream * Fix GET_ROWS packed integer NaN when using f16 as memory buffer in shader quants * Update Unary wgsl EXP and EXPM1 for f16 stability * Fix GET_ROWS IQ4_XS strcut for NaN f16 canonicalization * Fix numerical percision for unary sqrt when working with f16 * Fix NaN canonicalization for packed integers using f16 * Update err threshold for binary div ops when using f16 * backend: Keep one Dawn/WebGPU instance alive for the lifetime of the static backend * clean: uncomment existing code logs * clean: clean the unncessary debug info * Refactor and generalize dequant helpers * Remove deprecated quant structs * Refactor shader defines to reduce repetition * Remove error override for F16 type * fix: fix the accidential removal of the proper initialization of ctx * clean: clean legacy and format code * fix: did not modify tests ops --------- Co-authored-by: Jeremy J. Hartmann <jeremy@mtion.tv>	2026-04-30 11:29:05 +03:00

1 2

99 Commits