whisper.cpp

History

CrispStrobe 5f08683bb6 metal : tighten input-position loop in kernel_conv_transpose_1d (ggml/1477) For a given output position j on the time axis, only input positions i such that is0 <= j < is0 + K contribute -- i.e. i in [ceil((j - K + 1)/s0), floor(j/s0)] intersected with [0, IL-1]. That's at most ceil(K/s0) values (typically 2 for stride==K/2 transposed convs). The current kernel iterates the full IL range and filters with an `if`, amplifying per-thread work by IL/ceil(K/s0) (~160x for IL=320, K=10, s0=5 -- a representative codec-decoder shape). On Apple M1 the wasted work trips the macOS GPU watchdog (kIOGPUCommandBufferCallbackErrorImpactingInteractivity) on long graphs. Compute i_min, i_max analytically before the inner loop and iterate only [i_min, i_max]. Output is bit-identical (same multiplies and adds in the same order); loop bound shrinks by IL/ceil(K/s0). Tested on M1 with a downstream consumer running a TTS codec at full T_codec; end-to-end codec decode ~3-4x faster, zero watchdog hits across long synthesis runs vs ~30% pre-patch.		2026-05-14 21:26:48 +03:00
..
cmake	cmake : add FindNCCL.cmake (ggml/0)	2026-05-02 15:02:42 +03:00
include	CUDA: lower-case PCI bus id, standardize for ggml (llama/22820)	2026-05-14 21:26:48 +03:00
src	metal : tighten input-position loop in kernel_conv_transpose_1d (ggml/1477)	2026-05-14 21:26:48 +03:00
.gitignore	…
CMakeLists.txt	ggml: install ggml.pc in <libdir>/pkgconfig (ggml/1480)	2026-05-14 21:26:48 +03:00