whisper.cpp

ZihaoMu 882736f886 ggml: support concat for scalar types at cuda backend (llama/24011) * cuda: support concat for scalar types * Update concat.cu * fix metal ci issue	2026-06-15 10:33:53 +03:00
..
template-instances	Add flash attention MMA / Tiles to support MiMo-V2.5 (llama/22812)	2026-05-14 21:26:48 +03:00
vendors	HIP: add gfx1152 and gfx1153 to RDNA3.5 (llama/24129)	2026-06-08 14:36:36 +03:00
CMakeLists.txt	Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)	2026-05-25 12:26:07 +03:00
acc.cu	llama/ggml: add LLM training support (llama/10544)	2025-05-13 13:59:21 +03:00
acc.cuh	…
add-id.cu	musa: fix build warnings (llama/15258)	2025-09-20 13:42:38 +03:00
add-id.cuh	llama : add gpt-oss (llama/15091)	2025-08-18 20:30:45 +03:00
allreduce.cu	Fix for issue #22974 . Cast intermediate results to float before adding and casting the result to the destination type. Avoids half+half operator ambiguity. (llama/22994)	2026-05-14 21:26:48 +03:00
allreduce.cuh	internal AllReduce kernel for CUDA provider (llama/22299)	2026-05-14 21:26:48 +03:00
arange.cu	…
arange.cuh	…
argmax.cu	ggml : use WARP_SIZE/2 for argmax reduction offset (llama/18092)	2025-12-18 08:20:56 +02:00
argmax.cuh	ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)	2024-10-05 15:23:51 +03:00
argsort.cu	CUDA: directly include cuda/iterator (llama/22936)	2026-05-14 21:26:48 +03:00
argsort.cuh	sampling : add support for backend sampling (llama/17004)	2026-01-14 09:11:59 +02:00
binbcast.cu	Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)	2026-05-25 12:26:07 +03:00
binbcast.cuh	CUDA: fuse muls (llama/21665)	2026-04-30 11:29:05 +03:00
clamp.cu	cuda: unary ops as float + de-duplicate (ggml/1130)	2025-03-08 15:13:01 +02:00
clamp.cuh	…
common.cuh	Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)	2026-06-08 14:36:36 +03:00
concat.cu	ggml: support concat for scalar types at cuda backend (llama/24011)	2026-06-15 10:33:53 +03:00
concat.cuh	…
conv-transpose-1d.cu	musa: add GGML_UNUSED_VARS (llama/15446)	2025-09-20 13:42:38 +03:00
conv-transpose-1d.cuh	…
conv2d-dw.cu	CUDA: add conv_2d_dw (llama/14265)	2025-06-21 07:34:17 +03:00
conv2d-dw.cuh	CUDA: add conv_2d_dw (llama/14265)	2025-06-21 07:34:17 +03:00
conv2d-transpose.cu	CUDA & CPU: support F32 kernel type for `CONV_TRANSPOSE_2D` (llama/17094)	2026-03-29 15:04:36 +03:00
conv2d-transpose.cuh	CUDA & CPU: support F32 kernel type for `CONV_TRANSPOSE_2D` (llama/17094)	2026-03-29 15:04:36 +03:00
conv2d.cu	CUDA: fix build error from ambiguous __half conversions in conv2d (llama/15690)	2025-09-20 13:42:46 +03:00
conv2d.cuh	CUDA: add conv2d (llama/15635)	2025-09-20 13:42:44 +03:00
convert.cu	cuda: Q1_0 initial backend (llama/21629)	2026-04-30 11:29:10 +03:00
convert.cuh	CUDA: fix BF16 FA compilation (llama/20865)	2026-03-29 15:04:36 +03:00
count-equal.cu	ggml: fix zero division in ‘dne’ calculation in CUDA COUNT_EQUAL operator when ‘ne’ is small (#10213 )	2024-11-15 15:21:04 +02:00
count-equal.cuh	ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)	2024-10-05 15:23:51 +03:00
cp-async.cuh	CUDA: FA support for Deepseek (Ampere or newer) (llama/13306)	2025-05-13 13:59:21 +03:00
cpy-utils.cuh	cuda : support non-contiguous i32 to i32 copy (llama/17326)	2025-12-12 17:53:06 +02:00
cpy.cu	Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)	2026-05-25 12:26:07 +03:00
cpy.cuh	cuda : remove legacy copy-op pointer indirection code (llama/16485)	2025-10-15 09:29:17 +03:00
cross-entropy-loss.cu	CUDA: add dynamic shared mem to softmax, refactor general usage (llama/14497)	2025-07-12 19:23:56 +03:00
cross-entropy-loss.cuh	ggml/examples: add backend support for numerical optimization (ggml/949)	2024-09-24 19:45:08 +03:00
cumsum.cu	sampling : add support for backend sampling (llama/17004)	2026-01-14 09:11:59 +02:00
cumsum.cuh	Add support for CUMSUM and TRI for CUDA. (llama/17584)	2025-12-12 17:53:17 +02:00
dequantize.cuh	cuda: Q1_0 initial backend (llama/21629)	2026-04-30 11:29:10 +03:00
diag.cu	Add DIAG for CUDA (llama/17873)	2025-12-12 17:53:23 +02:00
diag.cuh	Add DIAG for CUDA (llama/17873)	2025-12-12 17:53:23 +02:00
diagmask.cu	…
diagmask.cuh	…
fattn-common.cuh	Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)	2026-06-08 14:36:36 +03:00
fattn-mma-f16.cuh	Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)	2026-06-08 14:36:36 +03:00
fattn-tile.cu	Add flash attention MMA / Tiles to support MiMo-V2.5 (llama/22812)	2026-05-14 21:26:48 +03:00
fattn-tile.cuh	Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)	2026-06-08 14:36:36 +03:00
fattn-vec.cuh	Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)	2026-06-08 14:36:36 +03:00
fattn-wmma-f16.cu	Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)	2026-06-08 14:36:36 +03:00
fattn-wmma-f16.cuh	chore : correct typos [no ci] (llama/20041)	2026-03-16 13:10:15 +02:00
fattn.cu	cuda: reserve space for quantize kv-cache at startup (llama/23907)	2026-06-08 14:36:36 +03:00
fattn.cuh	cuda: reserve space for quantize kv-cache at startup (llama/23907)	2026-06-08 14:36:36 +03:00
fill.cu	ggml : allow fill node alloc inplace (llama/17870)	2025-12-12 17:53:22 +02:00
fill.cuh	cuda : add FILL op support (llama/17851)	2025-12-12 17:53:22 +02:00
fwht.cu	CUDA: missing PDL sync for FWHT, better fallback (llama/23690)	2026-05-29 09:47:30 +03:00
fwht.cuh	CUDA: missing PDL sync for FWHT, better fallback (llama/23690)	2026-05-29 09:47:30 +03:00
gated_delta_net.cu	Remove padding and multiple D2D copies for MTP (llama/24086)	2026-06-15 10:33:53 +03:00
gated_delta_net.cuh	ggml: add GATED_DELTA_NET op (llama/19504)	2026-03-16 13:10:15 +02:00
getrows.cu	Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)	2026-06-08 14:36:36 +03:00
getrows.cuh	CUDA: batched+noncont MMQ, refactor bs>1 MoE code (llama/13199)	2025-05-01 13:29:02 +03:00
ggml-cuda.cu	ggml: support concat for scalar types at cuda backend (llama/24011)	2026-06-15 10:33:53 +03:00
gla.cu	llama: add support for QRWKV6 model architecture (llama/11001)	2025-01-14 10:38:01 +02:00
gla.cuh	llama: add support for QRWKV6 model architecture (llama/11001)	2025-01-14 10:38:01 +02:00
im2col.cu	CUDA: handle OW > 65535 in im2col (2D and 3D) (llama/22944)	2026-05-14 21:26:48 +03:00
im2col.cuh	ggml: add ops for WAN video model (cuda && cpu) (llama/15669)	2025-09-20 13:42:49 +03:00
mean.cu	Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)	2026-05-25 12:26:07 +03:00
mean.cuh	CUDA: add mean operation (llama/14313)	2025-07-01 17:54:53 +03:00
mma.cuh	HIP: RDNA3 mma FA, faster AMD transpose, tune AMD (llama/22880)	2026-05-25 12:26:07 +03:00
mmf.cu	HIP: add mmf for CDNA (llama/18896)	2026-01-30 15:56:40 +02:00
mmf.cuh	clean up unused variables warnings (llama/23975)	2026-06-08 14:36:36 +03:00
mmid.cu	CUDA: add fp kernel for larger batch size MoE (llama/16512)	2025-10-15 09:29:17 +03:00
mmid.cuh	CUDA: add fp kernel for larger batch size MoE (llama/16512)	2025-10-15 09:29:17 +03:00
mmq.cu	ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (llama/22196)	2026-04-30 11:29:22 +03:00
mmq.cuh	ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (llama/22196)	2026-04-30 11:29:22 +03:00
mmvf.cu	Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)	2026-06-08 14:36:36 +03:00
mmvf.cuh	CUDA: use mmvq for mul-mat-id for small batch sizes (llama/18958)	2026-02-08 09:29:10 +02:00
mmvq.cu	Remove case for GGML_TYPE_Q4_K in mvvq.cu (llama/23528)	2026-06-15 10:33:53 +03:00
mmvq.cuh	CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (llama/23227)	2026-05-29 09:47:30 +03:00
norm.cu	Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)	2026-05-25 12:26:07 +03:00
norm.cuh	CUDA: fuse adds, fuse add with rms norm (llama/15631)	2025-09-20 13:42:44 +03:00
opt-step-adamw.cu	ggml: new optimization interface (ggml/988)	2024-11-20 21:00:08 +02:00
opt-step-adamw.cuh	ggml/examples: add backend support for numerical optimization (ggml/949)	2024-09-24 19:45:08 +03:00
opt-step-sgd.cu	finetune: SGD optimizer, more CLI args (llama/13873)	2025-08-18 20:30:45 +03:00
opt-step-sgd.cuh	finetune: SGD optimizer, more CLI args (llama/13873)	2025-08-18 20:30:45 +03:00
out-prod.cu	CUDA: batch out_prod inner loop with cublasSgemmStridedBatched (llama/22651)	2026-05-14 21:26:48 +03:00
out-prod.cuh	ggml/examples: add backend support for numerical optimization (ggml/949)	2024-09-24 19:45:08 +03:00
pad.cu	cuda : extend GGML_OP_PAD to work with non-cont src0 (llama/19429)	2026-02-15 21:44:37 +02:00
pad.cuh	…
pad_reflect_1d.cu	musa: fix build warnings (llama/15611)	2025-09-29 15:18:10 +03:00
pad_reflect_1d.cuh	cuda : add Pad Reflect 1D support (llama/14659)	2025-09-20 13:42:39 +03:00
pool2d.cu	…
pool2d.cuh	…
quantize.cu	Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)	2026-06-08 14:36:36 +03:00
quantize.cuh	ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (llama/22196)	2026-04-30 11:29:22 +03:00
reduce_rows.cuh	Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)	2026-06-08 14:36:36 +03:00
roll.cu	CUDA: add roll (llama/14919)	2025-08-18 20:30:45 +03:00
roll.cuh	CUDA: add roll (llama/14919)	2025-08-18 20:30:45 +03:00
rope.cu	Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)	2026-05-25 12:26:07 +03:00
rope.cuh	CUDA: fuse rope + set_rows (llama/16884)	2025-11-17 21:05:46 +02:00
scale.cu	Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)	2026-05-25 12:26:07 +03:00
scale.cuh	…
set-rows.cu	Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)	2026-06-08 14:36:36 +03:00
set-rows.cuh	CUDA: add set rows for f32 and f16 (llama/14551)	2025-07-20 00:23:50 +03:00
set.cu	cuda: add SET operation support (llama/16804)	2025-11-09 23:38:03 +02:00
set.cuh	cuda: add SET operation support (llama/16804)	2025-11-09 23:38:03 +02:00
snake.cu	cuda: fuse snake activation (mul, sin, sqr, mul, add) (llama/22667)	2026-05-14 21:26:48 +03:00
snake.cuh	cuda: fuse snake activation (mul, sin, sqr, mul, add) (llama/22667)	2026-05-14 21:26:48 +03:00
softcap.cu	Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)	2026-05-25 12:26:07 +03:00
softcap.cuh	cuda : add softcap fusion (llama/14907)	2025-08-18 20:30:45 +03:00
softmax.cu	chore : correct typos [no ci] (llama/20041)	2026-03-16 13:10:15 +02:00
softmax.cuh	CUDA: backwards pass for misc. ops, add tests (llama/11257)	2025-02-03 22:00:57 +02:00
solve_tri.cu	chore : correct typos [no ci] (llama/20041)	2026-03-16 13:10:15 +02:00
solve_tri.cuh	SOLVE_TRI CUDA kernel for small matrices (llama/17457)	2025-12-12 17:53:11 +02:00
ssm-conv.cu	Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)	2026-06-08 14:36:36 +03:00
ssm-conv.cuh	CUDA: fuse SSM_CONV + ADD(bias) + SILU (llama/22478)	2026-05-01 13:07:35 +03:00
ssm-scan.cu	CUDA: Fix ssm_scan_f32 data-races (llama/24360)	2026-06-15 10:33:53 +03:00
ssm-scan.cuh	ggml : faster ssm scan (llama/10558)	2025-04-02 15:51:57 +03:00
sum.cu	CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n (llama/15132)	2025-08-18 20:30:45 +03:00
sum.cuh	tests: add gradient tests for all backends (ggml/932)	2024-09-24 19:45:08 +03:00
sumrows.cu	Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)	2026-05-25 12:26:07 +03:00
sumrows.cuh	CUDA: add mean operation (llama/14313)	2025-07-01 17:54:53 +03:00
top-k.cu	CUDA: Continue directly including cuda/iterator (llama/23102)	2026-05-25 12:26:07 +03:00
top-k.cuh	sampling : add support for backend sampling (llama/17004)	2026-01-14 09:11:59 +02:00
topk-moe.cu	clean up unused variables warnings (llama/23975)	2026-06-08 14:36:36 +03:00
topk-moe.cuh	CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) (llama/19126)	2026-01-30 15:56:40 +02:00
tri.cu	Add support for CUMSUM and TRI for CUDA. (llama/17584)	2025-12-12 17:53:17 +02:00
tri.cuh	Add support for CUMSUM and TRI for CUDA. (llama/17584)	2025-12-12 17:53:17 +02:00
tsembd.cu	ggml : fix padding in timestep embedding kernels (llama/15932)	2025-09-20 13:45:30 +03:00
tsembd.cuh	…
unary.cu	Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)	2026-05-25 12:26:07 +03:00
unary.cuh	CUDA: fuse relu + sqr (llama/22249)	2026-04-30 11:29:16 +03:00
upscale.cu	model: LFM2-VL fixes (llama/17577)	2025-12-12 17:53:14 +02:00
upscale.cuh	…
vecdotq.cuh	cuda: Q1_0 initial backend (llama/21629)	2026-04-30 11:29:10 +03:00
wkv.cu	llama: Add support for RWKV v7 architecture (llama/12412)	2025-03-27 11:06:03 +02:00
wkv.cuh	llama: Add support for RWKV v7 architecture (llama/12412)	2025-03-27 11:06:03 +02:00

template-instances

Add flash attention MMA / Tiles to support MiMo-V2.5 (llama/22812)

2026-05-14 21:26:48 +03:00

vendors

HIP: add gfx1152 and gfx1153 to RDNA3.5 (llama/24129)

2026-06-08 14:36:36 +03:00

CMakeLists.txt

Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)

2026-05-25 12:26:07 +03:00

acc.cu

llama/ggml: add LLM training support (llama/10544)

2025-05-13 13:59:21 +03:00

add-id.cu

musa: fix build warnings (llama/15258)

2025-09-20 13:42:38 +03:00

add-id.cuh

llama : add gpt-oss (llama/15091)

2025-08-18 20:30:45 +03:00

allreduce.cu

Fix for issue #22974 . Cast intermediate results to float before adding and casting the result to the destination type. Avoids half+half operator ambiguity. (llama/22994)

2026-05-14 21:26:48 +03:00

allreduce.cuh

internal AllReduce kernel for CUDA provider (llama/22299)

2026-05-14 21:26:48 +03:00

argmax.cu

ggml : use WARP_SIZE/2 for argmax reduction offset (llama/18092)

2025-12-18 08:20:56 +02:00

argmax.cuh

ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)

2024-10-05 15:23:51 +03:00

argsort.cu

CUDA: directly include cuda/iterator (llama/22936)

2026-05-14 21:26:48 +03:00

argsort.cuh

sampling : add support for backend sampling (llama/17004)

2026-01-14 09:11:59 +02:00

binbcast.cu

Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)

2026-05-25 12:26:07 +03:00

binbcast.cuh

CUDA: fuse muls (llama/21665)

2026-04-30 11:29:05 +03:00

clamp.cu

cuda: unary ops as float + de-duplicate (ggml/1130)

2025-03-08 15:13:01 +02:00

common.cuh

Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)

2026-06-08 14:36:36 +03:00

concat.cu

ggml: support concat for scalar types at cuda backend (llama/24011)

2026-06-15 10:33:53 +03:00

conv-transpose-1d.cu

musa: add GGML_UNUSED_VARS (llama/15446)

2025-09-20 13:42:38 +03:00

conv2d-dw.cu

CUDA: add conv_2d_dw (llama/14265)

2025-06-21 07:34:17 +03:00

conv2d-dw.cuh

CUDA: add conv_2d_dw (llama/14265)

2025-06-21 07:34:17 +03:00

conv2d-transpose.cu

CUDA & CPU: support F32 kernel type for `CONV_TRANSPOSE_2D` (llama/17094)

2026-03-29 15:04:36 +03:00

conv2d-transpose.cuh

CUDA & CPU: support F32 kernel type for `CONV_TRANSPOSE_2D` (llama/17094)

2026-03-29 15:04:36 +03:00

conv2d.cu

CUDA: fix build error from ambiguous __half conversions in conv2d (llama/15690)

2025-09-20 13:42:46 +03:00

conv2d.cuh

CUDA: add conv2d (llama/15635)

2025-09-20 13:42:44 +03:00

convert.cu

cuda: Q1_0 initial backend (llama/21629)

2026-04-30 11:29:10 +03:00

convert.cuh

CUDA: fix BF16 FA compilation (llama/20865)

2026-03-29 15:04:36 +03:00

count-equal.cu

ggml: fix zero division in ‘dne’ calculation in CUDA COUNT_EQUAL operator when ‘ne’ is small (#10213 )

2024-11-15 15:21:04 +02:00

count-equal.cuh

ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)

2024-10-05 15:23:51 +03:00

cp-async.cuh

CUDA: FA support for Deepseek (Ampere or newer) (llama/13306)

2025-05-13 13:59:21 +03:00

cpy-utils.cuh

cuda : support non-contiguous i32 to i32 copy (llama/17326)

2025-12-12 17:53:06 +02:00

cpy.cu

Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)

2026-05-25 12:26:07 +03:00

cpy.cuh

cuda : remove legacy copy-op pointer indirection code (llama/16485)

2025-10-15 09:29:17 +03:00

cross-entropy-loss.cu

CUDA: add dynamic shared mem to softmax, refactor general usage (llama/14497)

2025-07-12 19:23:56 +03:00

cross-entropy-loss.cuh

ggml/examples: add backend support for numerical optimization (ggml/949)

2024-09-24 19:45:08 +03:00

cumsum.cu

sampling : add support for backend sampling (llama/17004)

2026-01-14 09:11:59 +02:00

cumsum.cuh

Add support for CUMSUM and TRI for CUDA. (llama/17584)

2025-12-12 17:53:17 +02:00

dequantize.cuh

cuda: Q1_0 initial backend (llama/21629)

2026-04-30 11:29:10 +03:00

diag.cu

Add DIAG for CUDA (llama/17873)

2025-12-12 17:53:23 +02:00

diag.cuh

Add DIAG for CUDA (llama/17873)

2025-12-12 17:53:23 +02:00

fattn-common.cuh

Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)

2026-06-08 14:36:36 +03:00

fattn-mma-f16.cuh

Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)

2026-06-08 14:36:36 +03:00

fattn-tile.cu

Add flash attention MMA / Tiles to support MiMo-V2.5 (llama/22812)

2026-05-14 21:26:48 +03:00

fattn-tile.cuh

Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)

2026-06-08 14:36:36 +03:00

fattn-vec.cuh

Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)

2026-06-08 14:36:36 +03:00

fattn-wmma-f16.cu

Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)

2026-06-08 14:36:36 +03:00

fattn-wmma-f16.cuh

chore : correct typos [no ci] (llama/20041)

2026-03-16 13:10:15 +02:00

fattn.cu

cuda: reserve space for quantize kv-cache at startup (llama/23907)

2026-06-08 14:36:36 +03:00

fattn.cuh

cuda: reserve space for quantize kv-cache at startup (llama/23907)

2026-06-08 14:36:36 +03:00

fill.cu

ggml : allow fill node alloc inplace (llama/17870)

2025-12-12 17:53:22 +02:00

fill.cuh

cuda : add FILL op support (llama/17851)

2025-12-12 17:53:22 +02:00

fwht.cu

CUDA: missing PDL sync for FWHT, better fallback (llama/23690)

2026-05-29 09:47:30 +03:00

fwht.cuh

CUDA: missing PDL sync for FWHT, better fallback (llama/23690)

2026-05-29 09:47:30 +03:00

gated_delta_net.cu

Remove padding and multiple D2D copies for MTP (llama/24086)

2026-06-15 10:33:53 +03:00

gated_delta_net.cuh

ggml: add GATED_DELTA_NET op (llama/19504)

2026-03-16 13:10:15 +02:00

getrows.cu

Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)

2026-06-08 14:36:36 +03:00

getrows.cuh

CUDA: batched+noncont MMQ, refactor bs>1 MoE code (llama/13199)

2025-05-01 13:29:02 +03:00

ggml-cuda.cu

ggml: support concat for scalar types at cuda backend (llama/24011)

2026-06-15 10:33:53 +03:00

gla.cu

llama: add support for QRWKV6 model architecture (llama/11001)

2025-01-14 10:38:01 +02:00

gla.cuh

llama: add support for QRWKV6 model architecture (llama/11001)

2025-01-14 10:38:01 +02:00

im2col.cu

CUDA: handle OW > 65535 in im2col (2D and 3D) (llama/22944)

2026-05-14 21:26:48 +03:00

im2col.cuh

ggml: add ops for WAN video model (cuda && cpu) (llama/15669)

2025-09-20 13:42:49 +03:00

mean.cu

Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)

2026-05-25 12:26:07 +03:00

mean.cuh

CUDA: add mean operation (llama/14313)

2025-07-01 17:54:53 +03:00

mma.cuh

HIP: RDNA3 mma FA, faster AMD transpose, tune AMD (llama/22880)

2026-05-25 12:26:07 +03:00

mmf.cu

HIP: add mmf for CDNA (llama/18896)

2026-01-30 15:56:40 +02:00

mmf.cuh

clean up unused variables warnings (llama/23975)

2026-06-08 14:36:36 +03:00

mmid.cu

CUDA: add fp kernel for larger batch size MoE (llama/16512)

2025-10-15 09:29:17 +03:00

mmid.cuh

CUDA: add fp kernel for larger batch size MoE (llama/16512)

2025-10-15 09:29:17 +03:00

mmq.cu

ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (llama/22196)

2026-04-30 11:29:22 +03:00

mmq.cuh

ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (llama/22196)

2026-04-30 11:29:22 +03:00

mmvf.cu

Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)

2026-06-08 14:36:36 +03:00

mmvf.cuh

CUDA: use mmvq for mul-mat-id for small batch sizes (llama/18958)

2026-02-08 09:29:10 +02:00

mmvq.cu

Remove case for GGML_TYPE_Q4_K in mvvq.cu (llama/23528)

2026-06-15 10:33:53 +03:00

mmvq.cuh

CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (llama/23227)

2026-05-29 09:47:30 +03:00

norm.cu

Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)

2026-05-25 12:26:07 +03:00

norm.cuh

CUDA: fuse adds, fuse add with rms norm (llama/15631)

2025-09-20 13:42:44 +03:00

opt-step-adamw.cu

ggml: new optimization interface (ggml/988)

2024-11-20 21:00:08 +02:00

opt-step-adamw.cuh

ggml/examples: add backend support for numerical optimization (ggml/949)

2024-09-24 19:45:08 +03:00

opt-step-sgd.cu

finetune: SGD optimizer, more CLI args (llama/13873)

2025-08-18 20:30:45 +03:00

opt-step-sgd.cuh

finetune: SGD optimizer, more CLI args (llama/13873)

2025-08-18 20:30:45 +03:00

out-prod.cu

CUDA: batch out_prod inner loop with cublasSgemmStridedBatched (llama/22651)

2026-05-14 21:26:48 +03:00

out-prod.cuh

ggml/examples: add backend support for numerical optimization (ggml/949)

2024-09-24 19:45:08 +03:00

pad.cu

cuda : extend GGML_OP_PAD to work with non-cont src0 (llama/19429)

2026-02-15 21:44:37 +02:00

pad_reflect_1d.cu

musa: fix build warnings (llama/15611)

2025-09-29 15:18:10 +03:00

pad_reflect_1d.cuh

cuda : add Pad Reflect 1D support (llama/14659)

2025-09-20 13:42:39 +03:00

quantize.cu

Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)

2026-06-08 14:36:36 +03:00

quantize.cuh

ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (llama/22196)

2026-04-30 11:29:22 +03:00

reduce_rows.cuh

Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)

2026-06-08 14:36:36 +03:00

roll.cu

CUDA: add roll (llama/14919)

2025-08-18 20:30:45 +03:00

roll.cuh

CUDA: add roll (llama/14919)

2025-08-18 20:30:45 +03:00

rope.cu

Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)

2026-05-25 12:26:07 +03:00

rope.cuh

CUDA: fuse rope + set_rows (llama/16884)

2025-11-17 21:05:46 +02:00

scale.cu

Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)

2026-05-25 12:26:07 +03:00

set-rows.cu

Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)

2026-06-08 14:36:36 +03:00

set-rows.cuh

CUDA: add set rows for f32 and f16 (llama/14551)

2025-07-20 00:23:50 +03:00

set.cu

cuda: add SET operation support (llama/16804)

2025-11-09 23:38:03 +02:00

set.cuh

cuda: add SET operation support (llama/16804)

2025-11-09 23:38:03 +02:00

snake.cu

cuda: fuse snake activation (mul, sin, sqr, mul, add) (llama/22667)

2026-05-14 21:26:48 +03:00

snake.cuh

cuda: fuse snake activation (mul, sin, sqr, mul, add) (llama/22667)

2026-05-14 21:26:48 +03:00

softcap.cu

Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)

2026-05-25 12:26:07 +03:00

softcap.cuh

cuda : add softcap fusion (llama/14907)

2025-08-18 20:30:45 +03:00

softmax.cu

chore : correct typos [no ci] (llama/20041)

2026-03-16 13:10:15 +02:00

softmax.cuh

CUDA: backwards pass for misc. ops, add tests (llama/11257)

2025-02-03 22:00:57 +02:00

solve_tri.cu

chore : correct typos [no ci] (llama/20041)

2026-03-16 13:10:15 +02:00

solve_tri.cuh

SOLVE_TRI CUDA kernel for small matrices (llama/17457)

2025-12-12 17:53:11 +02:00

ssm-conv.cu

Avoid PDL race conditions by disabling __restrict__ when PDL is used (llama/24030)

2026-06-08 14:36:36 +03:00

ssm-conv.cuh

CUDA: fuse SSM_CONV + ADD(bias) + SILU (llama/22478)

2026-05-01 13:07:35 +03:00

ssm-scan.cu

CUDA: Fix ssm_scan_f32 data-races (llama/24360)

2026-06-15 10:33:53 +03:00

ssm-scan.cuh

ggml : faster ssm scan (llama/10558)

2025-04-02 15:51:57 +03:00

sum.cu

CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n (llama/15132)

2025-08-18 20:30:45 +03:00

sum.cuh

tests: add gradient tests for all backends (ggml/932)

2024-09-24 19:45:08 +03:00

sumrows.cu

Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)

2026-05-25 12:26:07 +03:00

sumrows.cuh

CUDA: add mean operation (llama/14313)

2025-07-01 17:54:53 +03:00

top-k.cu

CUDA: Continue directly including cuda/iterator (llama/23102)

2026-05-25 12:26:07 +03:00

top-k.cuh

sampling : add support for backend sampling (llama/17004)

2026-01-14 09:11:59 +02:00

topk-moe.cu

clean up unused variables warnings (llama/23975)

2026-06-08 14:36:36 +03:00

topk-moe.cuh

CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) (llama/19126)

2026-01-30 15:56:40 +02:00

tri.cu

Add support for CUMSUM and TRI for CUDA. (llama/17584)

2025-12-12 17:53:17 +02:00

tri.cuh

Add support for CUMSUM and TRI for CUDA. (llama/17584)

2025-12-12 17:53:17 +02:00

tsembd.cu

ggml : fix padding in timestep embedding kernels (llama/15932)

2025-09-20 13:45:30 +03:00

unary.cu

Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (llama/22522)

2026-05-25 12:26:07 +03:00

unary.cuh

CUDA: fuse relu + sqr (llama/22249)

2026-04-30 11:29:16 +03:00

upscale.cu

model: LFM2-VL fixes (llama/17577)

2025-12-12 17:53:14 +02:00

vecdotq.cuh

cuda: Q1_0 initial backend (llama/21629)

2026-04-30 11:29:10 +03:00

wkv.cu

llama: Add support for RWKV v7 architecture (llama/12412)

2025-03-27 11:06:03 +02:00

wkv.cuh

llama: Add support for RWKV v7 architecture (llama/12412)

2025-03-27 11:06:03 +02:00