whisper.cpp

History

Aman Gupta 23f956de33 llama + spec: MTP Support (llama/22673) * spec: support MTP * fix batch size * rename files * cont : simplify (llama/7) * MTP: clean-up (llama/9) * MTP: clean-up * review: use llama_context_type instead of llama_graph_type * review: remove llama_model_has_mtp * review: fix convert issues * convert: fix pycheck * review: formatting * use `mtp-` for identifying mtp models * convert: fix mtp conversion * mtp -> draft-mtp * remove unused llama_arch * add need_embd in speculative * llama: allow partial seq_rm for GDN models for speculative decoding Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates. * fix pending state * vulkan: add GDN partial rollback * meta: extend check to axis 1 * metal: add GDN partial rollback Extend the gated delta net kernel to store intermediate states for partial rollback support on the Metal backend. - Add K (snapshot slot count) as a function constant - Read input state from slot 0 of the 3D state tensor - Write intermediate states to different slots during token loop - For K=1, maintain backward-compatible single-slot behavior Ref: `8c05923630` Assisted-by: llama.cpp:local pi * delta_net_base: use ggml_pad instead of new_tensor * review: add need_rs_seq * review: rename part_bounded to n_rs * review: deslop comments * review: rename, add asserts * server : adjust checkpoint logic (llama/11) * server : adjust checkpoint logic * cont : rm asserts * server-context: fix early exit * spec : fix compatibility with n-gram and add TODOs (llama/13) * metal : cleanup * llama : fix faulty bitwise check in recurrent memory * server : disable RS-based MTP in combination with other spec types * spec : add TODOs * cont : fix comment * cont : update comment * common : fix logic for ngram + mtp compat * llama-memory: enable checkpointing with partial rollback * cont: add test-case for loading into a dirty ctx * llama-memory-recurrent: clear rs_idx in clear * download: fix mtp path * llama-arch: fix enorm op * docs: update docs * conversion: fix type annotations --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>		2026-05-25 12:26:07 +03:00
..
amx	ggml : use 64 bytes aligned tile buffers (llama/21058)	2026-04-30 11:29:20 +03:00
arch	ggml-cpu: Optimized risc-v cpu q1_0 dot	2026-05-14 21:26:48 +03:00
cmake	ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (llama/22863)	2026-05-25 12:26:07 +03:00
kleidiai	kleidiai : fix MUL_MAT support for batched (3D) inputs (llama/20620)	2026-03-29 15:04:36 +03:00
llamafile	ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (llama/22293)	2026-05-01 13:07:34 +03:00
spacemit	ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (llama/22863)	2026-05-25 12:26:07 +03:00
CMakeLists.txt	ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (llama/22863)	2026-05-25 12:26:07 +03:00
arch-fallback.h	ggml-cpu: Optimized risc-v cpu q1_0 dot	2026-05-14 21:26:48 +03:00
binary-ops.cpp	ggml : extend bin bcast for permuted src1 (llama/19484)	2026-02-15 21:44:37 +02:00
binary-ops.h	cpu: de-duplicate some of the operators and refactor (ggml/1144)	2025-03-31 14:56:53 +03:00
common.h	ggml-cpu: FA add GEMM microkernel (llama/19422)	2026-02-27 20:57:58 +02:00
ggml-cpu-impl.h	ggml : fix ARM NEON nvfp4 dot product on non-dotprod targets (llama/21559)	2026-04-30 11:29:08 +03:00
ggml-cpu.c	llama + spec: MTP Support (llama/22673)	2026-05-25 12:26:07 +03:00
ggml-cpu.cpp	vulkan: add get/set tensor 2d functions (llama/22514)	2026-05-01 13:07:35 +03:00
hbm.cpp	ggml-cpu : split arch-specific implementations (llama/13892)	2025-06-10 12:40:33 +03:00
hbm.h	ggml-cpu : split arch-specific implementations (llama/13892)	2025-06-10 12:40:33 +03:00
ops.cpp	llama + spec: MTP Support (llama/22673)	2026-05-25 12:26:07 +03:00
ops.h	ggml-cpu: fuse RMS_NORM + MUL on CPU backend (llama/22423)	2026-05-14 21:26:48 +03:00
quants.c	ggml-cpu: Optimized x86 and generic cpu q1_0 dot (follow up) (llama/21636)	2026-04-30 11:29:14 +03:00
quants.h	ggml: add Q1_0 1-bit quantization support (CPU) (llama/21273)	2026-04-30 11:29:01 +03:00
repack.cpp	ggml-cpu: fix RVV checks in quants and repacking (llama/20682)	2026-03-29 15:04:36 +03:00
repack.h	ggml-cpu: add RVV repack GEMM and GEMV for quantization types (llama/19121)	2026-03-16 13:10:15 +02:00
simd-gemm.h	ggml : implemented simd_gemm kernel for riscv vector extension (llama/20627)	2026-04-30 11:29:11 +03:00
simd-mappings.h	ggml : add native AVX512-FP16 support for F16 operations (llama/20529)	2026-03-16 13:10:15 +02:00
traits.cpp	ggml : fix fallback to CPU for ununsupported ops (llama/15118)	2025-08-18 20:30:45 +03:00
traits.h	ggml : fix fallback to CPU for ununsupported ops (llama/15118)	2025-08-18 20:30:45 +03:00
unary-ops.cpp	ggml : unary ops support non-cont src0 + metal F16 unary ops (llama/19511)	2026-02-15 21:44:37 +02:00
unary-ops.h	ggml : add ops SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM (llama/17063)	2025-11-17 21:05:46 +02:00
vec.cpp	ggml-cpu: optimize ggml_vec_dot_bf16 for s390x (llama/19399)	2026-02-27 20:57:58 +02:00
vec.h	ggml-cpu : re-enable fast gelu_quick_f16 (llama/22339)	2026-04-30 11:29:20 +03:00