whisper.cpp/ggml/src/ggml-cpu
Aman Gupta 23f956de33 llama + spec: MTP Support (llama/22673)
* spec: support MTP

* fix batch size

* rename files

* cont : simplify (llama/7)

* MTP: clean-up (llama/9)

* MTP: clean-up

* review: use llama_context_type instead of llama_graph_type

* review: remove llama_model_has_mtp

* review: fix convert issues

* convert: fix pycheck

* review: formatting

* use `mtp-` for identifying mtp models

* convert: fix mtp conversion

* mtp -> draft-mtp

* remove unused llama_arch

* add need_embd in speculative

* llama: allow partial seq_rm for GDN models for speculative decoding

Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.

* fix pending state

* vulkan: add GDN partial rollback

* meta: extend check to axis 1

* metal: add GDN partial rollback

Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.

- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior

Ref: 8c05923630

Assisted-by: llama.cpp:local pi

* delta_net_base: use ggml_pad instead of new_tensor

* review: add need_rs_seq

* review: rename part_bounded to n_rs

* review: deslop comments

* review: rename, add asserts

* server : adjust checkpoint logic (llama/11)

* server : adjust checkpoint logic

* cont : rm asserts

* server-context: fix early exit

* spec : fix compatibility with n-gram and add TODOs (llama/13)

* metal : cleanup

* llama : fix faulty bitwise check in recurrent memory

* server : disable RS-based MTP in combination with other spec types

* spec : add TODOs

* cont : fix comment

* cont : update comment

* common : fix logic for ngram + mtp compat

* llama-memory: enable checkpointing with partial rollback

* cont: add test-case for loading into a dirty ctx

* llama-memory-recurrent: clear rs_idx in clear

* download: fix mtp path

* llama-arch: fix enorm op

* docs: update docs

* conversion: fix type annotations

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-05-25 12:26:07 +03:00
..
amx ggml : use 64 bytes aligned tile buffers (llama/21058) 2026-04-30 11:29:20 +03:00
arch ggml-cpu: Optimized risc-v cpu q1_0 dot 2026-05-14 21:26:48 +03:00
cmake ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (llama/22863) 2026-05-25 12:26:07 +03:00
kleidiai kleidiai : fix MUL_MAT support for batched (3D) inputs (llama/20620) 2026-03-29 15:04:36 +03:00
llamafile ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (llama/22293) 2026-05-01 13:07:34 +03:00
spacemit ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (llama/22863) 2026-05-25 12:26:07 +03:00
CMakeLists.txt ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (llama/22863) 2026-05-25 12:26:07 +03:00
arch-fallback.h ggml-cpu: Optimized risc-v cpu q1_0 dot 2026-05-14 21:26:48 +03:00
binary-ops.cpp ggml : extend bin bcast for permuted src1 (llama/19484) 2026-02-15 21:44:37 +02:00
binary-ops.h cpu: de-duplicate some of the operators and refactor (ggml/1144) 2025-03-31 14:56:53 +03:00
common.h ggml-cpu: FA add GEMM microkernel (llama/19422) 2026-02-27 20:57:58 +02:00
ggml-cpu-impl.h ggml : fix ARM NEON nvfp4 dot product on non-dotprod targets (llama/21559) 2026-04-30 11:29:08 +03:00
ggml-cpu.c llama + spec: MTP Support (llama/22673) 2026-05-25 12:26:07 +03:00
ggml-cpu.cpp vulkan: add get/set tensor 2d functions (llama/22514) 2026-05-01 13:07:35 +03:00
hbm.cpp ggml-cpu : split arch-specific implementations (llama/13892) 2025-06-10 12:40:33 +03:00
hbm.h ggml-cpu : split arch-specific implementations (llama/13892) 2025-06-10 12:40:33 +03:00
ops.cpp llama + spec: MTP Support (llama/22673) 2026-05-25 12:26:07 +03:00
ops.h ggml-cpu: fuse RMS_NORM + MUL on CPU backend (llama/22423) 2026-05-14 21:26:48 +03:00
quants.c ggml-cpu: Optimized x86 and generic cpu q1_0 dot (follow up) (llama/21636) 2026-04-30 11:29:14 +03:00
quants.h ggml: add Q1_0 1-bit quantization support (CPU) (llama/21273) 2026-04-30 11:29:01 +03:00
repack.cpp ggml-cpu: fix RVV checks in quants and repacking (llama/20682) 2026-03-29 15:04:36 +03:00
repack.h ggml-cpu: add RVV repack GEMM and GEMV for quantization types (llama/19121) 2026-03-16 13:10:15 +02:00
simd-gemm.h ggml : implemented simd_gemm kernel for riscv vector extension (llama/20627) 2026-04-30 11:29:11 +03:00
simd-mappings.h ggml : add native AVX512-FP16 support for F16 operations (llama/20529) 2026-03-16 13:10:15 +02:00
traits.cpp ggml : fix fallback to CPU for ununsupported ops (llama/15118) 2025-08-18 20:30:45 +03:00
traits.h ggml : fix fallback to CPU for ununsupported ops (llama/15118) 2025-08-18 20:30:45 +03:00
unary-ops.cpp ggml : unary ops support non-cont src0 + metal F16 unary ops (llama/19511) 2026-02-15 21:44:37 +02:00
unary-ops.h ggml : add ops SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM (llama/17063) 2025-11-17 21:05:46 +02:00
vec.cpp ggml-cpu: optimize ggml_vec_dot_bf16 for s390x (llama/19399) 2026-02-27 20:57:58 +02:00
vec.h ggml-cpu : re-enable fast gelu_quick_f16 (llama/22339) 2026-04-30 11:29:20 +03:00