Commit Graph

62 Commits

Author SHA1 Message Date
dskwe 511f8602b1 ggml : Check the right iface method before using the fallback 2d get (llama/23514) 2026-05-25 12:26:07 +03:00
Matt Corallo 03da9f17f4 ggml : Check the right iface method before using the fallback 2d get (llama/23306)
Probably no backends implement only one of 2d get/set, but this
might be annoying for some future backend developer trying to add
2d get/set.
2026-05-25 12:26:07 +03:00
Max Krasnyansky eb38a02de1 ggml: update SCHED_DEBUG output to use ggml_op_desc() (llama/22825) 2026-05-14 21:26:48 +03:00
Aman Gupta 820438ae2c
ggml: add graph_reused (llama/21764)
* ggml: add graph_reused

* use versioning instead of reuse flag

* increment version with atomic

* use top bits for split numbering

* add assert

* move counter to ggml.c

* set uid in split_graph only

* fix windows

* address further review comments

* get next_uid rather than doing bit manipulation

* rename + add comment about uid
2026-04-30 11:29:11 +03:00
Johannes Gäßler bb895c843d
ggml: backend-agnostic tensor parallelism (experimental) (llama/19378)
* ggml: backend-agnostic tensor parallelism

* support for GPT-OSS, Qwen 3 MoE

* partial Vulkan fix

* add support for 4/8 GPUs

* unconditional peer access

* re-use buffers + ggml contexts

* fix output pattern

* NCCL support

* GGML: HIP: add RCCL support

* Remove shfl and AllReduce from backend interface

* move allocation workaround out of ggml-alloc.c

* 2d tensor set/get support

* Fix the seg fault without NCCL

* Apply suggestion from JohannesGaessler

* support for tensor dims % n_devs != 0

* fix view_offs scaling

* arbitrary num. of GPUs/tensor split

* fix compilation

* better granularity estimate

* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.

Fix compilation errors.

* partial Qwen 3 Next support

* Fix qwen3 30b (llama/8)

* Fix crash with Qwen-30B-A3B Q4_0

Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.

* Decide block size based on tensor quantization type

* Fix crashes due to KV cache serialization (llama/9)

KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.

* metal : fix build (llama/7)

* static memory allocations, fix usage count

* fix tensor granularity

* more even memory distribution

* use BF16 for allreduce

* rebase fixup

* better error message for unsupported architectures

* Fix device mismatch during scatter of allReduce. (llama/11)

There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies

* Enable the previous allreduce implementation. It is better in both perf and stability (llama/12)

* delay AllReduce for Moe for less I/O

* build : clean-up compile warnings

* backend : move most of the meta backend API to ggml-backend-impl.h

* cont : hide unused public API in the implementation

* llama : use llama_device + remove ggml_backend_dev_is_meta()

* ggml-backend : remove unused alloc include

* minor : remove regex include

* ggml : introduce ggml-ext.h for staging new APIs

* rebase fixup

* fix tests

* llama : more robust logic for determining Meta devices (llama/16)

* llama : more robust logic for determining Meta devices

* cont : fix devs size check

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cont : fix log type

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* disable roundtrip for meta backend

* fix arch selection

* Qwen 3.5 support

* fix Gemma 4 MoE

* fix OpenVino, SYCL

* fix test-llama-archs for CPU-only builds

* Fix Qwen 3.5 MoE

* disable meta backend tests for WebGPU

* tests : filter CPU-based devices from the Meta backend tests (llama/17)

* meta : formatting, naming, indentation (llama/18)

* formatting : llama-model.cpp

* formatting : ggml-ext.h

* formatting : ggml-backend-meta.cpp

* meta : add TODO

* add documentation

* better error messages

* fix GPT-OSS

---------

Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-30 11:29:05 +03:00
Georgi Gerganov 2ed6dc0222 llama : disable graph reuse with pipeline parallelism (llama/20463) 2026-03-16 13:10:15 +02:00
Andreas Kieslinger 51f397c1af CUDA: Improve performance via less synchronizations between token (llama/17795)
* Adds CPU-to-CUDA copy capability to
ggml_backend_cuda_cpy_tensor_async()

* Adds function to relax sync requirements between input copies on
supported backends (CUDA for now)

* Exchanges synchronous copy with async copy function.

* Adds macro guards to allow compilation in non-CUDA builds

* Reworked backend detection in ggml-backend.cpp to avoid linking
conflicts

* Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues

* Minor cleanup

* Makes opt-in to relax use of explicit syncs more general. Backends like
vulkan which require a synchronization between HtoD copies and graph
execution could also adopt this change now.

* Reintroduces stricter check for CPU->CUDA backend async copy via
GGML_DEVICE_TYPE_CPU.

* Corrects initialization of ggml_backend_sync_mode in
ggml_backend_sched_split initialization

* Simplifies synchronizations to adhere to `saaasg` pattern.

* Apply suggestion from @ggerganov (src->buffer to buf_src)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Apply suggestion from @ggerganov (src->buffer to buf_src) v2

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-16 13:10:15 +02:00
Johannes Gäßler 625c8d863e ggml-backend: fix async set/get fallback sync (llama/19179) 2026-02-08 09:29:10 +02:00
Georgi Gerganov 47f3e3b927 ggml : add ggml_build_forward_select (llama/18550)
* ggml : add ggml_build_forward_select

* cuda : adapt CUDA graph compat to new feature

* vulkan : update logic to handle command buffer closing

* ggml : check compute for fusion

* ggml : add comment
2026-01-30 15:56:40 +02:00
Jeff Bolz b1f65a4a7e vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron (llama/18295)
* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.

* change test_topk_moe to allow results in arbitrary order

* disable sigmoid fusion for moltenvk
2026-01-14 09:11:59 +02:00
Johannes Gäßler aaf3f39b4a llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (llama/16653)
* llama: automatically fit args to free memory

llama-fit-params tool

* fix CI

* hints for bug reports, ensure no reallocation

* fix segfault with Vulkan

* add llama-fit-params to CI

* fix CI

* fix CI

* fix CI

* minor adjustments

* fix assignment of 1 dense layer

* fix logger not being reset on model load failure

* remove --n-gpu-layer hint on model load failure

* fix llama-fit-params verbosity

* fix edge case

* fix typo [no ci]
2025-12-18 08:20:56 +02:00
Daniel Bevenius 201b910743
ggml : remove redundant n_copies check when setting input/output (llama/17612)
This commit removes a redundant check for sched->n_copies > 1 when
setting input and output flags on tensor copies in
ggml_backend_sched_split_graph.

The motivation for this change is to clarify the code as the outer if
statement already performs this check.
2025-12-12 17:53:15 +02:00
Georgi Gerganov 7cd3de89bf
ggml : extend the GGML_SCHED_NO_REALLOC debug logic of the scheduler (llama/17617) 2025-12-12 17:53:14 +02:00
Diego Devesa 463003e76c
ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched (llama/17276)
* ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched
Enabled in ggml-ci for testing.

* llama : update worst-case graph for unified cache

* ci : disable op offload in some tests

* fix spelling

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-12 17:53:12 +02:00
Diego Devesa 210f0f860b sched : fix reserve ignoring user tensor assignments (llama/17232) 2025-11-17 21:05:46 +02:00
Johannes Gäßler cd431223e0
llama: print memory breakdown on exit (llama/15860)
* llama: print memory breakdown on exit
2025-09-29 15:18:10 +03:00
Jeff Bolz 7fcb7e83ec
rename optimize_graph to graph_optimize (llama/16082) 2025-09-20 13:46:39 +03:00
Jeff Bolz c29cd54818
vulkan: sort graph to allow more parallel execution (llama/15850)
* vulkan: sort graph to allow more parallel execution

Add a backend proc to allow the backend to modify the graph. The
vulkan implementation looks at which nodes depend on each other
and greedily reorders them to group together nodes that don't
depend on each other. It only reorders the nodes, doesn't change
the contents of any of them.

With #15489, this reduces the number of synchronizations needed.

* call optimize_graph per-split
2025-09-20 13:42:52 +03:00
Johannes Gäßler f20a7b0e99
ggml-backend: raise GGML_MAX_SPLIT_INPUTS (llama/15722) 2025-09-20 13:42:47 +03:00
Diego Devesa b11c972b88
llama : separate compute buffer reserve from fattn check (llama/15696)
Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.
2025-09-20 13:42:45 +03:00
Johannes Gäßler f6ba3949b6
llama: use FA + max. GPU layers by default (llama/15434)
* llama: use max. GPU layers by default, auto -fa

* ggml-backend: abort instead of segfault
2025-09-20 13:42:44 +03:00
Diego Devesa 554f96f385
sched : fix possible use of wrong ids tensor when offloading moe prompt processing (llama/15488) 2025-09-20 13:42:39 +03:00
Diego Devesa 622dec5bf6
sched : copy only the used experts when offloading prompt processing (llama/15346) 2025-09-20 13:42:38 +03:00
Diego Devesa 6fb55d8f7c ggml : fix fallback to CPU for ununsupported ops (llama/15118) 2025-08-18 20:30:45 +03:00
Diego Devesa 270fa9b25c sched : fix multiple evaluations of the same graph with pipeline parallelism (llama/14855)
ggml-ci
2025-07-28 13:02:32 +03:00
Georgi Gerganov 0ed687c6f1 metal : fuse add, mul + add tests (llama/14596)
ggml-ci
2025-07-20 00:23:50 +03:00
Jeff Bolz 00b36237ba vulkan: Add fusion support for RMS_NORM+MUL (llama/14366)
* vulkan: Add fusion support for RMS_NORM+MUL

- Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
- Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
- Add detection logic and basic fusion logic in ggml-vulkan.
- Add some testing support for fusion. Rather than computing one node at a time, allow
for computing the whole graph and just testing one node's results. Add rms_norm_mul tests
and enable a llama test.

* extract some common fusion logic

* fix -Winconsistent-missing-override

* move ggml_can_fuse to a common function

* build fix

* C and C++ versions of can_fuse

* move use count to the graph to avoid data races and double increments when used in multiple threads

* use hash table lookup to find node index

* change use_counts to be indexed by hash table slot

* minimize hash lookups

style fixes

* last node doesn't need single use.
fix type.
handle mul operands being swapped.

* remove redundant parameter

---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-07-01 17:54:53 +03:00
Diego Devesa 6c0472ab8f sched : avoid changing cur_copy when a graph is already allocated (llama/13922) 2025-06-01 15:14:44 +03:00
Diego Devesa b75babebb2 ggml : allow CUDA graphs when using pipeline parallelism (llama/13814) 2025-05-27 18:03:00 +03:00
Johannes Gäßler 5d8b068249 llama/ggml: add LLM training support (llama/10544)
* llama/ggml: add LLM training support

more compact progress bar

llama_save_model_to_file

llama_opt_param_filter

ggml_graph_dup force_grads

refactor ggml_opt, fix test-opt

* remove logits_all

* refactor CUDA implementation for ACC

* reset graph at beginning of opt period
2025-05-13 13:59:21 +03:00
David Huang e1b2ace0f8 Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (llama/13386) 2025-05-13 13:59:21 +03:00
Johannes Gäßler 2ffdda99e8 CUDA: fix logic for clearing padding with -ngl 0 (llama/13320) 2025-05-07 21:00:32 +03:00
mgroeber9110 96a92ecc4c ggml : portability fixes for VS 2017 (llama/12150)
* Add include files for std::min/max and std::toupper/tolower

* win32: move _USE_MATH_DEFINES before includes to ensure M_PI is defined

* Use GGML_RESTRICT instead of "restrict" keyword everywhere, and use "__restrict" in MSVC plain C mode

* win32: only use __restrict in MSVC if C11/C17 support is not enabled

---------

Co-authored-by: Marcus Groeber <Marcus.Groeber@cerence.com>
2025-03-08 15:13:01 +02:00
William Tambellini c98681e6d5 ggml : upgrade init_tensor API to return a ggml_status (llama/11854)
* Upgrade init_tensor API to return a ggml_status

To prepare for an 'abort-free' ggml
(ggml not to abort on OOMs but return a OOM status),
as agreeed with Diego in the ggml repo,
upgrade the init_tensor() and view_init() APIs
to return a ggml_status.

* misc fixes

---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-03-08 15:13:01 +02:00
Diego Devesa 09fabffdf5 ggml-backend : only offload from host buffers (fix) (llama/11124) 2025-01-14 10:38:01 +02:00
Diego Devesa 3988d6396b ggml-backend : only offload from host buffers (llama/11120) 2025-01-14 10:38:01 +02:00
Daniel Bevenius 6348d73e55 ggml : improve inputs log sched_print_assignments (ggml/1053)
This commit attempts to improve the log message for the inputs of the
splits in the sched_print_assignments function.

The motivation for this change is that currently even if there are no
inputs a colon is displayed at the end of the line, which can make it a
little confusing when reading the output as it could be interpreted as
the line below are inputs when they are in fact nodes. With this change
the colon will only be printed if there actually are inputs.
2025-01-04 10:45:01 +02:00
Diego Devesa 3daeacad24 ggml : move AMX to the CPU backend (llama/10570)
ggml : automatic selection of best CPU backend (llama/10606)
2024-12-08 20:14:35 +02:00
Johannes Gäßler 98f9916c9f ggml-opt: fix data corruption (ggml/1022) 2024-12-08 20:14:35 +02:00
slaren 9db070a3c5 ggml/sched : do not skip views in pre-assignments 2024-11-20 21:00:08 +02:00
Georgi Gerganov f4c1d7df39 ggml : sync resolve (skip) (#0) 2024-11-20 21:00:08 +02:00
Diego Devesa 0879d3599e llama : only use default buffer types for the KV cache (llama/10358) 2024-11-20 21:00:08 +02:00
Diego Devesa 24ad19d0e9 ggml : fix possible buffer use after free in sched reserve (llama/9930) 2024-11-20 21:00:08 +02:00
Johannes Gäßler c9541741e6 ggml: new optimization interface (ggml/988)
* ggml: new optimization interface

remove test2.c, test3.c

store adamw params in tensor

move grads from tensor to graph

* avoid segfault upon API misuse

* add ggml-opt.h to public headers

* remove dependence of ggml-opt.cpp on ggml-cpu.h
2024-11-20 21:00:08 +02:00
Georgi Gerganov bb12cd9b77
ggml : tmp workaround for whisper.cpp (skip) (#2565) 2024-11-16 20:21:24 +02:00
Diego Devesa 9c817edb48 ggml : move CPU backend to a separate file (llama/10144) 2024-11-15 15:21:04 +02:00
Diego Devesa 3e231ab9cc llama : fix buffer checks for mamba and rwk (llama/10111)
* llama : fix buffer checks for mamba and rwk

* llama : fix missing worst case flag during reserve

* cuda : fix supports_op for norm

* disable sched SET_CAUSE
2024-11-15 15:21:04 +02:00
Sergio López 1e122d66f9 kompute: add backend registry / device interfaces (llama/10045)
Get in line with the other backends by supporting the newer
backend/device registry interfaces.

Signed-off-by: Sergio Lopez <slp@redhat.com>
2024-11-15 15:21:04 +02:00
Diego Devesa 1d48457aa6 llama : refactor model loader with backend registry (llama/10026) 2024-11-15 15:21:04 +02:00
leo-pony 13db492f83 Adapt to dynamically loadable backends mechanism (llama/9970)
* [CANN] Adapt to dynamically loadable backends mechanism

* Fix the Bug: inference running result is garbled in debug running model for LM models who's type is Q4_0 class

* Handle the review comments of this pull request
2024-11-01 10:19:05 +02:00