Commit Graph

52 Commits

Author SHA1 Message Date
Johannes Gäßler aaf3f39b4a llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (llama/16653)
* llama: automatically fit args to free memory

llama-fit-params tool

* fix CI

* hints for bug reports, ensure no reallocation

* fix segfault with Vulkan

* add llama-fit-params to CI

* fix CI

* fix CI

* fix CI

* minor adjustments

* fix assignment of 1 dense layer

* fix logger not being reset on model load failure

* remove --n-gpu-layer hint on model load failure

* fix llama-fit-params verbosity

* fix edge case

* fix typo [no ci]
2025-12-18 08:20:56 +02:00
Daniel Bevenius 201b910743
ggml : remove redundant n_copies check when setting input/output (llama/17612)
This commit removes a redundant check for sched->n_copies > 1 when
setting input and output flags on tensor copies in
ggml_backend_sched_split_graph.

The motivation for this change is to clarify the code as the outer if
statement already performs this check.
2025-12-12 17:53:15 +02:00
Georgi Gerganov 7cd3de89bf
ggml : extend the GGML_SCHED_NO_REALLOC debug logic of the scheduler (llama/17617) 2025-12-12 17:53:14 +02:00
Diego Devesa 463003e76c
ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched (llama/17276)
* ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched
Enabled in ggml-ci for testing.

* llama : update worst-case graph for unified cache

* ci : disable op offload in some tests

* fix spelling

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-12 17:53:12 +02:00
Diego Devesa 210f0f860b sched : fix reserve ignoring user tensor assignments (llama/17232) 2025-11-17 21:05:46 +02:00
Johannes Gäßler cd431223e0
llama: print memory breakdown on exit (llama/15860)
* llama: print memory breakdown on exit
2025-09-29 15:18:10 +03:00
Jeff Bolz 7fcb7e83ec
rename optimize_graph to graph_optimize (llama/16082) 2025-09-20 13:46:39 +03:00
Jeff Bolz c29cd54818
vulkan: sort graph to allow more parallel execution (llama/15850)
* vulkan: sort graph to allow more parallel execution

Add a backend proc to allow the backend to modify the graph. The
vulkan implementation looks at which nodes depend on each other
and greedily reorders them to group together nodes that don't
depend on each other. It only reorders the nodes, doesn't change
the contents of any of them.

With #15489, this reduces the number of synchronizations needed.

* call optimize_graph per-split
2025-09-20 13:42:52 +03:00
Johannes Gäßler f20a7b0e99
ggml-backend: raise GGML_MAX_SPLIT_INPUTS (llama/15722) 2025-09-20 13:42:47 +03:00
Diego Devesa b11c972b88
llama : separate compute buffer reserve from fattn check (llama/15696)
Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.
2025-09-20 13:42:45 +03:00
Johannes Gäßler f6ba3949b6
llama: use FA + max. GPU layers by default (llama/15434)
* llama: use max. GPU layers by default, auto -fa

* ggml-backend: abort instead of segfault
2025-09-20 13:42:44 +03:00
Diego Devesa 554f96f385
sched : fix possible use of wrong ids tensor when offloading moe prompt processing (llama/15488) 2025-09-20 13:42:39 +03:00
Diego Devesa 622dec5bf6
sched : copy only the used experts when offloading prompt processing (llama/15346) 2025-09-20 13:42:38 +03:00
Diego Devesa 6fb55d8f7c ggml : fix fallback to CPU for ununsupported ops (llama/15118) 2025-08-18 20:30:45 +03:00
Diego Devesa 270fa9b25c sched : fix multiple evaluations of the same graph with pipeline parallelism (llama/14855)
ggml-ci
2025-07-28 13:02:32 +03:00
Georgi Gerganov 0ed687c6f1 metal : fuse add, mul + add tests (llama/14596)
ggml-ci
2025-07-20 00:23:50 +03:00
Jeff Bolz 00b36237ba vulkan: Add fusion support for RMS_NORM+MUL (llama/14366)
* vulkan: Add fusion support for RMS_NORM+MUL

- Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
- Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
- Add detection logic and basic fusion logic in ggml-vulkan.
- Add some testing support for fusion. Rather than computing one node at a time, allow
for computing the whole graph and just testing one node's results. Add rms_norm_mul tests
and enable a llama test.

* extract some common fusion logic

* fix -Winconsistent-missing-override

* move ggml_can_fuse to a common function

* build fix

* C and C++ versions of can_fuse

* move use count to the graph to avoid data races and double increments when used in multiple threads

* use hash table lookup to find node index

* change use_counts to be indexed by hash table slot

* minimize hash lookups

style fixes

* last node doesn't need single use.
fix type.
handle mul operands being swapped.

* remove redundant parameter

---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-07-01 17:54:53 +03:00
Diego Devesa 6c0472ab8f sched : avoid changing cur_copy when a graph is already allocated (llama/13922) 2025-06-01 15:14:44 +03:00
Diego Devesa b75babebb2 ggml : allow CUDA graphs when using pipeline parallelism (llama/13814) 2025-05-27 18:03:00 +03:00
Johannes Gäßler 5d8b068249 llama/ggml: add LLM training support (llama/10544)
* llama/ggml: add LLM training support

more compact progress bar

llama_save_model_to_file

llama_opt_param_filter

ggml_graph_dup force_grads

refactor ggml_opt, fix test-opt

* remove logits_all

* refactor CUDA implementation for ACC

* reset graph at beginning of opt period
2025-05-13 13:59:21 +03:00
David Huang e1b2ace0f8 Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (llama/13386) 2025-05-13 13:59:21 +03:00
Johannes Gäßler 2ffdda99e8 CUDA: fix logic for clearing padding with -ngl 0 (llama/13320) 2025-05-07 21:00:32 +03:00
mgroeber9110 96a92ecc4c ggml : portability fixes for VS 2017 (llama/12150)
* Add include files for std::min/max and std::toupper/tolower

* win32: move _USE_MATH_DEFINES before includes to ensure M_PI is defined

* Use GGML_RESTRICT instead of "restrict" keyword everywhere, and use "__restrict" in MSVC plain C mode

* win32: only use __restrict in MSVC if C11/C17 support is not enabled

---------

Co-authored-by: Marcus Groeber <Marcus.Groeber@cerence.com>
2025-03-08 15:13:01 +02:00
William Tambellini c98681e6d5 ggml : upgrade init_tensor API to return a ggml_status (llama/11854)
* Upgrade init_tensor API to return a ggml_status

To prepare for an 'abort-free' ggml
(ggml not to abort on OOMs but return a OOM status),
as agreeed with Diego in the ggml repo,
upgrade the init_tensor() and view_init() APIs
to return a ggml_status.

* misc fixes

---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-03-08 15:13:01 +02:00
Diego Devesa 09fabffdf5 ggml-backend : only offload from host buffers (fix) (llama/11124) 2025-01-14 10:38:01 +02:00
Diego Devesa 3988d6396b ggml-backend : only offload from host buffers (llama/11120) 2025-01-14 10:38:01 +02:00
Daniel Bevenius 6348d73e55 ggml : improve inputs log sched_print_assignments (ggml/1053)
This commit attempts to improve the log message for the inputs of the
splits in the sched_print_assignments function.

The motivation for this change is that currently even if there are no
inputs a colon is displayed at the end of the line, which can make it a
little confusing when reading the output as it could be interpreted as
the line below are inputs when they are in fact nodes. With this change
the colon will only be printed if there actually are inputs.
2025-01-04 10:45:01 +02:00
Diego Devesa 3daeacad24 ggml : move AMX to the CPU backend (llama/10570)
ggml : automatic selection of best CPU backend (llama/10606)
2024-12-08 20:14:35 +02:00
Johannes Gäßler 98f9916c9f ggml-opt: fix data corruption (ggml/1022) 2024-12-08 20:14:35 +02:00
slaren 9db070a3c5 ggml/sched : do not skip views in pre-assignments 2024-11-20 21:00:08 +02:00
Georgi Gerganov f4c1d7df39 ggml : sync resolve (skip) (#0) 2024-11-20 21:00:08 +02:00
Diego Devesa 0879d3599e llama : only use default buffer types for the KV cache (llama/10358) 2024-11-20 21:00:08 +02:00
Diego Devesa 24ad19d0e9 ggml : fix possible buffer use after free in sched reserve (llama/9930) 2024-11-20 21:00:08 +02:00
Johannes Gäßler c9541741e6 ggml: new optimization interface (ggml/988)
* ggml: new optimization interface

remove test2.c, test3.c

store adamw params in tensor

move grads from tensor to graph

* avoid segfault upon API misuse

* add ggml-opt.h to public headers

* remove dependence of ggml-opt.cpp on ggml-cpu.h
2024-11-20 21:00:08 +02:00
Georgi Gerganov bb12cd9b77
ggml : tmp workaround for whisper.cpp (skip) (#2565) 2024-11-16 20:21:24 +02:00
Diego Devesa 9c817edb48 ggml : move CPU backend to a separate file (llama/10144) 2024-11-15 15:21:04 +02:00
Diego Devesa 3e231ab9cc llama : fix buffer checks for mamba and rwk (llama/10111)
* llama : fix buffer checks for mamba and rwk

* llama : fix missing worst case flag during reserve

* cuda : fix supports_op for norm

* disable sched SET_CAUSE
2024-11-15 15:21:04 +02:00
Sergio López 1e122d66f9 kompute: add backend registry / device interfaces (llama/10045)
Get in line with the other backends by supporting the newer
backend/device registry interfaces.

Signed-off-by: Sergio Lopez <slp@redhat.com>
2024-11-15 15:21:04 +02:00
Diego Devesa 1d48457aa6 llama : refactor model loader with backend registry (llama/10026) 2024-11-15 15:21:04 +02:00
leo-pony 13db492f83 Adapt to dynamically loadable backends mechanism (llama/9970)
* [CANN] Adapt to dynamically loadable backends mechanism

* Fix the Bug: inference running result is garbled in debug running model for LM models who's type is Q4_0 class

* Handle the review comments of this pull request
2024-11-01 10:19:05 +02:00
Ouadie EL FAROUKI a4a22daa8f Add SYCL Backend registry, device and Event Interfaces (llama/9705)
* implemented missing SYCL event APIs

* sycl : Added device and backend reg interfaces

* Restructured ggml-sycl.cpp
2024-11-01 10:19:05 +02:00
Ma Mingfei e1936eb2a5 add amx kernel for gemm (llama/8998)
add intel amx isa detection

add vnni kernel for gemv cases

add vnni and amx kernel support for block_q8_0

code cleanup

fix packing B issue

enable openmp

fine tune amx kernel

switch to aten parallel pattern

add error message for nested parallelism

code cleanup

add f16 support in ggml-amx

add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS

update CMakeList

update README

fix some compilation warning

fix compiler warning when amx is not enabled

minor change

ggml-ci

move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp

ggml-ci

update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16

ggml-ci

add amx as an ggml-backend

update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h

minor change

update CMakeLists.txt

minor change

apply weight prepacking in set_tensor method in ggml-backend

fix compile error

ggml-ci

minor change

ggml-ci

update CMakeLists.txt

ggml-ci

add march dependency

minor change

ggml-ci

change ggml_backend_buffer_is_host to return false for amx backend

ggml-ci

fix supports_op

use device reg for AMX backend

ggml-ci

minor change

ggml-ci

minor change

fix rebase

set .buffer_from_host_ptr to be false for AMX backend
2024-11-01 10:19:05 +02:00
Diego Devesa 28b044dad9 vulkan : add backend registry / device interfaces (llama/9721)
* vulkan : add backend registry / device interfaces

* llama : print devices used on model load
2024-11-01 10:19:05 +02:00
Gilad S b8f11a0a17 fix: allocating CPU buffer with size `0` (llama/9917) 2024-11-01 10:19:05 +02:00
Gilad S ff5a838099 fix: use `vm_allocate` to allocate CPU backend buffer on macOS (llama/9875)
* fix: use `vm_allocate` to allocate CPU backend buffer on macOS

* fix: switch to `posix_memalign` to keep existing `free()` usages work

* feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS

* style: formatting

* fix: move const outside of `#ifndef`

* style: formatting

* fix: unused var

* fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h`

* fix: unused var

* fix: page align to `GGUF_DEFAULT_ALIGNMENT`

* fix: page align to `TENSOR_ALIGNMENT`

* fix: convert `TENSOR_ALIGNMENT` to a macro

* fix: increase page size to `32` on iOS

* fix: iOS page size

* fix: `hbw_posix_memalign` alignment
2024-11-01 10:19:05 +02:00
Diego Devesa 81110c0174 ggml : move more prints to the ggml log system (llama/9839)
* ggml : move more prints to the ggml log system

* show BLAS OpenMP warnings in all builds using debug print
2024-11-01 10:19:05 +02:00
Diego Devesa c313723860 rpc : add backend registry / device interfaces (llama/9812)
* rpc : add backend registry / device interfaces

* llama : add llama_supports_rpc API

* ggml_backend_rpc_start_rpc_server -> ggml_backend_rpc_start_server
2024-11-01 10:19:05 +02:00
Diego Devesa 1531259b2c ggml : fix BLAS with unsupported types (llama/9775)
* ggml : do not use BLAS with types without to_float

* ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies

* ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits

it's not really internal if everybody uses it
2024-11-01 10:19:05 +02:00
Diego Devesa 44bc2767fd ggml : add backend registry / device interfaces to BLAS backend (llama/9752)
* ggml : add backend registry / device interfaces to BLAS backend

* fix mmap usage when using host buffers
2024-11-01 10:19:05 +02:00
Georgi Gerganov 315364d7de ggml : add metal backend registry / device (llama/9713)
* ggml : add metal backend registry / device

ggml-ci

* metal : fix names [no ci]

* metal : global registry and device instances

ggml-ci

* cont : alternative initialization of global objects

ggml-ci

* llama : adapt to backend changes

ggml-ci

* fixes

* metal : fix indent

* metal : fix build when MTLGPUFamilyApple3 is not available

ggml-ci

* fix merge

* metal : avoid unnecessary singleton accesses

ggml-ci

* metal : minor fix [no ci]

* metal : g_state -> g_ggml_ctx_dev_main [no ci]

* metal : avoid reference of device context in the backend context

ggml-ci

* metal : minor [no ci]

* metal : fix maxTransferRate check

* metal : remove transfer rate stuff

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-11-01 10:19:05 +02:00