Commit Graph

3361 Commits

Author SHA1 Message Date
Aaron Teo e902731ccc
ggml-zdnn: fix #15414, activate FP16 and BF16 acceleration and incorrect zTensor free (llama/15839) 2025-09-20 13:45:28 +03:00
Ruben Ortlam 424c85f22a
Vulkan iGPU device selection overhaul and PCI ID API support (llama/15947)
* vulkan: implement ggml igpu device type, implement pci id support

* fix compiler warning

* prevent printf overflow warning
2025-09-20 13:45:28 +03:00
Mathieu Baudier 5a752bab84
vulkan: Make device memory check more portable (llama/15939) 2025-09-20 13:45:28 +03:00
Neo Zhang Jianyu cd764eaf2b
Revert "sycl: add usage of enqueue_functions extension (llama/14244)" (llama/15910)
* Revert "sycl: add usage of enqueue_functions extension (#14244)"

This reverts commit 8308f98c7fb778e54bf75538f5234d8bd20915e9.

* fix missed revert code, format the code
2025-09-20 13:45:28 +03:00
Diego Devesa 555dcb3e01
ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type (llama/15797)
* ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type

ggml-backend : add device id to device props

llama : only use iGPU devices if there are no GPU devices

llama : do not use multiple devices from different backends with the same device id
2025-09-20 13:45:28 +03:00
Johannes Gäßler f0768eb575
CUDA: larger SRAM reads for tile FA, AMD FP16 dot (llama/15927)
* CUDA: larger SRAM reads for tile FA, AMD FP16 dot

* fix logic for availability of v_dot2_f32_f16
2025-09-20 13:45:28 +03:00
Daniel Bevenius 020eb19eb3
ggml-cpu : add check for ARM MATMUL_INT8/i8mm support (llama/15922)
This commit adds a check for GGML_MACHINE_SUPPORTS_i8mm when enabling
MATMUL_INT8 features, ensuring that i8mm intrinsics are only used when
the target hardware actually supports them.

The motivation for this is to fix ggml CI build failures where the
feature detection correctly identifies that i8mm is not supported,
adding the +noi8mm flag, but MATMUL_INT8 preprocessor definitions are
still enabled, causing the compiler to attempt to use vmmlaq_s32
intrinsics without i8mm support.

Refs: https://github.com/ggml-org/ggml/actions/runs/17525174120/job/49909199499
2025-09-20 13:45:28 +03:00
Charles Xu b079d9c8b0
kleidiai: fix GGML_ASSERT(*cur_backend_id != -1) failed (llama/15614)
* kleidiai: fix GGML_ASSERT(*cur_backend_id != -1) failed

* removes the Whisper-specific check for GET_ROWS support
2025-09-20 13:45:27 +03:00
hipudding dadf73665a
CANN: Disable acl_graph for prefill stage (llama/15933)
Since the prefill length is not fixed, graphs constructed for the
prefill stage cannot be reused. For this reason, ACL graph
execution is disabled by default during prefill.
2025-09-20 13:45:27 +03:00
Oliver Simons f5ef0e25e2
CUDA: Add `fastdiv` to `k_bin_bcast*`, giving 1-3% E2E performance (llama/15872)
* Add fastdiv and fastmodulo to k_bin_bcast kernel

* Address review comments

* `prod_` instead of `prod` suffix

* Add test case for `k_bin_bcast_unravel` in CUDA backend
2025-09-20 13:45:27 +03:00
Daniel Bevenius 3617008c37
ggml-cpu : fix padding in ggml_timestep_embedding (llama/15917)
This commit fixes the zero padding for odd dimensions in
ggml_compute_forward_timestep_embedding_f32.
The motivation for this is that currently if an odd dimension is used,
the padding check incorrectly uses the dimension value for indexing.
For example, with dim=15:

Elements 0-6 are set to cosine values
Elements 7-13 are set to sine values
Element 14 is left uninitialized (contains garbage)
Element 15 is correctly set to zero

This fix changes embed_data[dim] to embed_data[2 * half] so that
element 14 (the first unused element) is properly set to zero as well
as the last element.

Resolves: https://github.com/ggml-org/ggml/issues/1324
2025-09-20 13:45:27 +03:00
Georgi Gerganov c974f63057
sync : ggml 2025-09-20 13:44:48 +03:00
Georgi Gerganov 7eae055e61
metal : make the backend async (llama/15906) 2025-09-20 13:44:27 +03:00
Georgi Gerganov e2c7f1cccd
sync : ggml 2025-09-20 13:43:01 +03:00
Chenguang Li 4d453b14a9
CANN: Add ROPE sin/cos cache for reuse (llama/15912)
* CANN: Add ROPE sin/cos cache for reuse

Introduce sin/cos caching mechanism in ROPE to avoid redundant
computation across layers. The cache is built on the first layer
per device and reused by subsequent layers if parameters match.

- Added sin_cache / cos_cache pointers and position_length tracking
- Introduced cache validity flags and properties:
  (ext_factor, theta_scale, freq_scale, attn_factor, is_neox)
- Accelerates ROPE by eliminating repeated sin/cos generation

This change reduces overhead in multi-layer scenarios while
preserving correctness by verifying parameter consistency.

Co-authored-by: hipudding <huafengchun@gmail.com>

* fix typo

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
2025-09-20 13:42:53 +03:00
Chenguang Li 9b773acac0
CANN: implement LRU cache for ACL graphs (llama/15814)
* CANN: implement LRU cache for ACL graphs in CANN backend

- Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects.
- Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded.
- Updated push, move_to_front, and clear methods to manage cached graphs efficiently.
- Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend.

* fix typo

* The LRU cache capacity can be configured via an env variable

Signed-off-by: noemotiovon <757486878@qq.com>

* refactory acl graph

* refactory && fix review comments

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-09-20 13:42:53 +03:00
Ruben Ortlam 7abe187860
vulkan: throw the oom error instead of no memory type found (llama/15905) 2025-09-20 13:42:53 +03:00
Jeff Bolz d0e98656c3
vulkan: Fix OOB accesses in soft_max_back (llama/15861) 2025-09-20 13:42:52 +03:00
Johannes Gäßler e35d1375ee
HIP: use v_dot2_f32_f16 instruction for FA (llama/15884) 2025-09-20 13:42:52 +03:00
lksj92hs 7fbbb67b47
Workaround for subgroup arithmetic failing on MoltenVK with AMD GPUs (issue 15846) (llama/15886) 2025-09-20 13:42:52 +03:00
Aman Gupta 621764b1a5
CUDA: Add mul_mat_id support for the mmf kernel (llama/15767)
* CUDA: Add mul_mat_id support the mmf

Add support for mul_mat_id for bs < 16

* Review: use warp_size, fix should_use_mmf condition

* Launch one block per expert, stride along n_expert_used

* templatize mul_mat_id

* Pad shmem to 16 bytes, add helper function mul_mat_f_switch_ids

* Reduce compile times by dividing mmf into f16, bf16 and f32 variants

* Divide mmf by ncols_dst

* Add missing files

* Fix MUSA/HIP builds
2025-09-20 13:42:52 +03:00
Johannes Gäßler 260982232c
CUDA: fix GET_ROWS for large tensors (llama/15882) 2025-09-20 13:42:52 +03:00
Jeff Bolz c29cd54818
vulkan: sort graph to allow more parallel execution (llama/15850)
* vulkan: sort graph to allow more parallel execution

Add a backend proc to allow the backend to modify the graph. The
vulkan implementation looks at which nodes depend on each other
and greedily reorders them to group together nodes that don't
depend on each other. It only reorders the nodes, doesn't change
the contents of any of them.

With #15489, this reduces the number of synchronizations needed.

* call optimize_graph per-split
2025-09-20 13:42:52 +03:00
Aman Gupta 70ee808f3d
CUDA: generate_cu_files.py - add missing mxfp4 (llama/15880) 2025-09-20 13:42:52 +03:00
Georgi Gerganov ae6cc6a386
cuda : fix supports_op condition for get_rows when number of blocks is too large (llama/15868)
* cuda : fix supports_op condition for get_rows when src1->ne2 > 1

ggml-ci

* ggml : add comment about ggml_get_rows

ggml-ci

* cuda : add FIXME [no ci]

* cuda : update support condition

ggml-ci
2025-09-20 13:42:52 +03:00
Georgi Gerganov e9cb59e970
metal : refactor + optimize (llama/15857) 2025-09-20 13:42:51 +03:00
Xuan-Son Nguyen 40bcd1a469
ggml: allow casting between f32 and i32 (llama/15783)
* ggml: allow casting between f32 and i32

* fix cuda

* add vulkan

* fix CPU non-cont

* add non-cont test case

* add note

* extend test number range

* correct note

* add cont version for vulkan
2025-09-20 13:42:51 +03:00
Sigbjørn Skjæret 0175a1df8d
CUDA: non-contiguous src0 not supported for PAD (llama/15869) 2025-09-20 13:42:51 +03:00
Chenguang Li d9c0ead2ab
CANN: Stream sync between devices for acl_graph (llama/15809)
* CANN: Switch to stream synchronization

Switch to stream synchronization because events are not effective.

Co-authored-by: hipudding <huafengchun@gmail.com>

* CANN: add Comments

---------

Co-authored-by: hipudding <huafengchun@gmail.com>
2025-09-20 13:42:51 +03:00
Jeff Bolz dfa7722e2e
vulkan: support im2col_3d (llama/15795) 2025-09-20 13:42:51 +03:00
Aaron Teo db4f504b69
ggml-cpu: clean up s390x SIMD (llama/15855)
* ggml-cpu: clean up s390x simd

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 0da4b6aa07d96b758812d17b2c82267632fa4ba5)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix hsum data types

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-20 13:42:51 +03:00
Jeff Bolz 9523fd8de6
vulkan: Support pad_ext (llama/15794) 2025-09-20 13:42:51 +03:00
Jeff Bolz 647e2d7de5
vulkan: Use larger loads in scalar/coopmat1 matmul (llama/15729)
I think glslang will translate an access like x[i][1].z to
OpAccessChain ... x, i, 1, 2
OpLoad float16_t ...

rather than loading all of x[i] in a single OpLoad. Change the
code to explicitly load the vector/matrix.
2025-09-20 13:42:51 +03:00
Daniel Bevenius cda7d4e5ac
ggml WebGPU: remove userdata from request adapter callback (llama/15527)
* ggml WebGPU: remove userdata from request adapter callback

This commit removes the `userdata` parameter from the WebGPU request
adapter callback in `ggml-webgpu.cpp`. Instead, the lambda function
captures the `webgpu_context` directly.

The motivation for this change is to simplify the code and improve
readability.

* inline the callback lambda into the RequestAdapter call

This commit removes the callback lambda variable and inlines it directly
into the RequestAdapter call.
2025-09-20 13:42:50 +03:00
Johannes Gäßler cd70d89628
CUDA: faster tile FA (Pascal/AMD), headsize 256 (llama/15769) 2025-09-20 13:42:50 +03:00
Charles Xu be2676bb1c
kleidiai: generalize compute_forward_kv_cache to compute_forward_fp16 (llama/15817) 2025-09-20 13:42:50 +03:00
Johannes Gäßler 69400f16f1
ggml-cpu: document use of "free" memory [no ci] (llama/15834) 2025-09-20 13:42:50 +03:00
Aaron Teo f499271c4e
ggml-cpu: drop support for nnpa intrinsics (llama/15821) 2025-09-20 13:42:50 +03:00
Johannes Gäßler 6ff468cfaa
CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (llama/15802)
* CUDA: fastdiv, launch bounds for mmvq + q8_1 quant
2025-09-20 13:42:50 +03:00
Daniel Bevenius 4d6e1144b1
ggml : introduce semantic versioning (ggml/1336)
* ggml : introduce semantic versioning

This commit introduces semantic versioning for the GGML library.

The motivation for this is that the current versioning, using build
numbers, makes it difficult to track changes and releases for projects
that use ggml.

The release steps are the following:
1. Sync the changes from llama.cpp using sync-llama-am.sh and after the
   PR has been approved and merged move to step 2.
2. Run scripts/release.sh and specify the type of release, major, minor,
   or patch. This script will handle incrementing the version
   (major|minor|patch), create a new commit with the version change,
   create a tag for the version, and prepare for the next development
   iteration.
3. Inspect the commits/tag and push to master. This will trigger the
   github release workflow which is triggered for new tags which will
   then publish a new release on github.

Example usage:
```console
$ ./scripts/release.sh major --dry-run
[dry-run] - No changes will be made

Step 1: Reading current version...
Current version: 0.9.0-dev
New release version: 1.0.0

Step 2: Updating version in CMakeLists.txt...
  [dry-run] Would update GGML_VERSION_MAJOR to 1
  [dry-run] Would update GGML_VERSION_MINOR to 0
  [dry-run] Would update GGML_VERSION_PATCH to 0
  [dry-run] Would remove -dev suffix

Step 3: Committing version bump...
  [dry-run] Would commit: 'ggml : bump version to 1.0.0'

Step 4: Creating git tag...
  [dry-run] Would create tag: v1.0.0 with message 'Release version 1.0.0'

Step 5: Preparing for next development cycle...
  [dry-run] Would update GGML_VERSION_MINOR to 1
  [dry-run] Would add -dev suffix back

Step 6: Committing development version...
  [dry-run] Would commit: 'ggml : prepare for development of 1.1.0-dev'

[dry-run] Summary (no changes were made):
  • Would have released version: 1.0.0
  • Would have created tag: v1.0.0
  • Would have set next development version: 1.1.0-dev
```

Refs: https://github.com/ggml-org/ggml/issues/1333

* ggml: create branch for release candidate and check master

* ggml : sign the git tag
2025-09-20 13:42:50 +03:00
Gregor Jasny c80f78cc7b
CUDA : conditionally add cuda architectures (ggml/1341) 2025-09-20 13:42:50 +03:00
Gabe Goodhart ffe560cbb1
metal : Add template specialization for mul_mm_id w/ ne20 == 10 (llama/15799)
Branch: GGMLMetalNE20

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-09-20 13:42:49 +03:00
Chenguang Li 3780a3c917
CANN: Refactor ND to NZ workspace to be per-device (llama/15763)
* CANN:Refactor ND to NZ workspace to be per-device in Ascend backend

- Replaced the previous single global ND→NZ workspace with a per-device
  cache using unordered_map keyed by device ID.
- Functions `release_nz_workspace`, `relloc_nz_workspace`, and
  `get_nz_workspace` now manage workspace independently for each device,
  preventing memory conflicts in multi-device / pipeline parallel scenarios.
- This change fixes potential precision issues caused by workspace
  overwrites when multiple devices perform ND→NZ conversions concurrently.

Co-authored-by: hipudding <huafengchun@gmail.com>

* refactor

Signed-off-by: noemotiovon <757486878@qq.com>

* rename

Signed-off-by: noemotiovon <757486878@qq.com>

* fix review comments

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
2025-09-20 13:42:49 +03:00
leejet 2228462b19
ggml: add ops for WAN video model (cuda && cpu) (llama/15669)
* add conv3d support

* add ggml_pad_ext for cpu & cuda backend

* cuda/cpu: add im2col_3d support

* cuda: make im2col a little faster

* fix cuda pad/scale/im2col3d

* make im2col_3d faster

* gguf: support loading tensors which n_dims > GGML_MAX_DIMS

* fix cuda get_rows

* avoid ggml_conv_3d conflict

* correct GGML_OP_COUNT assertion

* avoid build failure

* avoid build failure on MacOS

* cuda: remove unnecessary MIN define

* fix cpu im2col_3d

* adjust the code style

* cuda: use simpler loop in get_rows

* add test_im2col_3d to test-backend-ops

* test-backend-ops.cpp: remove trailing whitespace

* cpu: im2col_3d support non continuous src

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

* fix test_im2col_3d

* remove unused variables

* cuda: get_rows: dfloat2 -> float2

* add test_pad_ext to test-backend-ops.cpp

* add gguf_init_from_file_ext impl

* Revert "gguf: support loading tensors which n_dims > GGML_MAX_DIMS"

This reverts commit d8377a0a37f314bd3713fe043b4333ad661610c1.

* Revert "add gguf_init_from_file_ext impl"

This reverts commit d9f1d13208c68ef83b3538201ac7f31614fb1994.

* update ggml_backend_vk_device_supports_op

* fix ggml_backend_vk_device_supports_op

* update other backend supports op for ggml_pad_ext

* metal/opencl/sycl/vulkan: fix GGML_OP_PAD check in supports_op

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-09-20 13:42:49 +03:00
hipudding 96efb472b4
CANN: Fix precision issue on 310I DUO multi-devices (llama/15784) 2025-09-20 13:42:49 +03:00
rmatif 1569daf524
opencl: add hs=40 to FA (llama/15758) 2025-09-20 13:42:49 +03:00
Chenguang Li 5c860e94c6
CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (llama/15760)
Fixes #15330

Adjust the allocation size of acl_rstd. The parameter `dims` is set to 3 according to the CANN documentation.

Co-authored-by: Yuchuan <yuchuan-cao@users.noreply.github.com>
2025-09-20 13:42:49 +03:00
Ruben Ortlam 719a05c665
vulkan: fix mmv subgroup16 selection (llama/15775) 2025-09-20 13:42:49 +03:00
Jeff Bolz 4a702a867c
vulkan: don't use std::string in load_shaders, to improve compile time (llama/15724)
* vulkan: don't use std::string in load_shaders, to improve compile time

* keep the string version for those calls that use it
2025-09-20 13:42:49 +03:00
Daniel Bevenius 4144ae10e9
vulkan : update ggml_vk_instance_validation_ext_available (llama/15666)
* vulkan : update ggml_vk_instance_validation_ext_available

This commit updates ggml_vk_instance_validation_ext_available() to
check for VK_EXT_validation_features instead of
VK_KHR_portability_enumeration.

Based on how the returned boolean is used later in the code (to enable
both the validation layer and the VK_EXT_validation_features extension),
it appears the function may have been intended to check for the
validation layer features extension.

* remove try/catch

This was a left over from a previous iteration where I was explicitly
quering for a specific validation layer first, which would throw.

* update warning message about validation layers
2025-09-20 13:42:48 +03:00