Commit Graph

4079 Commits

Author SHA1 Message Date
Georgi Gerganov de949fb1db metal : consolidate unary ops (llama/19490) 2026-02-15 21:44:37 +02:00
Oliver Simons 57c620b4b1 CUDA : Update CCCL-tag for 3.2 to final release from RC (llama/19486)
CCCL 3.2 has been released since it was added to llama.cpp as part of
the backend-sampling PR, and it makes sense to update from RC to final
released version.

https://github.com/NVIDIA/cccl/releases/tag/v3.2.0
2026-02-15 21:44:37 +02:00
Nikhil Jain 562255fd77 Plug memory leaks and free resources on shutdown (llama/19315)
* Fix memory leaks in shader lib, backend, backend_context, buffer_context, and webgpu_buf_pool

* Free pools

* Cleanup

* More cleanup

* Run clang-format

* Fix arg-parser and tokenizer test errors that free an unallocated buffer

* Fix device lost callback to not print on device teardown

* Fix include and run clang-format

* remove unused unused

* Update binary ops

---------

Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-02-15 21:44:37 +02:00
Alberto Cabrera Pérez d77265c818 ggml-cpu: arm64: q6_K repack gemm and gemv (and generic) implementations (dotprod) (llama/19360)
* First working version of GEMM and GEMV

* interleave loads and compute

* Clang-format

* Added missing fallback. Removed tested TODO.

* Swap M and N to be consistent with the repack template convention
2026-02-15 21:44:37 +02:00
k4ss4n b0fe2e84fa ggml : use noexcept overload for is_regular_file in backend registration (llama/19452)
using noexcept std::filesystem::directory_entry::is_regular_file
overload prevents abnormal termination upon throwing an error
(as caused by symlinks to non-existent folders on linux)

Resolves: #18560
2026-02-15 21:44:37 +02:00
Raul Torres 2de2fc9270 CANN: Remove unnecessary wrapper for `gml_backend_buft_is_cann` (llama/18968) 2026-02-15 21:44:37 +02:00
hipudding 6a74f56212 CANN: implement quantized MUL_MAT_ID for MoE models (llama/19228)
Implement ggml_cann_mul_mat_id_quant function to support quantized matrix
multiplication for Mixture of Experts (MoE) architectures on CANN backend.

Key features:
- Support Q4_0 and Q8_0 quantized weight formats
- Use IndexSelect to dynamically route expert-specific weights based on indices
- Leverage WeightQuantBatchMatmulV2 for efficient quantized computation
- Handle automatic F16 type conversion for hardware compatibility
- Support both per-expert and broadcast input modes

Implementation details:
- Extract expert weights and scales using CANN IndexSelect operation
- Process each batch and expert combination independently
- Create proper tensor views with correct stride for matmul operations
- Automatic input/output type casting to/from F16 as needed

Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).
2026-02-15 21:44:37 +02:00
Georgi Gerganov a36210c836 cuda : extend GGML_OP_PAD to work with non-cont src0 (llama/19429)
* cuda : extend GGML_OP_PAD to work with non-cont src0

* tests : add permuted pad
2026-02-15 21:44:37 +02:00
Oliver Simons 808904277e CUDA: Fix non-contig rope (llama/19338)
* Rename variables + fix rope_neox

Seems memory layout is shared with Vulkan so we can port fix from
https://github.com/ggml-org/llama.cpp/pull/19299

* Fix rope_multi

* Fix rope_vision

* Fix rope_norm

* Rename ne* to ne0* for consistent variable naming

* cont : consistent stride names

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-02-15 21:44:37 +02:00
Nuno 764482c317
ci: add vulkan docker image (#3644)
Signed-off-by: rare-magma <rare-magma@posteo.eu>
2026-02-09 12:33:06 +02:00
Pádraic Slattery 052066c4f7
chore: Update outdated GitHub Actions versions (#3646) 2026-02-09 12:32:46 +02:00
Christian Kastner 525be69a66
cmake: Drop obsolete build-time configuration of backends (#3649)
The backend configuration now happens in ggml.

This updated configuration mirrors that of llama.cpp.
2026-02-09 12:32:18 +02:00
Sid Mohan eb27fa2252
server : fix hardcoded /inference path in default HTML page (#3639)
Closes #3596
2026-02-09 10:10:13 +02:00
Georgi Gerganov 193f7cdaaf
ci : try fix mirrors (#3655) 2026-02-09 09:59:22 +02:00
Georgi Gerganov 4b23ff249e talk-llama : sync llama.cpp 2026-02-08 09:29:10 +02:00
Georgi Gerganov b0e81c1a2e sync : ggml 2026-02-08 09:29:10 +02:00
Georgi Gerganov 55d7cb2e93 metal : consolidate bin kernels (llama/19390)
* metal : refactor bin kernels

* cont

* cont : fix cv
2026-02-08 09:29:10 +02:00
Georgi Gerganov a9a0a51fba metal : fix event synchronization in cpy_tensor_async (llama/19402) 2026-02-08 09:29:10 +02:00
Abhijit Ramesh 1739af663a ggml-webgpu: JIT compile binary operators and handle binding overlaps (llama/19310)
* ggml webgpu: port binary operators to use pre-wgsl

* Add binary.wgsl: unified shader with conditionals for all 4 ops

* Add gen_binary_shaders.cpp: build tool for using pre_wgsl preprocessor

* Remove bin_op.tmpl.wgsl and binary.wgsl (Python template)

* Update CMake to generate binary operator shaders at build time

* ggml-webgpu: migrate binary ops to JIT compilation with overlap handling

* port binary operators from AOT to pre-wgsl JIT compilation

* add src1=dst overlap handling for binary ops

* use compile-time workgroup size defines instead of runtime overrides

* ggml-webgpu: complete overlap handling for binary ops

* add support for inplace & overlap case in binding setup

* restructure conditional logic to handle all overlap cases

* ensure all buffer bindings are correctly assigned for edge cases

* ggml-webgpu: remove unused binary overlap cases

Remove src0==src1 binary overlap case that never occurs in practice.

* keep INPLACE (src0==dst), OVERLAP (src1==dst), DEFAULT

* remove unused src0==src1 and all-same variant

* refactor wgsl to eliminate duplication
2026-02-08 09:29:10 +02:00
Nechama Krashinski f2f7320817 sycl: add F16 support for GGML_OP_CEIL (llama/19306)
* Fix SYCL CEIL operator

* sycl: implement GGML_OP_CEIL
2026-02-08 09:29:10 +02:00
Jeff Bolz cea22b3075 vulkan: For coopmat2 FA, use fp16 accumulators for the final result (llama/19376)
The cpu and cuda backends use fp16 for the VKQ accumulator type, this change
does the same for vulkan. This helps particularly with large head sizes which
are very register-limited.

I tried this for the coopmat1 path and it slowed down a bit. I didn't try for
scalar.

I applied the softmax bias that the cuda backend uses to avoid overflow,
although I was not able to reproduce the original bug without it.
2026-02-08 09:29:10 +02:00
Jeff Bolz c1b63354bb vulkan: make FA mask/softcap enables spec constants (llama/19309)
* vulkan: make FA mask/softcap enables spec constants

* don't specialize for sinks

* bump timeout a little bit
2026-02-08 09:29:10 +02:00
Georgi Gerganov 776cf61857 metal : skip loading all-zero mask (llama/19337)
* metal : skip loading all-zero mask

* cont : minor
2026-02-08 09:29:10 +02:00
Georgi Gerganov 2a7d5490f1 cuda : cuda graphs now compare all node params (llama/19383) 2026-02-08 09:29:10 +02:00
Georgi Gerganov 34d332aca5 metal : adaptive CPU/GPU interleave based on number of nodes (llama/19369) 2026-02-08 09:29:10 +02:00
Jeff Bolz a567c140a3 vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (llama/19281)
Write out a 2-bit code per block and avoid loading the mask when it
matches these two common cases.

Apply this optimization when the mask is relatively large (i.e. prompt
processing).
2026-02-08 09:29:10 +02:00
Georgi Gerganov 0781df2518 metal : add diag (llama/19330) 2026-02-08 09:29:10 +02:00
Oleksandr Kuvshynov 932def3198 vulkan: fix GPU deduplication logic. (llama/19222)
* vulkan: fix GPU deduplication logic.

As reported in https://github.com/ggml-org/llama.cpp/issues/19221, the
(same uuid, same driver) logic is problematic for windows+intel igpu.

Let's just avoid filtering for MoltenVK which is apple-specific, and
keep the logic the  same as before 88d23ad5 - just dedup based on UUID.

Verified that MacOS + 4xVega still reports 4 GPUs with this version.

* vulkan: only skip dedup when both drivers are moltenVk
2026-02-08 09:29:10 +02:00
Jeff Bolz 5a786f7648 vulkan: Set k_load_shmem to false when K is too large (llama/19301) 2026-02-08 09:29:10 +02:00
Jeff Bolz e0a3f393ad vulkan: fix non-contig rope (llama/19299) 2026-02-08 09:29:10 +02:00
will-lms eecc9bfa69 metal : add missing includes (llama/19348) 2026-02-08 09:29:10 +02:00
Kevin Pouget 2763054f99 ggml-virtgpu: make the code thread safe (llama/19204)
* ggml-virtgpu: regenerate_remoting.py: add the ability to deprecate a function

* ggml-virtgpu: deprecate buffer_type is_host remoting

not necessary

* ggml-virtgpu: stop using static vars as cache

The static init isn't thread safe.

* ggml-virtgpu: protect the use of the shared memory to transfer data

* ggml-virtgpu: make the remote calls thread-safe

* ggml-virtgpu: backend: don't continue if couldn't allocate the tensor memory

* ggml-virtgpu: add a cleanup function for consistency

* ggml-virtgpu: backend: don't crash if buft->iface.get_max_size is missing

* fix style and ordering

* Remove the static variable in apir_device_get_count

* ggml-virtgpu: improve the logging

* fix review minor formatting changes
2026-02-08 09:29:10 +02:00
Aman Gupta 4685ec9555 ggml-cpu: use LUT for converting e8->f32 scales on x86 (llama/19288)
* ggml-cpu: use LUT for converting e8->f32 scales on x86

* add dispatch based on macro
2026-02-08 09:29:10 +02:00
Georgi Gerganov 5dda94dd2e metal : add solve_tri (llama/19302) 2026-02-08 09:29:10 +02:00
Ruben Ortlam aa34558b6f vulkan: disable coopmat1 fa on Nvidia Turing (llama/19290) 2026-02-08 09:29:10 +02:00
Aman Gupta 8eede801e3 CUDA: use mmvq for mul-mat-id for small batch sizes (llama/18958)
* CUDA: use mmvq for mul-mat-id for small batch sizes

* add mmvq too

* Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs

* templatize multi_token_path
2026-02-08 09:29:10 +02:00
Georgi Gerganov ce8a2da620 metal : minor cleanup (llama/19251) 2026-02-08 09:29:10 +02:00
Oliver Simons 698265d754 CUDA: Fix loop unrolling for BW in mul_mat_q_stream_k_fixup (llama/19053)
By providing stride_* variables as size_t (i.e., 64-bit) the compiler can
correctly unroll the [two for-loops](557515be1e/ggml/src/ggml-cuda/mmq.cuh (L3789-L3816))
on BW. This gives some perf for prefill/pp phase on BW, while not affecting
other SMs:

| GPU                                                     | Model                 | Test   |   t/s master |   t/s osimons/fix_bw_mmq_fixup_kernel |   Speedup |
|:--------------------------------------------------------|:----------------------|:-------|-------------:|--------------------------------------:|----------:|
| NVIDIA RTX 6000 Ada Generation                          | gpt-oss 20B MXFP4 MoE | pp8096 |      8404.05 |                               8375.79 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | llama 3B Q4_K_M       | pp8096 |     16148.93 |                              16019.60 |      0.99 |
| NVIDIA RTX 6000 Ada Generation                          | llama 8B Q4_0         | pp8096 |      8008.29 |                               7978.80 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | nemotron_h 9B BF16    | pp8096 |      4263.16 |                               4248.53 |      1.00 |
| NVIDIA RTX 6000 Ada Generation                          | nemotron_h 9B Q4_K_M  | pp8096 |      5165.11 |                               5157.43 |      1.00 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | gpt-oss 20B MXFP4 MoE | pp8096 |     12582.80 |                              12758.37 |      1.01 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 3B Q4_K_M       | pp8096 |     16879.10 |                              17619.47 |      1.04 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 8B Q4_0         | pp8096 |     10649.90 |                              10982.65 |      1.03 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B BF16    | pp8096 |      7717.73 |                               7716.22 |      1.00 |
| NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B Q4_K_M  | pp8096 |      7301.90 |                               7370.38 |      1.01 |
2026-02-08 09:29:10 +02:00
George 57107b2bf8 ggml: added cleanups in ggml_quantize_free (llama/19278)
Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.
2026-02-08 09:29:10 +02:00
Gaurav Garg 6ec362d2e0 cuda : revert CUDA_SCALE_LAUNCH_QUEUES override until investigated (llama/19227)
Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.
2026-02-08 09:29:10 +02:00
lhez 591072fcc8 opencl: refactor some ops, concat, repeat, tanh and scale (llama/19226)
* opencl: refactor concat

* opencl: refactor repeat

* opencl: refactor tanh

* opencl: enable fp16 for tanh

* opencl: refactor scale

* opencl: fix unused variables
2026-02-08 09:29:10 +02:00
Aman Gupta 871063016d ggml-cpu: FA split across kv for faster TG (llama/19209)
* ggml-cpu: split across kv for faster TG

* simplify sinks application

* add ref impl
2026-02-08 09:29:10 +02:00
Neo Zhang c4003da2b8 Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. (llama/19246)
User can't build up the software for Nvidia & AMD GPU.
rm the oneMath since it is only used in NV and AMD code path.
2026-02-08 09:29:10 +02:00
Tamar 74353e90a1 sycl: implement GGML_OP_TOP_K (llama/19242) 2026-02-08 09:29:10 +02:00
Georgi Gerganov 73e04555eb metal : support virtual devices (llama/18919)
* metal : support virtual devices

* cont : manage buffer type context memory

* metal : add events

* cont : implement cpy_tensor_async
2026-02-08 09:29:10 +02:00
Johannes Gäßler 625c8d863e ggml-backend: fix async set/get fallback sync (llama/19179) 2026-02-08 09:29:10 +02:00
Christian Kastner 0e219ebf89 docs : Minor cleanups (llama/19252)
* Update old URLs to github.com/ggml-org/

* Bump copyrights
2026-02-08 09:29:10 +02:00
Nikhil Jain a0256b8159 Remove pipeline cache mutexes (llama/19195)
* Remove mutex for pipeline caches, since they are now per-thread.

* Add comment

* Run clang-format

* Cleanup

* Run CI again

* Run CI once more

* Run clang-format
2026-02-08 09:29:10 +02:00
Max Krasnyansky aca5953d8d Bump cmake max version (needed for Windows on Snapdragon builds) (llama/19188)
* Bump max cmake version (needed for Windows on Snapdragon builds)

* cmake: move max version setting into ggml/CMakeLists
2026-02-08 09:29:10 +02:00
nullname 9b927dd849 ggml-hexagon: flash-attention and reduce-sum optimizations (llama/19141)
* wip

* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation

* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations

* wip

* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance

* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability

* optimize vector dot product functions to use unified reduction for improved performance

* wip

* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation

* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations

* wip

* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance

* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability

* optimize vector dot product functions to use unified reduction for improved performance

* hexagon: optimize reduce-sum for v75+

* hexagon: always keep row_sums in sf/fp32

* ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAGON_TOOLS_ROOT

* fix compiling error after rebase

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-02-08 09:29:10 +02:00