* CUDA: optimize get_int_from_table_16
* CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs
* revise documentation
---------
Co-authored-by: xix <xiapc@outlook.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* vulkan: use subgroup function for mul_mat_id shader even without coopmat
* vulkan: fix compile warnings
* vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id
* vulkan: disable subgroup mul_mat_id on devices with subgroups < 16
The scalar FA shader already handled multiples of 8. The coopmat1 FA
shader assumed 16x16x16 and the shared memory allocations need the HSK
dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation
requires multiples of 16 for N and K, and needs the matrix dimensions
padded and loads clamped.
Store the FA pipelines in a map, indexed by the pipeline state.
* vulkan: optimize rms_norm, and allow the work to spread across multiple SMs
There are really two parts to this change:
(1) Some optimizations similar to what we have in soft_max, to unroll with
different numbers of iterations.
(2) A fusion optimization where we detect add followed by rms_norm, and make
the add shader atomically accumulate the values^2 into memory. Then the
rms_norm shader can just load that sum. This allows the rms_norm to be
parallelized across multiple workgroups, it just becomes a simple per-element
multiply.
The fusion optimization is currently only applied when the rms_norm is on a
single vector. This previously always ran on a single SM. It could apply more
broadly, but when there are other dimensions the work can already spread across
SMs, and there would be some complexity to tracking multiple atomic sums.
* Change add+rms_norm optimization to write out an array of partial sums
rather than using atomic add, to make it deterministic. The rms_norm
shader fetches a subgroup's worth in parallel and uses subgroupAdd to
add them up.
* complete rebase against fused adds - multi_add shader can also compute partial sums
* fix validation errors
* disable add_rms_fusion for Intel due to possible driver bug
* resolve against #15489, sync after clearing partial sums
Track a list of nodes that need synchronization, and only sync if the new node
depends on them (or overwrites them). This allows some overlap which can
improve performance, and centralizes a big chunk of the synchronization logic.
The remaining synchronization logic involves writes to memory other than the
nodes, e.g. for dequantization or split_k. Each of these allocations has a bool
indicating whether they were in use and need to be synced. This should be
checked before they are written to, and set to true after they are done being
consumed.
* vulkan : support ggml_mean
* vulkan : support sum, sum_rows and mean with non-contiguous tensors
* vulkan : fix subbuffer size not accounting for misalign offset
* tests : add backend-op tests for non-contiguous sum_rows
* cuda : require contiguous src for SUM_ROWS, MEAN support
* sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support
* require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader
- Spread the work across the whole workgroup. Using more threads seems to
far outweigh the synchronization overhead.
- Specialize the code for when the division is by a power of two.
* Begin work on set_rows
* Work on set rows
* Add error buffers for reporting unsupported SET_ROWS indices
* Remove extra comments
* Work on templating for different types in shaders
* Work on shader type generation
* Working q4_0 mul_mat and some templating for different types
* Add q4_0_f16 matmul and fix device init
* Add matmul support for basic quantization types
* Add q2_k and q3_k quantization
* Add rest of k-quants
* Get firt i-quant working
* Closer to supporting all i-quants
* Support rest of i-quants
* Cleanup code
* Fix python formatting
* debug
* Bugfix for memset
* Add padding to end of buffers on creation
* Simplify bit-shifting
* Update usage of StringView
* Add Pad Reflect 1D CUDA support
* Update ggml/src/ggml-cuda/pad_reflect_1d.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* vulkan: Reuse conversion results in prealloc_y
Cache the pipeline and tensor that were most recently used to fill prealloc_y,
and skip the conversion if the current pipeline/tensor match.
* don't use shared pointer for prealloc_y_last_pipeline_used
* musa: fix build warnings
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* fix warning: comparison of integers of different signs: 'const int' and 'unsigned int' [-Wsign-compare]
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
While working on the [whisper-cpp](https://conan.io/center/recipes/whisper-cpp) Conan package for ConanCenter, I noticed that enabling the `with_blas` option fails to build due to an issue in the _MKL_ detection logic.
The problem is that the CMake condition currently expands `BLAS_INCLUDE_DIRS` without quotes:
```cmake
if (${BLAS_INCLUDE_DIRS} MATCHES "mkl" AND (${GGML_BLAS_VENDOR} MATCHES "Generic" OR ${GGML_BLAS_VENDOR} MATCHES "Intel"))
```
When `BLAS_INCLUDE_DIRS` is a list (as Conan provides it), the `if()` command receives multiple arguments and produces a CMake error:
```bash
...
-- BLAS found, Includes: /root/.conan2/p/b/openb034c5a6ca927b/p/include;/root/.conan2/p/b/openb034c5a6ca927b/p/include/openblas
CMake Error at ggml/src/ggml-blas/CMakeLists.txt:77 (if):
if given arguments:
"/root/.conan2/p/b/openb034c5a6ca927b/p/include" "/root/.conan2/p/b/openb034c5a6ca927b/p/include/openblas" "MATCHES" "mkl" "AND" "(" "OpenBLAS" "MATCHES" "Generic" "OR" "OpenBLAS" "MATCHES" "Intel" ")"
Unknown arguments specified
...
```
This PR fixes the issue by quoting the variable:
```cmake
if ("${BLAS_INCLUDE_DIRS}" MATCHES "mkl" AND (${GGML_BLAS_VENDOR} MATCHES "Generic" OR ${GGML_BLAS_VENDOR} MATCHES "Intel"))
```
With this change, the whole list is treated as a single string and the regex still works correctly.
On busybox-based systems like Alpine Linux, wget does not have
certain CLI flags such as '--no-config'. Thus, search for the
existence of 'curl' first in the PATH before wget. wget2 is
still the preferred download tool.
This commit remove the brew install of cmake for macos-latest
as this now seems to be pre-installed on the runner.
The motivation for this is that this job is failing with the following
error:
```console
Error: cmake was installed from the local/pinned tap
but you are trying to install it from the homebrew/core tap.
Formulae with the same name from different taps cannot be installed at the same time.
```
This commit modifies the test-vad and test-vad-full tests to use CMake
definitions for the model and sample paths.
The motivation for this is that currently the tests use relative paths
which might not always be correct depending on the working directory.
With the changes in this commit the tests can be run usins ctest:
```console
$ ctest -R ^test-vad$ --test-dir build
```
Or directly (which is not currently possible without this fix):
```
./build/bin/test-vad
```
Resolves: https://github.com/ggml-org/whisper.cpp/issues/3404
this might happen depending on the way the $stderr.winsize is defined. If the expression "$stderr.winsize[1] - line.size" in Line 114 gets negative, we will get a "negative argument" exception in the padding calculation