Commit Graph

2532 Commits

Author SHA1 Message Date
John Balis 3eea171cab feat: cuda implementation for `ggml_conv_transpose_1d` (ggml/854)
* conv transpose 1d passing test for 1d input and kernel

* working for different input and output channel counts, added test for variable stride

* initial draft appears to work with stride other than 1

* working with all old and new conv1d  tests

* added a test for large tensors

* removed use cuda hardcoding

* restored test-conv-transpose.c

* removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail

* fixed accumulator bug

* added test to test-backend-ops

* fixed mistake

* addressed review

* fixed includes

* removed blank lines

* style and warning fixes

* return failure when test fails

* fix supports_op

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-07-08 14:53:55 +03:00
Georgi Gerganov 64a56ebf13
ci : disable java build 2024-07-08 14:26:59 +03:00
Emmanuel Schmidbauer bec9836849
server : add inference path to make OAI API compatible (#2270) 2024-07-08 14:24:58 +03:00
Georgi Gerganov c118733a29
sync : ggml + fix sync script 2024-06-26 23:20:19 +03:00
Georgi Gerganov bb3dd45524
make : disable CUDA graphs 2024-06-26 23:20:13 +03:00
slaren 04e7fa6f4f
ggml : add GGML_CUDA_USE_GRAPHS option, restore GGML_CUDA_FORCE_CUBLAS (cmake) (llama/8140) 2024-06-26 23:18:11 +03:00
Georgi Gerganov 9f7f36d4c9
make : disable CUDA mel build 2024-06-26 22:25:25 +03:00
Georgi Gerganov 4a62efbb95
cmake : minor fixes 2024-06-26 21:42:39 +03:00
Georgi Gerganov 0a55a70b9b
make : fix missing -O3
same as https://github.com/ggerganov/llama.cpp/pull/8143
2024-06-26 21:21:12 +03:00
Georgi Gerganov dc8cc2dd6f
whisper : disable CUDA mel + fix FFMPEG 2024-06-26 20:11:38 +03:00
Georgi Gerganov 3efedb9511
sync : ggml 2024-06-26 19:40:23 +03:00
Georgi Gerganov e30c679928
whisper : reorganize source code + improve CMake (#2256)
* scripts : update sync [no ci]

* files : reorganize [no ci]

* sync : llama.cpp

* cmake : link math library

* cmake : build normal ggml library

* files : move headers to include

* objc : fix path to ggml-metal.h

* ci : fix WHISPER_CUDA -> GGML_CUDA

* scripts : sync LICENSE [no ci]
2024-06-26 19:34:09 +03:00
mky_coder bf4cb4abad
whisper : optimize fft() function (#2242)
Co-authored-by: Mike Fan <60965742+mike-fzy@users.noreply.github.com>
2024-06-18 18:10:33 +03:00
Georgi Gerganov e293f17d34
talk-llama : sync llama.cpp 2024-06-18 09:45:37 +03:00
Georgi Gerganov 5d950c4b8d whisper : use ggml_backend_sched (#2239)
* whisper : use ggml_backend_sched (wip)

* use sched in whisper_allocr

* whisper : single backend in whisper_context

* whisper : remove whisper_state->backends_used

* whisper : remove whisper_context->backend

* whisper : reset scheduler after init

* whisper : fix external encoder (e.g. CoreML)

* whisper : cleanup

* whisper : handle null GPU buffer types + fix sycl

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-18 09:39:40 +03:00
Georgi Gerganov 820446e230 fix : remove extra files 2024-06-18 09:39:40 +03:00
Georgi Gerganov 54d5823ebe scripts : sync ggml-blas 2024-06-18 09:39:40 +03:00
Georgi Gerganov 5181494e9f build : update make / cmake 2024-06-18 09:39:40 +03:00
Georgi Gerganov 4a6e6e8b30 sync : ggml 2024-06-18 09:39:40 +03:00
slaren de29b193f6 move BLAS to a separate backend (cont) (llama/6210)
ggml-ci
2024-06-18 09:39:40 +03:00
0cc4m 922971041b Vulkan Shader Refactor, Memory Debugging Option (llama/7947)
* Refactor shaders, extract GLSL code from ggml_vk_generate_shaders.py into vulkan-shaders directory

* Improve debug log code

* Add memory debug output option

* Fix flake8

* Fix unnecessary high llama-3 VRAM use
2024-06-18 09:39:40 +03:00
Georgi Gerganov 63a767a134 scripts : stop sync whisper example from ggml 2024-06-18 09:39:40 +03:00
Georgi Gerganov 30841fa786 cmake : fix sycl build (#0) 2024-06-16 18:19:48 +03:00
Georgi Gerganov 3b1ac03828 ggml : remove OpenCL (#0) 2024-06-16 18:19:48 +03:00
Georgi Gerganov 990de617b5 sycl : sync (#0) 2024-06-16 18:19:48 +03:00
Georgi Gerganov 6975600b4b cuda : enable CUDA graphs (#0) 2024-06-16 18:19:48 +03:00
Georgi Gerganov 061eeb9f61 talk-llama : sync llama.cpp 2024-06-16 18:19:48 +03:00
Georgi Gerganov 4942b1b428 cmake : fix CUDA build (#0) 2024-06-16 18:19:48 +03:00
Georgi Gerganov 3c7cc5c437 sync : ggml
ggml-ci
2024-06-16 18:19:48 +03:00
Hong Bo PENG 5cd42ee2cc ggml : fix and optimize ppc64le (ggml/849)
* fix compile issues introduced by loongarch_asx

* restore quant changes to merge

* fix compile issues introduced by loongarch_asx

* further optimize by using vec_msum & vec_sum4s on ppc64le
2024-06-16 18:19:48 +03:00
Daniel Bevenius ee718f3da6 ggml : remove duplicate include of ggml-common.h (ggml/853)
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-06-16 18:19:48 +03:00
Meng, Hengyu 63eac1f608 remove global variables (llama/7710)
* separate DPCT helpers outside

* replace global variables with context

* remove useless extra

* update mul_mat condition

* remove duplicate buft initialization

* remove duplicate extra and global work group size

* remove useless backend check

* remove duplicated extras

* use macro for group_size and remove cuda-related
2024-06-16 18:19:48 +03:00
Johannes Gäßler b17ba2815b CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (llama/7921)
* CUDA: faster q2_K, q3_K MMQ + int8 tensor cores

* try CI fix

* try CI fix

* try CI fix

* fix data race

* rever q2_K precision related changes
2024-06-16 18:19:48 +03:00
Georgi Gerganov 7a489af2f3 metal : utilize max shared memory for mul_mat_id (llama/7935) 2024-06-16 18:19:48 +03:00
Radoslav Gerganov 4a4ea13d6d rpc : fix ggml_backend_rpc_supports_buft() (llama/7918) 2024-06-16 18:19:48 +03:00
slaren 174a461fc6 move BLAS to a separate backend (llama/6210)
* move BLAS to a separate backend

* rename GGML_USE_OPENBLAS to GGML_USE_BLAS

* alloc : reuse same buffer when the same buffer type if used multiple times

* set number of threads automatically for openblas and blis

* sched : print assignments when GGML_SCHED_DEBUG env variable is set

* sched : allow ops with weights on an incompatible buffer type

This will cause the weight to be copied to a backend that supports the
op, which is very costly. The weight should have been stored in a buffer
of a backend that can run the op, but llama.cpp cannot do this
automatically at the moment.

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-16 18:19:48 +03:00
Johannes Gäßler d8b7a24bc9 CUDA: fix broken oob check for FA vec f32 kernel (llama/7904) 2024-06-16 18:19:48 +03:00
Georgi Gerganov acf3832c9c tests : add non-cont unary tests (llama/7857)
* tests : add non-cont unary tests

* ggml : update unary asserts and "supports_op"

ggml-ci
2024-06-16 18:19:48 +03:00
Georgi Gerganov d29ac44303 ggml : improve ggml_is_contiguous logic (llama/7856)
* ggml : improve ggml_is_contiguous logic

ggml-ci

* ggml : support more contiguous cases

ggml-ci
2024-06-16 18:19:48 +03:00
k.h.lai 12638dfef0 vulkan: select only one device for single gpu with multiple drivers (llama/7582) 2024-06-16 18:19:48 +03:00
0cc4m f100b3b523 Update Vulkan RoPE implementation (llama/7818)
* Update Vulkan RoPE implementation

* Return nullptr on alloc_buffer when allocation fails, instead of throwing an exception

Minor fixes

* Fix segfault when running out of VRAM

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-16 18:19:48 +03:00
Johannes Gäßler a99e213a82 CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) (llama/7860) 2024-06-16 18:19:48 +03:00
Johannes Gäßler 7483d2b61c CUDA: use tensor cores for MMQ (llama/7676)
* CUDA: int8 tensor cores for MMQ (legacy quants)

* fix out-of-bounds writes

* __builtin_assume -> GGML_CUDA_ASSUME

* fix writeback returning too early
2024-06-16 18:19:48 +03:00
Ben Ashbaugh 1fe5948227 use the correct SYCL context for host USM allocations (llama/7777)
Signed-off-by: Ben Ashbaugh <ben.ashbaugh@intel.com>
2024-06-16 18:19:48 +03:00
Johannes Gäßler 760497e1ab CUDA: revise q8_1 data layout for mul_mat_q (llama/7824) 2024-06-16 18:19:48 +03:00
slaren b172e7714c vulkan : reuse parent extra for views (llama/7806)
* vulkan : reuse parent extra for views

* Fix validation error when multiple compute contexts are used in a graph

---------

Co-authored-by: 0cc4m <picard12@live.de>
2024-06-16 18:19:48 +03:00
pengxin99 dc01aadb18 fix softmax r2r result wrong issue (llama/7811) 2024-06-16 18:19:48 +03:00
Johannes Gäßler e08c62149b CUDA: refactor mmq, dmmv, mmvq (llama/7716)
* CUDA: refactor mmq, dmmv, mmvq

* fix out-of-bounds write

* struct for qk, qr, qi

* fix cmake build

* mmq_type_traits
2024-06-16 18:19:48 +03:00
Georgi Gerganov abab4500fa ggml : refactor rope norm/neox (llama/7634)
* ggml : unify rope norm/neox (CPU)

* ggml : fix compile warning

* ggml : remove GLM rope mode

ggml-ci

* metal : better rope implementation

ggml-ci

* cuda : better rope implementation

ggml-ci

* naming : n_orig_ctx -> n_ctx_orig

ggml-ci

* dev : add reminders to update backends

ggml-ci

* vulkan : fix ggml_rope_ext() usage

* cuda : fix array size + indents

ggml-ci
2024-06-16 18:19:48 +03:00
agray3 e666315fa8 Allow number of nodes in CUDA graph to change (llama/7738)
Previously the code would have failed to cope in the case that the
number of nodes changes in an existing CUDA graph. This fixes the
issue by removing an unnecessary conditional.
2024-06-16 18:19:48 +03:00