whisper.cpp

Commit Graph

Author	SHA1	Message	Date
PureJourney	b1385e9aa9	CUDA: correct the lowest Maxwell supported by CUDA 12 (llama/11984) * CUDA: correct the lowest Maxwell supported by CUDA 12 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-02-27 08:55:36 +02:00
Bodhi	48f5e893f5	MUSA: support ARM64 and enable dp4a .etc (llama/11843) * MUSA: support ARM64 and enable __dp4a .etc * fix cross entropy loss op for musa * update * add cc info log for musa * add comment for the MUSA .cc calculation block --------- Co-authored-by: Bodhi Hu <huaishun.hu@mthreads.com>	2025-02-27 08:55:36 +02:00
Charles Xu	dc21871fcb	ggml-cpu: Add CPU backend support for KleidiAI library (llama/11390) * ggml-cpu: Add CPU backend support for KleidiAI library * Add environmental variable GGML_KLEIDIAI_SME * Add support for multithread LHS conversion * Switch kernel selection order to dotprod and i8mm * updates for review comments * More updates for review comments * Reorganize and rename KleidiAI files * Move ggml-cpu-traits.h to source file * Update cmake for SME build and add alignment for SME * Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list	2025-02-27 08:55:36 +02:00
Prashant Vithule	64a430bc81	ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot (llama/11917) * Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file * Improved Formating of code in ggml-cpu-quants.c file * style : minor fixes * style : less whitespaces * style : ptr spaceing --------- Co-authored-by: vithulep <p.m.vithule1517@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-27 08:55:36 +02:00
Johannes Gäßler	51a3580c79	CUDA: use async data loading for FlashAttention (llama/11894) * CUDA: use async data loading for FlashAttention --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-02-27 08:55:36 +02:00
Rémy O	37a21dd43d	vulkan: implement several ops relevant for ggml_opt (llama/11769) * vulkan: support memset_tensor * vulkan: support GGML_OP_SUM * vulkan: implement GGML_OP_ARGMAX * vulkan: implement GGML_OP_SUB * vulkan: implement GGML_OP_COUNT_EQUAL * vulkan: implement GGML_OP_OPT_STEP_ADAMW * vulkan: fix check_results RWKV_WKV6 crash and memory leaks * vulkan: implement GGML_OP_REPEAT_BACK * tests: remove invalid test-backend-ops REPEAT_BACK tests * vulkan: fix COUNT_EQUAL memset using a fillBuffer command	2025-02-27 08:55:36 +02:00
Jeff Bolz	8a22a8b17f	vulkan: support multi/vision rope, and noncontiguous rope (llama/11902)	2025-02-27 08:55:36 +02:00
Hale Chan	fcbcad0c90	metal : fix the crash caused by the lack of residency set support on Intel Macs. (llama/11904)	2025-02-27 08:55:36 +02:00
Adrian Kretz	4444db7360	metal : optimize dequant q6_K kernel (llama/11892)	2025-02-27 08:55:36 +02:00
Georgi Gerganov	a7fc1038ca	repo : update links to new url (llama/11886) * repo : update links to new url ggml-ci * cont : more urls ggml-ci	2025-02-27 08:55:36 +02:00
Rémy O	1689aaf854	vulkan: initial support for IQ1_S and IQ1_M quantizations (llama/11528) * vulkan: initial support for IQ1_S and IQ1_M quantizations * vulkan: define MMV kernels for IQ1 quantizations * devops: increase timeout of Vulkan tests again * vulkan: simplify ifdef for init_iq_shmem	2025-02-27 08:55:36 +02:00
lhez	4b48fe449a	opencl: Fix rope and softmax (llama/11833) * opencl: fix `ROPE` * opencl: fix `SOFT_MAX` * Add fp16 variant * opencl: enforce subgroup size for `soft_max`	2025-02-27 08:55:36 +02:00
Diego Devesa	47cc043e69	cuda : add ampere to the list of default architectures (llama/11870)	2025-02-27 08:55:36 +02:00
Jinyang He	e3d9ffb98b	ggml: optimize some vec dot functions for LoongArch ASX (llama/11842) * Optimize ggml_vec_dot_q3_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q4_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q6_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q5_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q2_K_q8_K for LoongArch ASX * Optimize mul_sum_i8_pairs_float for LoongArch ASX * Optimize ggml_vec_dot_iq4_xs_q8_K for LoongArch ASX	2025-02-27 08:55:36 +02:00
Eve	e22d69839d	vulkan: linux builds + small subgroup size fixes (llama/11767) * mm subgroup size * upload vulkan x86 builds	2025-02-27 08:55:36 +02:00
Jeffrey Morgan	defe731263	llamafile: use member variable instead of constant for iq4nlt (llama/11780)	2025-02-27 08:55:36 +02:00
R0CKSTAR	4e07957bf9	musa: bump MUSA SDK version to rc3.1.1 (llama/11822) * musa: Update MUSA SDK version to rc3.1.1 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: Remove workaround in PR #10042 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-02-27 08:55:36 +02:00
Diego Devesa	d2c5154bb5	ggml-cpu : add chunking support to mul_mat_id (llama/11666) * ggml-cpu : add chunking support to mul_mat_id * allocate chunk counter in wdata parallelize src1 quantization by column to allows parallelization even when there is only one row * disable for arm * cleanup * better way to disable for arm * fix uninitialized counter when using 1 thread only * revert test-backend-ops changes	2025-02-27 08:55:36 +02:00
Xuan-Son Nguyen	4fac43fe00	ggml : x2 speed for WASM by optimizing SIMD (llama/11453) * ggml : x2 speed for WASM by optimizing SIMD * fix bad merging * rm trailing spaces * rm redundant clamp * better quantize_row_q8_K Co-authored-by: camel-cdr <camel-cdr@protonmail.com> * remove memset that causes buffer overflow Co-authored-by: camel-cdr <camel-cdr@protonmail.com> --------- Co-authored-by: camel-cdr <camel-cdr@protonmail.com>	2025-02-27 08:55:36 +02:00
uvos	3be9670f17	HIP: Remove GCN from list of devices that avoid MMQ (llama/11831)	2025-02-27 08:55:36 +02:00
uvos	86729fcd6d	HIP: Switch to std::vector in rocblas version check (llama/11820)	2025-02-27 08:55:36 +02:00
Richard	d597f83e1a	ggml : fix multi-threaded clamp_f32 (llama/11824) * Bug fix for clamp_f32 When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0. * Bug fix for clamp_f32 * Bug fix for clamp_f32	2025-02-27 08:55:36 +02:00
Weizhao Ouyang	e5edcc6259	ggml-cpu: Fix duplicate MATMUL_INT8 (llama/11817) Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>	2025-02-27 08:55:36 +02:00
Johannes Gäßler	556f773d53	CUDA: fix CUDART_VERSION checks (llama/11821)	2025-02-27 08:55:36 +02:00
Sheldon Robinson	91d02de332	Fix #11802 : Compile bug - RegQueryValueExA changed to RegQueryValueEx (llama/11803) * Fix #11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx * Fix #11802: PR #11803 - keep RegQueryValueExA, remove TEXT macro, description needs to be ANSI string	2025-02-27 08:55:36 +02:00
Johannes Gäßler	1b67d72f87	CUDA: use arch list for compatibility check (llama/11775) * CUDA: use arch list for feature availability check --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-02-27 08:55:36 +02:00
Maxim Evtush	14d7c0368d	fix: typos in documentation files (llama/11791) * Update ggml.c * Update arg.cpp * Update speculative.h	2025-02-27 08:55:36 +02:00
Danny Milosavljevic	db6e19188a	vulkan: Make Vulkan optional at runtime (ggml/11493). (llama/11494) Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-02-27 08:55:36 +02:00
Wagner Bruna	b4b063a5c9	vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid VRAM allocation (llama/11592)	2025-02-27 08:55:36 +02:00
Jeff Bolz	930b739e7a	vulkan: account for lookup tables when checking shared memory size (llama/11502)	2025-02-27 08:55:36 +02:00
Karol Kontny	5981352bb5	ggml: Fix data race in ggml threadpool (llama/11736) After the barrier in last iteration is executed, still the loop termination condition will be executed. However main thread can destroy the cgraph object and its nodes already, then another thread will access it, but the thing is already gone. Also trouble can happen when n_nodes == 0 or abort is called, but I'm not sure if the prior situation is possible. Last syncronization should be done after the loop to ensure the cgraph/cplan won't be accessed after the main thread exits from the function.	2025-02-27 08:55:36 +02:00
Johannes Gäßler	7561da244e	CUDA: fix min. version for movmatrix (llama/11751)	2025-02-27 08:55:36 +02:00
Jeff Bolz	be83f342fb	vulkan: print shared memory size (llama/11719)	2025-02-27 08:55:36 +02:00
Akarshan Biswas	fd369871f7	SYCL: remove XMX info from print devices (llama/11712)	2025-02-27 08:55:36 +02:00
Jinyang He	bbd8364f5e	ggml : optimize and build warning fix for LoongArch (llama/11709) * ggml : optimize convert f32<->f16 for loongarch_asx * ggml : optimize loongarch_asx extend i16,i8,u8 to i32,i16 * ggml : Fix warnings when run cpu CI locally on LoongArch	2025-02-27 08:55:36 +02:00
Akarshan Biswas	e4102440ef	SYCL: Adjust support condition for norm operators (llama/11674) SYCL does not support non contiguous tensors for norm operations	2025-02-27 08:55:36 +02:00
junchao-zhao	f8242ec483	ggml : fix LoongArch compile error with 128-bit SIMD (llama/11701)	2025-02-27 08:55:36 +02:00
Jeff Bolz	ef51b4cba4	vulkan: optimize coopmat2 iq2/iq3 callbacks (llama/11521) * vulkan: optimize coopmat2 iq2/iq3 callbacks * build: trigger CI on GLSL compute shader changes	2025-02-27 08:55:36 +02:00
Rémy O	6f08b24146	vulkan: initial support for IQ4_XS quantization (llama/11501)	2025-02-27 08:55:36 +02:00
Jeff Bolz	7c165d7fa8	vulkan: use smaller combined allocations to avoid fragmentation (llama/11551)	2025-02-27 08:55:36 +02:00
Charles Duffy	2f0cf44915	metal : avoid breaking build when metal API predates TARGET_OS_VISION (llama/11690) Avoids breakage in nix flake build introduced by b0569130c5e9c671152c913d82803b7c2f014ff9	2025-02-27 08:55:36 +02:00
Georgi Gerganov	b9c972fd0d	metal : adjust support conditions for norm operators (llama/11671) cont #11659 ggml-ci	2025-02-27 08:55:36 +02:00
Johannes Gäßler	01c9aafbfd	CUDA: support for mat. mul. with ne03 != ne13 (llama/11656)	2025-02-27 08:55:36 +02:00
Johannes Gäßler	bae6bbf487	CUDA: non-contiguous (RMS) norm support (llama/11659) * CUDA: non-contiguous (RMS) norm support --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-27 08:55:36 +02:00
fxzjshm	c310272fa0	HIP: force max threads per block to be 1024 (llama/11621) Some old/vendor forked version of llvm still use 256. Explicitly set it to 1024 to align with upstream llvm. Signed-off-by: fxzjshm <fxzjshm@163.com>	2025-02-27 08:55:36 +02:00
Jhen-Jie Hong	bd0b55dbe0	metal : use residency set for other platforms (llama/11648)	2025-02-27 08:55:36 +02:00
Patrick Peng	ba4645db2c	rpc: fix known RCE in rpc-server (ggml/1103) Add bounds checking in `rpc_server::copy_tensor` to prevent out-of-bounds writes + Check if `(uint8_t *)dst->data + ggml_nbytes(src)` remains within the destination buffer’s allocated region.	2025-02-27 08:55:36 +02:00
midnight	46d07b9c85	cmake : fix compile assumptions for power9/etc (#2777 ) * Add small comment re: VSX to readme Co-authored-by: midnight <midnight@example.com>	2025-02-05 14:41:10 +02:00
Johannes Gäßler	dbeb7916b8	CUDA: fix Volta FlashAttention logic (llama/11615)	2025-02-03 22:00:57 +02:00
Johannes Gäßler	fad2806352	HIP: fix flash_attn_stream_k_fixup warning (llama/11604)	2025-02-03 22:00:57 +02:00
uvos	9906792ec3	CUDA/HIP: add support for selectable warp size to mmv (llama/11519) CUDA/HIP: add support for selectable warp size to mmv	2025-02-03 22:00:57 +02:00
uvos	c49ee07ff4	HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other (llama/11601) This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly	2025-02-03 22:00:57 +02:00
Johannes Gäßler	f8a831779e	CUDA: use mma PTX instructions for FlashAttention (llama/11583) * CUDA: use mma PTX instructions for FlashAttention * __shfl_sync workaround for movmatrix * add __shfl_sync to HIP Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-02-03 22:00:57 +02:00
Olivier Chafik	85451e3612	`ci`: use sccache on windows instead of ccache (llama/11545) * Use sccache on ci for windows * Detect sccache in cmake	2025-02-03 22:00:57 +02:00
uvos	43c744ce8b	HIP: require at least HIP 5.5	2025-02-03 22:00:57 +02:00
uvos	fc2e44490d	HIP: Prepare reduction operators for wave 64	2025-02-03 22:00:57 +02:00
uvos	f41fdad200	CUDA/HIP: add warp_size to cuda_device_info	2025-02-03 22:00:57 +02:00
Rémy Oudompheng	80fa576254	vulkan: implement initial support for IQ2 and IQ3 quantizations (llama/11360) * vulkan: initial support for IQ3_S * vulkan: initial support for IQ3_XXS * vulkan: initial support for IQ2_XXS * vulkan: initial support for IQ2_XS * vulkan: optimize Q3_K by removing branches * vulkan: implement dequantize variants for coopmat2 * vulkan: initial support for IQ2_S * vulkan: vertically realign code * port failing dequant callbacks from mul_mm * Fix array length mismatches * vulkan: avoid using workgroup size before it is referenced * tests: increase timeout for Vulkan llvmpipe backend --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-02-03 22:00:57 +02:00
Jeff Bolz	75e7d0585e	vulkan: Catch pipeline creation failure and print an error message (llama/11436) * vulkan: Catch pipeline creation failure and print an error message Also, fix some warnings from my on-demand compile change. * vulkan: fix pipeline creation logging	2025-02-03 22:00:57 +02:00
uvos	682a6f5f87	HIP: Supress transformation warning in softmax.cu loops with bounds not known at compile time can not be unrolled. when ncols_template == 0, the bounds of the loop are not constexpr, thus llvm cant unroll the loops here.	2025-02-03 22:00:57 +02:00
Nikita Sarychev	115716d109	HIP: Only call rocblas_initialize on rocblas versions with the multiple instantation bug (llama/11080) This disables the workaround on rocblas fixed versions (>=4.0.0) to eliminate the runtime cost and unnecessary VRAM allocation of loading all tensile objects.	2025-02-03 22:00:57 +02:00
someone13574	b2cfef655b	cmake : don't fail on `GGML_CPU=OFF` (llama/11457)	2025-02-03 22:00:57 +02:00
Akarshan Biswas	22e3df0afa	SYCL : SOFTMAX F16 mask support and other fixes (llama/11261) Implemented ggml_sycl_op_soft_max() F16 src1(mask) support for which a pragma deprecation warning was added during #5021. To do this, had to decouple it from ggml_sycl_op_flatten which always considered src1 to be of fp32 type(many OP functions are dependent on it). * SYCL: SOFTMAX F16 mask support and other fixes * test-backend-ops: Add F16 mask test cases	2025-02-03 22:00:57 +02:00
Haus1	028511d349	AMD: parse the architecture as supplied by gcnArchName (llama/11244) The value provided by minor doesn't include stepping for AMD, parse the value returned by gcnArchName instead to retrieve an accurate ID.	2025-02-03 22:00:57 +02:00
Ihar Hrachyshka	70c4038842	metal: Handle null returned from MTLCreateSystemDefaultDevice() (llama/11441) This fixes segmentation fault error when running tests when no metal devices are available (for example, when not linked with Core Graphics framework or otherwise).	2025-02-03 22:00:57 +02:00
Georgi Gerganov	8639c003a9	metal : use residency sets (llama/11427) * metal : use residency sets ggml-ci * metal : restore commandBufferWithUnretainedReferences calls [no ci] * metal : release descriptors ggml-ci * metal : check env GGML_METAL_NO_RESIDENCY ggml-ci * metal : fix build + clean-up ggml-ci	2025-02-03 22:00:57 +02:00
bandoti	d5d831da65	cmake: add ggml find package (llama/11369) * Add initial ggml cmake package * Add build numbers to ggml find-package * Expand variables with GGML_ prefix * Guard against adding to cache variable twice * Add git to msys2 workflow * Handle ggml-cpu-* variants * Link ggml/ggml-base libraries to their targets * Replace main-cmake-pkg with simple-cmake-pkg * Interface features require c_std_90 * Fix typo * Removed unnecessary bracket from status message * Update examples/simple-cmake-pkg/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/simple-cmake-pkg/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-03 22:00:57 +02:00
Jeff Bolz	7230a6e1c8	vulkan: compile shaders on-demand (llama/11406) Reduce first-run startup time and memory consumption. Should fix #11339.	2025-02-03 22:00:57 +02:00
uvos	a160fa0f3a	Hip: disable VMM on hip as it seams that it dosent work in some configurations (llama/11420)	2025-02-03 22:00:57 +02:00
uvos	0282ad8fd1	hip : Add hipGraph and VMM support to ROCM (llama/11362) * Add hipGraph support * Enable VMM on rocm	2025-02-03 22:00:57 +02:00
Johannes Gäßler	9e467815d4	CUDA: fix FP16 cuBLAS GEMM (llama/11396)	2025-02-03 22:00:57 +02:00
uvos	727891d9bf	rocBLAS: Avoid fp32->fp16->fp32 conversion on cdna (llama/11356)	2025-02-03 22:00:57 +02:00
Johannes Gäßler	c262dc80e2	CPU/CUDA: fix (GQA) mul mat back, add CUDA support (llama/11380)	2025-02-03 22:00:57 +02:00
amd-dwang	16eeb31933	Vulkan-run-test: fix mmq_wg_denoms (llama/11343) There should be a copy-and-paste error here. mmq_wg_denoms should be used together with warptile_mmq, instead of wg_denoms.	2025-02-03 22:00:57 +02:00
Jeff Bolz	ba523d5e22	vulkan: sort shaders for more deterministic binary (llama/11315) Fixes #11306.	2025-02-03 22:00:57 +02:00
Jeff Bolz	3736706139	vulkan: fix diag_mask_inf (llama/11323) With robustbufferaccess disabled, this shader was showing OOB stores. There is a bounds check in the code, but the workgrouop dimensions were reversed vs CUDA and it was running the wrong number of threads. So fix the workgroup dimensions and disable robustness for this pipeline.	2025-02-03 22:00:57 +02:00
Radoslav Gerganov	58640aa456	rpc : better caching of the base buffer pointer (llama/11331) There is no need to use map, just store the base pointer in the buffer context.	2025-02-03 22:00:57 +02:00
Georgi Gerganov	5183a05e56	metal : fix out-of-bounds write (llama/11314) ggml-ci	2025-02-03 22:00:57 +02:00
Jeff Bolz	0dcada42d4	vulkan: fix coopmat2 validation failures (llama/11284) mul mat and flash attention shaders were loading f32 types directly into A/B matrices, which happens to work but is technically invalid usage. For FA, we can load it as an Accumulator matrix and convert and this is not in the inner loop and is cheap enough. For mul mat, it's more efficient to do this conversion in a separate pass and have the input(s) be f16. coopmat2 requires SPIR-V 1.6 (related using to LocalSizeId). LocalSizeId requires maintenance4 be enabled, and SPIR-V 1.6 requires Vulkan 1.3.	2025-02-03 22:00:57 +02:00
Nicolò Scipione	d507b4cebe	SYCL: Introducing memory host pool (llama/11251) * Implement host pool for matrix_info Creating a new memory pool on the host to store memory location for matrix_info needed to launch gemm_batch from oneMKL/oneMath. Removing complex support in gemm_batch since it is not used in llama.cpp * Remove unnecessary headers and cast * Reorder member variable to avoid warning on initialization * Formatting * Remove unused variable * Address PR review feedback - remove warning --------- Signed-off-by: nscipione <nicolo.scipione@codeplay.com>	2025-02-03 22:00:57 +02:00
Georgi Gerganov	90171055f3	cmake : add sanitizer flags for llama.cpp (llama/11279) * cmake : add sanitizer flags for llama.cpp ggml-ci * tests : fix compile warnings ggml-ci * cmake : move sanitizer flags to llama_add_compile_flags ggml-ci * cmake : move llama.cpp compile flags to top level lists ggml-ci * cmake : apply only sanitizer flags at top level ggml-ci * tests : fix gguf context use in same_tensor_data * gguf-test: tensor data comparison * dummy : trigger ggml-ci * unicode : silence gcc warnings ggml-ci * ci : use sanitizer builds only in Debug mode ggml-ci * cmake : add status messages [no ci] --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-02-03 22:00:57 +02:00
Jeff Bolz	668306ff2b	vulkan: fix coopmat2 flash attention for non-contiguous inputs (llama/11281) Add code similar to mul_mm_cm2 to force alignment of strides, to avoid a performance regression. Add noncontiguous FA tests in test-backend-ops. Fixes #11268.	2025-02-03 22:00:57 +02:00
Radoslav Gerganov	fdc21fc87b	rpc : early register backend devices (llama/11262) Early register RPC devices and do not propagate RPC specifics in the llama model structures. ref: #10609	2025-02-03 22:00:57 +02:00
Jeff Bolz	7183a1eb72	vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl (llama/11166) * vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl Shaders are based on cpy.cu. * vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32 * ggml: copy q->f32 assumes some contiguity in the destination	2025-02-03 22:00:57 +02:00
Jeff Bolz	09f3c66648	vulkan: optimize coopmat2 q4_k/q5_k dequant functions. (llama/11206) Do masking on whole dwords, fetch all scales at once.	2025-02-03 22:00:57 +02:00
Jeff Bolz	62e2414620	vulkan: optimize coopmat2 q2_k dequant function (llama/11130)	2025-02-03 22:00:57 +02:00
Johannes Gäßler	de49024e49	CUDA: backwards pass for misc. ops, add tests (llama/11257) * CUDA: backwards pass for misc. ops, add tests * remove restrict from pointers	2025-02-03 22:00:57 +02:00
fj-y-saito	db6383094c	ggml: aarch64: implement SVE kernels for q4_K_q8_K vector dot (llama/11227) * Add SVE support for q4_K_q8_K * Update ggml/src/ggml-cpu/ggml-cpu-quants.c change to use K_SCALE_SIZE Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-03 22:00:57 +02:00
Eve	164f13c6a9	vulkan: scale caching for k quants + misc fixes (llama/11081) * q6_k scale caching * 16 bit unpack * q4_k test (slow) * revert it * q3_k * q2_k * little stuff * try precalculating products of a and q2_k scales * Revert "try precalculating products of a and q2_k scales" This reverts commit 65110b81f23f66331a50c6e889a7c1ab9470a86b. * unpack should be u16, add vim swap to gitignore (about time) * better q4_k scales * q5_k * better q6_k with separate paths for all threads and partial threads in use, plus some more optimizations * q2_k better dequant * q3_k optimizations * q3_k use hmask simd from cpu avx version * make the caches happy * q3_k separate out calculation * q2_k separate out * little stuff * use calc_superblock everywhere * q2_k optimize scale calculation * more barriers	2025-02-03 22:00:57 +02:00
Junil Kim	02aa86230a	fix: ggml: fix vulkan-shaders-gen build (llama/10448) * fix: ggml: fix vulkan-shaders-gen build The vulkan-shaders-gen target was not being built correctly in case of cross-compilation. Other outputs need to be built for the cross compile target, but vulkan-shaders-gen needs to be built for the host. * refactor: ggml: Improve vulkan-shaders-gen toolchain setup - Add GGML_SHADERS_GEN_TOOLCHAIN CMake option. - Auto-detect host toolchain if not set. * refactor: ggml: Improve vulkan-shaders-gen toolchain setup Use configure_file to generate host_toolchain.cmake from template * fix: ggml: Fix compile error Fix compile error not finding vulkan-shaders-gen * fix: vulkan-shaders-gen build and path handling Fix build issues with vulkan-shaders-gen: - Add target dependency for correct build order - Use CMAKE_HOST_SYSTEM_NAME for executable suffix - Fix MSVC output directory in host toolchain - Normalize path handling for cross-compilation * fix: improve host compiler detection in vulkan shader build Improve host compiler detection for vulkan shader generation: - Add NO_CMAKE_FIND_ROOT_PATH to all compiler searches - Consolidate compiler detection logic - Fix Windows-specific MSVC detection - Ensure correct compiler search in cross-compilation * refactor: Simplify CMake function for detecting host compiler Simplified the CMake function to improve the process of detecting the host compiler. * fix: Remove unnecessary Vulkan library linkage in CMakeLists.txt Since `vulkan-shader-gen.cpp` only requires the `glslc` executable and not the Vulkan headers or libraries, CMakeLists.txt needs to be corrected. (See: ecc93d0558fc3ecb8a5af69d2ece02fae4710ade) * refactor: Rename host_toolchain.cmake.in - Rename host_toolchain.cmake.in to cmake/host-toolchain.cmake.in * refactor: GGML_VULKAN_SHADERS_GEN_TOOLCHAIN Rename the macro GGML_SHADERS_GEN_TOOLCHAIN to GGML_VULKAN_SHADERS_GEN_TOOLCHAIN	2025-02-03 22:00:57 +02:00
Johannes Gäßler	54a2ee648f	RoPE: fix back, CUDA support for back + noncont. (llama/11240) * RoPE: fix back, CUDA support for back + noncont. * fix comments reg. non-cont. RoPE support [no-ci]	2025-02-03 22:00:57 +02:00
Akarshan Biswas	9700cfb0a3	SYCL: Add gated linear attention kernel (llama/11175) * SYCL: Add Gated Linear attention kernel * glahpp: add a space at the end of file * gla: Put the barrier inside the main logic loop	2025-02-03 22:00:57 +02:00
William Tambellini	8e0143e205	ggml : add option to not print stack on abort (ggml/1081) * Add option to not print stack on abort Add option/envvar to disable stack printing on abort. Also link some unittests with Threads to fix link errors on ubuntu/g++11. * Update ggml/src/ggml.c --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-02-03 22:00:57 +02:00
issixx	f12559d590	ggml-cpu : fix ggml_graph_compute_thread did not terminate on abort. (ggml/1065) some threads kept looping and failed to terminate properly after an abort during CPU execution. Co-authored-by: issi <issi@gmail.com>	2025-02-03 22:00:57 +02:00
Johannes Gäßler	d5ef1737d8	GGUF: C++ refactor, backend support, misc fixes (skip) (llama/11030) ggml-ci	2025-01-14 10:38:01 +02:00
lhez	1deb41f0e7	ggml : add opencl backend (skip) (llama/10693) --------- Co-authored-by: Skyler Szot <quic_sszot@quicinc.com> Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com> Co-authored-by: Alexander Angus <quic_aangus@quicinc.com> Co-authored-by: Hongqiang Wang <quic_wangh@quicinc.com> Co-authored-by: Max Krasnyansky <quic_maxk@quicinc.com>	2025-01-14 10:38:01 +02:00
Andreas Kieslinger	2425caf4fd	cuda : CUDA Graph Compute Function Refactor (precursor for performance improvements) (llama/11042) * Refactor: Moves cuda graph executable update step to separate function. * Refactor: Moves cuda graph update check to separate function. * Refactor: Moves cuda graph maintenance (update or adjusting copy parameters) to separate function for improved readability. * Fix: Adds missing reference to maintain_cuda_graph() definition. * Refactor: Improves structure and abstractions by moving CUDA graph evaluation and capture to its own function. * Refactor: Moves node graph checks and copy ops into individual function for improved readability. * Refactor: Removes code permanently excluded from compilation to increase readability. * Style: Adds missing newline * Style: Consolidates several neighboring '#ifdef USE_CUDA_GRAPH' into a single one * Refactor: Makes 'cuda_graph_update_required' a local variable * remove double lines between functions --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-01-14 10:38:01 +02:00
Radoslav Gerganov	a4b00bcaaf	ggml : do not define GGML_USE_CUDA when building with GGML_BACKEND_DL (llama/11211) Build fails when using HIP and GGML_BACKEND_DL: ``` /usr/bin/ld: ../ggml/src/libggml.so: undefined reference to `ggml_backend_cuda_reg' collect2: error: ld returned 1 exit status ``` This patch fixes this.	2025-01-14 10:38:01 +02:00
0cc4m	cdb8aa2f2e	Vulkan: Fix float16 use on devices without float16 support + fix subgroup_size_control validation error (llama/11161) * Vulkan: Remove float16 use in shaders * Fix validation error about subgroup_size_control extension	2025-01-14 10:38:01 +02:00
Molly Sophia	06209f6683	llama: add support for QRWKV6 model architecture (llama/11001) llama: add support for QRWKV6 model architecture (llama/11001) * WIP: Add support for RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV: Some graph simplification Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add support for RWKV6Qwen2 with cpu and cuda GLA Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix some typos Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * code format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix wkv test & add gla test Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix cuda warning Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update README.md Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update ggml/src/ggml-cuda/gla.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Fix fused lerp weights loading with RWKV6 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * better sanity check skipping for QRWKV6 in llama-quant thanks @compilade Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: compilade <git@compilade.net> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <git@compilade.net>	2025-01-14 10:38:01 +02:00

1 2 3 4 5 ...

594 Commits