whisper.cpp/ggml
shalinib-ibm 0630539c8a
ggml : Enable MMA for BF16 in llamafile_sgemm (llama/13148)
This patch upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for BF16 data type.

This change results in 9x - 40x gains
in total speed S t/s (ie all tokens/total time), across various batch sizes tested using llama-batched-bench benchmark.

The patch is tested with Meta-Lllama-3-8B,
and Mistral-7B models (BF16 models generated by using llama-quantize from corresponding FP32 models) on an IBM POWER10 machine.

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2025-05-07 13:17:41 +03:00
..
cmake ggml : sync/merge cmake,riscv,powerpc, add common.cmake (ggml/0) 2025-03-27 11:06:03 +02:00
include CUDA: fix q_nope_absorbed prec for DS 2 Lite f16 (llama/13137) 2025-05-01 13:29:02 +03:00
src ggml : Enable MMA for BF16 in llamafile_sgemm (llama/13148) 2025-05-07 13:17:41 +03:00
.gitignore whisper : reorganize source code + improve CMake (#2256) 2024-06-26 19:34:09 +03:00
CMakeLists.txt whisper: remove MSVC warnings pragmas (#3090) 2025-05-05 13:09:35 +02:00