hexagon: improved Op queuing, buffer and cache management (llama/21705)

* hexagon: introduce op request batching and rewrite buffer managment

The host now prepares batches of requests and dispatches them via a single dspqueue message.

Buffers are mapped explicitly by NPU while processing batches.

* hex-dma: disable l2 bypass since to work around new issue due to no flushes between Ops

* hex-utils: add explicit l2flush and l2clear helpers

* hex-opreq: use fine-grain per tensor l2 management

* hex-opreq: avoid redundant invalidates for tensors we already flushed

* hex-opreq: update debug messages

* htp-opreq: reuse ops_context

* hex-opreq: do not flush or invalidate cache lines beyond buffer boundry

* hex-opreq: fix errors in log message

* Revert "hex-opreq: do not flush or invalidate cache lines beyond buffer boundry"

This reverts commit 8b7f0a55a750a6430ce4eb1874c7feb3d720056d.

* hexagon: limit l2 flushes to 1MB which covers l2 cache

* hex-opreq: limit cache flush to 4MB

Looks like 4MB cont. vitual space should cover the 1MB cache.

* hexagon: drop cache flush size to 2MB

* hex-opreq: start reworking opreq packing

* hex-opreq: introduce new way of packing opbatch where tensors are stored separately

* hex-opreq: add a simple fastrpc call to force unmap all buffers

* hex-l2flush: somehow 2MB does not seem robust, also cleanup step size to use line-size

* hex-opreq: bump opreq batch size to 256

* hex-mm: place src1 spad at the top of vtcm for easy reuse

* hex-ops: introduce internal types and disable src1 reuse for now

Nothing new just formalizing the repack / qyn.quant types we've been using.

* htp-opreq: use tensor pointers instead of copies

* hex-opreq: introduce more robust way for tracking vtcm/spad reuse

This removes the SKIP_QUANTIZE flag that became fragile with the addition of HMX and other ops.

* hex-cumsum: fix error post opreq merge

* hex-opreq: move request batch handling into the session

Prepping everything for using dspqueue buffers and doing that inside the session is much cleaner.

* hex-mm: yet another fix for src1 reuse when we're mixing hmx/hvx

* hex-bufs: introduce pinned mmapings and use non-pinned ones for model buffers

* hex-buf: add support for allocating shared/pinned buffer for opreqs

* hex-opbatch: make opbatches configurable

* hex-naming: better name for ggml_hexagon_shared_buffer

* hex-naming: add session->c_name() helper

* hex-opbatch: start using shm but still copy for now

* hex-opbatch: use shared buffer for packing opbatch

* hex-opbatch: beter naming for opbatch related classes and code

* hex-opbatch: reuse batched tensors with same data/dims/strides

* hex-opbatch: update logging

* hex-opbatch: add support for vmem limit for op batching

* hex-opbatch: update htp side to properly support dynamic mmap/unmap

* hex-opbatch: add OB and OQ params for run-completion script and fix the asserts in batch processing

* hex-opbatch: fixed src1 handling in act ops

* hex-act: fix empty src1 handling in swiglu and friends

Simplify preamble macro while at it

* hex-mm: minor fix vtcm and dma handling in matmul

cleaning up some left-overs from merges

* hex-opbatch: allocate extra 1KB for dspqueue overhead

* hexagon: fix softmax for non-aligned tensors and cleanup vtcm alloc

* hex-mm: properly handle hmx_disabled flag

* hex-ops: update comments

* hex-ops: add debug output for get/set-rows

* hex-mmap: optimize un/mapping of buffers

* hex-opreq: global cache flush and invalidate beyond 128KB threshold

* hex-ops: add super simple opfilter regex for debugging

If an Op matches the regex hex backend will reject it.

* hex-opbatch: wireup newer ops missed in merge and update main switch to detect this in future

* hexagon: improved vtcm acquision to remove inter-op overhead

Fully compatible with QNN-HTP coex

* hex-mm: fixed hvx fallback path

* hex-mm: lower the vmem threshold a bit further to ~3GB

* hexagon: update debug & error logs

This also fixes an issue with newer llvm merging repack and non-repack
functions. We use those pointer to distinguish between buffer types.

* hexagon: move ops context into main context

Just a cleanup. We don't need separate contexts at this point.

* hex-opbatch: cleanup naming and headers for opbatch and related descriptors

* hex-fa: it's now better to enable FA during TG to reduce graph splits

* hexagon: remove GGML_HEXAGON_EXPERIMENTAL env var

It's no longer useful. Please use more flexible GGML_HEXAGON_OPFILTER to disable Ops
if needed for debugging or validation.

* hexagon: fixed editorconfig check

* Update ggml/src/ggml-hexagon/ggml-hexagon.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
This commit is contained in:
Max Krasnyansky 2026-04-10 15:47:43 -07:00 committed by Georgi Gerganov
parent 2580cfc703
commit 28ce072f59
No known key found for this signature in database
GPG Key ID: 449E073F9DC10735
24 changed files with 1786 additions and 2595 deletions

File diff suppressed because it is too large Load Diff

View File

@ -14,59 +14,42 @@
#define GGML_COMMON_DECL_C
#include "ggml-common.h"
#include "htp-ctx.h"
#include "htp-msg.h"
#include "htp-ops.h"
#include "htp-ops.h"
#define htp_act_preamble3 \
const uint32_t ne00 = src0->ne[0]; \
const uint32_t ne01 = src0->ne[1]; \
const uint32_t ne02 = src0->ne[2]; \
const uint32_t ne03 = src0->ne[3]; \
\
const uint32_t ne10 = src1->ne[0]; \
const uint32_t ne11 = src1->ne[1]; \
const uint32_t ne12 = src1->ne[2]; \
const uint32_t ne13 = src1->ne[3]; \
\
const uint32_t ne0 = dst->ne[0]; \
const uint32_t ne1 = dst->ne[1]; \
const uint32_t ne2 = dst->ne[2]; \
const uint32_t ne3 = dst->ne[3]; \
\
const uint32_t nb00 = src0->nb[0]; \
const uint32_t nb01 = src0->nb[1]; \
const uint32_t nb02 = src0->nb[2]; \
const uint32_t nb03 = src0->nb[3]; \
\
const uint32_t nb10 = src1->nb[0]; \
const uint32_t nb11 = src1->nb[1]; \
const uint32_t nb12 = src1->nb[2]; \
const uint32_t nb13 = src1->nb[3]; \
\
const uint32_t nb0 = dst->nb[0]; \
const uint32_t nb1 = dst->nb[1]; \
const uint32_t nb2 = dst->nb[2]; \
const uint32_t nb3 = dst->nb[3];
#define htp_act_preamble2 \
const uint32_t ne00 = src0->ne[0]; \
const uint32_t ne01 = src0->ne[1]; \
const uint32_t ne02 = src0->ne[2]; \
const uint32_t ne03 = src0->ne[3]; \
\
const uint32_t ne0 = dst->ne[0]; \
const uint32_t ne1 = dst->ne[1]; \
const uint32_t ne2 = dst->ne[2]; \
const uint32_t ne3 = dst->ne[3]; \
\
const uint32_t nb00 = src0->nb[0]; \
const uint32_t nb01 = src0->nb[1]; \
const uint32_t nb02 = src0->nb[2]; \
const uint32_t nb03 = src0->nb[3]; \
\
const uint32_t nb0 = dst->nb[0]; \
const uint32_t nb1 = dst->nb[1]; \
const uint32_t nb2 = dst->nb[2]; \
#define htp_act_preamble \
const struct htp_tensor * src0 = actx->octx->src[0]; \
const struct htp_tensor * src1 = actx->octx->src[1]; \
const struct htp_tensor * dst = actx->octx->dst; \
\
const uint32_t ne00 = src0->ne[0]; \
const uint32_t ne01 = src0->ne[1]; \
const uint32_t ne02 = src0->ne[2]; \
const uint32_t ne03 = src0->ne[3]; \
\
const uint32_t nb00 = src0->nb[0]; \
const uint32_t nb01 = src0->nb[1]; \
const uint32_t nb02 = src0->nb[2]; \
const uint32_t nb03 = src0->nb[3]; \
\
const uint32_t ne10 = src1 ? src1->ne[0] : 0; \
const uint32_t ne11 = src1 ? src1->ne[1] : 0; \
const uint32_t ne12 = src1 ? src1->ne[2] : 0; \
const uint32_t ne13 = src1 ? src1->ne[3] : 0; \
\
const uint32_t nb10 = src1 ? src1->nb[0] : 0; \
const uint32_t nb11 = src1 ? src1->nb[1] : 0; \
const uint32_t nb12 = src1 ? src1->nb[2] : 0; \
const uint32_t nb13 = src1 ? src1->nb[3] : 0; \
\
const uint32_t ne0 = dst->ne[0]; \
const uint32_t ne1 = dst->ne[1]; \
const uint32_t ne2 = dst->ne[2]; \
const uint32_t ne3 = dst->ne[3]; \
\
const uint32_t nb0 = dst->nb[0]; \
const uint32_t nb1 = dst->nb[1]; \
const uint32_t nb2 = dst->nb[2]; \
const uint32_t nb3 = dst->nb[3];
struct htp_act_context {
@ -97,10 +80,7 @@ struct htp_act_context {
static void glu_swiglu_f32_per_thread(unsigned int nth, unsigned int ith, void * data) {
struct htp_act_context * actx = (struct htp_act_context *) data;
const struct htp_tensor * src0 = &actx->octx->src0;
const struct htp_tensor * src1 = &actx->octx->src1;
const struct htp_tensor * dst = &actx->octx->dst;
htp_act_preamble3;
htp_act_preamble;
size_t src0_row_size = actx->src0_row_size;
size_t src1_row_size = actx->src1_row_size;
@ -207,10 +187,7 @@ static void glu_swiglu_f32_per_thread(unsigned int nth, unsigned int ith, void *
static void glu_swiglu_oai_f32_per_thread(unsigned int nth, unsigned int ith, void * data) {
struct htp_act_context * actx = (struct htp_act_context *) data;
const struct htp_tensor * src0 = &actx->octx->src0;
const struct htp_tensor * src1 = &actx->octx->src1;
const struct htp_tensor * dst = &actx->octx->dst;
htp_act_preamble3;
htp_act_preamble;
uint64_t t1, t2;
t1 = HAP_perf_get_qtimer_count();
@ -332,9 +309,7 @@ static void glu_swiglu_oai_f32_per_thread(unsigned int nth, unsigned int ith, vo
static void unary_gelu_f32_per_thread(unsigned int nth, unsigned int ith, void * data) {
struct htp_act_context * actx = (struct htp_act_context *) data;
const struct htp_tensor * src0 = &actx->octx->src0;
const struct htp_tensor * dst = &actx->octx->dst;
htp_act_preamble2;
htp_act_preamble;
uint64_t t1, t2;
t1 = HAP_perf_get_qtimer_count();
@ -433,9 +408,7 @@ static void unary_gelu_f32_per_thread(unsigned int nth, unsigned int ith, void *
static void unary_silu_f32_per_thread(unsigned int nth, unsigned int ith, void * data) {
struct htp_act_context * actx = (struct htp_act_context *) data;
const struct htp_tensor * src0 = &actx->octx->src0;
const struct htp_tensor * dst = &actx->octx->dst;
htp_act_preamble2;
htp_act_preamble;
uint64_t t1, t2;
t1 = HAP_perf_get_qtimer_count();
@ -533,10 +506,7 @@ static const float SQRT_2_OVER_PI = 0.79788456080286535587989211986876f;
static void glu_geglu_f32_per_thread(unsigned int nth, unsigned int ith, void * data) {
struct htp_act_context * actx = (struct htp_act_context *) data;
const struct htp_tensor * src0 = &actx->octx->src0;
const struct htp_tensor * src1 = &actx->octx->src1;
const struct htp_tensor * dst = &actx->octx->dst;
htp_act_preamble3;
htp_act_preamble;
size_t src0_row_size = actx->src0_row_size;
size_t src1_row_size = actx->src1_row_size;
@ -652,9 +622,9 @@ static void glu_geglu_f32_per_thread(unsigned int nth, unsigned int ith, void *
}
static int execute_op_activations_f32(struct htp_ops_context * octx) {
const struct htp_tensor * src0 = &octx->src0;
const struct htp_tensor * src1 = &octx->src1;
struct htp_tensor * dst = &octx->dst;
const struct htp_tensor * src0 = octx->src[0];
const struct htp_tensor * src1 = octx->src[1];
const struct htp_tensor * dst = octx->dst;
if (((src0->ne[0] * SIZEOF_FP32) != src0->nb[1]) || ((dst->ne[0] * SIZEOF_FP32) != dst->nb[1])) {
FARF(ERROR, "Non-contiguous tensors are not supported at this time \n");
@ -697,25 +667,20 @@ static int execute_op_activations_f32(struct htp_ops_context * octx) {
const uint32_t n_threads = MIN(octx->n_threads, src0_nrows);
size_t src0_row_size = src0->nb[1];
size_t src1_row_size = src1->nb[1]; // zero bytes if src1 is not used
size_t src1_row_size = src1 ? src1->nb[1] : src0->nb[1];
size_t dst_row_size = dst->nb[1];
const bool src1_valid = src1->ne[0];
if (!src1_valid) {
src1_row_size = src0_row_size;
}
const size_t src0_row_size_aligned = hex_round_up(src0_row_size, VLEN);
const size_t src1_row_size_aligned = hex_round_up(src1_row_size, VLEN);
const size_t dst_row_size_aligned = hex_round_up(dst_row_size, VLEN);
// VTCM scratchpads for all tensors
// N rows per thread, padded to HVX vector size
size_t spad_size_per_row = (src0_row_size_aligned + src1_row_size_aligned) + dst_row_size_aligned;
size_t vtcm_row_per_thread = (octx->ctx->vtcm_size)/ (n_threads* spad_size_per_row);
// Make sure the reserved vtcm size is sufficient
if(vtcm_row_per_thread ==0){
if (vtcm_row_per_thread == 0) {
FARF(ERROR, "act-%s : current VTCM reservation %zu is too small for even 1 row per thread, needed at least %zu\n", op_type, octx->ctx->vtcm_size,
spad_size_per_row * n_threads);
return HTP_STATUS_VTCM_TOO_SMALL;
@ -733,7 +698,11 @@ static int execute_op_activations_f32(struct htp_ops_context * octx) {
octx->src1_spad.data = octx->src0_spad.data + octx->src0_spad.size;
octx->dst_spad.data = octx->src1_spad.data + octx->src1_spad.size;
if (src1->ne[0]) {
octx->src0_spad.src = NULL;
octx->src1_spad.src = NULL;
octx->dst_spad.src = NULL;
if (src1) {
FARF(HIGH, "%s: %ux%ux%ux%u x %ux%ux%ux%u -> %ux%ux%ux%u : src0-spad-size %u src1-spad-size %u dst-spad-size %u\n",
op_type, src0->ne[0], src0->ne[1], src0->ne[2], src0->ne[3], src1->ne[0], src1->ne[1], src1->ne[2],
src1->ne[3], dst->ne[0], dst->ne[1], dst->ne[2], dst->ne[3], octx->src0_spad.size, octx->src1_spad.size,
@ -773,9 +742,9 @@ static int execute_op_activations_f32(struct htp_ops_context * octx) {
// Pointers and GLU logic
const uint8_t * data_src0 = (const uint8_t *) src0->data;
const uint8_t * data_src1 = (const uint8_t *) src1->data;
const uint8_t * data_src1 = src1 ? (const uint8_t *) src1->data : NULL;
if (!src1_valid && (octx->op == HTP_OP_GLU_SWIGLU || octx->op == HTP_OP_GLU_SWIGLU_OAI || octx->op == HTP_OP_GLU_GEGLU)) {
if (!src1 && (octx->op == HTP_OP_GLU_SWIGLU || octx->op == HTP_OP_GLU_SWIGLU_OAI || octx->op == HTP_OP_GLU_GEGLU)) {
const int32_t swapped = octx->op_params[1];
data_src1 = data_src0;
actx.src1_row_size = actx.src0_row_size;
@ -799,7 +768,7 @@ static int execute_op_activations_f32(struct htp_ops_context * octx) {
int op_activations(struct htp_ops_context * octx) {
int err = HTP_STATUS_OK;
switch (octx->src0.type) {
switch (octx->src[0]->type) {
case HTP_TYPE_F32:
err = execute_op_activations_f32(octx);
break;

View File

@ -12,7 +12,7 @@
#include "hex-dma.h"
#include "htp-ctx.h"
#include "htp-msg.h"
#include "htp-ops.h"
#include "htp-ops.h"
#ifndef MIN
@ -175,8 +175,8 @@ static void htp_argsort_f32(unsigned int n, unsigned int i, void * data) {
struct htp_ops_context * octx = actx->octx;
// Unpack context
const struct htp_tensor * src0 = &octx->src0;
const struct htp_tensor * dst = &octx->dst;
const struct htp_tensor * src0 = octx->src[0];
const struct htp_tensor * dst = octx->dst;
// Scratchpad memory
uint8_t * spad = octx->src0_spad.data + octx->src0_spad.size_per_thread * i;
@ -249,16 +249,16 @@ static void htp_argsort_f32(unsigned int n, unsigned int i, void * data) {
int op_argsort(struct htp_ops_context * octx) {
// Check supported types
if (octx->src0.type != HTP_TYPE_F32) {
if (octx->src[0]->type != HTP_TYPE_F32) {
return HTP_STATUS_NO_SUPPORT;
}
const uint32_t total_rows = octx->src0.ne[1] * octx->src0.ne[2] * octx->src0.ne[3];
const uint32_t total_rows = octx->src[0]->ne[1] * octx->src[0]->ne[2] * octx->src[0]->ne[3];
const uint32_t n_threads = MIN(total_rows, octx->n_threads);
// Allocate scratchpad
// We need 1 row of float + 1 row of int32 per thread.
uint32_t ne00 = octx->src0.ne[0];
uint32_t ne00 = octx->src[0]->ne[0];
size_t values_size = hex_round_up(ne00 * sizeof(float), 128);
size_t indices_size = hex_round_up(ne00 * sizeof(int32_t), 128);
size_t spad_per_thread = values_size + indices_size;
@ -278,9 +278,9 @@ int op_argsort(struct htp_ops_context * octx) {
octx->src0_spad.size_per_thread = spad_per_thread;
FARF(HIGH, "argsort: %ux%ux%ux%u -> %ux%ux%ux%u (0x%x, 0x%x)",
octx->src0.ne[0], octx->src0.ne[1], octx->src0.ne[2], octx->src0.ne[3],
octx->dst.ne[0], octx->dst.ne[1], octx->dst.ne[2], octx->dst.ne[3],
octx->src0.data, octx->dst.data);
octx->src[0]->ne[0], octx->src[0]->ne[1], octx->src[0]->ne[2], octx->src[0]->ne[3],
octx->dst->ne[0], octx->dst->ne[1], octx->dst->ne[2], octx->dst->ne[3],
octx->src[0]->data, octx->dst->data);
struct htp_argsort_context actx;
actx.octx = octx;

View File

@ -14,7 +14,7 @@
#define GGML_COMMON_DECL_C
#include "ggml-common.h"
#include "htp-ctx.h"
#include "htp-msg.h"
#include "htp-ops.h"
#include "htp-ops.h"
#ifndef MIN
@ -43,10 +43,10 @@ struct htp_binary_context {
bool split_at_ne02;
};
#define htp_binary_preamble \
const struct htp_tensor * src0 = &octx->src0; \
const struct htp_tensor * src1 = &octx->src1; \
struct htp_tensor * dst = &octx->dst; \
#define htp_binary_preamble \
const struct htp_tensor * src0 = octx->src[0]; \
const struct htp_tensor * src1 = octx->src[1]; \
const struct htp_tensor * dst = octx->dst; \
\
const uint32_t ne00 = src0->ne[0]; \
const uint32_t ne01 = src0->ne[1]; \
@ -181,7 +181,7 @@ static void binary_job_scalar(unsigned int nth, unsigned int ith, void * data) {
struct htp_ops_context * octx = bctx->octx;
htp_binary_preamble;
const uint32_t src0_type = octx->src0.type;
const uint32_t src0_type = octx->src[0]->type;
const uint32_t row_size_bytes = (src0_type == HTP_TYPE_F32) ? ne00 * sizeof(float) : ne00 * sizeof(_Float16);
const uint32_t total_rows = ne01 * ne02 * ne03;
const uint32_t start_row = bctx->nrows_per_thread * ith;
@ -274,7 +274,7 @@ static void binary_job_vector_same_shape(unsigned int nth, unsigned int ith, voi
struct htp_ops_context * octx = bctx->octx;
htp_binary_preamble;
const uint32_t src0_type = octx->src0.type;
const uint32_t src0_type = octx->src[0]->type;
const uint32_t row_size_bytes = (src0_type == HTP_TYPE_F32) ? ne00 * sizeof(float) : ne00 * sizeof(_Float16);
const uint32_t total_rows = ne01 * ne02 * ne03;
const uint32_t start_row = bctx->nrows_per_thread * ith;
@ -374,7 +374,7 @@ static void binary_job_vector_row_broadcast(unsigned int nth, unsigned int ith,
struct htp_ops_context * octx = bctx->octx;
htp_binary_preamble;
const uint32_t src0_type = octx->src0.type;
const uint32_t src0_type = octx->src[0]->type;
const uint32_t row_size_bytes = (src0_type == HTP_TYPE_F32) ? ne00 * sizeof(float) : ne00 * sizeof(_Float16);
const uint32_t total_rows = ne01 * ne02 * ne03;
const uint32_t start_row = bctx->nrows_per_thread * ith;
@ -455,7 +455,7 @@ static void binary_job_vector_complex(unsigned int nth, unsigned int ith, void *
struct htp_ops_context * octx = bctx->octx;
htp_binary_preamble;
const uint32_t src0_type = octx->src0.type;
const uint32_t src0_type = octx->src[0]->type;
const uint32_t row_size_bytes = (src0_type == HTP_TYPE_F32) ? ne00 * sizeof(float) : ne00 * sizeof(_Float16);
const uint32_t total_rows = ne01 * ne02 * ne03;
const uint32_t start_row = bctx->nrows_per_thread * ith;
@ -540,7 +540,7 @@ static void binary_job_element_repeat(unsigned int nth, unsigned int ith, void *
struct htp_ops_context * octx = bctx->octx;
htp_binary_preamble;
const uint32_t src0_type = octx->src0.type;
const uint32_t src0_type = octx->src[0]->type;
const uint32_t elem_size_bytes = (src0_type == HTP_TYPE_F32) ? sizeof(float) : sizeof(_Float16);
const uint32_t row_size_bytes = ne00 * elem_size_bytes;;
const uint32_t total_rows = ne01 * ne02 * ne03;
@ -629,10 +629,10 @@ static void binary_job_add_id(unsigned int nth, unsigned int ith, void * data) {
struct htp_binary_context * bctx = (struct htp_binary_context *) data;
struct htp_ops_context * octx = bctx->octx;
const struct htp_tensor * src0 = &octx->src0;
const struct htp_tensor * src1 = &octx->src1;
const struct htp_tensor * src2 = &octx->src2;
struct htp_tensor * dst = &octx->dst;
const struct htp_tensor * src0 = octx->src[0];
const struct htp_tensor * src1 = octx->src[1];
const struct htp_tensor * src2 = octx->src[2];
const struct htp_tensor * dst = octx->dst;
const uint32_t ne00 = src0->ne[0];
const uint32_t ne01 = src0->ne[1];
@ -723,15 +723,15 @@ static void binary_job_add_id(unsigned int nth, unsigned int ith, void * data) {
}
static int execute_op_binary(struct htp_ops_context * octx) {
const struct htp_tensor * src0 = &octx->src0;
const struct htp_tensor * src1 = &octx->src1;
struct htp_tensor * dst = &octx->dst;
const struct htp_tensor * src0 = octx->src[0];
const struct htp_tensor * src1 = octx->src[1];
const struct htp_tensor * dst = octx->dst;
const uint32_t src0_nrows = src0->ne[1] * src0->ne[2] * src0->ne[3];
const uint32_t n_threads = MIN(octx->n_threads, src0_nrows);
// Use packed row sizes for VTCM allocation
const uint32_t src0_type = octx->src0.type;
const uint32_t src0_type = octx->src[0]->type;
const size_t elem_size = (src0_type == HTP_TYPE_F32) ? sizeof(float) : sizeof(_Float16);
const size_t src0_row_size = src0->ne[0] * elem_size;
const size_t src1_row_size = src1->ne[0] * elem_size;
@ -799,9 +799,9 @@ static int execute_op_binary(struct htp_ops_context * octx) {
return HTP_STATUS_VTCM_TOO_SMALL;
}
octx->src0_spad.data = octx->ctx->vtcm_base;
octx->src1_spad.data = octx->src0_spad.data + octx->src0_spad.size;
octx->dst_spad.data = octx->src1_spad.data + octx->src1_spad.size;
octx->src0_spad.data = octx->ctx->vtcm_base; octx->src0_spad.src = NULL;
octx->src1_spad.data = octx->src0_spad.data + octx->src0_spad.size; octx->src1_spad.src = NULL;
octx->dst_spad.data = octx->src1_spad.data + octx->src1_spad.size; octx->dst_spad.src = NULL;
if ((octx->flags & HTP_OPFLAGS_SKIP_COMPUTE)) {
return HTP_STATUS_OK;
@ -857,12 +857,12 @@ static int execute_op_binary(struct htp_ops_context * octx) {
int op_binary(struct htp_ops_context * octx) {
// Does not support permutations of src1
const struct htp_tensor * src1 = &octx->src1;
const struct htp_tensor * src1 = octx->src[1];
if (src1->nb[1] < src1->nb[0]) {
return HTP_STATUS_NO_SUPPORT;
}
const uint32_t src0_type = octx->src0.type;
const uint32_t src0_type = octx->src[0]->type;
if ((src0_type == HTP_TYPE_F32) || (src0_type == HTP_TYPE_F16)) {
return execute_op_binary(octx);
}

View File

@ -11,7 +11,7 @@
#define GGML_COMMON_DECL_C
#include "ggml-common.h"
#include "htp-ctx.h"
#include "htp-msg.h"
#include "htp-ops.h"
#include "htp-ops.h"
#include "hvx-utils.h"
@ -32,10 +32,10 @@ struct htp_copy_context {
void (*copy)(struct htp_copy_context * ct, struct htp_ops_context * octx, int nth, int ith);
};
#define cpy_preamble \
struct htp_tensor *src0 = &octx->src0; \
struct htp_tensor *dst = &octx->dst; \
\
#define cpy_preamble \
const struct htp_tensor *src0 = octx->src[0]; \
const struct htp_tensor *dst = octx->dst; \
\
const uint32_t ne00 = src0->ne[0]; \
const uint32_t ne01 = src0->ne[1]; \
const uint32_t ne02 = src0->ne[2]; \

View File

@ -13,9 +13,9 @@
#include "hvx-utils.h"
#include "hex-dma.h"
#define htp_cumsum_tensors_preamble \
struct htp_tensor * restrict src0 = &octx->src0; \
struct htp_tensor * restrict dst = &octx->dst; \
#define htp_cumsum_tensors_preamble \
const struct htp_tensor * restrict src0 = octx->src[0]; \
const struct htp_tensor * restrict dst = octx->dst; \
\
const uint32_t ne00 = src0->ne[0]; \
const uint32_t ne01 = src0->ne[1]; \
@ -206,8 +206,8 @@ static void cumsum_thread_f32(unsigned int nth, unsigned int ith, void * data) {
}
int op_cumsum_f32(struct htp_ops_context * octx) {
const struct htp_tensor * src0 = &octx->src0;
const struct htp_tensor * dst = &octx->dst;
const struct htp_tensor * src0 = octx->src[0];
const struct htp_tensor * dst = octx->dst;
if (octx->flags & HTP_OPFLAGS_SKIP_COMPUTE) {
return HTP_STATUS_OK;
@ -226,10 +226,12 @@ int op_cumsum_f32(struct htp_ops_context * octx) {
octx->src0_spad.size_per_thread = src_row_size_aligned * 2;
octx->dst_spad.size_per_thread = dst_row_size_aligned * 2;
octx->src0_spad.size = n_threads * octx->src0_spad.size_per_thread;
octx->dst_spad.size = n_threads * octx->dst_spad.size_per_thread;
octx->src0_spad.data = octx->ctx->vtcm_base;
octx->dst_spad.data = octx->src0_spad.data + octx->src0_spad.size;
octx->src0_spad.size = n_threads * octx->src0_spad.size_per_thread;
octx->dst_spad.size = n_threads * octx->dst_spad.size_per_thread;
octx->src0_spad.data = octx->ctx->vtcm_base; octx->src0_spad.src = NULL;
octx->dst_spad.data = octx->src0_spad.data + octx->src0_spad.size; octx->dst_spad.src = NULL;
struct htp_cumsum_context cctx = {
.octx = octx,
@ -251,8 +253,9 @@ int op_cumsum_f32(struct htp_ops_context * octx) {
}
int op_cumsum(struct htp_ops_context * octx) {
int err = HTP_STATUS_OK;
struct htp_tensor * dst = &octx->dst;
const struct htp_tensor * dst = octx->dst;
int err = HTP_STATUS_OK;
switch (dst->type) {
case HTP_TYPE_F32:

View File

@ -15,7 +15,7 @@
#define GGML_COMMON_DECL_C
#include "ggml-common.h"
#include "htp-ctx.h"
#include "htp-msg.h"
#include "htp-ops.h"
#include "htp-ops.h"
// Must be multiple of 32
@ -278,12 +278,12 @@ static inline void hvx_scale_vec_f32_aa(uint8_t * restrict dst, const uint8_t *
static void flash_attn_ext_f16_thread(unsigned int nth, unsigned int ith, void * data) {
struct htp_fa_context * factx = (struct htp_fa_context *) data;
const struct htp_ops_context * octx = factx->octx;
const struct htp_tensor * q = &octx->src0;
const struct htp_tensor * k = &octx->src1;
const struct htp_tensor * v = &octx->src2;
const struct htp_tensor * mask = (octx->src3.data) ? &octx->src3 : NULL;
const struct htp_tensor * sinks = (octx->src4.data) ? &octx->src4 : NULL;
const struct htp_tensor * dst = &octx->dst;
const struct htp_tensor * q = octx->src[0];
const struct htp_tensor * k = octx->src[1];
const struct htp_tensor * v = octx->src[2];
const struct htp_tensor * mask = octx->src[3];
const struct htp_tensor * sinks = octx->src[4];
const struct htp_tensor * dst = octx->dst;
const uint32_t neq0 = q->ne[0];
const uint32_t neq1 = q->ne[1];
@ -610,11 +610,11 @@ static void flash_attn_ext_f16_thread(unsigned int nth, unsigned int ith, void *
}
int op_flash_attn_ext(struct htp_ops_context * octx) {
const struct htp_tensor * q = &octx->src0;
const struct htp_tensor * k = &octx->src1;
const struct htp_tensor * v = &octx->src2;
const struct htp_tensor * mask = (octx->src3.data) ? &octx->src3 : NULL;
const struct htp_tensor * dst = &octx->dst;
const struct htp_tensor * q = octx->src[0];
const struct htp_tensor * k = octx->src[1];
const struct htp_tensor * v = octx->src[2];
const struct htp_tensor * mask = octx->src[3];
const struct htp_tensor * dst = octx->dst;
// Check support
if ((q->type != HTP_TYPE_F16 && q->type != HTP_TYPE_F32) || k->type != HTP_TYPE_F16 || v->type != HTP_TYPE_F16) {
@ -701,13 +701,11 @@ int op_flash_attn_ext(struct htp_ops_context * octx) {
return HTP_STATUS_VTCM_TOO_SMALL;
}
octx->src0_spad.data = octx->ctx->vtcm_base;
octx->src1_spad.data = octx->src0_spad.data + octx->src0_spad.size;
octx->src2_spad.data = octx->src1_spad.data + octx->src1_spad.size;
octx->src3_spad.data = octx->src2_spad.data + octx->src2_spad.size;
octx->dst_spad.data = octx->src3_spad.data + octx->src3_spad.size;
// FARF(ERROR, "fa: qrows-per-thread %u", factx.qrows_per_thread);
octx->src0_spad.data = octx->ctx->vtcm_base; octx->src0_spad.src = NULL;
octx->src1_spad.data = octx->src0_spad.data + octx->src0_spad.size; octx->src1_spad.src = NULL;
octx->src2_spad.data = octx->src1_spad.data + octx->src1_spad.size; octx->src2_spad.src = NULL;
octx->src3_spad.data = octx->src2_spad.data + octx->src2_spad.size; octx->src3_spad.src = NULL;
octx->dst_spad.data = octx->src3_spad.data + octx->src3_spad.size; octx->dst_spad.src = NULL;
if (!(octx->flags & HTP_OPFLAGS_SKIP_COMPUTE)) {
worker_pool_run_func(octx->ctx->worker_pool, flash_attn_ext_f16_thread, &factx, octx->n_threads);

View File

@ -11,7 +11,7 @@
#define GGML_COMMON_DECL_C
#include "ggml-common.h"
#include "htp-ctx.h"
#include "htp-msg.h"
#include "htp-ops.h"
#include "htp-ops.h"
#include "hvx-utils.h"
@ -23,27 +23,33 @@ struct get_rows_context {
};
#define get_rows_preamble \
const uint32_t ne00 = octx->src0.ne[0]; \
const uint32_t ne01 = octx->src0.ne[1]; \
const uint32_t ne02 = octx->src0.ne[2]; \
const uint32_t ne03 = octx->src0.ne[3]; \
\
const uint32_t ne10 = octx->src1.ne[0]; \
const uint32_t ne11 = octx->src1.ne[1]; \
const uint32_t ne12 = octx->src1.ne[2]; \
\
const uint32_t nb01 = octx->src0.nb[1]; \
const uint32_t nb02 = octx->src0.nb[2]; \
const uint32_t nb03 = octx->src0.nb[3]; \
\
const uint32_t nb10 = octx->src1.nb[0]; \
const uint32_t nb11 = octx->src1.nb[1]; \
const uint32_t nb12 = octx->src1.nb[2]; \
\
const uint32_t nb1 = octx->dst.nb[1]; \
const uint32_t nb2 = octx->dst.nb[2]; \
const uint32_t nb3 = octx->dst.nb[3]; \
\
const uint32_t ne00 = octx->src[0]->ne[0]; \
const uint32_t ne01 = octx->src[0]->ne[1]; \
const uint32_t ne02 = octx->src[0]->ne[2]; \
const uint32_t ne03 = octx->src[0]->ne[3]; \
\
const uint32_t ne10 = octx->src[1]->ne[0]; \
const uint32_t ne11 = octx->src[1]->ne[1]; \
const uint32_t ne12 = octx->src[1]->ne[2]; \
const uint32_t ne13 = octx->src[1]->ne[3]; \
\
const uint32_t ne0 = octx->dst->ne[0]; \
const uint32_t ne1 = octx->dst->ne[1]; \
const uint32_t ne2 = octx->dst->ne[2]; \
const uint32_t ne3 = octx->dst->ne[3]; \
\
const uint32_t nb01 = octx->src[0]->nb[1]; \
const uint32_t nb02 = octx->src[0]->nb[2]; \
const uint32_t nb03 = octx->src[0]->nb[3]; \
\
const uint32_t nb10 = octx->src[1]->nb[0]; \
const uint32_t nb11 = octx->src[1]->nb[1]; \
const uint32_t nb12 = octx->src[1]->nb[2]; \
\
const uint32_t nb1 = octx->dst->nb[1]; \
const uint32_t nb2 = octx->dst->nb[2]; \
const uint32_t nb3 = octx->dst->nb[3]; \
\
const uint32_t nr = ne10 * ne11 * ne12;
static void get_rows_thread_f32_f32(unsigned int nth, unsigned int ith, void *data) {
@ -51,12 +57,14 @@ static void get_rows_thread_f32_f32(unsigned int nth, unsigned int ith, void *da
struct htp_ops_context * octx = grctx->octx;
get_rows_preamble;
uint64_t qt = HAP_perf_get_qtimer_count();
// parallelize by src1 elements (which correspond to dst rows)
const uint32_t dr = grctx->src1_nrows_per_thread;
const uint32_t ir0 = dr * ith;
const uint32_t ir1 = (ir0 + dr < nr) ? (ir0 + dr) : nr;
const bool is_i32 = (octx->src1.type == HTP_TYPE_I32);
const bool is_i32 = (octx->src[1]->type == HTP_TYPE_I32);
for (uint32_t i = ir0; i < ir1; ++i) {
const uint32_t i12 = fastdiv(i, &grctx->get_rows_div_ne10_ne11);
@ -64,7 +72,7 @@ static void get_rows_thread_f32_f32(unsigned int nth, unsigned int ith, void *da
const uint32_t i11 = fastdiv(rem, &grctx->get_rows_div_ne10);
const uint32_t i10 = rem - i11 * ne10;
const uintptr_t src1_addr = octx->src1.data + i10*nb10 + i11*nb11 + i12*nb12;
const uintptr_t src1_addr = octx->src[1]->data + i10*nb10 + i11*nb11 + i12*nb12;
uint32_t i01 = is_i32 ? *(int32_t *)src1_addr : *(int64_t *)src1_addr;
@ -73,10 +81,14 @@ static void get_rows_thread_f32_f32(unsigned int nth, unsigned int ith, void *da
continue;
}
const uintptr_t src0_ptr = octx->src0.data + i01*nb01 + i11*nb02 + i12*nb03;
const uintptr_t dst_ptr = octx->dst.data + i10*nb1 + i11*nb2 + i12*nb3;
const uintptr_t src0_ptr = octx->src[0]->data + i01*nb01 + i11*nb02 + i12*nb03;
const uintptr_t dst_ptr = octx->dst->data + i10*nb1 + i11*nb2 + i12*nb3;
hvx_copy_f32_uu((uint8_t *)dst_ptr, (const uint8_t *)src0_ptr, ne00);
}
qt = HAP_perf_qtimer_count_to_us(HAP_perf_get_qtimer_count() - qt);
FARF(HIGH, "get-rows-f32-f32 %d/%d: %ux%ux%ux%u (%u:%u) x %ux%ux%ux%u -> %ux%ux%ux%u usec %u\n", ith, nth,
ne00, ne01, ne02, ne03, ir0, ir1, ne10, ne11, ne12, ne13, ne0, ne1, ne2, ne3, (unsigned) qt);
}
int op_get_rows(struct htp_ops_context * octx) {
@ -84,15 +96,15 @@ int op_get_rows(struct htp_ops_context * octx) {
const uint32_t n_threads = MIN(nr, octx->n_threads);
if (octx->src0.type != HTP_TYPE_F32) {
if (octx->src[0]->type != HTP_TYPE_F32) {
return HTP_STATUS_NO_SUPPORT;
}
if (octx->dst.type != HTP_TYPE_F32) {
if (octx->dst->type != HTP_TYPE_F32) {
return HTP_STATUS_NO_SUPPORT;
}
if (octx->src1.type != HTP_TYPE_I32 && octx->src1.type != HTP_TYPE_I64) {
if (octx->src[1]->type != HTP_TYPE_I32 && octx->src[1]->type != HTP_TYPE_I64) {
return HTP_STATUS_NO_SUPPORT;
}
@ -102,8 +114,8 @@ int op_get_rows(struct htp_ops_context * octx) {
struct get_rows_context grctx;
grctx.octx = octx;
grctx.get_rows_div_ne10 = init_fastdiv_values(octx->src1.ne[0]);
grctx.get_rows_div_ne10_ne11 = init_fastdiv_values(octx->src1.ne[0] * octx->src1.ne[1]);
grctx.get_rows_div_ne10 = init_fastdiv_values(octx->src[1]->ne[0]);
grctx.get_rows_div_ne10_ne11 = init_fastdiv_values(octx->src[1]->ne[0] * octx->src[1]->ne[1]);
grctx.src1_nrows_per_thread = (nr + n_threads - 1) / n_threads;

View File

@ -3,8 +3,10 @@
#include <stdbool.h>
#include <stdint.h>
#include <qurt_memory.h>
#include "hexagon_types.h"
#include "hexagon_protos.h"
#include "hex-fastdiv.h"
#include "hex-dump.h"
@ -68,4 +70,23 @@ static inline void hex_l2fetch(const void * p, uint32_t width, uint32_t stride,
Q6_l2fetch_AP((void *) p, control);
}
#define HEX_L2_LINE_SIZE 64
#define HEX_L2_FLUSH_SIZE (128 * 1024)
static inline void hex_l2flush(void * addr, size_t size)
{
if (size > HEX_L2_FLUSH_SIZE) {
qurt_mem_cache_clean((qurt_addr_t) 0, 0, QURT_MEM_CACHE_FLUSH_INVALIDATE_ALL, QURT_MEM_DCACHE);
} else {
const uint32_t s = (uint32_t) addr;
const uint32_t e = s + size;
for (uint32_t i = s; i < e; i += HEX_L2_LINE_SIZE * 4) {
Q6_dccleaninva_A((void *) i + HEX_L2_LINE_SIZE * 0);
Q6_dccleaninva_A((void *) i + HEX_L2_LINE_SIZE * 1);
Q6_dccleaninva_A((void *) i + HEX_L2_LINE_SIZE * 2);
Q6_dccleaninva_A((void *) i + HEX_L2_LINE_SIZE * 3);
}
}
}
#endif /* HEX_UTILS_H */

View File

@ -20,7 +20,7 @@
#include "hvx-dump.h"
#include "worker-pool.h"
#include "htp-ctx.h"
#include "htp-msg.h"
#include "htp-ops.h"
#include "hmx-utils.h"
#include "hmx-ops.h"
@ -821,7 +821,7 @@ int hmx_mat_mul_permuted_w16a32_batched(struct htp_context *ctx, const hmx_matmu
// and each q_head is computed individually to avoid tile-major packing
// issues. m_chunk_n_rows is always a multiple of 32 (from
// hmx_compute_chunks), so per-head tile arrays don't overlap.
const size_t vtcm_budget = ctx->vtcm_scratch_size;
const size_t vtcm_budget = ctx->vtcm_size;
const size_t vec_dot_size = params->k * sizeof(__fp16);
// When the activation has a large stride (e.g. permuted Q tensor with
@ -998,7 +998,7 @@ int hmx_mat_mul_permuted_w16a32(struct htp_context *ctx, float *restrict dst, co
}
// --- Dynamic VTCM layout ---
const size_t vtcm_budget = ctx->vtcm_scratch_size;
const size_t vtcm_budget = ctx->vtcm_size;
const size_t vec_dot_size = k * sizeof(__fp16);
// DMA-based activation gather for strided tensors (see batched path comment).
@ -1182,7 +1182,7 @@ int hmx_mat_mul_permuted_qk_0_d16a32(struct htp_context *ctx, float *restrict ds
FARF(MEDIUM, "hmx_matmul_qk: STANDARD path m=%d k=%d n=%d type=%d", m, k, n, weight_type);
// --- Dynamic VTCM layout ---
const size_t vtcm_budget = ctx->vtcm_scratch_size;
const size_t vtcm_budget = ctx->vtcm_size;
const size_t vec_dot_size = k * sizeof(__fp16);
const bool use_pipeline = (m >= 128) && (k <= n);
@ -1273,9 +1273,6 @@ int hmx_mat_mul_permuted_qk_0_d16a32(struct htp_context *ctx, float *restrict ds
void *buf_curr = vtcm_scratch0;
void *buf_next = vtcm_scratch1;
// issue async DDR data transfer for the first weight chunk
// NOTE: use 2D DMA (n_cols rows x row_stride bytes) instead of 1D
// because UDMA roiwidth is 16-bit and total size can exceed 65535.
{
const size_t n_cols_first = hex_smin(n, n_chunk_n_cols);
dma_queue_push(ctx->dma[0], dma_make_ptr(buf_curr, permuted_weight), row_stride, row_stride, row_stride, n_cols_first);
@ -1533,20 +1530,15 @@ void transfer_activation_chunk_threaded(struct htp_context *ctx, __fp16 *dst, co
worker_pool_run_func(ctx->worker_pool, transfer_activation_chunk_worker_fn, &state, ctx->n_threads);
}
int mat_mul_qk_0_d16a32_out_stationary(struct htp_context *ctx, float *restrict out, const float *restrict x, const uint8_t *restrict w, int m,
int k, int n, int weight_type) {
// Runtime check -- k >= 16384 exceeds 2D DMA limit
if (k >= 16384) {
FARF(HIGH, "%s: k=%d exceeds 2D DMA limit", __func__, k);
return -1;
}
int mat_mul_qk_0_d16a32_out_stationary(struct htp_context *ctx, float *restrict out, const float *restrict x, const uint8_t *restrict w,
int m, int k, int n, int weight_type) {
// assume k % 32 == 0 && n % 32 == 0
const size_t row_stride = get_x4x2_row_stride(weight_type, k);
if (row_stride == 0) {
return -1;
}
const size_t vtcm_budget = ctx->vtcm_scratch_size;
const size_t vtcm_budget = ctx->vtcm_size;
const size_t M_BLOCK_SIZE = 512;
const size_t N_BLOCK_SIZE = 512;
@ -1576,8 +1568,7 @@ int mat_mul_qk_0_d16a32_out_stationary(struct htp_context *ctx, float *restrict
__fp16 *vtcm_scales = (__fp16 *) vtcm_seq_alloc(&vtcm_ptr, 256);
assert((size_t)(vtcm_ptr - (uint8_t *)ctx->vtcm_base) <= vtcm_budget);
FARF(MEDIUM, "%s: m=%d k=%d n=%d wtype=%d vtcm=%zu/%zu",
__func__, m, k, n, weight_type,
FARF(MEDIUM, "%s: m=%d k=%d n=%d wtype=%d vtcm=%zu/%zu", __func__, m, k, n, weight_type,
(size_t)(vtcm_ptr - (uint8_t *)ctx->vtcm_base), vtcm_budget);
// initialize eye tile (32x32 identity matrix)

View File

@ -7,16 +7,12 @@
#include <stddef.h>
#include <stdint.h>
#ifndef restrict
# define restrict __restrict
#endif
#include "htp-ops.h"
#ifdef __cplusplus
extern "C" {
#endif
struct htp_context; // forward declaration
typedef struct {
float *dst;
const float *activation;

View File

@ -2,6 +2,7 @@
#define HTP_CTX_H
#include "hex-dma.h"
#include "htp-ops.h"
#include "worker-pool.h"
#include <assert.h>
@ -10,38 +11,85 @@
#include <stdint.h>
#define HTP_MAX_NTHREADS 10
#define HTP_MAX_MMAPS 16
// Memory mapping
struct htp_mmap {
uint64_t size;
uint64_t base;
uint32_t fd;
uint32_t pinned;
};
// Scratchpad state
struct htp_spad {
const struct htp_tensor * src; // original src of the data (for reuse)
uint8_t * data; // pointer to an area in vtcm
uint32_t stride; // stride used inside this spad
uint32_t size; // total size
uint32_t size_per_thread; // size per thread
};
// Context while processing an Op
// TODO: fold this into the main context
struct htp_ops_context {
struct htp_context * ctx;
enum htp_op_code op; // FIXME: rename to opcode
int32_t op_params[HTP_OP_MAX_PARAMS];
const struct htp_tensor * src[HTP_OP_MAX_INPUTS];
const struct htp_tensor * dst;
// TODO convert these to an array
struct htp_spad src0_spad;
struct htp_spad src1_spad;
struct htp_spad src2_spad;
struct htp_spad src3_spad;
struct htp_spad dst_spad;
uint32_t n_threads;
uint32_t flags;
};
// Main context for htp DSP backend
struct htp_context {
dspqueue_t queue;
dma_queue * dma[HTP_MAX_NTHREADS];
worker_pool_context_t worker_pool;
uint32_t n_threads;
dspqueue_t queue;
dma_queue * dma[HTP_MAX_NTHREADS];
struct htp_mmap mmap[HTP_MAX_MMAPS];
worker_pool_context_t worker_pool;
uint32_t n_threads;
int thread_id;
int thread_prio;
int thread_id;
int thread_prio;
uint8_t * vtcm_base;
size_t vtcm_size;
uint32_t vtcm_rctx;
int hmx_enabled;
atomic_bool vtcm_valid;
atomic_bool vtcm_inuse;
atomic_bool vtcm_needs_release;
uint8_t * vtcm_base;
size_t vtcm_size;
uint32_t vtcm_rctx;
atomic_bool vtcm_valid;
atomic_bool vtcm_needs_release;
uint32_t opmask;
// Cached src1 spad position from the last quantize pass.
// When SKIP_QUANTIZE is set the Q8 activation data is already in VTCM
// at this address; the matmul must read from here instead of recomputing
// the offset (which depends on the current op's src0 size).
uint8_t * prev_src1_spad;
// HMX acceleration fields (v73+, enabled by compile-time HTP_HAS_HMX)
#ifdef HTP_HAS_HMX
int hmx_enabled; // Runtime flag: HMX initialisation succeeded
size_t vtcm_scratch_size; // Usable dynamic scratch (vtcm_size minus tail reservation)
#endif
struct htp_ops_context octx;
};
int op_matmul(struct htp_ops_context * octx);
int op_matmul_id(struct htp_ops_context * octx);
int op_binary(struct htp_ops_context * octx);
int op_unary(struct htp_ops_context * octx);
int op_sum_rows(struct htp_ops_context * octx);
int op_activations(struct htp_ops_context * octx);
int op_softmax(struct htp_ops_context * octx);
int op_add_id(struct htp_ops_context * octx);
int op_rope(struct htp_ops_context * octx);
int op_flash_attn_ext(struct htp_ops_context * octx);
int op_set_rows(struct htp_ops_context * octx);
int op_get_rows(struct htp_ops_context * octx);
int op_cpy(struct htp_ops_context * octx);
int op_repeat(struct htp_ops_context * octx);
int op_argsort(struct htp_ops_context * octx);
int op_ssm_conv(struct htp_ops_context * octx);
int op_cumsum(struct htp_ops_context * octx);
#endif /* HTP_CTX_H */

View File

@ -1,166 +0,0 @@
#ifndef HTP_MSG_H
#define HTP_MSG_H
#include <assert.h>
// ggml-common.h must be included prio to this header
// Mask to enable various stages of the Ops.
// Used for debugging and profiling.
enum {
HTP_OPMASK_QUEUE = (1 << 0), // Enable Queueing (ie calls into the DSP)
HTP_OPMASK_QUANTIZE = (1 << 1), // Enable Quantize
HTP_OPMASK_COMPUTE = (1 << 2), // Enable Compute
};
// Op flags
enum {
HTP_OPFLAGS_SKIP_QUANTIZE = (1 << 0), // Skip dynamic quantization (reuse quantized tensors)
HTP_OPFLAGS_SKIP_COMPUTE = (1 << 1), // Skip actual computation (used for profiling)
HTP_OPFLAGS_EARLY_WAKEUP = (1 << 2) // Send early wakeup notification
};
enum htp_status {
HTP_STATUS_OK = 1,
HTP_STATUS_INTERNAL_ERR = 2,
HTP_STATUS_NO_SUPPORT = 3,
HTP_STATUS_INVAL_PARAMS = 4,
HTP_STATUS_VTCM_TOO_SMALL = 5,
};
// The values must match the ggml_type.
// Duplicated here because we can't include full ggml.h in the htp build.
// We have some static_asserts in the cpp code to ensure things are in sync.
enum htp_data_type {
HTP_TYPE_F32 = 0,
HTP_TYPE_F16 = 1,
HTP_TYPE_Q4_0 = 2,
HTP_TYPE_Q8_0 = 8,
HTP_TYPE_IQ4_NL = 20,
HTP_TYPE_I32 = 26,
HTP_TYPE_I64 = 27,
HTP_TYPE_MXFP4 = 39,
HTP_TYPE_COUNT
};
// Do not reorder first 4 (used as an index)
enum htp_op {
HTP_OP_MUL = 0,
HTP_OP_ADD = 1,
HTP_OP_SUB = 2,
HTP_OP_DIV = 3,
HTP_OP_MUL_MAT,
HTP_OP_MUL_MAT_ID,
HTP_OP_RMS_NORM,
HTP_OP_UNARY_SILU,
HTP_OP_UNARY_GELU,
HTP_OP_UNARY_SIGMOID,
HTP_OP_UNARY_EXP,
HTP_OP_UNARY_NEG,
HTP_OP_UNARY_SOFTPLUS,
HTP_OP_GLU_SWIGLU,
HTP_OP_GLU_SWIGLU_OAI,
HTP_OP_GLU_GEGLU,
HTP_OP_SOFTMAX,
HTP_OP_ADD_ID,
HTP_OP_ROPE,
HTP_OP_FLASH_ATTN_EXT,
HTP_OP_SET_ROWS,
HTP_OP_GET_ROWS,
HTP_OP_SCALE,
HTP_OP_CPY,
HTP_OP_ARGSORT,
HTP_OP_SQR,
HTP_OP_SQRT,
HTP_OP_SUM_ROWS,
HTP_OP_SSM_CONV,
HTP_OP_REPEAT,
HTP_OP_CUMSUM,
INVALID
};
static inline size_t htp_t_block_size(uint32_t t) {
switch (t) {
case HTP_TYPE_F32:
return 1;
case HTP_TYPE_F16:
return 1;
case HTP_TYPE_Q4_0:
return QK4_0;
case HTP_TYPE_Q8_0:
return QK8_0;
case HTP_TYPE_IQ4_NL:
return QK4_NL;
case HTP_TYPE_MXFP4:
return QK_MXFP4;
default:
assert(0 && "unsupported HTP data type");
}
return 0;
}
static inline size_t htp_type_nbytes(uint32_t t) {
switch (t) {
case HTP_TYPE_F32:
return 4;
case HTP_TYPE_F16:
return 2;
case HTP_TYPE_Q4_0:
return sizeof(block_q4_0);
case HTP_TYPE_Q8_0:
return sizeof(block_q8_0);
case HTP_TYPE_IQ4_NL:
return sizeof(block_iq4_nl);
case HTP_TYPE_MXFP4:
return sizeof(block_mxfp4);
default:
assert(0 && "unsupported HTP data type");
}
return 0;
}
// Internal types
#define QK_Q4_0x4x2 256 // 4x Q4_0 blocks packed with next 4x Q4_0 blocks (size in bytes 128)
#define QK_Q8_0x4x2 256 // 4x Q8_0 blocks concat with next 4x Q8_0 blocks
#define QK_MXFP4x4x2 256 // 4x MXFP4 blocks concat with next 4x MXFP4 blocks
#define HTP_MAX_DIMS 4
struct htp_tensor {
uint32_t data; // Buffer offset in the messages, and data pointer on the NSP
uint32_t type; // Data type
uint32_t ne[HTP_MAX_DIMS]; // Number of elements
uint32_t nb[HTP_MAX_DIMS]; // Stride in bytes (see ggml.h ggml_tensor)
};
#define HTP_MAX_OP_PARAMS 64
struct htp_general_req {
uint32_t op; // GGML/HTP Op
int32_t op_params[HTP_MAX_OP_PARAMS / sizeof(int32_t)];
// Params for the op, e.g. epsilon of RMS norm
uint32_t flags; // Request flags
struct htp_tensor src0; // Input0 tensor
struct htp_tensor src1; // Input1 tensor
struct htp_tensor src2; // Input2 tensor
struct htp_tensor src3; // Input3 tensor
struct htp_tensor src4; // Input4 tensor
struct htp_tensor dst; // Output tensor
// should be multiple of 64 bytes (cacheline)
};
struct htp_general_rsp {
uint32_t op; // GGML/HTP Op
uint32_t status; // HTP_STATUS_...
uint32_t prof_usecs; // Number of usec per request
uint32_t prof_cycles; // Number of cycles per request
uint32_t prof_pkts; // Number of instruction packets per request
uint8_t unused[44]; // Pad to 64 bytes
};
#define HTP_MAX_MESSAGE_SIZE sizeof(struct htp_general_req)
#define HTP_MAX_PACKET_BUFFERS 8
#endif /* HTP_MSG_H */

View File

@ -1,65 +1,154 @@
#ifndef HTP_OPS_H
#define HTP_OPS_H
#include "htp-ctx.h"
#include "htp-msg.h"
#include "worker-pool.h"
#include <assert.h>
#include <stdint.h>
#include <hex-fastdiv.h>
// ggml-common.h must be included prio to this header
// ggml-common.h must be included prior to this header
struct htp_spad {
uint8_t * data;
size_t stride;
size_t size;
size_t size_per_thread;
enum htp_status {
HTP_STATUS_OK = 1,
HTP_STATUS_INTERNAL_ERR = 2,
HTP_STATUS_NO_SUPPORT = 3,
HTP_STATUS_INVAL_PARAMS = 4,
HTP_STATUS_VTCM_TOO_SMALL = 5,
};
struct htp_ops_context {
struct htp_context * ctx;
// First set of values must match the ggml_type.
// Duplicated here because we can't include full ggml.h in the htp build.
// We have some static_asserts in the cpp code to ensure things are in sync.
enum htp_data_type {
HTP_TYPE_F32 = 0,
HTP_TYPE_F16 = 1,
HTP_TYPE_Q4_0 = 2,
HTP_TYPE_Q8_0 = 8,
HTP_TYPE_IQ4_NL = 20,
HTP_TYPE_I32 = 26,
HTP_TYPE_I64 = 27,
HTP_TYPE_MXFP4 = 39,
enum htp_op op;
int32_t op_params[HTP_MAX_OP_PARAMS / sizeof(int32_t)];
// types used internally for repack, dyn.quant, etc
HTP_TYPE_Q4_0x4x2 = 200,
HTP_TYPE_Q8_0x4x2,
HTP_TYPE_MXFP4x4x2,
struct htp_tensor src0;
struct htp_tensor src1;
struct htp_tensor src2;
struct htp_tensor src3;
struct htp_tensor src4;
struct htp_tensor dst;
struct htp_spad src0_spad;
struct htp_spad src1_spad;
struct htp_spad src2_spad;
struct htp_spad src3_spad;
struct htp_spad dst_spad;
worker_pool_context_t * wpool; // worker pool
uint32_t n_threads; // num threads
uint32_t flags;
HTP_TYPE_INVALID
};
int op_matmul(struct htp_ops_context * octx);
int op_matmul_id(struct htp_ops_context * octx);
int op_binary(struct htp_ops_context * octx);
int op_unary(struct htp_ops_context * octx);
int op_sum_rows(struct htp_ops_context * octx);
int op_activations(struct htp_ops_context * octx);
int op_softmax(struct htp_ops_context * octx);
int op_add_id(struct htp_ops_context * octx);
int op_rope(struct htp_ops_context * octx);
int op_flash_attn_ext(struct htp_ops_context * octx);
int op_set_rows(struct htp_ops_context * octx);
int op_get_rows(struct htp_ops_context * octx);
int op_cpy(struct htp_ops_context * octx);
int op_repeat(struct htp_ops_context * octx);
int op_argsort(struct htp_ops_context * octx);
int op_ssm_conv(struct htp_ops_context * octx);
int op_cumsum(struct htp_ops_context * octx);
// Constats for internal types
#define QK_Q4_0x4x2 256 // 4x Q4_0 blocks packed with next 4x Q4_0 blocks (size in bytes 128)
#define QK_Q8_0x4x2 256 // 4x Q8_0 blocks concat with next 4x Q8_0 blocks
#define QK_MXFP4x4x2 256 // 4x MXFP4 blocks concat with next 4x MXFP4 blocks
// Mask to enable various stages of the Ops.
// Used for debugging and profiling.
enum htp_op_mask {
HTP_OPMASK_QUEUE = (1 << 0), // Enable Queueing (ie calls into the DSP)
HTP_OPMASK_COMPUTE = (1 << 1), // Enable Compute
};
// Do not reorder first 4 (used as an index)
enum htp_op_code {
HTP_OP_MUL = 0,
HTP_OP_ADD = 1,
HTP_OP_SUB = 2,
HTP_OP_DIV = 3,
HTP_OP_MUL_MAT,
HTP_OP_MUL_MAT_ID,
HTP_OP_RMS_NORM,
HTP_OP_UNARY_SILU,
HTP_OP_UNARY_GELU,
HTP_OP_UNARY_SIGMOID,
HTP_OP_UNARY_EXP,
HTP_OP_UNARY_NEG,
HTP_OP_UNARY_SOFTPLUS,
HTP_OP_GLU_SWIGLU,
HTP_OP_GLU_SWIGLU_OAI,
HTP_OP_GLU_GEGLU,
HTP_OP_SOFTMAX,
HTP_OP_ADD_ID,
HTP_OP_ROPE,
HTP_OP_FLASH_ATTN_EXT,
HTP_OP_SET_ROWS,
HTP_OP_GET_ROWS,
HTP_OP_SCALE,
HTP_OP_CPY,
HTP_OP_ARGSORT,
HTP_OP_SQR,
HTP_OP_SQRT,
HTP_OP_SUM_ROWS,
HTP_OP_SSM_CONV,
HTP_OP_REPEAT,
HTP_OP_CUMSUM,
HTP_OP_INVALID
};
#define HTP_OP_MAX_DIMS 4 // aka GGML_MAX_DIMS
#define HTP_OP_MAX_INPUTS 6 // aka GGML_MAX_SRCS
#define HTP_OP_MAX_PARAMS 16 // aka GGML_MAX_OP_PARAMS
#define HTP_OP_MAX_BUFS 8
#define HTP_OP_MAX_REQS 256
#define HTP_OP_MAX_TENSORS (HTP_OP_MAX_REQS * HTP_OP_MAX_INPUTS + HTP_OP_MAX_REQS)
#define HTP_OP_MAX_VMEM (3221225472u)
enum htp_tensor_flags {
HTP_TENSOR_COMPUTE = (1U << 0), // Tensor buffer temporal compute data (not weights)
HTP_TENSOR_FLUSHED = (1U << 1) // Tensor buffer has been flushed (set by the NPU)
};
// Tensor descriptor
struct htp_tensor {
uint32_t data; // Buffer offset in the messages, and data pointer on the NPU
uint32_t size; // Data size in bytes
uint32_t flags; // Buffer / tensor flags
uint16_t type; // Data type
uint16_t bi; // Buffer index
uint32_t ne[HTP_OP_MAX_DIMS]; // Number of elements
uint32_t nb[HTP_OP_MAX_DIMS]; // Stride in bytes (see ggml.h ggml_tensor)
};
// Buffer descriptor
struct htp_buf_desc {
uint64_t base; // base address
uint64_t size; // total size
uint32_t flags; // buffer flags (unused)
uint32_t fd; // file descriptor
};
enum htp_op_flags {
HTP_OPFLAGS_SKIP_COMPUTE = (1U << 0), // Skip actual computation (used for profiling)
};
// Op descriptor
struct htp_op_desc {
uint32_t opcode; // GGML/HTP Op
uint32_t flags; // Op flags
int32_t params[HTP_OP_MAX_PARAMS]; // Params for the op, e.g. epsilon of RMS norm
uint16_t src[HTP_OP_MAX_INPUTS]; // Input tensors indices
uint16_t dst; // Output tensor index
// the rest is filled in-place by the NPU
uint32_t prof_usecs; // Number of usec per request
uint32_t prof_cycles; // Number of cycles per request
uint32_t prof_pkts; // Number of instruction packets per request
uint32_t unused;
};
struct htp_opbatch_req {
uint32_t n_bufs; // Number of buffers
uint32_t n_tensors; // Number of tensors
uint32_t n_ops; // Number of ops
uint32_t flags; // unused
// struct htp_buf_desc bufs[]; -- dspqueue buf 0
// struct htp_tensor tensors[]; -- dspqueue buf 0
// struct htp_op_desc ops[]; -- dspqueue buf 0
};
struct htp_opbatch_rsp {
uint32_t status; // HTP_STATUS_...
// struct htp_op_req ops[]; -- dspqueue buf 0
};
#endif /* HTP_OPS_H */

View File

@ -9,6 +9,8 @@
interface htp_iface : remote_handle64 {
AEEResult start(in uint32 sess_id, in uint64 dsp_queue_id, in uint32 n_hvx, in uint32 use_hmx);
AEEResult stop();
AEEResult mmap(in uint32 fd, in uint32 size, in uint32 pinned);
AEEResult munmap(in uint32 fd);
AEEResult enable_etm();
AEEResult disable_etm();
};

File diff suppressed because it is too large Load Diff

View File

@ -16,8 +16,9 @@
#define GGML_COMMON_DECL_C
#include "ggml-common.h"
#include "htp-ctx.h"
#include "htp-msg.h"
#include "htp-ops.h"
#include "htp-ops.h"
#include "hmx-ops.h"
#define MM_SPAD_SRC0_NROWS 16
#define MM_SPAD_SRC1_NROWS 16
@ -1897,11 +1898,11 @@ static void vec_dot_f16_f32_uu_1x1(const int n, float * restrict s, const void *
hvx_vec_store_u(&s[0], 4, rsum);
}
#define htp_matmul_tensors_preamble \
struct htp_tensor * restrict src0 = &octx->src0; \
struct htp_tensor * restrict src1 = &octx->src1; \
struct htp_tensor * restrict src2 = &octx->src2; \
struct htp_tensor * restrict dst = &octx->dst; \
#define htp_matmul_tensors_preamble \
const struct htp_tensor * restrict src0 = octx->src[0]; \
const struct htp_tensor * restrict src1 = octx->src[1]; \
const struct htp_tensor * restrict src2 = octx->src[2]; \
const struct htp_tensor * restrict dst = octx->dst; \
struct htp_spad * restrict src0_spad = &octx->src0_spad; \
struct htp_spad * restrict src1_spad = &octx->src1_spad; \
struct htp_spad * restrict dst_spad = &octx->dst_spad; \
@ -2223,8 +2224,8 @@ struct mmid_row_mapping {
static void matmul_id(unsigned int nth, unsigned int ith, void * data) {
htp_matmul_preamble;
struct htp_tensor * restrict ids = &octx->src2;
struct htp_spad * restrict src2_spad = &octx->src2_spad;
const struct htp_tensor * restrict ids = octx->src[2];
struct htp_spad * restrict src2_spad = &octx->src2_spad;
uint64_t t1, t2;
t1 = HAP_perf_get_qtimer_count();
@ -2342,8 +2343,8 @@ static void matmul_id(unsigned int nth, unsigned int ith, void * data) {
static void matvec_id(unsigned int nth, unsigned int ith, void * data) {
htp_matmul_preamble;
struct htp_tensor * restrict ids = &octx->src2;
struct htp_spad * restrict src2_spad = &octx->src2_spad;
const struct htp_tensor * restrict ids = octx->src[2];
struct htp_spad * restrict src2_spad = &octx->src2_spad;
uint64_t t1, t2;
t1 = HAP_perf_get_qtimer_count();
@ -2612,7 +2613,7 @@ static void quantize_f32_q8x4x2(unsigned int nth, unsigned int ith, void * data)
struct htp_matmul_context * mmctx = data;
struct htp_ops_context * octx = mmctx->octx;
const struct htp_tensor * src = &octx->src1;
const struct htp_tensor * src = octx->src[1];
uint8_t * restrict dst = octx->src1_spad.data;
struct htp_spad * spad = &octx->src0_spad;
uint32_t nrows_per_thread = mmctx->src1_nrows_per_thread;
@ -2659,7 +2660,7 @@ static void quantize_f32_f16(unsigned int nth, unsigned int ith, void * data) {
struct htp_matmul_context * mmctx = data;
struct htp_ops_context * octx = mmctx->octx;
const struct htp_tensor * src = &octx->src1;
const struct htp_tensor * src = octx->src[1];
uint8_t * restrict dst = octx->src1_spad.data;
uint32_t nrows_per_thread = mmctx->src1_nrows_per_thread;
uint32_t dst_stride = octx->src1_spad.stride;
@ -2701,7 +2702,7 @@ static void quantize_f16_f16(unsigned int nth, unsigned int ith, void * data) {
struct htp_matmul_context * mmctx = data;
struct htp_ops_context * octx = mmctx->octx;
const struct htp_tensor * src = &octx->src1;
const struct htp_tensor * src = octx->src[1];
uint8_t * restrict dst = octx->src1_spad.data;
uint32_t nrows_per_thread = mmctx->src1_nrows_per_thread;
uint32_t dst_stride = octx->src1_spad.stride;
@ -2800,7 +2801,7 @@ static void htp_mminit_spad(struct htp_ops_context * octx,
octx->dst_spad.size = octx->dst_spad.size_per_thread * octx->n_threads;
}
int op_matmul(struct htp_ops_context * octx) {
static int op_matmul_hvx(struct htp_ops_context * octx) {
htp_matmul_tensors_preamble;
struct htp_matmul_context mmctx_struct = {0};
@ -2824,7 +2825,7 @@ int op_matmul(struct htp_ops_context * octx) {
worker_callback_t quant_job_func;
worker_callback_t matmul_job_func = src1_nrows > 1 ? matmul_2d : matvec_2d;
bool need_quant = !(octx->flags & HTP_OPFLAGS_SKIP_QUANTIZE);
bool need_quant = true;
if (src0->type == HTP_TYPE_F16) {
// Try optimized f16-f16 path first (src1 in VTCM)
@ -2838,7 +2839,7 @@ int op_matmul(struct htp_ops_context * octx) {
// Default matmul implementation does not support multi-batch src0 (N-vs-N broadcasting).
// It only supports 1-vs-N broadcasting (src0 is 2D) or standard 2D matmul.
const bool is_batched = (ne02 > 1) || (ne03 > 1);
const bool is_permuted = htp_is_permuted(&octx->src0) || htp_is_permuted(&octx->src1);
const bool is_permuted = htp_is_permuted(octx->src[0]) || htp_is_permuted(octx->src[1]);
if (!is_batched && !is_permuted && f16_total_size <= octx->ctx->vtcm_size) {
// Optimized path
@ -2915,34 +2916,172 @@ int op_matmul(struct htp_ops_context * octx) {
return HTP_STATUS_VTCM_TOO_SMALL;
}
octx->src0_spad.data = octx->ctx->vtcm_base;
octx->src1_spad.data = octx->src0_spad.data + octx->src0_spad.size;
octx->dst_spad.data = octx->src1_spad.data + octx->src1_spad.size;
// Place src1 spad first. We use it for dyn.quant and may reuse between ops
octx->src1_spad.data = octx->ctx->vtcm_base;
octx->src0_spad.data = octx->src1_spad.data + octx->src1_spad.size;
octx->dst_spad.data = octx->src0_spad.data + octx->src0_spad.size;
octx->src1_spad.src = (src1 == octx->src1_spad.src) ? src1 : NULL;
octx->src0_spad.src = NULL;
octx->dst_spad.src = NULL;
octx->src0_spad.stride = src0_row_size_padded;
octx->src1_spad.stride = src1_row_size;
if (need_quant) {
if (octx->flags & HTP_OPFLAGS_SKIP_COMPUTE)
return HTP_STATUS_OK;
if (need_quant && !octx->src1_spad.src) {
const uint32_t n_quant_jobs = MIN(src1_nrows, octx->n_threads);
mmctx->src1_nrows_per_thread = (src1_nrows + n_quant_jobs - 1) / n_quant_jobs;
worker_pool_run_func(octx->ctx->worker_pool, quant_job_func, mmctx, n_quant_jobs);
// Cache where src1 was written so subsequent SKIP_QUANTIZE ops can find it
octx->ctx->prev_src1_spad = octx->src1_spad.data;
} else {
// SKIP_QUANTIZE: Q8 data lives at the address written by the previous
// quantize pass. The current op may have a different src0 size (e.g.
// IQ4_NL vs MXFP4), so src1_spad.data computed above could be wrong.
octx->src1_spad.data = octx->ctx->prev_src1_spad;
octx->src1_spad.src = src1;
}
if (!(octx->flags & HTP_OPFLAGS_SKIP_COMPUTE)) {
const uint32_t n_matmul_jobs = octx->n_threads;
worker_pool_run_func(octx->ctx->worker_pool, matmul_job_func, mmctx, n_matmul_jobs);
}
const uint32_t n_matmul_jobs = octx->n_threads;
worker_pool_run_func(octx->ctx->worker_pool, matmul_job_func, mmctx, n_matmul_jobs);
return HTP_STATUS_OK;
}
int op_matmul(struct htp_ops_context * octx) {
htp_matmul_tensors_preamble;
#ifndef HTP_HAS_HMX
return op_matmul_hvx(octx);
#else
if (!octx->ctx->hmx_enabled) {
return op_matmul_hvx(octx);
}
// HMX weight tile requires N to be 32-aligned.
if (src0->ne[1] % 32 != 0) {
return op_matmul_hvx(octx);
}
// HMX supports F16, Q4_0, Q8_0, IQ4_NL, MXFP4 weights.
// Other types fall back to HVX.
uint32_t wtype = src0->type;
if (wtype != HTP_TYPE_F16 && wtype != HTP_TYPE_Q4_0 && wtype != HTP_TYPE_Q8_0 && wtype != HTP_TYPE_IQ4_NL && wtype != HTP_TYPE_MXFP4) {
return op_matmul_hvx(octx);
}
// Quantised HMX path requires K aligned to 256 (x4x2 super-block).
// F16 HMX path requires K aligned to 32 (tile width).
if (wtype != HTP_TYPE_F16 && src0->ne[0] % 256 != 0) {
return op_matmul_hvx(octx);
}
if (wtype == HTP_TYPE_F16 && src0->ne[0] % 32 != 0) {
return op_matmul_hvx(octx);
}
const bool is_batched = (src0->ne[2] * src0->ne[3] > 1 || src1->ne[2] * src1->ne[3] > 1);
// Quantised HMX kernels only handle flat 2D matmul (host already rejects
// batched quantised, but guard here too). F16 batched matmul is handled
// by the dedicated wrapper in hmx-matmul-ops.c.
if (is_batched && src0->type != HTP_TYPE_F16) {
return op_matmul_hvx(octx);
}
// HMX assumes contiguous row-major layout. Fall back for permuted
// tensors where strides are non-monotonic (e.g. transposed KV cache).
if (src0->nb[0] > src0->nb[1] || src1->nb[0] > src1->nb[1]) {
return op_matmul_hvx(octx);
}
// M alignment: when M > 32 but not 32-aligned, we split into
// HMX (first m_hmx = M & ~31 rows) + HVX (remaining m_tail rows).
// When M <= 32 and not 32-aligned, fall back entirely to HVX.
const int m_total = (int) src1->ne[1];
const int m_tail = m_total % 32;
const int m_hmx = m_total - m_tail;
if (m_hmx == 0) {
return op_matmul_hvx(octx);
}
// Always re-quantize src1 since HMX kernel overwrites vtcm/spad,
// so any previously cached quantized data is invalid.
octx->src1_spad.src = NULL;
int k = (int) src0->ne[0]; // inner dimension
int n = (int) src0->ne[1]; // weight columns
// --- Phase 1: HMX on the first m_hmx (32-aligned) rows ---
int ret = -1;
// Row strides in elements. For compact tensors these equal k; for
// permuted attention views they can be larger, so pass the real stride.
const int act_stride = (int)(src1->nb[1] / sizeof(float));
const int wgt_stride = (int)(src0->nb[1] / sizeof(__fp16));
if (src0->type == HTP_TYPE_F16) {
if (is_batched) {
hmx_matmul_w16a32_batched_params_t batch_params = {
.dst = (float *) dst->data,
.activation = (float *) src1->data,
.permuted_weight = (const __fp16 *) src0->data,
.m = m_hmx,
.k = k,
.n = n,
.act_stride = act_stride,
.weight_stride = wgt_stride,
.dst_stride = (int) (dst->nb[1] / sizeof(float)),
.ne02 = ne02,
.ne03 = ne03,
.ne12 = ne12,
.ne13 = ne13,
.src0_nb2 = src0->nb[2],
.src0_nb3 = src0->nb[3],
.src1_nb2 = src1->nb[2],
.src1_nb3 = src1->nb[3],
.dst_nb2 = dst->nb[2],
.dst_nb3 = dst->nb[3],
};
ret = hmx_mat_mul_permuted_w16a32_batched(octx->ctx, &batch_params);
} else {
ret = hmx_mat_mul_permuted_w16a32(octx->ctx,
(float*) dst->data, (float*) src1->data, (const __fp16 *) src0->data,
m_hmx, k, n, act_stride, wgt_stride);
}
} else {
ret = hmx_mat_mul_permuted_qk_0_d16a32(octx->ctx,
(float*) dst->data, (float*) src1->data, (const uint8_t *) src0->data,
m_hmx, k, n, (int) src0->type);
}
if (ret != 0) {
FARF(HIGH, "HMX matmul failed (ret=%d), falling back to HVX", ret);
return op_matmul(octx);
}
// --- Phase 2: HVX on the remaining m_tail rows ---
if (m_tail > 0) {
// copy of src1 and dst
struct htp_tensor src1_tail = *src1;
struct htp_tensor dst_tail = *dst;
src1_tail.ne[1] = m_tail; // only tail rows
dst_tail.ne[1] = m_tail; // only tail rows
// Offset activation and dst pointers past the HMX-processed rows.
// Use nb[1] (row stride in bytes) to compute the byte offset.
src1_tail.data += (uint32_t) m_hmx * src1->nb[1];
dst_tail.data += (uint32_t) m_hmx * dst->nb[1];
octx->src[1] = &src1_tail;
octx->dst = &dst_tail;
FARF(HIGH, "hmx-matmul: HVX tail m_tail %d src1 %p dst %p", m_tail, (void *) src1_tail.data, (void *) dst_tail.data);
return op_matmul_hvx(octx);
}
return 0;
#endif // HTP_HAS_HMX
}
int op_matmul_id(struct htp_ops_context * octx) {
htp_matmul_tensors_preamble;
@ -2950,7 +3089,7 @@ int op_matmul_id(struct htp_ops_context * octx) {
struct htp_matmul_context * mmctx = &mmctx_struct;
mmctx->octx = octx;
struct htp_tensor * restrict ids = &octx->src2;
const struct htp_tensor * restrict ids = octx->src[2];
const size_t src0_row_size = nb01;
const size_t dst_row_size = nb1;
@ -3003,11 +3142,17 @@ int op_matmul_id(struct htp_ops_context * octx) {
return HTP_STATUS_VTCM_TOO_SMALL;
}
octx->src0_spad.data = octx->ctx->vtcm_base;
octx->src1_spad.data = octx->src0_spad.data + octx->src0_spad.size;
octx->src2_spad.data = octx->src1_spad.data + octx->src1_spad.size;
// Place src1 spad first. We use it for dyn.quant and may reuse in subseq ops.
octx->src1_spad.data = octx->ctx->vtcm_base;
octx->src0_spad.data = octx->src1_spad.data + octx->src1_spad.size;
octx->src2_spad.data = octx->src0_spad.data + octx->src0_spad.size;
octx->dst_spad.data = octx->src2_spad.data + octx->src2_spad.size;
octx->src1_spad.src = (src1 == octx->src1_spad.src) ? src1 : NULL;
octx->src0_spad.src = NULL;
octx->src2_spad.src = NULL;
octx->dst_spad.src = NULL;
octx->src0_spad.stride = src0_row_size_padded;
octx->src1_spad.stride = src1_row_size;
@ -3031,20 +3176,18 @@ int op_matmul_id(struct htp_ops_context * octx) {
}
}
// Setup worker pool callbacks
if (!(octx->flags & HTP_OPFLAGS_SKIP_QUANTIZE)) {
if (octx->flags & HTP_OPFLAGS_SKIP_COMPUTE)
return HTP_STATUS_OK;
if (octx->src1_spad.src != src1) {
const uint32_t n_quant_jobs = MIN(src1_nrows, octx->n_threads);
mmctx->src1_nrows_per_thread = (src1_nrows + n_quant_jobs - 1) / n_quant_jobs;
worker_pool_run_func(octx->ctx->worker_pool, quant_job_func, mmctx, n_quant_jobs);
octx->ctx->prev_src1_spad = octx->src1_spad.data;
} else {
octx->src1_spad.data = octx->ctx->prev_src1_spad;
octx->src1_spad.src = src1;
}
if (!(octx->flags & HTP_OPFLAGS_SKIP_COMPUTE)) {
const uint32_t n_matmul_jobs = octx->n_threads;
worker_pool_run_func(octx->ctx->worker_pool, matmul_id_job_func, mmctx, n_matmul_jobs);
}
const uint32_t n_matmul_jobs = octx->n_threads;
worker_pool_run_func(octx->ctx->worker_pool, matmul_id_job_func, mmctx, n_matmul_jobs);
return HTP_STATUS_OK;
}

View File

@ -12,7 +12,7 @@
#define GGML_COMMON_DECL_C
#include "ggml-common.h"
#include "htp-ctx.h"
#include "htp-msg.h"
#include "htp-ops.h"
#include "htp-ops.h"
struct htp_repeat_context {
@ -32,8 +32,8 @@ struct htp_repeat_context {
static void repeat_job_per_thread(unsigned int nth, unsigned int ith, void * data) {
const struct htp_repeat_context * rctx = (const struct htp_repeat_context *) data;
struct htp_ops_context * octx = rctx->octx;
const struct htp_tensor * src = &octx->src0;
const struct htp_tensor * dst = &octx->dst;
const struct htp_tensor * src = octx->src[0];
const struct htp_tensor * dst = octx->dst;
const uint32_t ne00 = src->ne[0];
const uint32_t ne01 = src->ne[1];
@ -98,8 +98,8 @@ static void repeat_job_per_thread(unsigned int nth, unsigned int ith, void * dat
}
int op_repeat(struct htp_ops_context * octx) {
const struct htp_tensor * src0 = &octx->src0;
struct htp_tensor * dst = &octx->dst;
const struct htp_tensor * src0 = octx->src[0];
const struct htp_tensor * dst = octx->dst;
// Validate that dst dims are multiples of src dims
if (dst->ne[0] % src0->ne[0] != 0 ||

View File

@ -15,7 +15,7 @@
#define GGML_COMMON_DECL_C
#include "ggml-common.h"
#include "htp-ctx.h"
#include "htp-msg.h"
#include "htp-ops.h"
#include "htp-ops.h"
// Redefined the types GGML_ROPE_TYPE_NORMAL & GGML_ROPE_TYPE_NEOX as we can't include ggml.h
@ -253,10 +253,10 @@ static void rope_job_f32(unsigned int nth, unsigned int ith, void * data) {
struct htp_rope_context * rctx = (struct htp_rope_context *) data;
struct htp_ops_context * octx = rctx->octx;
const struct htp_tensor * src0 = &octx->src0;
const struct htp_tensor * src1 = &octx->src1;
const struct htp_tensor * src2 = &octx->src2;
struct htp_tensor * dst = &octx->dst;
const struct htp_tensor * src0 = octx->src[0];
const struct htp_tensor * src1 = octx->src[1];
const struct htp_tensor * src2 = octx->src[2];
const struct htp_tensor * dst = octx->dst;
htp_rope_preamble;
@ -284,7 +284,7 @@ static void rope_job_f32(unsigned int nth, unsigned int ith, void * data) {
dma_queue * dma_queue = octx->ctx->dma[ith];
const int32_t * pos = (const int32_t *) src1->data;
const float * freq_factors = src2->data ? (const float *) src2->data : NULL;
const float * freq_factors = src2 ? (const float *) src2->data : NULL;
uint32_t ir = 0;
uint32_t prev_i2 = (uint32_t) -1;
@ -384,10 +384,10 @@ done:
static int execute_op_rope_f32(struct htp_ops_context * octx) {
int err = HTP_STATUS_OK;
const struct htp_tensor * src0 = &octx->src0;
const struct htp_tensor * src1 = &octx->src1;
const struct htp_tensor * src2 = &octx->src2;
struct htp_tensor * dst = &octx->dst;
const struct htp_tensor * src0 = octx->src[0];
const struct htp_tensor * src1 = octx->src[1];
const struct htp_tensor * src2 = octx->src[2];
const struct htp_tensor * dst = octx->dst;
const char * op_type = "rope-f32";
@ -424,19 +424,16 @@ static int execute_op_rope_f32(struct htp_ops_context * octx) {
return HTP_STATUS_VTCM_TOO_SMALL;
}
// Assign sizes
octx->src0_spad.size_per_thread = src0_spad_per_thread;
octx->dst_spad.size_per_thread = dst_spad_per_thread;
octx->src0_spad.size = n_threads * src0_spad_per_thread;
octx->dst_spad.size = n_threads * dst_spad_per_thread;
octx->src1_spad.size = 0;
// Assign pointers
octx->src0_spad.data = octx->ctx->vtcm_base;
octx->src1_spad.data = NULL;
octx->dst_spad.data = octx->src0_spad.data + octx->src0_spad.size;
octx->src0_spad.data = octx->ctx->vtcm_base; octx->src0_spad.src = NULL;
octx->src1_spad.data = NULL; octx->src1_spad.src = NULL;
octx->dst_spad.data = octx->src0_spad.data + octx->src0_spad.size; octx->dst_spad.src = NULL;
// Fill context
struct htp_rope_context rctx;
memset(&rctx, 0, sizeof(struct htp_rope_context));
@ -483,7 +480,7 @@ static int execute_op_rope_f32(struct htp_ops_context * octx) {
int op_rope(struct htp_ops_context * octx) {
int err = HTP_STATUS_OK;
switch (octx->src0.type) {
switch (octx->src[0]->type) {
case HTP_TYPE_F32:
err = execute_op_rope_f32(octx);
break;

View File

@ -14,33 +14,37 @@
#define GGML_COMMON_DECL_C
#include "ggml-common.h"
#include "htp-ctx.h"
#include "htp-msg.h"
#include "htp-ops.h"
#include "htp-ops.h"
#define set_rows_preamble \
const uint32_t ne00 = octx->src0.ne[0]; \
const uint32_t ne01 = octx->src0.ne[1]; \
const uint32_t ne02 = octx->src0.ne[2]; \
const uint32_t ne03 = octx->src0.ne[3]; \
\
const uint32_t ne10 = octx->src1.ne[0]; \
const uint32_t ne11 = octx->src1.ne[1]; \
const uint32_t ne12 = octx->src1.ne[2]; \
\
const uint32_t nb01 = octx->src0.nb[1]; \
const uint32_t nb02 = octx->src0.nb[2]; \
const uint32_t nb03 = octx->src0.nb[3]; \
\
const uint32_t nb10 = octx->src1.nb[0]; \
const uint32_t nb11 = octx->src1.nb[1]; \
const uint32_t nb12 = octx->src1.nb[2]; \
\
const uint32_t nb1 = octx->dst.nb[1]; \
const uint32_t nb2 = octx->dst.nb[2]; \
const uint32_t nb3 = octx->dst.nb[3]; \
\
const uint32_t ne1 = octx->dst.ne[1]; \
\
#define set_rows_preamble \
const uint32_t ne00 = octx->src[0]->ne[0]; \
const uint32_t ne01 = octx->src[0]->ne[1]; \
const uint32_t ne02 = octx->src[0]->ne[2]; \
const uint32_t ne03 = octx->src[0]->ne[3]; \
\
const uint32_t ne10 = octx->src[1]->ne[0]; \
const uint32_t ne11 = octx->src[1]->ne[1]; \
const uint32_t ne12 = octx->src[1]->ne[2]; \
const uint32_t ne13 = octx->src[1]->ne[3]; \
\
const uint32_t nb01 = octx->src[0]->nb[1]; \
const uint32_t nb02 = octx->src[0]->nb[2]; \
const uint32_t nb03 = octx->src[0]->nb[3]; \
\
const uint32_t nb10 = octx->src[1]->nb[0]; \
const uint32_t nb11 = octx->src[1]->nb[1]; \
const uint32_t nb12 = octx->src[1]->nb[2]; \
\
const uint32_t nb1 = octx->dst->nb[1]; \
const uint32_t nb2 = octx->dst->nb[2]; \
const uint32_t nb3 = octx->dst->nb[3]; \
\
const uint32_t ne0 = octx->dst->ne[0]; \
const uint32_t ne1 = octx->dst->ne[1]; \
const uint32_t ne2 = octx->dst->ne[2]; \
const uint32_t ne3 = octx->dst->ne[3]; \
\
const uint32_t nr = ne01;
struct htp_set_rows_context {
@ -56,12 +60,14 @@ static void set_rows_thread_f32_f32(unsigned int nth, unsigned int ith, void *da
set_rows_preamble;
uint64_t qt = HAP_perf_get_qtimer_count();
// parallelize by rows of src0
const uint32_t dr = srctx->src0_nrows_per_thread;
const uint32_t ir0 = dr * ith;
const uint32_t ir1 = (ir0 + dr < nr) ? (ir0 + dr) : nr;
const bool is_i32 = (octx->src1.type == HTP_TYPE_I32);
const bool is_i32 = (octx->src[1]->type == HTP_TYPE_I32);
for (uint32_t i03 = 0; i03 < ne03; ++i03) {
for (uint32_t i02 = 0; i02 < ne02; ++i02) {
@ -70,7 +76,7 @@ static void set_rows_thread_f32_f32(unsigned int nth, unsigned int ith, void *da
const uint32_t i11 = fastmodulo(i02, ne11, &srctx->div_ne11);
const uint32_t i10 = i;
const uintptr_t src1_addr = octx->src1.data + i10*nb10 + i11*nb11 + i12*nb12;
const uintptr_t src1_addr = octx->src[1]->data + i10*nb10 + i11*nb11 + i12*nb12;
uint32_t i1 = is_i32 ? *(int32_t *)src1_addr : *(int64_t *)src1_addr;
if (i1 >= ne1) {
@ -78,14 +84,18 @@ static void set_rows_thread_f32_f32(unsigned int nth, unsigned int ith, void *da
continue;
}
const uintptr_t src0_ptr = octx->src0.data + i*nb01 + i02*nb02 + i03*nb03;
const uintptr_t dst_ptr = octx->dst.data + i1*nb1 + i02*nb2 + i03*nb3;
const uintptr_t src0_ptr = octx->src[0]->data + i*nb01 + i02*nb02 + i03*nb03;
const uintptr_t dst_ptr = octx->dst->data + i1*nb1 + i02*nb2 + i03*nb3;
// copy row
hvx_copy_f32_uu((uint8_t *)dst_ptr, (const uint8_t *)src0_ptr, ne00);
}
}
}
qt = HAP_perf_qtimer_count_to_us(HAP_perf_get_qtimer_count() - qt);
FARF(HIGH, "set-rows-f32-f32 %d/%d: %ux%ux%ux%u (%u:%u) x %ux%ux%ux%u -> %ux%ux%ux%u usec %u\n", ith, nth,
ne00, ne01, ne02, ne03, ir0, ir1, ne10, ne11, ne12, ne13, ne0, ne1, ne2, ne3, (unsigned) qt);
}
static void set_rows_thread_f16_f32(unsigned int nth, unsigned int ith, void *data) {
@ -94,12 +104,14 @@ static void set_rows_thread_f16_f32(unsigned int nth, unsigned int ith, void *da
set_rows_preamble;
uint64_t qt = HAP_perf_get_qtimer_count();
// parallelize by rows of src0
const uint32_t dr = srctx->src0_nrows_per_thread;
const uint32_t ir0 = dr * ith;
const uint32_t ir1 = (ir0 + dr < nr) ? (ir0 + dr) : nr;
const bool is_i32 = (octx->src1.type == HTP_TYPE_I32);
const bool is_i32 = (octx->src[1]->type == HTP_TYPE_I32);
for (uint32_t i03 = 0; i03 < ne03; ++i03) {
for (uint32_t i02 = 0; i02 < ne02; ++i02) {
@ -108,7 +120,7 @@ static void set_rows_thread_f16_f32(unsigned int nth, unsigned int ith, void *da
const uint32_t i11 = fastmodulo(i02, ne11, &srctx->div_ne11);
const uint32_t i10 = i;
const uintptr_t src1_addr = octx->src1.data + i10*nb10 + i11*nb11 + i12*nb12;
const uintptr_t src1_addr = octx->src[1]->data + i10*nb10 + i11*nb11 + i12*nb12;
uint32_t i1 = is_i32 ? *(int32_t *)src1_addr : *(int64_t *)src1_addr;
if (i1 >= ne1) {
@ -116,13 +128,17 @@ static void set_rows_thread_f16_f32(unsigned int nth, unsigned int ith, void *da
continue;
}
const uint8_t* src0_ptr = (const uint8_t *) octx->src0.data + i*nb01 + i02*nb02 + i03*nb03;
uint8_t* dst_ptr = (uint8_t *) octx->dst.data + i1*nb1 + i02*nb2 + i03*nb3;
const uint8_t* src0_ptr = (const uint8_t *) octx->src[0]->data + i*nb01 + i02*nb02 + i03*nb03;
uint8_t* dst_ptr = (uint8_t *) octx->dst->data + i1*nb1 + i02*nb2 + i03*nb3;
hvx_copy_f16_f32_uu(dst_ptr, src0_ptr, ne00);
}
}
}
qt = HAP_perf_qtimer_count_to_us(HAP_perf_get_qtimer_count() - qt);
FARF(HIGH, "set-rows-f16-f32 %d/%d: %ux%ux%ux%u (%u:%u) x %ux%ux%ux%u -> %ux%ux%ux%u usec %u\n", ith, nth,
ne00, ne01, ne02, ne03, ir0, ir1, ne10, ne11, ne12, ne13, ne0, ne1, ne2, ne3, (unsigned) qt);
}
int op_set_rows(struct htp_ops_context * octx) {
@ -130,15 +146,15 @@ int op_set_rows(struct htp_ops_context * octx) {
const uint32_t n_threads = MIN(nr, octx->n_threads);
if (octx->src0.type != HTP_TYPE_F32) {
if (octx->src[0]->type != HTP_TYPE_F32) {
return HTP_STATUS_NO_SUPPORT;
}
if (octx->dst.type != HTP_TYPE_F32 && octx->dst.type != HTP_TYPE_F16) {
if (octx->dst->type != HTP_TYPE_F32 && octx->dst->type != HTP_TYPE_F16) {
return HTP_STATUS_NO_SUPPORT;
}
if (octx->src1.type != HTP_TYPE_I32 && octx->src1.type != HTP_TYPE_I64) {
if (octx->src[1]->type != HTP_TYPE_I32 && octx->src[1]->type != HTP_TYPE_I64) {
return HTP_STATUS_NO_SUPPORT;
}
@ -153,7 +169,7 @@ int op_set_rows(struct htp_ops_context * octx) {
srctx.src0_nrows_per_thread = (nr + n_threads - 1) / n_threads;
switch(octx->dst.type) {
switch(octx->dst->type) {
case HTP_TYPE_F32:
worker_pool_run_func(octx->ctx->worker_pool, set_rows_thread_f32_f32, &srctx, n_threads);
break;

View File

@ -15,68 +15,89 @@
#define GGML_COMMON_DECL_C
#include "ggml-common.h"
#include "htp-ctx.h"
#include "htp-msg.h"
#include "htp-ops.h"
#include "htp-ops.h"
#define htp_softmax_preamble3 \
const uint32_t ne00 = src0->ne[0]; \
const uint32_t ne01 = src0->ne[1]; \
const uint32_t ne02 = src0->ne[2]; \
const uint32_t ne03 = src0->ne[3]; \
\
const uint32_t nb00 = src0->nb[0]; \
const uint32_t nb01 = src0->nb[1]; \
const uint32_t nb02 = src0->nb[2]; \
const uint32_t nb03 = src0->nb[3]; \
\
const uint32_t ne10 = (src1->ne[0]) ? src1->ne[0] : 1; \
const uint32_t ne11 = (src1->ne[0]) ? src1->ne[1] : 1; \
const uint32_t ne12 = (src1->ne[0]) ? src1->ne[2] : 1; \
const uint32_t ne13 = (src1->ne[0]) ? src1->ne[3] : 1; \
\
const uint32_t nb10 = (src1->ne[0]) ? src1->nb[0] : 1; \
const uint32_t nb11 = (src1->ne[0]) ? src1->nb[1] : 1; \
const uint32_t nb12 = (src1->ne[0]) ? src1->nb[2] : 1; \
const uint32_t nb13 = (src1->ne[0]) ? src1->nb[3] : 1; \
\
const uint32_t ne0 = dst->ne[0]; \
const uint32_t ne1 = dst->ne[1]; \
const uint32_t ne2 = dst->ne[2]; \
const uint32_t ne3 = dst->ne[3]; \
\
const uint32_t nb0 = dst->nb[0]; \
const uint32_t nb1 = dst->nb[1]; \
const uint32_t nb2 = dst->nb[2]; \
#define htp_softmax_preamble3 \
const uint32_t ne00 = src0->ne[0]; \
const uint32_t ne01 = src0->ne[1]; \
const uint32_t ne02 = src0->ne[2]; \
const uint32_t ne03 = src0->ne[3]; \
\
const uint32_t nb00 = src0->nb[0]; \
const uint32_t nb01 = src0->nb[1]; \
const uint32_t nb02 = src0->nb[2]; \
const uint32_t nb03 = src0->nb[3]; \
\
const uint32_t ne10 = src1 ? src1->ne[0] : 1; \
const uint32_t ne11 = src1 ? src1->ne[1] : 1; \
const uint32_t ne12 = src1 ? src1->ne[2] : 1; \
const uint32_t ne13 = src1 ? src1->ne[3] : 1; \
\
const uint32_t nb10 = src1 ? src1->nb[0] : 1; \
const uint32_t nb11 = src1 ? src1->nb[1] : 1; \
const uint32_t nb12 = src1 ? src1->nb[2] : 1; \
const uint32_t nb13 = src1 ? src1->nb[3] : 1; \
\
const uint32_t ne0 = dst->ne[0]; \
const uint32_t ne1 = dst->ne[1]; \
const uint32_t ne2 = dst->ne[2]; \
const uint32_t ne3 = dst->ne[3]; \
\
const uint32_t nb0 = dst->nb[0]; \
const uint32_t nb1 = dst->nb[1]; \
const uint32_t nb2 = dst->nb[2]; \
const uint32_t nb3 = dst->nb[3];
struct htp_softmax_context {
struct htp_ops_context * octx;
bool use_f16;
bool use_src1;
uint32_t n_head;
uint32_t n_head_log2;
float scale;
float max_bias;
float m0;
float m1;
float scale;
float max_bias;
float m0;
float m1;
uint32_t src0_nrows_per_thread;
struct fastdiv_values fastdiv_ne01;
struct fastdiv_values fastdiv_ne02;
struct fastdiv_values fastdiv_ne12; // For mask broadcasting
struct fastdiv_values fastdiv_ne13; // For mask broadcasting
size_t spad_stride;
struct htp_ops_context * octx;
uint32_t src0_nrows_per_thread;
};
static void apply_mask(float * restrict wp0,
const float * restrict mp_f32,
const __fp16 * restrict mp_f16,
uint32_t ne00,
float slope,
bool use_f16) {
if (!mp_f32) {
return;
}
if (use_f16) {
for (uint32_t i = 0; i < ne00; ++i) {
wp0[i] += slope * (float) mp_f16[i];
}
} else {
for (uint32_t i = 0; i < ne00; ++i) {
wp0[i] += slope * mp_f32[i];
}
}
}
static void init_softmax_ctx(struct htp_softmax_context * smctx, struct htp_ops_context * octx) {
const struct htp_tensor * src0 = &octx->src0;
const struct htp_tensor * src1 = &octx->src1;
const struct htp_tensor * src0 = octx->src[0];
const struct htp_tensor * src1 = octx->src[1];
memset(smctx, 0, sizeof(struct htp_softmax_context));
memcpy(&smctx->scale, (float *) octx->op_params, sizeof(float));
memcpy(&smctx->scale, (float *) octx->op_params, sizeof(float));
memcpy(&smctx->max_bias, (float *) octx->op_params + 1, sizeof(float));
smctx->n_head = src0->ne[2];
@ -85,8 +106,8 @@ static void init_softmax_ctx(struct htp_softmax_context * smctx, struct htp_ops_
smctx->m0 = powf(2.0f, -(smctx->max_bias) / smctx->n_head_log2);
smctx->m1 = powf(2.0f, -(smctx->max_bias / 2.0f) / smctx->n_head_log2);
smctx->use_src1 = (src1->ne[0] != 0);
smctx->use_f16 = (src1->ne[0] != 0) && (src1->type == HTP_TYPE_F16);
smctx->use_src1 = (src1 != 0);
smctx->use_f16 = (src1 != 0) && (src1->type == HTP_TYPE_F16);
smctx->octx = octx;
@ -97,8 +118,8 @@ static void init_softmax_ctx(struct htp_softmax_context * smctx, struct htp_ops_
if (ne01 > 0) smctx->fastdiv_ne01 = init_fastdiv_values(ne01);
if (ne02 > 0) smctx->fastdiv_ne02 = init_fastdiv_values(ne02);
const uint32_t ne12 = (src1->ne[0]) ? src1->ne[2] : 1;
const uint32_t ne13 = (src1->ne[0]) ? src1->ne[3] : 1;
const uint32_t ne12 = src1 ? src1->ne[2] : 1;
const uint32_t ne13 = src1 ? src1->ne[3] : 1;
if (ne12 > 0) smctx->fastdiv_ne12 = init_fastdiv_values(ne12);
if (ne13 > 0) smctx->fastdiv_ne13 = init_fastdiv_values(ne13);
@ -139,10 +160,7 @@ static void hvx_fast_softmax_prep_f32(const uint8_t * restrict src,
}
}
static void hvx_fast_softmax_f32(const uint8_t * restrict src,
uint8_t * restrict dst,
uint8_t * restrict pad,
const int num_elems) {
static void hvx_fast_softmax_f32(const uint8_t * restrict src, uint8_t * restrict dst, uint8_t * restrict pad, const int num_elems) {
const HVX_Vector * restrict v_src = (HVX_Vector *) src;
HVX_Vector * restrict v_pad = (HVX_Vector *) pad;
HVX_Vector * restrict v_dst = (HVX_Vector *) dst;
@ -188,27 +206,20 @@ static void hvx_fast_softmax_f32(const uint8_t * restrict src,
}
}
static float hvx_softmax_f32(const uint8_t * restrict src,
uint8_t * restrict dst,
uint8_t * restrict spad,
const int num_elems,
const float max) {
static float hvx_softmax_f32(const uint8_t * restrict src, uint8_t * restrict dst, uint8_t * restrict spad, const int num_elems, const float max) {
hvx_sub_scalar_f32(spad, src, max, num_elems);
hvx_exp_f32(dst, spad, num_elems, false);
float sum = hvx_reduce_sum_f32(dst, num_elems);
return sum;
return hvx_reduce_sum_f32(dst, num_elems);
}
static void softmax_job_f32(unsigned int nth, unsigned int ith, void * data) {
struct htp_softmax_context * smctx = (struct htp_softmax_context *) data;
struct htp_ops_context * octx = smctx->octx;
const struct htp_tensor * src0 = &octx->src0;
const struct htp_tensor * src1 = &octx->src1;
struct htp_tensor * dst = &octx->dst;
const struct htp_tensor * src0 = octx->src[0];
const struct htp_tensor * src1 = octx->src[1];
const struct htp_tensor * dst = octx->dst;
htp_softmax_preamble3;
@ -223,22 +234,26 @@ static void softmax_job_f32(unsigned int nth, unsigned int ith, void * data) {
return;
}
uint64_t t1, t2;
t1 = HAP_perf_get_qtimer_count();
uint64_t qt = HAP_perf_get_qtimer_count();
int is_aligned = 1;
int opt_path = 0;
if (!hex_is_aligned((void *) src0->data, VLEN) || !hex_is_aligned((void *) dst->data, VLEN)) {
is_aligned = 0;
FARF(HIGH, "softmax-f32: unaligned addresses in elementwise op, possibly slower execution\n");
}
// Only use the fast path when aligned AND row size is multiple of VLEN (128 bytes)
// The fast path (hvx_fast_softmax_f32) doesn't handle tail elements
// The non-opt path uses hvx_softmax_f32 which properly handles all sizes via its helper functions
if ((1 == is_aligned) && !(nb01 & (VLEN - 1))) {
opt_path = 1;
}
uint8_t * src0_spad_data = octx->src0_spad.data + (ith * smctx->spad_stride);
uint8_t * src1_spad_data = octx->src1_spad.data + (ith * smctx->spad_stride);
uint8_t * dst_spad_data = octx->dst_spad.data + (ith * smctx->spad_stride);
uint8_t * src0_spad_data = octx->src0_spad.data + (ith * octx->src0_spad.size_per_thread);
uint8_t * src1_spad_data = octx->src1_spad.data + (ith * octx->src1_spad.size_per_thread);
uint8_t * dst_spad_data = octx->dst_spad.data + (ith * octx->dst_spad.size_per_thread);
float * wp0 = (float *) src0_spad_data;
float * wp1 = (float *) src1_spad_data;
@ -278,47 +293,29 @@ static void softmax_job_f32(unsigned int nth, unsigned int ith, void * data) {
// ALiBi
if (i2 != prev_i2) {
const uint32_t h = i2; // head
slope = (smctx->max_bias > 0.0f) ?
h < smctx->n_head_log2 ?
powf(smctx->m0, h + 1) :
powf(smctx->m1, 2 * (h - smctx->n_head_log2) + 1) :
1.0f;
slope = (smctx->max_bias > 0.0f) ? h < smctx->n_head_log2 ? powf(smctx->m0, h + 1) : powf(smctx->m1, 2 * (h - smctx->n_head_log2) + 1) : 1.0f;
prev_i2 = i2;
}
float * sp = (float *) ((char *) octx->src0.data + i1 * nb01 + i2 * nb02 + i3 * nb03);
float * dp = (float *) ((char *) octx->dst.data + i1 * nb1 + i2 * nb2 + i3 * nb3);
float * sp = (float *) ((char *) src0->data + i1 * nb01 + i2 * nb02 + i3 * nb03);
float * dp = (float *) ((char *) dst->data + i1 * nb1 + i2 * nb2 + i3 * nb3);
// broadcast the mask across rows
__fp16 * mp_f16 = (smctx->use_src1) ?
(__fp16 *) ((char *) octx->src1.data + i11 * nb11 + i12 * nb12 + i13 * nb13) :
NULL;
float * mp_f32 = (smctx->use_src1) ?
(float *) ((char *) octx->src1.data + i11 * nb11 + i12 * nb12 + i13 * nb13) :
NULL;
__fp16 * mp_f16 = (smctx->use_src1) ? (__fp16 *) ((char *) src1->data + i11 * nb11 + i12 * nb12 + i13 * nb13) : NULL;
float * mp_f32 = (smctx->use_src1) ? (float *) ((char *) src1->data + i11 * nb11 + i12 * nb12 + i13 * nb13) : NULL;
if ((1 == opt_path) && (mp_f32) && !(smctx->use_f16)) {
hvx_fast_softmax_prep_f32((const uint8_t *) sp, (uint8_t *) wp0, ne00, smctx->scale,
(const uint8_t *) mp_f32, slope);
} else {
hvx_fast_softmax_prep_f32((const uint8_t *) sp, (uint8_t *) wp0, ne00, smctx->scale, (const uint8_t *) mp_f32, slope);
hvx_fast_softmax_f32((const uint8_t *) wp0, (uint8_t *) dp, (uint8_t *) wp1, ne00);
} else if (1 == opt_path) {
hvx_scale_f32((uint8_t *) wp0, (const uint8_t *) sp, ne00, smctx->scale);
if (mp_f32) {
if (smctx->use_f16) {
for (int i = 0; i < ne00; ++i) {
wp0[i] += slope * (float) mp_f16[i];
}
} else {
for (int i = 0; i < ne00; ++i) {
wp0[i] += slope * mp_f32[i];
}
}
}
}
if (1 == opt_path) {
apply_mask(wp0, mp_f32, mp_f16, ne00, slope, smctx->use_f16);
hvx_fast_softmax_f32((const uint8_t *) wp0, (uint8_t *) dp, (uint8_t *) wp1, ne00);
} else {
// Non-optimized path: uses HVX helper functions that properly handle all tensor sizes
// including non-multiples of 32 (the HVX vector lane count for f32)
hvx_scale_f32((uint8_t *) wp0, (const uint8_t *) sp, ne00, smctx->scale);
apply_mask(wp0, mp_f32, mp_f16, ne00, slope, smctx->use_f16);
float max = hvx_reduce_max_f32((const uint8_t *) wp0, ne00);
float sum = hvx_softmax_f32((const uint8_t *) wp0, (uint8_t *) wp2, (uint8_t *) wp1, ne00, max);
sum = sum > 0.0 ? (1.0 / sum) : 1;
@ -326,54 +323,47 @@ static void softmax_job_f32(unsigned int nth, unsigned int ith, void * data) {
}
}
t2 = HAP_perf_get_qtimer_count();
FARF(HIGH, "softmax-f32 %d/%d/%d/%d: %ux%ux%ux%u (%u:%u) x %ux%ux%ux%u -> %ux%ux%ux%u usec %u\n", ith, nth,
smctx->use_f16, opt_path, ne00, ne01, ne02, ne03, src0_start_row, src0_end_row, ne10, ne11, ne12, ne13,
ne0, ne1, ne2, ne3, (unsigned) HAP_perf_qtimer_count_to_us(t2 - t1));
qt = HAP_perf_qtimer_count_to_us(HAP_perf_get_qtimer_count() - qt);
FARF(HIGH, "softmax-f32 %d/%d: %ux%ux%ux%u (%u:%u) x %ux%ux%ux%u -> %ux%ux%ux%u : opt %u f16 %u usec %u\n", ith, nth,
ne00, ne01, ne02, ne03, src0_start_row, src0_end_row, ne10, ne11, ne12, ne13,
ne0, ne1, ne2, ne3, opt_path, smctx->use_f16, (unsigned) qt);
}
static int execute_op_softmax_f32(struct htp_ops_context * octx) {
int err = HTP_STATUS_OK;
const struct htp_tensor * src0 = &octx->src0;
const struct htp_tensor * src1 = &octx->src1;
struct htp_tensor * dst = &octx->dst;
const struct htp_tensor * src0 = octx->src[0];
const struct htp_tensor * src1 = octx->src[1];
const struct htp_tensor * dst = octx->dst;
struct htp_softmax_context smctx;
const char * op_type = "softmax-f32";
switch (octx->op) {
case HTP_OP_SOFTMAX:
init_softmax_ctx(&smctx, octx);
break;
default:
FARF(ERROR, "Unsupported Op %u\n", octx->op);
return HTP_STATUS_NO_SUPPORT;
}
init_softmax_ctx(&smctx, octx);
const uint32_t src0_nrows = src0->ne[1] * src0->ne[2] * src0->ne[3];
const uint32_t n_threads = MIN(octx->n_threads, src0_nrows);
smctx.src0_nrows_per_thread = (src0_nrows + n_threads - 1) / n_threads;
const size_t src0_row_size = src0->nb[1];
const size_t src1_row_size = src0_row_size;
const size_t dst_row_size = dst->nb[1];
// VTCM scratchpads for all tensors
// N rows per thread, padded to HVX vector size
octx->dst_spad.size = hex_round_up(dst_row_size, 128) * n_threads;
octx->src0_spad.size = hex_round_up(src0_row_size, 128) * n_threads;
octx->src1_spad.size = hex_round_up(src1_row_size, 128) * n_threads;
// 4 rows per thread, padded to HVX vector size
octx->src0_spad.size_per_thread = hex_round_up(4 * src0_row_size, 128);
octx->src1_spad.size_per_thread = hex_round_up(4 * src1_row_size, 128);
octx->dst_spad.size_per_thread = hex_round_up(4 * dst_row_size, 128);
// Use stride for calculating offset
smctx.spad_stride = hex_round_up(src0_row_size, 128);
octx->src0_spad.size = octx->src0_spad.size_per_thread * n_threads;
octx->src1_spad.size = octx->src1_spad.size_per_thread * n_threads;
octx->dst_spad.size = octx->dst_spad.size_per_thread * n_threads;
size_t spad_size = octx->src0_spad.size + octx->src1_spad.size + octx->dst_spad.size;
if (src1->ne[0]) {
FARF(HIGH,
"%s: %ux%ux%ux%u x %ux%ux%ux%u -> %ux%ux%ux%u : src0-spad-size %u src1-spad-size %u dst-spad-size %u\n",
if (src1) {
FARF(HIGH, "%s: %ux%ux%ux%u x %ux%ux%ux%u -> %ux%ux%ux%u : src0-spad-size %u src1-spad-size %u dst-spad-size %u\n",
op_type, src0->ne[0], src0->ne[1], src0->ne[2], src0->ne[3], src1->ne[0], src1->ne[1], src1->ne[2],
src1->ne[3], dst->ne[0], dst->ne[1], dst->ne[2], dst->ne[3], octx->src0_spad.size, octx->src1_spad.size,
octx->dst_spad.size);
@ -385,19 +375,17 @@ static int execute_op_softmax_f32(struct htp_ops_context * octx) {
// Make sure the reserved vtcm size is sufficient
if (octx->ctx->vtcm_size < spad_size) {
FARF(ERROR, "%s : current VTCM reservation %zu is too small, needed %zu\n", op_type, octx->ctx->vtcm_size,
spad_size);
FARF(ERROR, "%s : current VTCM reservation %zu is too small, needed %zu\n", op_type, octx->ctx->vtcm_size, spad_size);
return HTP_STATUS_VTCM_TOO_SMALL;
}
octx->src0_spad.data = octx->ctx->vtcm_base;
octx->src1_spad.data = octx->src0_spad.data + octx->src0_spad.size;
octx->dst_spad.data = octx->src1_spad.data + octx->src1_spad.size;
octx->src0_spad.data = octx->ctx->vtcm_base; octx->src0_spad.src = NULL;
octx->src1_spad.data = octx->src0_spad.data + octx->src0_spad.size; octx->src1_spad.src = NULL;
octx->dst_spad.data = octx->src1_spad.data + octx->src1_spad.size; octx->dst_spad.src = NULL;
if (!(octx->flags & HTP_OPFLAGS_SKIP_COMPUTE)) {
smctx.src0_nrows_per_thread = (src0_nrows + n_threads - 1) / n_threads;
worker_pool_run_func(octx->ctx->worker_pool, softmax_job_f32, &smctx, n_threads);
}
if (octx->flags & HTP_OPFLAGS_SKIP_COMPUTE) return err;
worker_pool_run_func(octx->ctx->worker_pool, softmax_job_f32, &smctx, n_threads);
return err;
}
@ -405,7 +393,7 @@ static int execute_op_softmax_f32(struct htp_ops_context * octx) {
int op_softmax(struct htp_ops_context * octx) {
int err = HTP_STATUS_OK;
switch (octx->src0.type) {
switch (octx->src[0]->type) {
case HTP_TYPE_F32:
err = execute_op_softmax_f32(octx);
break;

View File

@ -16,14 +16,14 @@
#include "ggml-common.h"
#include "htp-ctx.h"
#include "hex-dma.h"
#include "htp-msg.h"
#include "htp-ops.h"
#include "htp-ops.h"
#include "hvx-utils.h"
#define htp_ssm_conv_tensors_preamble \
struct htp_tensor * restrict src0 = &octx->src0; \
struct htp_tensor * restrict src1 = &octx->src1; \
struct htp_tensor * restrict dst = &octx->dst; \
#define htp_ssm_conv_tensors_preamble \
const struct htp_tensor * restrict src0 = octx->src[0]; \
const struct htp_tensor * restrict src1 = octx->src[1]; \
const struct htp_tensor * restrict dst = octx->dst; \
struct htp_spad * restrict src0_spad = &octx->src0_spad; \
struct htp_spad * restrict src1_spad = &octx->src1_spad; \
struct htp_spad * restrict dst_spad = &octx->dst_spad; \
@ -289,9 +289,9 @@ int op_ssm_conv_f32(struct htp_ops_context * octx) {
// Compute gather scratchpad size for src0 and src1
const size_t gather_spad_size = n_threads * VLEN * 2;
octx->src0_spad.data = octx->ctx->vtcm_base + gather_spad_size;
octx->src1_spad.data = octx->src0_spad.data + octx->src0_spad.size;
octx->dst_spad.data = octx->src1_spad.data + octx->src1_spad.size;
octx->src0_spad.data = octx->ctx->vtcm_base + gather_spad_size; octx->src0_spad.src = NULL;
octx->src1_spad.data = octx->src0_spad.data + octx->src0_spad.size; octx->src1_spad.src = NULL;
octx->dst_spad.data = octx->src1_spad.data + octx->src1_spad.size; octx->dst_spad.src = NULL;
FARF(HIGH, "ssm_conv-f32: gather-spad:%zu spad-per-thread:(%u:%u:%u) spad-sizes:(%u:%u:%u) spad-data:(%p:%p:%p)\n",
gather_spad_size, octx->src0_spad.size_per_thread, octx->src1_spad.size_per_thread,
@ -323,8 +323,9 @@ int op_ssm_conv_f32(struct htp_ops_context * octx) {
}
int op_ssm_conv(struct htp_ops_context * octx) {
int err = HTP_STATUS_OK;
struct htp_tensor * dst = &octx->dst;
const struct htp_tensor * dst = octx->dst;
int err = HTP_STATUS_OK;
switch (dst->type) {
case HTP_TYPE_F32:

View File

@ -14,13 +14,13 @@
#define GGML_COMMON_DECL_C
#include "ggml-common.h"
#include "htp-ctx.h"
#include "htp-msg.h"
#include "htp-ops.h"
#include "htp-ops.h"
#define sum_rows_preamble \
struct htp_tensor *src0 = &octx->src0;\
struct htp_tensor *dst = &octx->dst; \
\
#define sum_rows_preamble \
const struct htp_tensor *src0 = octx->src[0]; \
const struct htp_tensor *dst = octx->dst; \
\
const uint32_t ne00 = src0->ne[0]; \
const uint32_t ne01 = src0->ne[1]; \
const uint32_t ne02 = src0->ne[2]; \
@ -94,7 +94,7 @@ static void sum_rows_thread_f32(unsigned int nth, unsigned int ith, void *data)
int op_sum_rows(struct htp_ops_context * octx) {
sum_rows_preamble;
if (octx->src0.type != HTP_TYPE_F32) {
if (octx->src[0]->type != HTP_TYPE_F32) {
return HTP_STATUS_NO_SUPPORT;
}

View File

@ -16,7 +16,7 @@
#define GGML_COMMON_DECL_C
#include "ggml-common.h"
#include "htp-ctx.h"
#include "htp-msg.h"
#include "htp-ops.h"
#include "htp-ops.h"
struct htp_unary_context {
@ -267,8 +267,8 @@ static void softplus_f32(const float * restrict src,
static void unary_job_f32_per_thread(unsigned int nth, unsigned int ith, void * data) {
const struct htp_unary_context * uctx = (const struct htp_unary_context *) data;
struct htp_ops_context * octx = uctx->octx;
const struct htp_tensor * src = &octx->src0;
const struct htp_tensor * dst = &octx->dst;
const struct htp_tensor * src = octx->src[0];
const struct htp_tensor * dst = octx->dst;
htp_unary_preamble;
@ -387,8 +387,8 @@ static void unary_job_f32_per_thread(unsigned int nth, unsigned int ith, void *
static int execute_op_unary_f32(struct htp_ops_context * octx) {
int err = HTP_STATUS_OK;
const struct htp_tensor * src0 = &octx->src0;
struct htp_tensor * dst = &octx->dst;
const struct htp_tensor * src0 = octx->src[0];
const struct htp_tensor * dst = octx->dst;
const char * op_type = NULL;
@ -490,7 +490,7 @@ static int execute_op_unary_f32(struct htp_ops_context * octx) {
int op_unary(struct htp_ops_context * octx) {
int err = HTP_STATUS_OK;
switch (octx->src0.type) {
switch (octx->src[0]->type) {
case HTP_TYPE_F32:
err = execute_op_unary_f32(octx);
break;