* Enroll mul_mat_vec_q_moe into PDL, boosting MTP performance on BW
Data collected on a B4500:
Before
```
(llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py
code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=202.8
code_cpp pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=212.8
explain_concept pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=196.4
summarize pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=226.6
qa_factual pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=225.1
translation pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=201.5
creative_short pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=197.2
stepwise_math pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=209.2
long_code_review pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=208.9
```
After
```
(llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py
code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=211.9
code_cpp pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=224.6
explain_concept pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=207.8
summarize pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=240.2
qa_factual pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=238.5
translation pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=213.4
creative_short pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=208.8
stepwise_math pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=221.7
long_code_review pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=220.7
```
Server launched with:
```
➜ llama.cpp git:(osimons/enroll_mul_mat_vec_q_moe_into_PDL) ✗ ./build-x64-linux-gcc-reldbg/bin/llama-server \
-m /mnt/share/gguf/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -dio \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
-ngl all \
-fa on \
--host 0.0.0.0 \
--port 8080 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"
```
* LC to overlap with following kernels
|
||
|---|---|---|
| .. | ||
| cmake | ||
| include | ||
| src | ||
| .gitignore | ||
| CMakeLists.txt | ||