Commit Graph

4699 Commits

Author SHA1 Message Date
Kitaiti Makoto 2cfefa926b Add Parakeet::Segment 2026-06-17 10:30:56 +09:00
Kitaiti Makoto 46a3a2cb93 Add TestParakeetContext 2026-06-17 10:30:56 +09:00
Kitaiti Makoto fdaf031858 Implement Parakeet::Context#initialize 2026-06-17 10:30:56 +09:00
Kitaiti Makoto 8615ac87ec Free parakeet_full_params 2026-06-17 10:30:56 +09:00
Kitaiti Makoto 3bae1e2f1b Reduce if 2026-06-17 10:30:56 +09:00
Kitaiti Makoto f55f3f347c Check callback container in GetParakeetParams 2026-06-17 10:30:56 +09:00
Kitaiti Makoto a3515ac9fc Fix typo 2026-06-17 10:30:56 +09:00
Kitaiti Makoto d051ab6261 Add hook methods to Parakeet::Params 2026-06-17 10:30:56 +09:00
Kitaiti Makoto 105f7a86b9 Define Parakeet 2026-06-17 10:30:56 +09:00
Kitaiti Makoto c5894984b3 Simplify params registration 2026-06-17 10:30:56 +09:00
Kitaiti Makoto 17bd819585 Remove unnecessary macros 2026-06-17 10:30:56 +09:00
Kitaiti Makoto 1e7c734a1d Fix memsize 2026-06-17 10:30:56 +09:00
Kitaiti Makoto 09eff4d1ba Use ITERATE_CALLBACK_PARAMS instead of ITERATE_USER_DATA_PARAMS 2026-06-17 10:30:56 +09:00
Kitaiti Makoto d051c08841 Use ITERATE_CALLBACK_PARAMS 2026-06-17 10:30:56 +09:00
Kitaiti Makoto cd0e91175a Remove unused variable 2026-06-17 10:30:56 +09:00
Kitaiti Makoto b1dbf7452d Define GetParakeetParams 2026-06-17 10:30:56 +09:00
Kitaiti Makoto f412e289ea Undefine local macros 2026-06-17 10:30:56 +09:00
Kitaiti Makoto f39b100bb0 Group callback and user_data params 2026-06-17 10:30:56 +09:00
Kitaiti Makoto 555569481c Add callbacks to Parakeet::Params 2026-06-17 10:30:56 +09:00
Kitaiti Makoto 703fe18e60 Remove unused variabel 2026-06-17 10:30:56 +09:00
Kitaiti Makoto 30abb35db8 Add tests for Parakeet::Params 2026-06-17 10:30:56 +09:00
Kitaiti Makoto f3b2ed68e5 Add Whisper::Parakeet::Params 2026-06-17 10:30:56 +09:00
Daniel Bevenius 9efddafb91
parakeet : add support for NVIDIA Parakeet (#3735)
* parakeet : add support for NVIDIA Parakeet


Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-06-16 20:44:10 +02:00
Daniel Bevenius 3805e602d3
ci : only trigger release jobs for tags (#3883)
* ci : only trigger release jobs for tags

This commit removes the building of the release jobs on pushed to
master.

The motivation for this is that it can be confusing at the momement when
releasing that the push to master also triggers the release jobs but
the actual release will be skipped. With this change the release job is
only run when a tag is pushed which should result in a single Release
github actions job and make it easier to follow.

* ci : add GGML_NATIVE=OFF for ubuntu-22-gcc
2026-06-16 14:33:42 +02:00
Daniel Bevenius 48f628a848
release : v1.8.7 (#3881) 2026-06-16 12:28:23 +02:00
Rum Nguyen db5a84bd79
cli : add --version flag (#3878)
Adds a `--version` option to whisper-cli that prints the library version
via `whisper_version()` and exits, plus a corresponding entry in the help
output. Mirrors the existing `-h`/`--help` handling.

Closes #608
2026-06-16 08:58:09 +02:00
Georgi Gerganov 0ec0845110 talk-llama : sync llama.cpp 2026-06-15 10:33:53 +03:00
Georgi Gerganov 0a3fa9ca17 sync : ggml 2026-06-15 10:33:53 +03:00
Georgi Gerganov f35f47b5d2 ggml : bump version to 0.15.1 (ggml/1541) 2026-06-15 10:33:53 +03:00
ZihaoMu 882736f886 ggml: support concat for scalar types at cuda backend (llama/24011)
* cuda: support concat for scalar types

* Update concat.cu

* fix metal ci issue
2026-06-15 10:33:53 +03:00
shaofeiqi 2dcfd49d59 opencl: add q5_0/q5_1 gemm and gemv kernels for Adreno (llama/24319)
* opencl: add q5_0 adreno support

* opencl: add q5_1 adreno support

* opencl: cosmetic fix

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-06-15 10:33:53 +03:00
Jeff Bolz afd559279c vulkan: ifdef eMesaHoneykrisp (build fix) (llama/24479)
Fixes build/CI after #24306.
2026-06-15 10:33:53 +03:00
Georgi Gerganov b04008fcec ggml : bump version to 0.15.0 (ggml/1539) 2026-06-15 10:33:53 +03:00
Winston Ma 6870cfd616 vulkan: add fast path for contiguous buffer transfers (llama/23973) 2026-06-15 10:33:53 +03:00
Kevin Liu a512e4c5c3 vulkan: use medium matmul tile on Asahi Linux (llama/24306)
* vulkan: use medium matmul tile on Asahi Linux

* vulkan: switch Apple detection to Honeykrisp driver id
2026-06-15 10:33:53 +03:00
Gaurav Garg 1a1900f90c Remove padding and multiple D2D copies for MTP (llama/24086)
* Make ggml_gated_delta_net take only the initial recurrent state (D, 1, n_seqs) and passes the snapshot count K as an op parameter instead of inferring it from state->ne[1].

Remove the padding hack and copy all emitted snapshots into the recurrent cache with a single strided ggml_cpy

* Make GDN changes in all backends. Address review comments.

* Fix CI build errors
2026-06-15 10:33:53 +03:00
Oliver Simons ef85b26d9f CUDA: Fix ssm_scan_f32 data-races (llama/24360)
* Add missing syncthreads before resuing cub_temp_storage

__syncthreads() is required before being allowed to resue TempStorage
smem:
https://nvidia.github.io/cccl/unstable/cub/api/classcub_1_1BlockLoad.html#_CPPv4I0EN3cub9BlockLoad4LoadEv20RandomAccessIteratorRA14ItemsPerThread_1Ti

* Add one more missing __syncthreads

Could also double-buffer, but alternative is to simply ensure all
threads have read smem* before writing to it again in the next loop
iteration

* Remove unused smem from ssm_scan_f32
2026-06-15 10:33:53 +03:00
Jeff Bolz dc794303d8 vulkan: reduce iq1 shared memory usage for mul_mm (llama/24287) 2026-06-15 10:33:53 +03:00
Ruben Ortlam 686bc802d1 vulkan: add `v_dot2_f32_f16` support in matrix-matrix multiplication and Flash Attention (llama/24123)
* vulkan: add support for valve fp16 dot2 extension

* use macro for dot2 path choice

* properly check for the feature

* add dot_product abstraction to reduce preprocessor branching
2026-06-15 10:33:53 +03:00
Pascal 28c7ed3db7 ggml : add GGML_OP_COL2IM_1D (llama/24206)
* cpu: add GGML_OP_COL2IM_1D

Add the overlap-add (scatter-add) step of a 1D transposed convolution.
A ConvTranspose1d factorizes as a GEMM followed by col2im: a weight
pre-permuted to [IC, K*OC] is contracted against the [IC, T_in] input
with mul_mat to produce a column matrix [K*OC, T_in], and col2im_1d
scatters those columns back into the [T_out, OC] signal, with
T_out = (T_in - 1)*s0 + K - 2*p0.

Keeping the contraction as a plain mul_mat leaves the heavy work on the
optimized (and quantizable) matmul kernels, so col2im_1d only does the
cheap overlap-add.

CPU uses a gather formulation parallelized over output channels,
supporting F32, F16 and BF16 with an F32 accumulator.

* tests: add backend coverage for GGML_OP_COL2IM_1D

Add test_col2im_1d next to the conv_transpose_1d cases, covering F32,
F16 and BF16 across eight geometries: the canonical kernel = 2*stride
DAC upsampling shape, overlap, no overlap, cropping (p0 = 1 and
p0 = stride/2), kernel < stride with zeroed gaps, kernel not a
multiple of stride, and a single column unfold.

Perf mode gets three real vocoder stage shapes reporting memory
bandwidth. max_nmse_err relaxes to 5e-4 for F16 and BF16.

* cpu: harden GGML_OP_COL2IM_1D

ggml_col2im_1d validates s0, oc, p0 and input contiguity at graph
build time, before the oc division, protecting every backend at once.
The kernel asserts the contiguity its flat indexing assumes and its
doc states the full output length including the crop term.

The kernel parallelizes over the time axis: the split stays balanced
down to OC = 1, where the previous channel split was single threaded.
Values are bit identical on the three real vocoder chains, two out of
three improve.

* tests: extend the GGML_OP_COL2IM_1D grid

The eval grid grows to eleven geometries: OC = 1 (mono output stage),
K = 1 with stride > 1 (sparse scatter, every gap position zeroed) and
a crop down to T_out = 2 where all the gather bounds act at once.

* tests: add col2im_1d equivalence test

tests/test-col2im-1d.cpp proves mul_mat + col2im_1d matches the
native ggml_conv_transpose_1d on the CPU backend, F32 bit exact, F16
and BF16 through casts of the column matrix. test-backend-ops cannot
cover this for a CPU only op since the CPU backend is its own
reference there.

* rpc: bump protocol patch version for GGML_OP_COL2IM_1D

GGML_OP_COUNT goes from 96 to 97 with the new op, which trips the
static_assert in ggml-rpc.h. Bump RPC_PROTO_PATCH_VERSION since the
op is appended and no existing op code shifts.
2026-06-15 10:33:53 +03:00
Yash Raj Pandey 2d68a3066f ggml-cpu : fix rms_norm_back wrong output under in-place aliasing (llama/24305)
* ggml-cpu : fix rms_norm_back wrong output under in-place aliasing

* cont : clean-up comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-06-15 10:33:53 +03:00
ravel7524 72894aa250 Remove case for GGML_TYPE_Q4_K in mvvq.cu (llama/23528) 2026-06-15 10:33:53 +03:00
Reese Levine e69e5138fe ggml-webgpu: Add clang-format job (llama/24308)
* Add clang-format job

* try local formatting
2026-06-15 10:33:53 +03:00
Masashi Yoshimura aa42b48312 ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants (llama/24225)
* ggml-webgpu: Improve prefill speeds + refactor matmul for quants

* Fixes for editroconfig checker
2026-06-15 10:33:53 +03:00
Nikhil Jain 15e5d401d1 Handle buffer overlap / buffer aliasing for concat operator (llama/24000)
* Only run webgpu CI on my fork

* Add webgpu only workflow

* handle buffer overlap case for concat operator

* restore build-webgpu.yml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Run clang-format

* Update ggml/src/ggml-webgpu/wgsl-shaders/concat.wgsl

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-06-15 10:33:53 +03:00
Nikhil Jain 490e50056c Implement 2D workgroups for scale, binary, and unary ops (llama/24044)
* Only run webgpu CI on my fork

* Add webgpu only workflow

* Implement 2d workgroups for more operations

* fix

* Fix type

* Move back to global_invocation_id
2026-06-15 10:33:53 +03:00
Jeff Bolz fbf720dc9f vulkan: Use cm2 decode_vector for mul_mat_id B matrix loads (llama/23991)
This allows vec4 loads of the B elements. Also increase BK to 64 when this is
enabled. Neither of these alone is consistently faster, but together these give
a nice speedup.

In ggml-vulkan.cpp, we need to make sure the B matrix alignment and stride are
multiples of 4.
2026-06-15 10:33:53 +03:00
Ruben Ortlam 782f1226c8 cuda: reset cuda context after reading memory size (llama/23935)
* cuda: reset device in get_memory function if no backend is active

* also count device and host buffers

* exclude hip and musa from counting and device reset

* use device mutex instead of atomic

* undo backend_free function move
2026-06-15 10:33:53 +03:00
Daniel Bevenius df7638d822
ci : pin github actions to commit sha's (#3865) 2026-06-09 12:51:00 +02:00
Christopher Albert ba573929cd
coreml : fix --quantize crash for mlprogram format; fix --optimize-ane label (#3868)
commit 8b92060 switched ct.convert() to mlprogram, but did not update
the --quantize path.  quantize_weights() from
neural_network.quantization_utils only works with the legacy
neuralnetwork format.  Running with --quantize crashed with:

  Exception: MLModel of type mlProgram cannot be loaded just from the
  model spec object. It also needs the path to the weights file.

Fix: pass compute_precision=ct.precision.FLOAT16 into ct.convert() when
--quantize is set.  This matches the original intent of nbits=16 (F16
storage) without changing the quantization scheme or model accuracy.

Also fix the three boolean CLI flags (--encoder-only, --quantize,
--optimize-ane) to use a _str_to_bool helper so that both
  --flag True
and
  --flag False
parse correctly.  The type=bool form accepted "False" as True because
bool("False") == True.

Remove the "currently broken" label from --optimize-ane: the ANE path
(WhisperANE with Conv2d attention and LayerNormANE) converts and loads
correctly with both PyTorch 2.x and coremltools 9.x.
2026-06-09 08:34:31 +02:00