whisper.cpp/examples
texasich 27101c01dc
cli : merge tokens split across UTF-8 boundaries in JSON output (#3751)
* cli : merge tokens split across UTF-8 boundaries in JSON output

When a multi-byte UTF-8 codepoint (most commonly a CJK character, 3 bytes)
is split across multiple whisper tokens, the -ojf/--output-json-full
writer emitted each token's partial bytes as its own JSON string, producing
invalid UTF-8 that chokes downstream parsers.

Merge adjacent tokens in output_json whenever the accumulated text still
ends on an incomplete UTF-8 sequence. The merged entry keeps the first
token's id/p/t_dtw and extends t1 to the last absorbed token, which
matches how segment text is assembled elsewhere.

Refs #1798

* fix: address review — add braces for consistency, use full issue URL

- Add braces to if/else chain for codebase consistency
- Use full URL for issue #1798 reference

Review: @danbev

---------

Co-authored-by: texasich <texasich@users.noreply.github.com>
Co-authored-by: texasich <texasich@gmail.com>
2026-05-26 06:23:41 +02:00
..
addon.node vad : Silero VAD v6.2.0 (#3524) 2025-11-17 22:26:17 +09:00
bench bench : sync submit-results URL to ggml-org (#3769) 2026-04-20 07:12:57 +02:00
bench.wasm bench : sync submit-results URL to ggml-org (#3769) 2026-04-20 07:12:57 +02:00
cli cli : merge tokens split across UTF-8 boundaries in JSON output (#3751) 2026-05-26 06:23:41 +02:00
command whisper : enable flash attention by default (#3441) 2025-09-30 15:47:20 +03:00
command.wasm examples : add wchess.wasm to wasm examples build (#3443) 2025-09-30 16:23:01 +02:00
deprecation-warning
lsp examples : fix executable example targets (#3600) 2026-01-13 08:08:18 +01:00
python
quantize examples : fix executable example targets (#3600) 2026-01-13 08:08:18 +01:00
server common : fix server /inference fails to decode in-memory audio (regression) (#3818) 2026-05-22 08:27:35 +02:00
stream whisper : enable flash attention by default (#3441) 2025-09-30 15:47:20 +03:00
stream.wasm examples : add wchess.wasm to wasm examples build (#3443) 2025-09-30 16:23:01 +02:00
sycl
talk-llama talk-llama : sync llama.cpp 2026-05-25 12:26:07 +03:00
vad-speech-segments examples : fix executable example targets (#3600) 2026-01-13 08:08:18 +01:00
wchess wchess : fix link [no ci] 2025-09-30 21:28:03 +03:00
whisper.android whisper : add version function (#3289) 2025-06-26 18:09:42 +02:00
whisper.android.java whisper : add version function (#3289) 2025-06-26 18:09:42 +02:00
whisper.nvim rename : ggerganov -> ggml-org (#3005) 2025-04-04 16:11:52 +03:00
whisper.objc
whisper.swiftui
whisper.wasm wasm : fix Hebrew ID (#3487) 2025-10-27 08:49:32 +02:00
CMakeLists.txt examples : add wchess.wasm to wasm examples build (#3443) 2025-09-30 16:23:01 +02:00
coi-serviceworker.js
common-ggml.cpp examples : update to Q1_0 2026-05-01 13:07:33 +03:00
common-ggml.h
common-sdl.cpp
common-sdl.h
common-whisper.cpp common : fix server /inference fails to decode in-memory audio (regression) (#3818) 2026-05-22 08:27:35 +02:00
common-whisper.h common : fix server /inference fails to decode in-memory audio (regression) (#3818) 2026-05-22 08:27:35 +02:00
common.cpp
common.h
ffmpeg-transcode.cpp
generate-karaoke.sh
grammar-parser.cpp
grammar-parser.h
helpers.js
json.hpp
livestream.sh
miniaudio.h examples : update miniaudio library to 0.11.24 (#3672) 2026-02-27 11:15:15 +01:00
server.py examples : add wchess.wasm to wasm examples build (#3443) 2025-09-30 16:23:01 +02:00
stb_vorbis.c
twitch.sh
yt-wsp.sh