whisper.cpp/examples
texasich 27101c01dc
cli : merge tokens split across UTF-8 boundaries in JSON output (#3751)
* cli : merge tokens split across UTF-8 boundaries in JSON output

When a multi-byte UTF-8 codepoint (most commonly a CJK character, 3 bytes)
is split across multiple whisper tokens, the -ojf/--output-json-full
writer emitted each token's partial bytes as its own JSON string, producing
invalid UTF-8 that chokes downstream parsers.

Merge adjacent tokens in output_json whenever the accumulated text still
ends on an incomplete UTF-8 sequence. The merged entry keeps the first
token's id/p/t_dtw and extends t1 to the last absorbed token, which
matches how segment text is assembled elsewhere.

Refs #1798

* fix: address review — add braces for consistency, use full issue URL

- Add braces to if/else chain for codebase consistency
- Use full URL for issue #1798 reference

Review: @danbev

---------

Co-authored-by: texasich <texasich@users.noreply.github.com>
Co-authored-by: texasich <texasich@gmail.com>
2026-05-26 06:23:41 +02:00
..
addon.node vad : Silero VAD v6.2.0 (#3524) 2025-11-17 22:26:17 +09:00
bench bench : sync submit-results URL to ggml-org (#3769) 2026-04-20 07:12:57 +02:00
bench.wasm bench : sync submit-results URL to ggml-org (#3769) 2026-04-20 07:12:57 +02:00
cli cli : merge tokens split across UTF-8 boundaries in JSON output (#3751) 2026-05-26 06:23:41 +02:00
command whisper : enable flash attention by default (#3441) 2025-09-30 15:47:20 +03:00
command.wasm examples : add wchess.wasm to wasm examples build (#3443) 2025-09-30 16:23:01 +02:00
deprecation-warning examples : add WHISPER_SDL2 check to deprecation executables (#2911) 2025-03-20 18:36:02 +01:00
lsp examples : fix executable example targets (#3600) 2026-01-13 08:08:18 +01:00
python readme : remove invalid flag from Python example (#2396) 2024-08-30 14:00:38 +03:00
quantize examples : fix executable example targets (#3600) 2026-01-13 08:08:18 +01:00
server common : fix server /inference fails to decode in-memory audio (regression) (#3818) 2026-05-22 08:27:35 +02:00
stream whisper : enable flash attention by default (#3441) 2025-09-30 15:47:20 +03:00
stream.wasm examples : add wchess.wasm to wasm examples build (#3443) 2025-09-30 16:23:01 +02:00
sycl sycl: fix example build (#2570) 2024-11-18 14:57:23 +02:00
talk-llama talk-llama : sync llama.cpp 2026-05-25 12:26:07 +03:00
vad-speech-segments examples : fix executable example targets (#3600) 2026-01-13 08:08:18 +01:00
wchess wchess : fix link [no ci] 2025-09-30 21:28:03 +03:00
whisper.android whisper : add version function (#3289) 2025-06-26 18:09:42 +02:00
whisper.android.java whisper : add version function (#3289) 2025-06-26 18:09:42 +02:00
whisper.nvim rename : ggerganov -> ggml-org (#3005) 2025-04-04 16:11:52 +03:00
whisper.objc docs : update README.md for whisper.objc app (#2569) 2025-05-13 06:03:50 +02:00
whisper.swiftui examples : clarify Core ML encoder model usage [no ci] (#2987) 2025-04-02 08:32:14 +02:00
whisper.wasm wasm : fix Hebrew ID (#3487) 2025-10-27 08:49:32 +02:00
CMakeLists.txt examples : add wchess.wasm to wasm examples build (#3443) 2025-09-30 16:23:01 +02:00
coi-serviceworker.js ci : add github pages workflow for wasm examples (#2969) 2025-03-31 11:34:40 +02:00
common-ggml.cpp examples : update to Q1_0 2026-05-01 13:07:33 +03:00
common-ggml.h
common-sdl.cpp common : more general m_audio_len update logic (#2855) 2025-03-07 10:10:03 +02:00
common-sdl.h sdl : fix audio callback (#1523) 2023-11-20 13:16:38 +02:00
common-whisper.cpp common : fix server /inference fails to decode in-memory audio (regression) (#3818) 2026-05-22 08:27:35 +02:00
common-whisper.h common : fix server /inference fails to decode in-memory audio (regression) (#3818) 2026-05-22 08:27:35 +02:00
common.cpp whisper: remove MSVC warnings pragmas (#3090) 2025-05-05 13:09:35 +02:00
common.h examples : add --print-confidence option to cli (#3150) 2025-05-14 19:21:48 +02:00
ffmpeg-transcode.cpp examples : fix deprecated FFmpeg functions (#3073) 2025-04-28 06:16:50 +02:00
generate-karaoke.sh examples : use miniaudio for direct decoding flac, mp3, ogg and wav (#2759) 2025-02-27 09:06:54 +02:00
grammar-parser.cpp whisper : reorganize source code + improve CMake (#2256) 2024-06-26 19:34:09 +03:00
grammar-parser.h whisper : add grammar-based sampling (#1229) 2023-11-13 10:51:34 +02:00
helpers.js js : remove un-needed request header from fetchRemote (#2119) 2024-05-13 15:13:19 +03:00
json.hpp examples : clean up common code (#1871) 2024-02-19 10:50:15 +02:00
livestream.sh rename : ggerganov -> ggml-org (#3005) 2025-04-04 16:11:52 +03:00
miniaudio.h examples : update miniaudio library to 0.11.24 (#3672) 2026-02-27 11:15:15 +01:00
server.py examples : add wchess.wasm to wasm examples build (#3443) 2025-09-30 16:23:01 +02:00
stb_vorbis.c examples : use miniaudio for direct decoding flac, mp3, ogg and wav (#2759) 2025-02-27 09:06:54 +02:00
twitch.sh rename : ggerganov -> ggml-org (#3005) 2025-04-04 16:11:52 +03:00
yt-wsp.sh examples : update usage/help in yt-wsp.sh (#3251) 2025-06-16 12:21:16 +02:00