When processing long audio, whisper tends to produce progressively
shorter segments because timestamp tokens in the decoder prompt context
condition the model to insert more frequent segment breaks.
Add a seg_len_hint parameter (in ms) that thins timestamp tokens in
the rolling prompt context, keeping at most one per seg_len_hint
interval. This breaks the feedback loop while preserving text tokens
for continuity. The model can still break on natural boundaries
(speaker turns, pauses) — the hint only affects context conditioning,
not the actual segment creation.
Usage: --seg-len-hint 2000 (for ~2 second target segments)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>