|
|
||
|---|---|---|
| .. | ||
| __pycache__ | ||
| audio | ||
| references | ||
| results | ||
| README.md | ||
| bench.sh | ||
| parse_results.py | ||
README.md
Benchmark Harness
This harness is the single source of truth for baseline and optimization performance measurements.
Layout
benchmark/
bench.sh
audio/
short.wav
medium.wav
long.wav
lock.json
references/
short.txt
medium.txt
long.txt
parse_results.py
results/
Requirements
build/bin/whisper-cliexists (build first with CMake).- Audio files exist at:
benchmark/audio/short.wav(~30s)benchmark/audio/medium.wav(~5m)benchmark/audio/long.wav(~30m)
- Audio files must be 16 kHz, mono, 16-bit WAV.
Fixed Benchmark Policy
- Warm-up runs: 1
- Measured runs: 5
- Sequential execution only
- Fixed model:
models/ggml-small.en.bin - Fixed decode flags:
-l en -tp 0 -tpi 0 -nf -bs 1 -bo 1 -fa
- Fixed threading config:
-t 8 -p 1
Usage
- Create the lock file (checksums + durations):
./benchmark/bench.sh --create-lock
- Run benchmark (after lock exists):
./benchmark/bench.sh --variant metal-baseline
Outputs
Each run writes to benchmark/results/<timestamp>_<variant>/:
config.json: exact benchmark config + environment metadatavalidated_inputs.json: lock validation snapshotraw/...: per-run logs and metadata (warmup_*.log,run_*.log,*.meta.json)runs.csv: per-run metricssummary.csv: aggregated metricssummary.json: detailed aggregatessummary.md: required table formatcorrectness.json: WER/CER gate report against references
Correctness Gate
- References are read from
benchmark/references/{short,medium,long}.txt. - The parser extracts transcript text from each measured run log and computes:
- WER (word error rate)
- CER (character error rate)
- Default enforcement thresholds from
bench.sh:MAX_WER=0.02MAX_CER=0.02
- If enforcement is enabled and references are missing or thresholds are exceeded, the run exits non-zero.