# Benchmark Harness This harness is the single source of truth for baseline and optimization performance measurements. ## Layout ```text benchmark/ bench.sh audio/ short.wav medium.wav long.wav lock.json references/ short.txt medium.txt long.txt parse_results.py results/ ``` ## Requirements - `build/bin/whisper-cli` exists (build first with CMake). - Audio files exist at: - `benchmark/audio/short.wav` (~30s) - `benchmark/audio/medium.wav` (~5m) - `benchmark/audio/long.wav` (~30m) - Audio files must be 16 kHz, mono, 16-bit WAV. ## Fixed Benchmark Policy - Warm-up runs: 1 - Measured runs: 5 - Sequential execution only - Fixed model: `models/ggml-small.en.bin` - Fixed decode flags: - `-l en -tp 0 -tpi 0 -nf -bs 1 -bo 1 -fa` - Fixed threading config: - `-t 8 -p 1` ## Usage 1. Create the lock file (checksums + durations): ```bash ./benchmark/bench.sh --create-lock ``` 2. Run benchmark (after lock exists): ```bash ./benchmark/bench.sh --variant metal-baseline ``` ## Outputs Each run writes to `benchmark/results/_/`: - `config.json`: exact benchmark config + environment metadata - `validated_inputs.json`: lock validation snapshot - `raw/...`: per-run logs and metadata (`warmup_*.log`, `run_*.log`, `*.meta.json`) - `runs.csv`: per-run metrics - `summary.csv`: aggregated metrics - `summary.json`: detailed aggregates - `summary.md`: required table format - `correctness.json`: WER/CER gate report against references ## Correctness Gate - References are read from `benchmark/references/{short,medium,long}.txt`. - The parser extracts transcript text from each measured run log and computes: - WER (word error rate) - CER (character error rate) - Default enforcement thresholds from `bench.sh`: - `MAX_WER=0.02` - `MAX_CER=0.02` - If enforcement is enabled and references are missing or thresholds are exceeded, the run exits non-zero.