History

shaihi 40a46f9ed1 benchmark: add harness artifacts and handoff record		2026-03-09 10:13:34 +02:00
..
__pycache__	benchmark: add harness artifacts and handoff record	2026-03-09 10:13:34 +02:00
audio	benchmark: add harness artifacts and handoff record	2026-03-09 10:13:34 +02:00
references	benchmark: add harness artifacts and handoff record	2026-03-09 10:13:34 +02:00
results	benchmark: add harness artifacts and handoff record	2026-03-09 10:13:34 +02:00
README.md	benchmark: add harness artifacts and handoff record	2026-03-09 10:13:34 +02:00
bench.sh	benchmark: add harness artifacts and handoff record	2026-03-09 10:13:34 +02:00
parse_results.py	benchmark: add harness artifacts and handoff record	2026-03-09 10:13:34 +02:00

README.md

Benchmark Harness

This harness is the single source of truth for baseline and optimization performance measurements.

Layout

benchmark/
  bench.sh
  audio/
    short.wav
    medium.wav
    long.wav
    lock.json
  references/
    short.txt
    medium.txt
    long.txt
  parse_results.py
  results/

Requirements

build/bin/whisper-cli exists (build first with CMake).
Audio files exist at:
- benchmark/audio/short.wav (~30s)
- benchmark/audio/medium.wav (~5m)
- benchmark/audio/long.wav (~30m)
Audio files must be 16 kHz, mono, 16-bit WAV.

Fixed Benchmark Policy

Warm-up runs: 1
Measured runs: 5
Sequential execution only
Fixed model: models/ggml-small.en.bin
Fixed decode flags:
- -l en -tp 0 -tpi 0 -nf -bs 1 -bo 1 -fa
Fixed threading config:
- -t 8 -p 1

Usage

Create the lock file (checksums + durations):

./benchmark/bench.sh --create-lock

Run benchmark (after lock exists):

./benchmark/bench.sh --variant metal-baseline

Outputs

Each run writes to benchmark/results/<timestamp>_<variant>/:

config.json: exact benchmark config + environment metadata
validated_inputs.json: lock validation snapshot
raw/...: per-run logs and metadata (warmup_*.log, run_*.log, *.meta.json)
runs.csv: per-run metrics
summary.csv: aggregated metrics
summary.json: detailed aggregates
summary.md: required table format
correctness.json: WER/CER gate report against references

Correctness Gate

References are read from benchmark/references/{short,medium,long}.txt.
The parser extracts transcript text from each measured run log and computes:
- WER (word error rate)
- CER (character error rate)
Default enforcement thresholds from bench.sh:
- MAX_WER=0.02
- MAX_CER=0.02
If enforcement is enabled and references are missing or thresholds are exceeded, the run exits non-zero.