whisper.cpp/benchmark
shaihi 40a46f9ed1 benchmark: add harness artifacts and handoff record 2026-03-09 10:13:34 +02:00
..
__pycache__ benchmark: add harness artifacts and handoff record 2026-03-09 10:13:34 +02:00
audio benchmark: add harness artifacts and handoff record 2026-03-09 10:13:34 +02:00
references benchmark: add harness artifacts and handoff record 2026-03-09 10:13:34 +02:00
results benchmark: add harness artifacts and handoff record 2026-03-09 10:13:34 +02:00
README.md benchmark: add harness artifacts and handoff record 2026-03-09 10:13:34 +02:00
bench.sh benchmark: add harness artifacts and handoff record 2026-03-09 10:13:34 +02:00
parse_results.py benchmark: add harness artifacts and handoff record 2026-03-09 10:13:34 +02:00

README.md

Benchmark Harness

This harness is the single source of truth for baseline and optimization performance measurements.

Layout

benchmark/
  bench.sh
  audio/
    short.wav
    medium.wav
    long.wav
    lock.json
  references/
    short.txt
    medium.txt
    long.txt
  parse_results.py
  results/

Requirements

  • build/bin/whisper-cli exists (build first with CMake).
  • Audio files exist at:
    • benchmark/audio/short.wav (~30s)
    • benchmark/audio/medium.wav (~5m)
    • benchmark/audio/long.wav (~30m)
  • Audio files must be 16 kHz, mono, 16-bit WAV.

Fixed Benchmark Policy

  • Warm-up runs: 1
  • Measured runs: 5
  • Sequential execution only
  • Fixed model: models/ggml-small.en.bin
  • Fixed decode flags:
    • -l en -tp 0 -tpi 0 -nf -bs 1 -bo 1 -fa
  • Fixed threading config:
    • -t 8 -p 1

Usage

  1. Create the lock file (checksums + durations):
./benchmark/bench.sh --create-lock
  1. Run benchmark (after lock exists):
./benchmark/bench.sh --variant metal-baseline

Outputs

Each run writes to benchmark/results/<timestamp>_<variant>/:

  • config.json: exact benchmark config + environment metadata
  • validated_inputs.json: lock validation snapshot
  • raw/...: per-run logs and metadata (warmup_*.log, run_*.log, *.meta.json)
  • runs.csv: per-run metrics
  • summary.csv: aggregated metrics
  • summary.json: detailed aggregates
  • summary.md: required table format
  • correctness.json: WER/CER gate report against references

Correctness Gate

  • References are read from benchmark/references/{short,medium,long}.txt.
  • The parser extracts transcript text from each measured run log and computes:
    • WER (word error rate)
    • CER (character error rate)
  • Default enforcement thresholds from bench.sh:
    • MAX_WER=0.02
    • MAX_CER=0.02
  • If enforcement is enabled and references are missing or thresholds are exceeded, the run exits non-zero.