Feature: whisper.cpp(音声認識)と pyannote-audio(話者識別)を組み合わせ

This commit is contained in:
zoe-simpson 2026-04-26 18:31:38 +09:00
parent fc674574ca
commit 147a83122f
4 changed files with 604 additions and 0 deletions

143
CLAUDE.md Normal file
View File

@ -0,0 +1,143 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Build Commands
```bash
# Standard build (Release by default on non-MSVC)
cmake -B build
cmake --build build -j --config Release
# GPU backends
cmake -B build -DGGML_CUDA=1 # NVIDIA CUDA
cmake -B build -DGGML_VULKAN=1 # Vulkan (cross-vendor)
cmake -B build -DGGML_METAL=1 # Apple Metal
cmake -B build -DGGML_BLAS=1 # CPU via OpenBLAS
# Optional features
cmake -B build -DWHISPER_SDL2=ON # Enable SDL2 for real-time audio (stream example)
cmake -B build -DWHISPER_CURL=ON # Enable libcurl for model download
cmake -B build -DWHISPER_COREML=ON # Apple Core ML encoder (Apple Silicon only)
# Sanitizers
cmake -B build -DWHISPER_SANITIZE_ADDRESS=ON
cmake -B build -DWHISPER_SANITIZE_THREAD=ON
cmake -B build -DWHISPER_SANITIZE_UNDEFINED=ON
```
## Running Tests
Tests require model files (downloaded separately) and use CTest:
```bash
# Build with tests enabled (on by default when building standalone)
cmake -B build -DWHISPER_BUILD_TESTS=ON
cmake --build build -j --config Release
# Run all tests
cd build && ctest
# Run a specific test by label
cd build && ctest -L tiny
# Run single integration test manually (requires model at models/for-tests-ggml-tiny.en.bin)
./build/bin/whisper-cli -m models/for-tests-ggml-tiny.en.bin -f samples/jfk.wav
# Run the unit VAD test binary
./build/bin/test-vad
```
## Downloading Models
```bash
# Download a pre-converted ggml model
bash ./models/download-ggml-model.sh base.en # or: tiny, small, medium, large-v3, etc.
# Download VAD model (for --vad flag)
bash ./models/download-vad-model.sh silero-v6.2.0
# Convenience Makefile targets (downloads model + builds + runs on samples/)
make base.en
make tiny
```
## Transcribing Audio
Audio must be 16-bit WAV at 16 kHz. Convert with ffmpeg:
```bash
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav
./build/bin/whisper-cli -m models/ggml-base.en.bin -f output.wav
```
## Quantization
```bash
./build/bin/quantize models/ggml-base.en.bin models/ggml-base.en-q5_0.bin q5_0
./build/bin/whisper-cli -m models/ggml-base.en-q5_0.bin -f samples/jfk.wav
```
---
## Architecture Overview
### Two-layer design
```
include/whisper.h ← Public C API (cross-language compatible)
src/whisper.cpp ← Whisper model: audio preprocessing, encoder, decoder, beam search
src/whisper-arch.h ← Tensor name map (encoder/decoder/cross-attention weight paths in ggml format)
ggml/ ← Tensor math library (git subtree from ggml-org/ggml)
```
`whisper_context` holds the loaded model weights (shared, read-only across threads). `whisper_state` holds per-inference mutable state (KV cache, mel buffers). You can create multiple states from one context for parallel inference.
### ggml subdirectory
`ggml/` is a git subtree (synced via `sync` commits). Do not edit it directly unless you are making changes intended to be upstreamed. Hardware backends live in `ggml/src/`:
- `ggml-cpu/` — generic CPU with NEON/AVX/VSX intrinsics
- `ggml-cuda/` — CUDA kernels
- `ggml-metal/` — Metal shaders (Apple)
- `ggml-vulkan/` — Vulkan compute shaders
- `ggml-sycl/` — SYCL (Intel)
### Whisper pipeline (inside `src/whisper.cpp`)
1. **Audio preprocessing** — raw PCM → log-Mel spectrogram (80 mel bins, 30-second chunks at 16 kHz)
2. **Encoder** — convolutional feature extraction + transformer encoder; optional Core ML / OpenVINO offload
3. **Decoder** — autoregressive transformer decoder with optional beam search, temperature fallback, and cross-attention timestamps
4. **VAD** — optional pre-pass using Silero-VAD to skip silence before encoding
### Examples (`examples/`)
Shared utilities used by all examples live at the top level of `examples/`:
- `common.h / common.cpp` — CLI arg parsing, vocab helpers
- `common-whisper.h / common-whisper.cpp` — WAV reading, timestamp formatting
- `common-sdl.h / common-sdl.cpp` — SDL2 audio capture (stream example only)
- `grammar-parser.h / grammar-parser.cpp` — GBNF grammar parsing for constrained decoding
Key example binaries:
| Binary | Source | Purpose |
|--------|--------|---------|
| `whisper-cli` | `examples/cli/` | Primary file transcription tool |
| `whisper-stream` | `examples/stream/` | Real-time mic input (needs SDL2) |
| `whisper-server` | `examples/server/` | HTTP API server |
| `whisper-bench` | `examples/bench/` | Inference benchmarking |
| `quantize` | `examples/quantize/` | Model quantization |
| `vad-speech-segments` | `examples/vad-speech-segments/` | VAD-only segment extraction |
### Bindings (`bindings/`)
Language bindings wrap the C API in `include/whisper.h`:
- `bindings/go/` — Go
- `bindings/java/` — JNI (used by the Android example)
- `bindings/javascript/` — WASM/Node.js (built via Emscripten)
- `bindings/ruby/` — Ruby
### Model format
Models are stored in custom `ggml` binary format (not GGUF). The original OpenAI PyTorch weights are converted with `models/convert-pt-to-ggml.py`. Pre-converted models are available from HuggingFace (`ggerganov/whisper.cpp`). Tensor names follow the pattern defined in `src/whisper-arch.h`.
## Windows-specific notes
The project builds with MSVC. The CMakeLists.txt defines `_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR` on Windows to work around an MSVC STL issue that causes crashes in the Java bindings. Several MSVC warnings are suppressed project-wide (see the `MSVC_WARNING_FLAGS` block at the bottom of `CMakeLists.txt`).

211
examples/python/diarize.py Normal file
View File

@ -0,0 +1,211 @@
#!/usr/bin/env python3
"""
Meeting transcription with per-speaker labels.
Pipeline:
1. whisper.cpp (whisper-cli) -> timestamped transcript JSON
2. pyannote-audio -> speaker segments
3. merge by timestamp overlap -> labelled output
Usage:
python diarize.py -f meeting.wav -m large-v3 --hf-token hf_xxx
Requirements:
pip install -r requirements-diarize.txt
(HuggingFace token with pyannote/speaker-diarization-3.1 terms accepted)
"""
import argparse
import json
import os
import subprocess
import sys
import tempfile
from pathlib import Path
import soundfile as sf
import torch
from pyannote.audio import Pipeline
# ---------------------------------------------------------------------------
# Step 1: whisper.cpp transcription
# ---------------------------------------------------------------------------
def find_whisper_cli() -> str:
script_dir = Path(__file__).parent
repo_root = script_dir.parent.parent
candidates = [
repo_root / "build" / "bin" / "Release" / "whisper-cli.exe", # Windows MSVC
repo_root / "build" / "bin" / "whisper-cli.exe", # Windows MinGW
repo_root / "build" / "bin" / "whisper-cli", # Linux/Mac
]
for p in candidates:
if p.exists():
return str(p)
raise FileNotFoundError(
"whisper-cli not found. Build the project first:\n"
" cmake -B build && cmake --build build -j --config Release"
)
def run_whisper(audio_path: str, model: str, language: str, threads: int) -> list:
cli = find_whisper_cli()
repo_root = Path(__file__).parent.parent.parent
model_path = repo_root / "models" / f"ggml-{model}.bin"
if not model_path.exists():
raise FileNotFoundError(
f"Model not found: {model_path}\n"
f"Download with: bash models/download-ggml-model.sh {model}"
)
with tempfile.TemporaryDirectory() as tmpdir:
out_base = os.path.join(tmpdir, "out")
cmd = [
cli,
"-m", str(model_path),
"-f", audio_path,
"-l", language,
"-t", str(threads),
"--output-json",
"--output-file", out_base,
]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
raise RuntimeError(f"whisper-cli failed:\n{result.stderr}")
json_path = out_base + ".json"
if not os.path.exists(json_path):
raise RuntimeError(
"whisper-cli did not produce a JSON file. "
"stderr:\n" + result.stderr
)
with open(json_path, encoding="utf-8") as f:
data = json.load(f)
return data.get("transcription", [])
# ---------------------------------------------------------------------------
# Step 2: pyannote-audio speaker diarization
# ---------------------------------------------------------------------------
def run_diarization(audio_path: str, hf_token: str, num_speakers: int | None) -> list:
print("Loading pyannote speaker-diarization-3.1 (CPU) ...", file=sys.stderr)
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
token=hf_token,
)
pipeline.to(torch.device("cpu"))
# Use soundfile to avoid the torchcodec/FFmpeg dependency on Windows.
# pyannote accepts a pre-loaded {'waveform': Tensor, 'sample_rate': int} dict.
waveform, sample_rate = sf.read(audio_path, dtype="float32", always_2d=True)
waveform_tensor = torch.from_numpy(waveform.T) # (channels, time)
audio_input = {"waveform": waveform_tensor, "sample_rate": sample_rate}
kwargs = {}
if num_speakers is not None:
kwargs["num_speakers"] = num_speakers
print("Running diarization ...", file=sys.stderr)
result = pipeline(audio_input, **kwargs)
# pyannote.audio 4.x returns DiarizeOutput; the Annotation is in .speaker_diarization
annotation = result.speaker_diarization
segments = []
for turn, _, speaker in annotation.itertracks(yield_label=True):
segments.append({"start": turn.start, "end": turn.end, "speaker": speaker})
return segments
# ---------------------------------------------------------------------------
# Step 3: merge by timestamp overlap
# ---------------------------------------------------------------------------
def assign_speakers(transcription: list, diarization: list) -> list:
results = []
for seg in transcription:
# whisper offsets are in milliseconds
t0 = seg["offsets"]["from"] / 1000.0
t1 = seg["offsets"]["to"] / 1000.0
text = seg.get("text", "").strip()
if not text:
continue
best_speaker = "UNKNOWN"
best_overlap = 0.0
for d in diarization:
overlap = min(t1, d["end"]) - max(t0, d["start"])
if overlap > best_overlap:
best_overlap = overlap
best_speaker = d["speaker"]
results.append({"start": t0, "end": t1, "speaker": best_speaker, "text": text})
return results
# ---------------------------------------------------------------------------
# Output formatting
# ---------------------------------------------------------------------------
def _fmt_time(seconds: float) -> str:
m, s = divmod(int(seconds), 60)
h, m = divmod(m, 60)
return f"{h:02d}:{m:02d}:{s:02d}"
def format_output(segments: list) -> str:
lines = []
for seg in segments:
ts = f"[{_fmt_time(seg['start'])} --> {_fmt_time(seg['end'])}]"
lines.append(f"{ts} {seg['speaker']}: {seg['text']}")
return "\n".join(lines)
def format_json(segments: list) -> str:
return json.dumps(segments, ensure_ascii=False, indent=2)
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def parse_args():
p = argparse.ArgumentParser(description="Whisper.cpp + pyannote diarization pipeline")
p.add_argument("-f", "--file", required=True, help="Input WAV file (16 kHz, 16-bit)")
p.add_argument("-m", "--model", default="large-v3", help="ggml model name (default: large-v3)")
p.add_argument("-l", "--language", default="ja", help="Language code (default: ja)")
p.add_argument("-t", "--threads", type=int, default=4, help="whisper-cli thread count")
p.add_argument("--hf-token", required=True, help="HuggingFace access token")
p.add_argument("--num-speakers", type=int, default=None, help="Known speaker count (optional)")
p.add_argument("--output-json", action="store_true", help="Output JSON instead of plain text")
return p.parse_args()
def main():
args = parse_args()
if not os.path.exists(args.file):
sys.exit(f"Audio file not found: {args.file}")
print("Step 1/3: Transcribing with whisper.cpp ...", file=sys.stderr)
transcription = run_whisper(args.file, args.model, args.language, args.threads)
print("Step 2/3: Diarizing speakers with pyannote ...", file=sys.stderr)
diarization = run_diarization(args.file, args.hf_token, args.num_speakers)
print("Step 3/3: Merging results ...", file=sys.stderr)
segments = assign_speakers(transcription, diarization)
if args.output_json:
print(format_json(segments))
else:
print(format_output(segments))
if __name__ == "__main__":
main()

View File

@ -0,0 +1,14 @@
# Install in two steps:
# Step A (PyTorch CPU wheel):
# pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
# Step B (remaining packages from PyPI):
# pip install -r requirements-diarize.txt
# Speaker diarization
pyannote.audio>=3.1.0
# Hugging Face model hub access
huggingface_hub>=0.20.0
# Audio reading (avoids torchcodec/FFmpeg dependency on Windows)
soundfile>=0.12.0

View File

@ -0,0 +1,236 @@
# pyannote-audio 導入手順
whisper.cpp音声認識と pyannote-audio話者識別を組み合わせた、
**日本語会議録音の話者別文字起こしパイプライン**の構築手順です。
## 前提条件
- OS: Windows 11
- Anaconda / Miniconda がインストール済みであること
- Git for Windows がインストール済みであること
- リポジトリ: `c:\work\30.Projects\102.AI_Projects\whisper.cpp\whisper.cpp`
---
## ステップ 1: whisper.cpp のビルド
```bash
cd c:\work\30.Projects\102.AI_Projects\whisper.cpp\whisper.cpp
cmake -B build
cmake --build build -j --config Release
```
ビルド完了後、以下のバイナリが生成されます:
```
build/bin/Release/whisper-cli.exe
```
---
## ステップ 2: Python 仮想環境の作成
> **注意:** Python 3.13 は pyannote.audio 非対応のため、**3.11 を使用すること**。
```bash
conda create -n whisper-diarize python=3.11 -y
```
---
## ステップ 3: パッケージのインストール
`--index-url` の適用範囲の問題により、**2段階でインストール**する必要があります。
### Step A: PyTorchCPU専用ビルド
```bash
conda run -n whisper-diarize pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
```
### Step B: pyannote.audio および関連パッケージ
```bash
conda run -n whisper-diarize pip install -r examples/python/requirements-diarize.txt
```
`requirements-diarize.txt` の内容(`examples/python/requirements-diarize.txt`
```
# Speaker diarization
pyannote.audio>=3.1.0
# Hugging Face model hub access
huggingface_hub>=0.20.0
# Audio readingWindows で torchcodec/FFmpeg 不要にするための回避策)
soundfile>=0.12.0
```
> **Windows の注意事項:**
> pyannote.audio 4.x は音声読み込みに `torchcodec`FFmpeg 必須)を使用しますが、
> conda-forge の FFmpeg は Windows 日本語環境でインストールエラーになる場合があります。
> 代わりに `soundfile` でWAVを読み込み、テンソルとして直接 pyannote に渡す方式を採用しています。
### インストール確認
```bash
C:/work/60.Tools/Anaconda/miniconda3/envs/whisper-diarize/python.exe -c ^
"import torch, pyannote.audio, soundfile; ^
print('torch:', torch.__version__); ^
print('pyannote:', pyannote.audio.__version__); ^
print('soundfile:', soundfile.__version__); ^
print('CPU only:', not torch.cuda.is_available())"
```
期待される出力:
```
torch: 2.11.0+cpu
pyannote.audio: 4.0.4
soundfile: 0.13.1
CPU only: True
```
---
## ステップ 4: HuggingFace トークンの取得と利用規約への同意
pyannote のモデルはゲート付きリポジトリのため、以下の手順が**全て必要**です。
### 4-1. HuggingFace アカウント作成
https://huggingface.co でサインアップ(既存アカウントがあればスキップ)。
### 4-2. 利用規約への同意3つ全て必要
以下の各ページにアクセスし、**"Agree and access repository"** をクリックする。
| モデル | URL |
|--------|-----|
| speaker-diarization-3.1 | https://huggingface.co/pyannote/speaker-diarization-3.1 |
| segmentation-3.0 | https://huggingface.co/pyannote/segmentation-3.0 |
| speaker-diarization-community-1 | https://huggingface.co/pyannote/speaker-diarization-community-1 |
> **注意:** 3つ目の `speaker-diarization-community-1` は pyannote.audio 4.x から追加された依存リポジトリです。
> 同意しないと実行時に `403 Forbidden` エラーが発生します。
### 4-3. アクセストークンの発行
1. https://huggingface.co/settings/tokens にアクセス
2. "New token" をクリック
3. 権限: `read` を選択して作成
4. `hf_` で始まるトークン文字列を控えておく
---
## ステップ 5: 日本語モデルのダウンロード
```bash
cd c:\work\30.Projects\102.AI_Projects\whisper.cpp\whisper.cpp
# 精度重視推奨、2.9GB
bash models/download-ggml-model.sh large-v3
# 速度重視動作確認用、142MB
bash models/download-ggml-model.sh base
```
| モデル | サイズ | 日本語精度 |
|--------|--------|------------|
| large-v3 | 2.9 GB | 最高 |
| medium | 1.5 GB | 高 |
| small | 466 MB | 中 |
| base | 142 MB | 低(動作確認用途) |
---
## ステップ 6: 音声ファイルの準備
whisper.cpp は **16kHz / 16bit / モノラル WAV** のみ対応。
他の形式の場合は ffmpeg で変換する。
```bash
ffmpeg -i 会議録音.mp4 -ar 16000 -ac 1 -c:a pcm_s16le 会議録音.wav
```
---
## ステップ 7: 実行
スクリプト: `examples/python/diarize.py`
```bash
cd c:\work\30.Projects\102.AI_Projects\whisper.cpp\whisper.cpp
C:/work/60.Tools/Anaconda/miniconda3/envs/whisper-diarize/python.exe \
examples/python/diarize.py \
-f 会議録音.wav \
-m large-v3 \
--hf-token hf_xxxxxxxxxxxx \
--language ja
```
### オプション一覧
| オプション | 省略形 | デフォルト | 説明 |
|---|---|---|---|
| `--file` | `-f` | (必須) | 入力WAVファイル |
| `--model` | `-m` | `large-v3` | ggmlモデル名 |
| `--language` | `-l` | `ja` | 言語コード |
| `--hf-token` | | (必須) | HuggingFace トークン |
| `--num-speakers` | | 自動検出 | 参加者数(既知の場合に指定すると精度向上) |
| `--threads` | `-t` | `4` | whisper-cli スレッド数 |
| `--output-json` | | OFF | JSON形式で出力 |
### 出力例
```
[00:00:00 --> 00:00:03] SPEAKER_00: 本日はお集まりいただきありがとうございます。
[00:00:03 --> 00:00:07] SPEAKER_01: よろしくお願いします。
[00:00:07 --> 00:00:12] SPEAKER_00: では、議題に入りましょう。
[00:00:12 --> 00:00:18] SPEAKER_02: 先週の進捗を報告します。
```
---
## トラブルシューティング
### `torchcodec` の警告が大量に出る
```
UserWarning: torchcodec is not installed correctly so built-in audio decoding will fail.
```
**→ 無視して問題ありません。**
`soundfile` による回避策が有効になっているため、実際の動作に影響しません。
### `403 Forbidden` / `GatedRepoError`
pyannote モデルへのアクセス権がありません。
**→ ステップ 4-2 の3リポジトリ全てに同意済みか確認してください。**
### `TypeError: Pipeline.from_pretrained() got an unexpected keyword argument 'use_auth_token'`
pyannote.audio 4.x で `use_auth_token` 引数が廃止されました。
**→ `token=` に変更してください(`diarize.py` は修正済み)。**
### `AttributeError: 'DiarizeOutput' object has no attribute 'itertracks'`
pyannote.audio 4.x で出力型が `DiarizeOutput` に変更されました。
**→ `result.speaker_diarization.itertracks()` を使用してください(`diarize.py` は修正済み)。**
---
## 環境情報(動作確認済み)
| 項目 | バージョン |
|---|---|
| OS | Windows 11 Pro 10.0.26200 |
| Python | 3.11 (conda) |
| torch | 2.11.0+cpu |
| pyannote.audio | 4.0.4 |
| soundfile | 0.13.1 |
| whisper.cpp | v1.8.4 (master) |
| whisper-cliモデル | ggml-base.bin / ggml-large-v3.bin |