Feature: whisper.cpp（音声認識）と pyannote-audio（話者識別）を組み合わせ

2026-04-26 18:31:38 +09:00 · 2026-04-26 18:31:38 +09:00 · 147a83122f
parent fc674574ca
commit 147a83122f
4 changed files with 604 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -0,0 +1,143 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Build Commands
+
+```bash
+# Standard build (Release by default on non-MSVC)
+cmake -B build
+cmake --build build -j --config Release
+
+# GPU backends
+cmake -B build -DGGML_CUDA=1          # NVIDIA CUDA
+cmake -B build -DGGML_VULKAN=1        # Vulkan (cross-vendor)
+cmake -B build -DGGML_METAL=1         # Apple Metal
+cmake -B build -DGGML_BLAS=1          # CPU via OpenBLAS
+
+# Optional features
+cmake -B build -DWHISPER_SDL2=ON      # Enable SDL2 for real-time audio (stream example)
+cmake -B build -DWHISPER_CURL=ON      # Enable libcurl for model download
+cmake -B build -DWHISPER_COREML=ON    # Apple Core ML encoder (Apple Silicon only)
+
+# Sanitizers
+cmake -B build -DWHISPER_SANITIZE_ADDRESS=ON
+cmake -B build -DWHISPER_SANITIZE_THREAD=ON
+cmake -B build -DWHISPER_SANITIZE_UNDEFINED=ON
+```
+
+## Running Tests
+
+Tests require model files (downloaded separately) and use CTest:
+
+```bash
+# Build with tests enabled (on by default when building standalone)
+cmake -B build -DWHISPER_BUILD_TESTS=ON
+cmake --build build -j --config Release
+
+# Run all tests
+cd build && ctest
+
+# Run a specific test by label
+cd build && ctest -L tiny
+
+# Run single integration test manually (requires model at models/for-tests-ggml-tiny.en.bin)
+./build/bin/whisper-cli -m models/for-tests-ggml-tiny.en.bin -f samples/jfk.wav
+
+# Run the unit VAD test binary
+./build/bin/test-vad
+```
+
+## Downloading Models
+
+```bash
+# Download a pre-converted ggml model
+bash ./models/download-ggml-model.sh base.en   # or: tiny, small, medium, large-v3, etc.
+
+# Download VAD model (for --vad flag)
+bash ./models/download-vad-model.sh silero-v6.2.0
+
+# Convenience Makefile targets (downloads model + builds + runs on samples/)
+make base.en
+make tiny
+```
+
+## Transcribing Audio
+
+Audio must be 16-bit WAV at 16 kHz. Convert with ffmpeg:
+```bash
+ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav
+./build/bin/whisper-cli -m models/ggml-base.en.bin -f output.wav
+```
+
+## Quantization
+
+```bash
+./build/bin/quantize models/ggml-base.en.bin models/ggml-base.en-q5_0.bin q5_0
+./build/bin/whisper-cli -m models/ggml-base.en-q5_0.bin -f samples/jfk.wav
+```
+
+---
+
+## Architecture Overview
+
+### Two-layer design
+
+```
+include/whisper.h         ← Public C API (cross-language compatible)
+src/whisper.cpp           ← Whisper model: audio preprocessing, encoder, decoder, beam search
+src/whisper-arch.h        ← Tensor name map (encoder/decoder/cross-attention weight paths in ggml format)
+ggml/                     ← Tensor math library (git subtree from ggml-org/ggml)
+```
+
+`whisper_context` holds the loaded model weights (shared, read-only across threads). `whisper_state` holds per-inference mutable state (KV cache, mel buffers). You can create multiple states from one context for parallel inference.
+
+### ggml subdirectory
+
+`ggml/` is a git subtree (synced via `sync` commits). Do not edit it directly unless you are making changes intended to be upstreamed. Hardware backends live in `ggml/src/`:
+- `ggml-cpu/` — generic CPU with NEON/AVX/VSX intrinsics
+- `ggml-cuda/` — CUDA kernels
+- `ggml-metal/` — Metal shaders (Apple)
+- `ggml-vulkan/` — Vulkan compute shaders
+- `ggml-sycl/` — SYCL (Intel)
+
+### Whisper pipeline (inside `src/whisper.cpp`)
+
+1. **Audio preprocessing** — raw PCM → log-Mel spectrogram (80 mel bins, 30-second chunks at 16 kHz)
+2. **Encoder** — convolutional feature extraction + transformer encoder; optional Core ML / OpenVINO offload
+3. **Decoder** — autoregressive transformer decoder with optional beam search, temperature fallback, and cross-attention timestamps
+4. **VAD** — optional pre-pass using Silero-VAD to skip silence before encoding
+
+### Examples (`examples/`)
+
+Shared utilities used by all examples live at the top level of `examples/`:
+- `common.h / common.cpp` — CLI arg parsing, vocab helpers
+- `common-whisper.h / common-whisper.cpp` — WAV reading, timestamp formatting
+- `common-sdl.h / common-sdl.cpp` — SDL2 audio capture (stream example only)
+- `grammar-parser.h / grammar-parser.cpp` — GBNF grammar parsing for constrained decoding
+
+Key example binaries:
+| Binary | Source | Purpose |
+|--------|--------|---------|
+| `whisper-cli` | `examples/cli/` | Primary file transcription tool |
+| `whisper-stream` | `examples/stream/` | Real-time mic input (needs SDL2) |
+| `whisper-server` | `examples/server/` | HTTP API server |
+| `whisper-bench` | `examples/bench/` | Inference benchmarking |
+| `quantize` | `examples/quantize/` | Model quantization |
+| `vad-speech-segments` | `examples/vad-speech-segments/` | VAD-only segment extraction |
+
+### Bindings (`bindings/`)
+
+Language bindings wrap the C API in `include/whisper.h`:
+- `bindings/go/` — Go
+- `bindings/java/` — JNI (used by the Android example)
+- `bindings/javascript/` — WASM/Node.js (built via Emscripten)
+- `bindings/ruby/` — Ruby
+
+### Model format
+
+Models are stored in custom `ggml` binary format (not GGUF). The original OpenAI PyTorch weights are converted with `models/convert-pt-to-ggml.py`. Pre-converted models are available from HuggingFace (`ggerganov/whisper.cpp`). Tensor names follow the pattern defined in `src/whisper-arch.h`.
+
+## Windows-specific notes
+
+The project builds with MSVC. The CMakeLists.txt defines `_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR` on Windows to work around an MSVC STL issue that causes crashes in the Java bindings. Several MSVC warnings are suppressed project-wide (see the `MSVC_WARNING_FLAGS` block at the bottom of `CMakeLists.txt`).
--- a/examples/python/diarize.py
+++ b/examples/python/diarize.py
@ -0,0 +1,211 @@
+#!/usr/bin/env python3
+"""
+Meeting transcription with per-speaker labels.
+
+Pipeline:
+  1. whisper.cpp (whisper-cli) -> timestamped transcript JSON
+  2. pyannote-audio            -> speaker segments
+  3. merge by timestamp overlap -> labelled output
+
+Usage:
+    python diarize.py -f meeting.wav -m large-v3 --hf-token hf_xxx
+
+Requirements:
+    pip install -r requirements-diarize.txt
+    (HuggingFace token with pyannote/speaker-diarization-3.1 terms accepted)
+"""
+
+import argparse
+import json
+import os
+import subprocess
+import sys
+import tempfile
+from pathlib import Path
+
+import soundfile as sf
+import torch
+from pyannote.audio import Pipeline
+
+
+# ---------------------------------------------------------------------------
+# Step 1: whisper.cpp transcription
+# ---------------------------------------------------------------------------
+
+def find_whisper_cli() -> str:
+    script_dir = Path(__file__).parent
+    repo_root = script_dir.parent.parent
+    candidates = [
+        repo_root / "build" / "bin" / "Release" / "whisper-cli.exe",  # Windows MSVC
+        repo_root / "build" / "bin" / "whisper-cli.exe",               # Windows MinGW
+        repo_root / "build" / "bin" / "whisper-cli",                   # Linux/Mac
+    ]
+    for p in candidates:
+        if p.exists():
+            return str(p)
+    raise FileNotFoundError(
+        "whisper-cli not found. Build the project first:\n"
+        "  cmake -B build && cmake --build build -j --config Release"
+    )
+
+
+def run_whisper(audio_path: str, model: str, language: str, threads: int) -> list:
+    cli = find_whisper_cli()
+
+    repo_root = Path(__file__).parent.parent.parent
+    model_path = repo_root / "models" / f"ggml-{model}.bin"
+    if not model_path.exists():
+        raise FileNotFoundError(
+            f"Model not found: {model_path}\n"
+            f"Download with: bash models/download-ggml-model.sh {model}"
+        )
+
+    with tempfile.TemporaryDirectory() as tmpdir:
+        out_base = os.path.join(tmpdir, "out")
+        cmd = [
+            cli,
+            "-m", str(model_path),
+            "-f", audio_path,
+            "-l", language,
+            "-t", str(threads),
+            "--output-json",
+            "--output-file", out_base,
+        ]
+        result = subprocess.run(cmd, capture_output=True, text=True)
+        if result.returncode != 0:
+            raise RuntimeError(f"whisper-cli failed:\n{result.stderr}")
+
+        json_path = out_base + ".json"
+        if not os.path.exists(json_path):
+            raise RuntimeError(
+                "whisper-cli did not produce a JSON file. "
+                "stderr:\n" + result.stderr
+            )
+
+        with open(json_path, encoding="utf-8") as f:
+            data = json.load(f)
+
+    return data.get("transcription", [])
+
+
+# ---------------------------------------------------------------------------
+# Step 2: pyannote-audio speaker diarization
+# ---------------------------------------------------------------------------
+
+def run_diarization(audio_path: str, hf_token: str, num_speakers: int | None) -> list:
+    print("Loading pyannote speaker-diarization-3.1 (CPU) ...", file=sys.stderr)
+    pipeline = Pipeline.from_pretrained(
+        "pyannote/speaker-diarization-3.1",
+        token=hf_token,
+    )
+    pipeline.to(torch.device("cpu"))
+
+    # Use soundfile to avoid the torchcodec/FFmpeg dependency on Windows.
+    # pyannote accepts a pre-loaded {'waveform': Tensor, 'sample_rate': int} dict.
+    waveform, sample_rate = sf.read(audio_path, dtype="float32", always_2d=True)
+    waveform_tensor = torch.from_numpy(waveform.T)  # (channels, time)
+    audio_input = {"waveform": waveform_tensor, "sample_rate": sample_rate}
+
+    kwargs = {}
+    if num_speakers is not None:
+        kwargs["num_speakers"] = num_speakers
+
+    print("Running diarization ...", file=sys.stderr)
+    result = pipeline(audio_input, **kwargs)
+
+    # pyannote.audio 4.x returns DiarizeOutput; the Annotation is in .speaker_diarization
+    annotation = result.speaker_diarization
+
+    segments = []
+    for turn, _, speaker in annotation.itertracks(yield_label=True):
+        segments.append({"start": turn.start, "end": turn.end, "speaker": speaker})
+    return segments
+
+
+# ---------------------------------------------------------------------------
+# Step 3: merge by timestamp overlap
+# ---------------------------------------------------------------------------
+
+def assign_speakers(transcription: list, diarization: list) -> list:
+    results = []
+    for seg in transcription:
+        # whisper offsets are in milliseconds
+        t0 = seg["offsets"]["from"] / 1000.0
+        t1 = seg["offsets"]["to"] / 1000.0
+        text = seg.get("text", "").strip()
+        if not text:
+            continue
+
+        best_speaker = "UNKNOWN"
+        best_overlap = 0.0
+        for d in diarization:
+            overlap = min(t1, d["end"]) - max(t0, d["start"])
+            if overlap > best_overlap:
+                best_overlap = overlap
+                best_speaker = d["speaker"]
+
+        results.append({"start": t0, "end": t1, "speaker": best_speaker, "text": text})
+    return results
+
+
+# ---------------------------------------------------------------------------
+# Output formatting
+# ---------------------------------------------------------------------------
+
+def _fmt_time(seconds: float) -> str:
+    m, s = divmod(int(seconds), 60)
+    h, m = divmod(m, 60)
+    return f"{h:02d}:{m:02d}:{s:02d}"
+
+
+def format_output(segments: list) -> str:
+    lines = []
+    for seg in segments:
+        ts = f"[{_fmt_time(seg['start'])} --> {_fmt_time(seg['end'])}]"
+        lines.append(f"{ts}  {seg['speaker']}: {seg['text']}")
+    return "\n".join(lines)
+
+
+def format_json(segments: list) -> str:
+    return json.dumps(segments, ensure_ascii=False, indent=2)
+
+
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+
+def parse_args():
+    p = argparse.ArgumentParser(description="Whisper.cpp + pyannote diarization pipeline")
+    p.add_argument("-f", "--file", required=True, help="Input WAV file (16 kHz, 16-bit)")
+    p.add_argument("-m", "--model", default="large-v3", help="ggml model name (default: large-v3)")
+    p.add_argument("-l", "--language", default="ja", help="Language code (default: ja)")
+    p.add_argument("-t", "--threads", type=int, default=4, help="whisper-cli thread count")
+    p.add_argument("--hf-token", required=True, help="HuggingFace access token")
+    p.add_argument("--num-speakers", type=int, default=None, help="Known speaker count (optional)")
+    p.add_argument("--output-json", action="store_true", help="Output JSON instead of plain text")
+    return p.parse_args()
+
+
+def main():
+    args = parse_args()
+
+    if not os.path.exists(args.file):
+        sys.exit(f"Audio file not found: {args.file}")
+
+    print("Step 1/3: Transcribing with whisper.cpp ...", file=sys.stderr)
+    transcription = run_whisper(args.file, args.model, args.language, args.threads)
+
+    print("Step 2/3: Diarizing speakers with pyannote ...", file=sys.stderr)
+    diarization = run_diarization(args.file, args.hf_token, args.num_speakers)
+
+    print("Step 3/3: Merging results ...", file=sys.stderr)
+    segments = assign_speakers(transcription, diarization)
+
+    if args.output_json:
+        print(format_json(segments))
+    else:
+        print(format_output(segments))
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/python/requirements-diarize.txt
+++ b/examples/python/requirements-diarize.txt
@ -0,0 +1,14 @@
+# Install in two steps:
+#   Step A (PyTorch CPU wheel):
+#     pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
+#   Step B (remaining packages from PyPI):
+#     pip install -r requirements-diarize.txt
+
+# Speaker diarization
+pyannote.audio>=3.1.0
+
+# Hugging Face model hub access
+huggingface_hub>=0.20.0
+
+# Audio reading (avoids torchcodec/FFmpeg dependency on Windows)
+soundfile>=0.12.0
--- a/work/pyannote-audio/install.md
+++ b/work/pyannote-audio/install.md
@ -0,0 +1,236 @@
+# pyannote-audio 導入手順
+
+whisper.cpp（音声認識）と pyannote-audio（話者識別）を組み合わせた、  
+**日本語会議録音の話者別文字起こしパイプライン**の構築手順です。
+
+## 前提条件
+
+- OS: Windows 11
+- Anaconda / Miniconda がインストール済みであること
+- Git for Windows がインストール済みであること
+- リポジトリ: `c:\work\30.Projects\102.AI_Projects\whisper.cpp\whisper.cpp`
+
+---
+
+## ステップ 1: whisper.cpp のビルド
+
+```bash
+cd c:\work\30.Projects\102.AI_Projects\whisper.cpp\whisper.cpp
+
+cmake -B build
+cmake --build build -j --config Release
+```
+
+ビルド完了後、以下のバイナリが生成されます：
+
+```
+build/bin/Release/whisper-cli.exe
+```
+
+---
+
+## ステップ 2: Python 仮想環境の作成
+
+> **注意:** Python 3.13 は pyannote.audio 非対応のため、**3.11 を使用すること**。
+
+```bash
+conda create -n whisper-diarize python=3.11 -y
+```
+
+---
+
+## ステップ 3: パッケージのインストール
+
+`--index-url` の適用範囲の問題により、**2段階でインストール**する必要があります。
+
+### Step A: PyTorch（CPU専用ビルド）
+
+```bash
+conda run -n whisper-diarize pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
+```
+
+### Step B: pyannote.audio および関連パッケージ
+
+```bash
+conda run -n whisper-diarize pip install -r examples/python/requirements-diarize.txt
+```
+
+`requirements-diarize.txt` の内容（`examples/python/requirements-diarize.txt`）：
+
+```
+# Speaker diarization
+pyannote.audio>=3.1.0
+
+# Hugging Face model hub access
+huggingface_hub>=0.20.0
+
+# Audio reading（Windows で torchcodec/FFmpeg 不要にするための回避策）
+soundfile>=0.12.0
+```
+
+> **Windows の注意事項:**  
+> pyannote.audio 4.x は音声読み込みに `torchcodec`（FFmpeg 必須）を使用しますが、  
+> conda-forge の FFmpeg は Windows 日本語環境でインストールエラーになる場合があります。  
+> 代わりに `soundfile` でWAVを読み込み、テンソルとして直接 pyannote に渡す方式を採用しています。
+
+### インストール確認
+
+```bash
+C:/work/60.Tools/Anaconda/miniconda3/envs/whisper-diarize/python.exe -c ^
+  "import torch, pyannote.audio, soundfile; ^
+   print('torch:', torch.__version__); ^
+   print('pyannote:', pyannote.audio.__version__); ^
+   print('soundfile:', soundfile.__version__); ^
+   print('CPU only:', not torch.cuda.is_available())"
+```
+
+期待される出力：
+
+```
+torch: 2.11.0+cpu
+pyannote.audio: 4.0.4
+soundfile: 0.13.1
+CPU only: True
+```
+
+---
+
+## ステップ 4: HuggingFace トークンの取得と利用規約への同意
+
+pyannote のモデルはゲート付きリポジトリのため、以下の手順が**全て必要**です。
+
+### 4-1. HuggingFace アカウント作成
+
+https://huggingface.co でサインアップ（既存アカウントがあればスキップ）。
+
+### 4-2. 利用規約への同意（3つ全て必要）
+
+以下の各ページにアクセスし、**"Agree and access repository"** をクリックする。
+
+| モデル | URL |
+|--------|-----|
+| speaker-diarization-3.1 | https://huggingface.co/pyannote/speaker-diarization-3.1 |
+| segmentation-3.0 | https://huggingface.co/pyannote/segmentation-3.0 |
+| speaker-diarization-community-1 | https://huggingface.co/pyannote/speaker-diarization-community-1 |
+
+> **注意:** 3つ目の `speaker-diarization-community-1` は pyannote.audio 4.x から追加された依存リポジトリです。  
+> 同意しないと実行時に `403 Forbidden` エラーが発生します。
+
+### 4-3. アクセストークンの発行
+
+1. https://huggingface.co/settings/tokens にアクセス
+2. "New token" をクリック
+3. 権限: `read` を選択して作成
+4. `hf_` で始まるトークン文字列を控えておく
+
+---
+
+## ステップ 5: 日本語モデルのダウンロード
+
+```bash
+cd c:\work\30.Projects\102.AI_Projects\whisper.cpp\whisper.cpp
+
+# 精度重視（推奨、2.9GB）
+bash models/download-ggml-model.sh large-v3
+
+# 速度重視（動作確認用、142MB）
+bash models/download-ggml-model.sh base
+```
+
+| モデル | サイズ | 日本語精度 |
+|--------|--------|------------|
+| large-v3 | 2.9 GB | 最高 |
+| medium | 1.5 GB | 高 |
+| small | 466 MB | 中 |
+| base | 142 MB | 低（動作確認用途） |
+
+---
+
+## ステップ 6: 音声ファイルの準備
+
+whisper.cpp は **16kHz / 16bit / モノラル WAV** のみ対応。  
+他の形式の場合は ffmpeg で変換する。
+
+```bash
+ffmpeg -i 会議録音.mp4 -ar 16000 -ac 1 -c:a pcm_s16le 会議録音.wav
+```
+
+---
+
+## ステップ 7: 実行
+
+スクリプト: `examples/python/diarize.py`
+
+```bash
+cd c:\work\30.Projects\102.AI_Projects\whisper.cpp\whisper.cpp
+
+C:/work/60.Tools/Anaconda/miniconda3/envs/whisper-diarize/python.exe \
+  examples/python/diarize.py \
+  -f 会議録音.wav \
+  -m large-v3 \
+  --hf-token hf_xxxxxxxxxxxx \
+  --language ja
+```
+
+### オプション一覧
+
+| オプション | 省略形 | デフォルト | 説明 |
+|---|---|---|---|
+| `--file` | `-f` | （必須） | 入力WAVファイル |
+| `--model` | `-m` | `large-v3` | ggmlモデル名 |
+| `--language` | `-l` | `ja` | 言語コード |
+| `--hf-token` | | （必須） | HuggingFace トークン |
+| `--num-speakers` | | 自動検出 | 参加者数（既知の場合に指定すると精度向上） |
+| `--threads` | `-t` | `4` | whisper-cli スレッド数 |
+| `--output-json` | | OFF | JSON形式で出力 |
+
+### 出力例
+
+```
+[00:00:00 --> 00:00:03]  SPEAKER_00: 本日はお集まりいただきありがとうございます。
+[00:00:03 --> 00:00:07]  SPEAKER_01: よろしくお願いします。
+[00:00:07 --> 00:00:12]  SPEAKER_00: では、議題に入りましょう。
+[00:00:12 --> 00:00:18]  SPEAKER_02: 先週の進捗を報告します。
+```
+
+---
+
+## トラブルシューティング
+
+### `torchcodec` の警告が大量に出る
+
+```
+UserWarning: torchcodec is not installed correctly so built-in audio decoding will fail.
+```
+
+**→ 無視して問題ありません。**  
+`soundfile` による回避策が有効になっているため、実際の動作に影響しません。
+
+### `403 Forbidden` / `GatedRepoError`
+
+pyannote モデルへのアクセス権がありません。  
+**→ ステップ 4-2 の3リポジトリ全てに同意済みか確認してください。**
+
+### `TypeError: Pipeline.from_pretrained() got an unexpected keyword argument 'use_auth_token'`
+
+pyannote.audio 4.x で `use_auth_token` 引数が廃止されました。  
+**→ `token=` に変更してください（`diarize.py` は修正済み）。**
+
+### `AttributeError: 'DiarizeOutput' object has no attribute 'itertracks'`
+
+pyannote.audio 4.x で出力型が `DiarizeOutput` に変更されました。  
+**→ `result.speaker_diarization.itertracks()` を使用してください（`diarize.py` は修正済み）。**
+
+---
+
+## 環境情報（動作確認済み）
+
+| 項目 | バージョン |
+|---|---|
+| OS | Windows 11 Pro 10.0.26200 |
+| Python | 3.11 (conda) |
+| torch | 2.11.0+cpu |
+| pyannote.audio | 4.0.4 |
+| soundfile | 0.13.1 |
+| whisper.cpp | v1.8.4 (master) |
+| whisper-cliモデル | ggml-base.bin / ggml-large-v3.bin |