diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 00000000..0d20d1e8 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,143 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Build Commands + +```bash +# Standard build (Release by default on non-MSVC) +cmake -B build +cmake --build build -j --config Release + +# GPU backends +cmake -B build -DGGML_CUDA=1 # NVIDIA CUDA +cmake -B build -DGGML_VULKAN=1 # Vulkan (cross-vendor) +cmake -B build -DGGML_METAL=1 # Apple Metal +cmake -B build -DGGML_BLAS=1 # CPU via OpenBLAS + +# Optional features +cmake -B build -DWHISPER_SDL2=ON # Enable SDL2 for real-time audio (stream example) +cmake -B build -DWHISPER_CURL=ON # Enable libcurl for model download +cmake -B build -DWHISPER_COREML=ON # Apple Core ML encoder (Apple Silicon only) + +# Sanitizers +cmake -B build -DWHISPER_SANITIZE_ADDRESS=ON +cmake -B build -DWHISPER_SANITIZE_THREAD=ON +cmake -B build -DWHISPER_SANITIZE_UNDEFINED=ON +``` + +## Running Tests + +Tests require model files (downloaded separately) and use CTest: + +```bash +# Build with tests enabled (on by default when building standalone) +cmake -B build -DWHISPER_BUILD_TESTS=ON +cmake --build build -j --config Release + +# Run all tests +cd build && ctest + +# Run a specific test by label +cd build && ctest -L tiny + +# Run single integration test manually (requires model at models/for-tests-ggml-tiny.en.bin) +./build/bin/whisper-cli -m models/for-tests-ggml-tiny.en.bin -f samples/jfk.wav + +# Run the unit VAD test binary +./build/bin/test-vad +``` + +## Downloading Models + +```bash +# Download a pre-converted ggml model +bash ./models/download-ggml-model.sh base.en # or: tiny, small, medium, large-v3, etc. + +# Download VAD model (for --vad flag) +bash ./models/download-vad-model.sh silero-v6.2.0 + +# Convenience Makefile targets (downloads model + builds + runs on samples/) +make base.en +make tiny +``` + +## Transcribing Audio + +Audio must be 16-bit WAV at 16 kHz. Convert with ffmpeg: +```bash +ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav +./build/bin/whisper-cli -m models/ggml-base.en.bin -f output.wav +``` + +## Quantization + +```bash +./build/bin/quantize models/ggml-base.en.bin models/ggml-base.en-q5_0.bin q5_0 +./build/bin/whisper-cli -m models/ggml-base.en-q5_0.bin -f samples/jfk.wav +``` + +--- + +## Architecture Overview + +### Two-layer design + +``` +include/whisper.h ← Public C API (cross-language compatible) +src/whisper.cpp ← Whisper model: audio preprocessing, encoder, decoder, beam search +src/whisper-arch.h ← Tensor name map (encoder/decoder/cross-attention weight paths in ggml format) +ggml/ ← Tensor math library (git subtree from ggml-org/ggml) +``` + +`whisper_context` holds the loaded model weights (shared, read-only across threads). `whisper_state` holds per-inference mutable state (KV cache, mel buffers). You can create multiple states from one context for parallel inference. + +### ggml subdirectory + +`ggml/` is a git subtree (synced via `sync` commits). Do not edit it directly unless you are making changes intended to be upstreamed. Hardware backends live in `ggml/src/`: +- `ggml-cpu/` — generic CPU with NEON/AVX/VSX intrinsics +- `ggml-cuda/` — CUDA kernels +- `ggml-metal/` — Metal shaders (Apple) +- `ggml-vulkan/` — Vulkan compute shaders +- `ggml-sycl/` — SYCL (Intel) + +### Whisper pipeline (inside `src/whisper.cpp`) + +1. **Audio preprocessing** — raw PCM → log-Mel spectrogram (80 mel bins, 30-second chunks at 16 kHz) +2. **Encoder** — convolutional feature extraction + transformer encoder; optional Core ML / OpenVINO offload +3. **Decoder** — autoregressive transformer decoder with optional beam search, temperature fallback, and cross-attention timestamps +4. **VAD** — optional pre-pass using Silero-VAD to skip silence before encoding + +### Examples (`examples/`) + +Shared utilities used by all examples live at the top level of `examples/`: +- `common.h / common.cpp` — CLI arg parsing, vocab helpers +- `common-whisper.h / common-whisper.cpp` — WAV reading, timestamp formatting +- `common-sdl.h / common-sdl.cpp` — SDL2 audio capture (stream example only) +- `grammar-parser.h / grammar-parser.cpp` — GBNF grammar parsing for constrained decoding + +Key example binaries: +| Binary | Source | Purpose | +|--------|--------|---------| +| `whisper-cli` | `examples/cli/` | Primary file transcription tool | +| `whisper-stream` | `examples/stream/` | Real-time mic input (needs SDL2) | +| `whisper-server` | `examples/server/` | HTTP API server | +| `whisper-bench` | `examples/bench/` | Inference benchmarking | +| `quantize` | `examples/quantize/` | Model quantization | +| `vad-speech-segments` | `examples/vad-speech-segments/` | VAD-only segment extraction | + +### Bindings (`bindings/`) + +Language bindings wrap the C API in `include/whisper.h`: +- `bindings/go/` — Go +- `bindings/java/` — JNI (used by the Android example) +- `bindings/javascript/` — WASM/Node.js (built via Emscripten) +- `bindings/ruby/` — Ruby + +### Model format + +Models are stored in custom `ggml` binary format (not GGUF). The original OpenAI PyTorch weights are converted with `models/convert-pt-to-ggml.py`. Pre-converted models are available from HuggingFace (`ggerganov/whisper.cpp`). Tensor names follow the pattern defined in `src/whisper-arch.h`. + +## Windows-specific notes + +The project builds with MSVC. The CMakeLists.txt defines `_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR` on Windows to work around an MSVC STL issue that causes crashes in the Java bindings. Several MSVC warnings are suppressed project-wide (see the `MSVC_WARNING_FLAGS` block at the bottom of `CMakeLists.txt`). diff --git a/examples/python/diarize.py b/examples/python/diarize.py new file mode 100644 index 00000000..d5819754 --- /dev/null +++ b/examples/python/diarize.py @@ -0,0 +1,211 @@ +#!/usr/bin/env python3 +""" +Meeting transcription with per-speaker labels. + +Pipeline: + 1. whisper.cpp (whisper-cli) -> timestamped transcript JSON + 2. pyannote-audio -> speaker segments + 3. merge by timestamp overlap -> labelled output + +Usage: + python diarize.py -f meeting.wav -m large-v3 --hf-token hf_xxx + +Requirements: + pip install -r requirements-diarize.txt + (HuggingFace token with pyannote/speaker-diarization-3.1 terms accepted) +""" + +import argparse +import json +import os +import subprocess +import sys +import tempfile +from pathlib import Path + +import soundfile as sf +import torch +from pyannote.audio import Pipeline + + +# --------------------------------------------------------------------------- +# Step 1: whisper.cpp transcription +# --------------------------------------------------------------------------- + +def find_whisper_cli() -> str: + script_dir = Path(__file__).parent + repo_root = script_dir.parent.parent + candidates = [ + repo_root / "build" / "bin" / "Release" / "whisper-cli.exe", # Windows MSVC + repo_root / "build" / "bin" / "whisper-cli.exe", # Windows MinGW + repo_root / "build" / "bin" / "whisper-cli", # Linux/Mac + ] + for p in candidates: + if p.exists(): + return str(p) + raise FileNotFoundError( + "whisper-cli not found. Build the project first:\n" + " cmake -B build && cmake --build build -j --config Release" + ) + + +def run_whisper(audio_path: str, model: str, language: str, threads: int) -> list: + cli = find_whisper_cli() + + repo_root = Path(__file__).parent.parent.parent + model_path = repo_root / "models" / f"ggml-{model}.bin" + if not model_path.exists(): + raise FileNotFoundError( + f"Model not found: {model_path}\n" + f"Download with: bash models/download-ggml-model.sh {model}" + ) + + with tempfile.TemporaryDirectory() as tmpdir: + out_base = os.path.join(tmpdir, "out") + cmd = [ + cli, + "-m", str(model_path), + "-f", audio_path, + "-l", language, + "-t", str(threads), + "--output-json", + "--output-file", out_base, + ] + result = subprocess.run(cmd, capture_output=True, text=True) + if result.returncode != 0: + raise RuntimeError(f"whisper-cli failed:\n{result.stderr}") + + json_path = out_base + ".json" + if not os.path.exists(json_path): + raise RuntimeError( + "whisper-cli did not produce a JSON file. " + "stderr:\n" + result.stderr + ) + + with open(json_path, encoding="utf-8") as f: + data = json.load(f) + + return data.get("transcription", []) + + +# --------------------------------------------------------------------------- +# Step 2: pyannote-audio speaker diarization +# --------------------------------------------------------------------------- + +def run_diarization(audio_path: str, hf_token: str, num_speakers: int | None) -> list: + print("Loading pyannote speaker-diarization-3.1 (CPU) ...", file=sys.stderr) + pipeline = Pipeline.from_pretrained( + "pyannote/speaker-diarization-3.1", + token=hf_token, + ) + pipeline.to(torch.device("cpu")) + + # Use soundfile to avoid the torchcodec/FFmpeg dependency on Windows. + # pyannote accepts a pre-loaded {'waveform': Tensor, 'sample_rate': int} dict. + waveform, sample_rate = sf.read(audio_path, dtype="float32", always_2d=True) + waveform_tensor = torch.from_numpy(waveform.T) # (channels, time) + audio_input = {"waveform": waveform_tensor, "sample_rate": sample_rate} + + kwargs = {} + if num_speakers is not None: + kwargs["num_speakers"] = num_speakers + + print("Running diarization ...", file=sys.stderr) + result = pipeline(audio_input, **kwargs) + + # pyannote.audio 4.x returns DiarizeOutput; the Annotation is in .speaker_diarization + annotation = result.speaker_diarization + + segments = [] + for turn, _, speaker in annotation.itertracks(yield_label=True): + segments.append({"start": turn.start, "end": turn.end, "speaker": speaker}) + return segments + + +# --------------------------------------------------------------------------- +# Step 3: merge by timestamp overlap +# --------------------------------------------------------------------------- + +def assign_speakers(transcription: list, diarization: list) -> list: + results = [] + for seg in transcription: + # whisper offsets are in milliseconds + t0 = seg["offsets"]["from"] / 1000.0 + t1 = seg["offsets"]["to"] / 1000.0 + text = seg.get("text", "").strip() + if not text: + continue + + best_speaker = "UNKNOWN" + best_overlap = 0.0 + for d in diarization: + overlap = min(t1, d["end"]) - max(t0, d["start"]) + if overlap > best_overlap: + best_overlap = overlap + best_speaker = d["speaker"] + + results.append({"start": t0, "end": t1, "speaker": best_speaker, "text": text}) + return results + + +# --------------------------------------------------------------------------- +# Output formatting +# --------------------------------------------------------------------------- + +def _fmt_time(seconds: float) -> str: + m, s = divmod(int(seconds), 60) + h, m = divmod(m, 60) + return f"{h:02d}:{m:02d}:{s:02d}" + + +def format_output(segments: list) -> str: + lines = [] + for seg in segments: + ts = f"[{_fmt_time(seg['start'])} --> {_fmt_time(seg['end'])}]" + lines.append(f"{ts} {seg['speaker']}: {seg['text']}") + return "\n".join(lines) + + +def format_json(segments: list) -> str: + return json.dumps(segments, ensure_ascii=False, indent=2) + + +# --------------------------------------------------------------------------- +# CLI +# --------------------------------------------------------------------------- + +def parse_args(): + p = argparse.ArgumentParser(description="Whisper.cpp + pyannote diarization pipeline") + p.add_argument("-f", "--file", required=True, help="Input WAV file (16 kHz, 16-bit)") + p.add_argument("-m", "--model", default="large-v3", help="ggml model name (default: large-v3)") + p.add_argument("-l", "--language", default="ja", help="Language code (default: ja)") + p.add_argument("-t", "--threads", type=int, default=4, help="whisper-cli thread count") + p.add_argument("--hf-token", required=True, help="HuggingFace access token") + p.add_argument("--num-speakers", type=int, default=None, help="Known speaker count (optional)") + p.add_argument("--output-json", action="store_true", help="Output JSON instead of plain text") + return p.parse_args() + + +def main(): + args = parse_args() + + if not os.path.exists(args.file): + sys.exit(f"Audio file not found: {args.file}") + + print("Step 1/3: Transcribing with whisper.cpp ...", file=sys.stderr) + transcription = run_whisper(args.file, args.model, args.language, args.threads) + + print("Step 2/3: Diarizing speakers with pyannote ...", file=sys.stderr) + diarization = run_diarization(args.file, args.hf_token, args.num_speakers) + + print("Step 3/3: Merging results ...", file=sys.stderr) + segments = assign_speakers(transcription, diarization) + + if args.output_json: + print(format_json(segments)) + else: + print(format_output(segments)) + + +if __name__ == "__main__": + main() diff --git a/examples/python/requirements-diarize.txt b/examples/python/requirements-diarize.txt new file mode 100644 index 00000000..c89bdf3f --- /dev/null +++ b/examples/python/requirements-diarize.txt @@ -0,0 +1,14 @@ +# Install in two steps: +# Step A (PyTorch CPU wheel): +# pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu +# Step B (remaining packages from PyPI): +# pip install -r requirements-diarize.txt + +# Speaker diarization +pyannote.audio>=3.1.0 + +# Hugging Face model hub access +huggingface_hub>=0.20.0 + +# Audio reading (avoids torchcodec/FFmpeg dependency on Windows) +soundfile>=0.12.0 diff --git a/work/pyannote-audio/install.md b/work/pyannote-audio/install.md new file mode 100644 index 00000000..62c28519 --- /dev/null +++ b/work/pyannote-audio/install.md @@ -0,0 +1,236 @@ +# pyannote-audio 導入手順 + +whisper.cpp(音声認識)と pyannote-audio(話者識別)を組み合わせた、 +**日本語会議録音の話者別文字起こしパイプライン**の構築手順です。 + +## 前提条件 + +- OS: Windows 11 +- Anaconda / Miniconda がインストール済みであること +- Git for Windows がインストール済みであること +- リポジトリ: `c:\work\30.Projects\102.AI_Projects\whisper.cpp\whisper.cpp` + +--- + +## ステップ 1: whisper.cpp のビルド + +```bash +cd c:\work\30.Projects\102.AI_Projects\whisper.cpp\whisper.cpp + +cmake -B build +cmake --build build -j --config Release +``` + +ビルド完了後、以下のバイナリが生成されます: + +``` +build/bin/Release/whisper-cli.exe +``` + +--- + +## ステップ 2: Python 仮想環境の作成 + +> **注意:** Python 3.13 は pyannote.audio 非対応のため、**3.11 を使用すること**。 + +```bash +conda create -n whisper-diarize python=3.11 -y +``` + +--- + +## ステップ 3: パッケージのインストール + +`--index-url` の適用範囲の問題により、**2段階でインストール**する必要があります。 + +### Step A: PyTorch(CPU専用ビルド) + +```bash +conda run -n whisper-diarize pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu +``` + +### Step B: pyannote.audio および関連パッケージ + +```bash +conda run -n whisper-diarize pip install -r examples/python/requirements-diarize.txt +``` + +`requirements-diarize.txt` の内容(`examples/python/requirements-diarize.txt`): + +``` +# Speaker diarization +pyannote.audio>=3.1.0 + +# Hugging Face model hub access +huggingface_hub>=0.20.0 + +# Audio reading(Windows で torchcodec/FFmpeg 不要にするための回避策) +soundfile>=0.12.0 +``` + +> **Windows の注意事項:** +> pyannote.audio 4.x は音声読み込みに `torchcodec`(FFmpeg 必須)を使用しますが、 +> conda-forge の FFmpeg は Windows 日本語環境でインストールエラーになる場合があります。 +> 代わりに `soundfile` でWAVを読み込み、テンソルとして直接 pyannote に渡す方式を採用しています。 + +### インストール確認 + +```bash +C:/work/60.Tools/Anaconda/miniconda3/envs/whisper-diarize/python.exe -c ^ + "import torch, pyannote.audio, soundfile; ^ + print('torch:', torch.__version__); ^ + print('pyannote:', pyannote.audio.__version__); ^ + print('soundfile:', soundfile.__version__); ^ + print('CPU only:', not torch.cuda.is_available())" +``` + +期待される出力: + +``` +torch: 2.11.0+cpu +pyannote.audio: 4.0.4 +soundfile: 0.13.1 +CPU only: True +``` + +--- + +## ステップ 4: HuggingFace トークンの取得と利用規約への同意 + +pyannote のモデルはゲート付きリポジトリのため、以下の手順が**全て必要**です。 + +### 4-1. HuggingFace アカウント作成 + +https://huggingface.co でサインアップ(既存アカウントがあればスキップ)。 + +### 4-2. 利用規約への同意(3つ全て必要) + +以下の各ページにアクセスし、**"Agree and access repository"** をクリックする。 + +| モデル | URL | +|--------|-----| +| speaker-diarization-3.1 | https://huggingface.co/pyannote/speaker-diarization-3.1 | +| segmentation-3.0 | https://huggingface.co/pyannote/segmentation-3.0 | +| speaker-diarization-community-1 | https://huggingface.co/pyannote/speaker-diarization-community-1 | + +> **注意:** 3つ目の `speaker-diarization-community-1` は pyannote.audio 4.x から追加された依存リポジトリです。 +> 同意しないと実行時に `403 Forbidden` エラーが発生します。 + +### 4-3. アクセストークンの発行 + +1. https://huggingface.co/settings/tokens にアクセス +2. "New token" をクリック +3. 権限: `read` を選択して作成 +4. `hf_` で始まるトークン文字列を控えておく + +--- + +## ステップ 5: 日本語モデルのダウンロード + +```bash +cd c:\work\30.Projects\102.AI_Projects\whisper.cpp\whisper.cpp + +# 精度重視(推奨、2.9GB) +bash models/download-ggml-model.sh large-v3 + +# 速度重視(動作確認用、142MB) +bash models/download-ggml-model.sh base +``` + +| モデル | サイズ | 日本語精度 | +|--------|--------|------------| +| large-v3 | 2.9 GB | 最高 | +| medium | 1.5 GB | 高 | +| small | 466 MB | 中 | +| base | 142 MB | 低(動作確認用途) | + +--- + +## ステップ 6: 音声ファイルの準備 + +whisper.cpp は **16kHz / 16bit / モノラル WAV** のみ対応。 +他の形式の場合は ffmpeg で変換する。 + +```bash +ffmpeg -i 会議録音.mp4 -ar 16000 -ac 1 -c:a pcm_s16le 会議録音.wav +``` + +--- + +## ステップ 7: 実行 + +スクリプト: `examples/python/diarize.py` + +```bash +cd c:\work\30.Projects\102.AI_Projects\whisper.cpp\whisper.cpp + +C:/work/60.Tools/Anaconda/miniconda3/envs/whisper-diarize/python.exe \ + examples/python/diarize.py \ + -f 会議録音.wav \ + -m large-v3 \ + --hf-token hf_xxxxxxxxxxxx \ + --language ja +``` + +### オプション一覧 + +| オプション | 省略形 | デフォルト | 説明 | +|---|---|---|---| +| `--file` | `-f` | (必須) | 入力WAVファイル | +| `--model` | `-m` | `large-v3` | ggmlモデル名 | +| `--language` | `-l` | `ja` | 言語コード | +| `--hf-token` | | (必須) | HuggingFace トークン | +| `--num-speakers` | | 自動検出 | 参加者数(既知の場合に指定すると精度向上) | +| `--threads` | `-t` | `4` | whisper-cli スレッド数 | +| `--output-json` | | OFF | JSON形式で出力 | + +### 出力例 + +``` +[00:00:00 --> 00:00:03] SPEAKER_00: 本日はお集まりいただきありがとうございます。 +[00:00:03 --> 00:00:07] SPEAKER_01: よろしくお願いします。 +[00:00:07 --> 00:00:12] SPEAKER_00: では、議題に入りましょう。 +[00:00:12 --> 00:00:18] SPEAKER_02: 先週の進捗を報告します。 +``` + +--- + +## トラブルシューティング + +### `torchcodec` の警告が大量に出る + +``` +UserWarning: torchcodec is not installed correctly so built-in audio decoding will fail. +``` + +**→ 無視して問題ありません。** +`soundfile` による回避策が有効になっているため、実際の動作に影響しません。 + +### `403 Forbidden` / `GatedRepoError` + +pyannote モデルへのアクセス権がありません。 +**→ ステップ 4-2 の3リポジトリ全てに同意済みか確認してください。** + +### `TypeError: Pipeline.from_pretrained() got an unexpected keyword argument 'use_auth_token'` + +pyannote.audio 4.x で `use_auth_token` 引数が廃止されました。 +**→ `token=` に変更してください(`diarize.py` は修正済み)。** + +### `AttributeError: 'DiarizeOutput' object has no attribute 'itertracks'` + +pyannote.audio 4.x で出力型が `DiarizeOutput` に変更されました。 +**→ `result.speaker_diarization.itertracks()` を使用してください(`diarize.py` は修正済み)。** + +--- + +## 環境情報(動作確認済み) + +| 項目 | バージョン | +|---|---| +| OS | Windows 11 Pro 10.0.26200 | +| Python | 3.11 (conda) | +| torch | 2.11.0+cpu | +| pyannote.audio | 4.0.4 | +| soundfile | 0.13.1 | +| whisper.cpp | v1.8.4 (master) | +| whisper-cliモデル | ggml-base.bin / ggml-large-v3.bin |