Feature: whisper.cpp(音声認識)と pyannote-audio(話者識別)を組み合わせ
This commit is contained in:
parent
fc674574ca
commit
147a83122f
|
|
@ -0,0 +1,143 @@
|
|||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
## Build Commands
|
||||
|
||||
```bash
|
||||
# Standard build (Release by default on non-MSVC)
|
||||
cmake -B build
|
||||
cmake --build build -j --config Release
|
||||
|
||||
# GPU backends
|
||||
cmake -B build -DGGML_CUDA=1 # NVIDIA CUDA
|
||||
cmake -B build -DGGML_VULKAN=1 # Vulkan (cross-vendor)
|
||||
cmake -B build -DGGML_METAL=1 # Apple Metal
|
||||
cmake -B build -DGGML_BLAS=1 # CPU via OpenBLAS
|
||||
|
||||
# Optional features
|
||||
cmake -B build -DWHISPER_SDL2=ON # Enable SDL2 for real-time audio (stream example)
|
||||
cmake -B build -DWHISPER_CURL=ON # Enable libcurl for model download
|
||||
cmake -B build -DWHISPER_COREML=ON # Apple Core ML encoder (Apple Silicon only)
|
||||
|
||||
# Sanitizers
|
||||
cmake -B build -DWHISPER_SANITIZE_ADDRESS=ON
|
||||
cmake -B build -DWHISPER_SANITIZE_THREAD=ON
|
||||
cmake -B build -DWHISPER_SANITIZE_UNDEFINED=ON
|
||||
```
|
||||
|
||||
## Running Tests
|
||||
|
||||
Tests require model files (downloaded separately) and use CTest:
|
||||
|
||||
```bash
|
||||
# Build with tests enabled (on by default when building standalone)
|
||||
cmake -B build -DWHISPER_BUILD_TESTS=ON
|
||||
cmake --build build -j --config Release
|
||||
|
||||
# Run all tests
|
||||
cd build && ctest
|
||||
|
||||
# Run a specific test by label
|
||||
cd build && ctest -L tiny
|
||||
|
||||
# Run single integration test manually (requires model at models/for-tests-ggml-tiny.en.bin)
|
||||
./build/bin/whisper-cli -m models/for-tests-ggml-tiny.en.bin -f samples/jfk.wav
|
||||
|
||||
# Run the unit VAD test binary
|
||||
./build/bin/test-vad
|
||||
```
|
||||
|
||||
## Downloading Models
|
||||
|
||||
```bash
|
||||
# Download a pre-converted ggml model
|
||||
bash ./models/download-ggml-model.sh base.en # or: tiny, small, medium, large-v3, etc.
|
||||
|
||||
# Download VAD model (for --vad flag)
|
||||
bash ./models/download-vad-model.sh silero-v6.2.0
|
||||
|
||||
# Convenience Makefile targets (downloads model + builds + runs on samples/)
|
||||
make base.en
|
||||
make tiny
|
||||
```
|
||||
|
||||
## Transcribing Audio
|
||||
|
||||
Audio must be 16-bit WAV at 16 kHz. Convert with ffmpeg:
|
||||
```bash
|
||||
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav
|
||||
./build/bin/whisper-cli -m models/ggml-base.en.bin -f output.wav
|
||||
```
|
||||
|
||||
## Quantization
|
||||
|
||||
```bash
|
||||
./build/bin/quantize models/ggml-base.en.bin models/ggml-base.en-q5_0.bin q5_0
|
||||
./build/bin/whisper-cli -m models/ggml-base.en-q5_0.bin -f samples/jfk.wav
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
### Two-layer design
|
||||
|
||||
```
|
||||
include/whisper.h ← Public C API (cross-language compatible)
|
||||
src/whisper.cpp ← Whisper model: audio preprocessing, encoder, decoder, beam search
|
||||
src/whisper-arch.h ← Tensor name map (encoder/decoder/cross-attention weight paths in ggml format)
|
||||
ggml/ ← Tensor math library (git subtree from ggml-org/ggml)
|
||||
```
|
||||
|
||||
`whisper_context` holds the loaded model weights (shared, read-only across threads). `whisper_state` holds per-inference mutable state (KV cache, mel buffers). You can create multiple states from one context for parallel inference.
|
||||
|
||||
### ggml subdirectory
|
||||
|
||||
`ggml/` is a git subtree (synced via `sync` commits). Do not edit it directly unless you are making changes intended to be upstreamed. Hardware backends live in `ggml/src/`:
|
||||
- `ggml-cpu/` — generic CPU with NEON/AVX/VSX intrinsics
|
||||
- `ggml-cuda/` — CUDA kernels
|
||||
- `ggml-metal/` — Metal shaders (Apple)
|
||||
- `ggml-vulkan/` — Vulkan compute shaders
|
||||
- `ggml-sycl/` — SYCL (Intel)
|
||||
|
||||
### Whisper pipeline (inside `src/whisper.cpp`)
|
||||
|
||||
1. **Audio preprocessing** — raw PCM → log-Mel spectrogram (80 mel bins, 30-second chunks at 16 kHz)
|
||||
2. **Encoder** — convolutional feature extraction + transformer encoder; optional Core ML / OpenVINO offload
|
||||
3. **Decoder** — autoregressive transformer decoder with optional beam search, temperature fallback, and cross-attention timestamps
|
||||
4. **VAD** — optional pre-pass using Silero-VAD to skip silence before encoding
|
||||
|
||||
### Examples (`examples/`)
|
||||
|
||||
Shared utilities used by all examples live at the top level of `examples/`:
|
||||
- `common.h / common.cpp` — CLI arg parsing, vocab helpers
|
||||
- `common-whisper.h / common-whisper.cpp` — WAV reading, timestamp formatting
|
||||
- `common-sdl.h / common-sdl.cpp` — SDL2 audio capture (stream example only)
|
||||
- `grammar-parser.h / grammar-parser.cpp` — GBNF grammar parsing for constrained decoding
|
||||
|
||||
Key example binaries:
|
||||
| Binary | Source | Purpose |
|
||||
|--------|--------|---------|
|
||||
| `whisper-cli` | `examples/cli/` | Primary file transcription tool |
|
||||
| `whisper-stream` | `examples/stream/` | Real-time mic input (needs SDL2) |
|
||||
| `whisper-server` | `examples/server/` | HTTP API server |
|
||||
| `whisper-bench` | `examples/bench/` | Inference benchmarking |
|
||||
| `quantize` | `examples/quantize/` | Model quantization |
|
||||
| `vad-speech-segments` | `examples/vad-speech-segments/` | VAD-only segment extraction |
|
||||
|
||||
### Bindings (`bindings/`)
|
||||
|
||||
Language bindings wrap the C API in `include/whisper.h`:
|
||||
- `bindings/go/` — Go
|
||||
- `bindings/java/` — JNI (used by the Android example)
|
||||
- `bindings/javascript/` — WASM/Node.js (built via Emscripten)
|
||||
- `bindings/ruby/` — Ruby
|
||||
|
||||
### Model format
|
||||
|
||||
Models are stored in custom `ggml` binary format (not GGUF). The original OpenAI PyTorch weights are converted with `models/convert-pt-to-ggml.py`. Pre-converted models are available from HuggingFace (`ggerganov/whisper.cpp`). Tensor names follow the pattern defined in `src/whisper-arch.h`.
|
||||
|
||||
## Windows-specific notes
|
||||
|
||||
The project builds with MSVC. The CMakeLists.txt defines `_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR` on Windows to work around an MSVC STL issue that causes crashes in the Java bindings. Several MSVC warnings are suppressed project-wide (see the `MSVC_WARNING_FLAGS` block at the bottom of `CMakeLists.txt`).
|
||||
|
|
@ -0,0 +1,211 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Meeting transcription with per-speaker labels.
|
||||
|
||||
Pipeline:
|
||||
1. whisper.cpp (whisper-cli) -> timestamped transcript JSON
|
||||
2. pyannote-audio -> speaker segments
|
||||
3. merge by timestamp overlap -> labelled output
|
||||
|
||||
Usage:
|
||||
python diarize.py -f meeting.wav -m large-v3 --hf-token hf_xxx
|
||||
|
||||
Requirements:
|
||||
pip install -r requirements-diarize.txt
|
||||
(HuggingFace token with pyannote/speaker-diarization-3.1 terms accepted)
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
|
||||
import soundfile as sf
|
||||
import torch
|
||||
from pyannote.audio import Pipeline
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 1: whisper.cpp transcription
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def find_whisper_cli() -> str:
|
||||
script_dir = Path(__file__).parent
|
||||
repo_root = script_dir.parent.parent
|
||||
candidates = [
|
||||
repo_root / "build" / "bin" / "Release" / "whisper-cli.exe", # Windows MSVC
|
||||
repo_root / "build" / "bin" / "whisper-cli.exe", # Windows MinGW
|
||||
repo_root / "build" / "bin" / "whisper-cli", # Linux/Mac
|
||||
]
|
||||
for p in candidates:
|
||||
if p.exists():
|
||||
return str(p)
|
||||
raise FileNotFoundError(
|
||||
"whisper-cli not found. Build the project first:\n"
|
||||
" cmake -B build && cmake --build build -j --config Release"
|
||||
)
|
||||
|
||||
|
||||
def run_whisper(audio_path: str, model: str, language: str, threads: int) -> list:
|
||||
cli = find_whisper_cli()
|
||||
|
||||
repo_root = Path(__file__).parent.parent.parent
|
||||
model_path = repo_root / "models" / f"ggml-{model}.bin"
|
||||
if not model_path.exists():
|
||||
raise FileNotFoundError(
|
||||
f"Model not found: {model_path}\n"
|
||||
f"Download with: bash models/download-ggml-model.sh {model}"
|
||||
)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
out_base = os.path.join(tmpdir, "out")
|
||||
cmd = [
|
||||
cli,
|
||||
"-m", str(model_path),
|
||||
"-f", audio_path,
|
||||
"-l", language,
|
||||
"-t", str(threads),
|
||||
"--output-json",
|
||||
"--output-file", out_base,
|
||||
]
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
if result.returncode != 0:
|
||||
raise RuntimeError(f"whisper-cli failed:\n{result.stderr}")
|
||||
|
||||
json_path = out_base + ".json"
|
||||
if not os.path.exists(json_path):
|
||||
raise RuntimeError(
|
||||
"whisper-cli did not produce a JSON file. "
|
||||
"stderr:\n" + result.stderr
|
||||
)
|
||||
|
||||
with open(json_path, encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
|
||||
return data.get("transcription", [])
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 2: pyannote-audio speaker diarization
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def run_diarization(audio_path: str, hf_token: str, num_speakers: int | None) -> list:
|
||||
print("Loading pyannote speaker-diarization-3.1 (CPU) ...", file=sys.stderr)
|
||||
pipeline = Pipeline.from_pretrained(
|
||||
"pyannote/speaker-diarization-3.1",
|
||||
token=hf_token,
|
||||
)
|
||||
pipeline.to(torch.device("cpu"))
|
||||
|
||||
# Use soundfile to avoid the torchcodec/FFmpeg dependency on Windows.
|
||||
# pyannote accepts a pre-loaded {'waveform': Tensor, 'sample_rate': int} dict.
|
||||
waveform, sample_rate = sf.read(audio_path, dtype="float32", always_2d=True)
|
||||
waveform_tensor = torch.from_numpy(waveform.T) # (channels, time)
|
||||
audio_input = {"waveform": waveform_tensor, "sample_rate": sample_rate}
|
||||
|
||||
kwargs = {}
|
||||
if num_speakers is not None:
|
||||
kwargs["num_speakers"] = num_speakers
|
||||
|
||||
print("Running diarization ...", file=sys.stderr)
|
||||
result = pipeline(audio_input, **kwargs)
|
||||
|
||||
# pyannote.audio 4.x returns DiarizeOutput; the Annotation is in .speaker_diarization
|
||||
annotation = result.speaker_diarization
|
||||
|
||||
segments = []
|
||||
for turn, _, speaker in annotation.itertracks(yield_label=True):
|
||||
segments.append({"start": turn.start, "end": turn.end, "speaker": speaker})
|
||||
return segments
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 3: merge by timestamp overlap
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def assign_speakers(transcription: list, diarization: list) -> list:
|
||||
results = []
|
||||
for seg in transcription:
|
||||
# whisper offsets are in milliseconds
|
||||
t0 = seg["offsets"]["from"] / 1000.0
|
||||
t1 = seg["offsets"]["to"] / 1000.0
|
||||
text = seg.get("text", "").strip()
|
||||
if not text:
|
||||
continue
|
||||
|
||||
best_speaker = "UNKNOWN"
|
||||
best_overlap = 0.0
|
||||
for d in diarization:
|
||||
overlap = min(t1, d["end"]) - max(t0, d["start"])
|
||||
if overlap > best_overlap:
|
||||
best_overlap = overlap
|
||||
best_speaker = d["speaker"]
|
||||
|
||||
results.append({"start": t0, "end": t1, "speaker": best_speaker, "text": text})
|
||||
return results
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Output formatting
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _fmt_time(seconds: float) -> str:
|
||||
m, s = divmod(int(seconds), 60)
|
||||
h, m = divmod(m, 60)
|
||||
return f"{h:02d}:{m:02d}:{s:02d}"
|
||||
|
||||
|
||||
def format_output(segments: list) -> str:
|
||||
lines = []
|
||||
for seg in segments:
|
||||
ts = f"[{_fmt_time(seg['start'])} --> {_fmt_time(seg['end'])}]"
|
||||
lines.append(f"{ts} {seg['speaker']}: {seg['text']}")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def format_json(segments: list) -> str:
|
||||
return json.dumps(segments, ensure_ascii=False, indent=2)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def parse_args():
|
||||
p = argparse.ArgumentParser(description="Whisper.cpp + pyannote diarization pipeline")
|
||||
p.add_argument("-f", "--file", required=True, help="Input WAV file (16 kHz, 16-bit)")
|
||||
p.add_argument("-m", "--model", default="large-v3", help="ggml model name (default: large-v3)")
|
||||
p.add_argument("-l", "--language", default="ja", help="Language code (default: ja)")
|
||||
p.add_argument("-t", "--threads", type=int, default=4, help="whisper-cli thread count")
|
||||
p.add_argument("--hf-token", required=True, help="HuggingFace access token")
|
||||
p.add_argument("--num-speakers", type=int, default=None, help="Known speaker count (optional)")
|
||||
p.add_argument("--output-json", action="store_true", help="Output JSON instead of plain text")
|
||||
return p.parse_args()
|
||||
|
||||
|
||||
def main():
|
||||
args = parse_args()
|
||||
|
||||
if not os.path.exists(args.file):
|
||||
sys.exit(f"Audio file not found: {args.file}")
|
||||
|
||||
print("Step 1/3: Transcribing with whisper.cpp ...", file=sys.stderr)
|
||||
transcription = run_whisper(args.file, args.model, args.language, args.threads)
|
||||
|
||||
print("Step 2/3: Diarizing speakers with pyannote ...", file=sys.stderr)
|
||||
diarization = run_diarization(args.file, args.hf_token, args.num_speakers)
|
||||
|
||||
print("Step 3/3: Merging results ...", file=sys.stderr)
|
||||
segments = assign_speakers(transcription, diarization)
|
||||
|
||||
if args.output_json:
|
||||
print(format_json(segments))
|
||||
else:
|
||||
print(format_output(segments))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -0,0 +1,14 @@
|
|||
# Install in two steps:
|
||||
# Step A (PyTorch CPU wheel):
|
||||
# pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
|
||||
# Step B (remaining packages from PyPI):
|
||||
# pip install -r requirements-diarize.txt
|
||||
|
||||
# Speaker diarization
|
||||
pyannote.audio>=3.1.0
|
||||
|
||||
# Hugging Face model hub access
|
||||
huggingface_hub>=0.20.0
|
||||
|
||||
# Audio reading (avoids torchcodec/FFmpeg dependency on Windows)
|
||||
soundfile>=0.12.0
|
||||
|
|
@ -0,0 +1,236 @@
|
|||
# pyannote-audio 導入手順
|
||||
|
||||
whisper.cpp(音声認識)と pyannote-audio(話者識別)を組み合わせた、
|
||||
**日本語会議録音の話者別文字起こしパイプライン**の構築手順です。
|
||||
|
||||
## 前提条件
|
||||
|
||||
- OS: Windows 11
|
||||
- Anaconda / Miniconda がインストール済みであること
|
||||
- Git for Windows がインストール済みであること
|
||||
- リポジトリ: `c:\work\30.Projects\102.AI_Projects\whisper.cpp\whisper.cpp`
|
||||
|
||||
---
|
||||
|
||||
## ステップ 1: whisper.cpp のビルド
|
||||
|
||||
```bash
|
||||
cd c:\work\30.Projects\102.AI_Projects\whisper.cpp\whisper.cpp
|
||||
|
||||
cmake -B build
|
||||
cmake --build build -j --config Release
|
||||
```
|
||||
|
||||
ビルド完了後、以下のバイナリが生成されます:
|
||||
|
||||
```
|
||||
build/bin/Release/whisper-cli.exe
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ステップ 2: Python 仮想環境の作成
|
||||
|
||||
> **注意:** Python 3.13 は pyannote.audio 非対応のため、**3.11 を使用すること**。
|
||||
|
||||
```bash
|
||||
conda create -n whisper-diarize python=3.11 -y
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ステップ 3: パッケージのインストール
|
||||
|
||||
`--index-url` の適用範囲の問題により、**2段階でインストール**する必要があります。
|
||||
|
||||
### Step A: PyTorch(CPU専用ビルド)
|
||||
|
||||
```bash
|
||||
conda run -n whisper-diarize pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
|
||||
```
|
||||
|
||||
### Step B: pyannote.audio および関連パッケージ
|
||||
|
||||
```bash
|
||||
conda run -n whisper-diarize pip install -r examples/python/requirements-diarize.txt
|
||||
```
|
||||
|
||||
`requirements-diarize.txt` の内容(`examples/python/requirements-diarize.txt`):
|
||||
|
||||
```
|
||||
# Speaker diarization
|
||||
pyannote.audio>=3.1.0
|
||||
|
||||
# Hugging Face model hub access
|
||||
huggingface_hub>=0.20.0
|
||||
|
||||
# Audio reading(Windows で torchcodec/FFmpeg 不要にするための回避策)
|
||||
soundfile>=0.12.0
|
||||
```
|
||||
|
||||
> **Windows の注意事項:**
|
||||
> pyannote.audio 4.x は音声読み込みに `torchcodec`(FFmpeg 必須)を使用しますが、
|
||||
> conda-forge の FFmpeg は Windows 日本語環境でインストールエラーになる場合があります。
|
||||
> 代わりに `soundfile` でWAVを読み込み、テンソルとして直接 pyannote に渡す方式を採用しています。
|
||||
|
||||
### インストール確認
|
||||
|
||||
```bash
|
||||
C:/work/60.Tools/Anaconda/miniconda3/envs/whisper-diarize/python.exe -c ^
|
||||
"import torch, pyannote.audio, soundfile; ^
|
||||
print('torch:', torch.__version__); ^
|
||||
print('pyannote:', pyannote.audio.__version__); ^
|
||||
print('soundfile:', soundfile.__version__); ^
|
||||
print('CPU only:', not torch.cuda.is_available())"
|
||||
```
|
||||
|
||||
期待される出力:
|
||||
|
||||
```
|
||||
torch: 2.11.0+cpu
|
||||
pyannote.audio: 4.0.4
|
||||
soundfile: 0.13.1
|
||||
CPU only: True
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ステップ 4: HuggingFace トークンの取得と利用規約への同意
|
||||
|
||||
pyannote のモデルはゲート付きリポジトリのため、以下の手順が**全て必要**です。
|
||||
|
||||
### 4-1. HuggingFace アカウント作成
|
||||
|
||||
https://huggingface.co でサインアップ(既存アカウントがあればスキップ)。
|
||||
|
||||
### 4-2. 利用規約への同意(3つ全て必要)
|
||||
|
||||
以下の各ページにアクセスし、**"Agree and access repository"** をクリックする。
|
||||
|
||||
| モデル | URL |
|
||||
|--------|-----|
|
||||
| speaker-diarization-3.1 | https://huggingface.co/pyannote/speaker-diarization-3.1 |
|
||||
| segmentation-3.0 | https://huggingface.co/pyannote/segmentation-3.0 |
|
||||
| speaker-diarization-community-1 | https://huggingface.co/pyannote/speaker-diarization-community-1 |
|
||||
|
||||
> **注意:** 3つ目の `speaker-diarization-community-1` は pyannote.audio 4.x から追加された依存リポジトリです。
|
||||
> 同意しないと実行時に `403 Forbidden` エラーが発生します。
|
||||
|
||||
### 4-3. アクセストークンの発行
|
||||
|
||||
1. https://huggingface.co/settings/tokens にアクセス
|
||||
2. "New token" をクリック
|
||||
3. 権限: `read` を選択して作成
|
||||
4. `hf_` で始まるトークン文字列を控えておく
|
||||
|
||||
---
|
||||
|
||||
## ステップ 5: 日本語モデルのダウンロード
|
||||
|
||||
```bash
|
||||
cd c:\work\30.Projects\102.AI_Projects\whisper.cpp\whisper.cpp
|
||||
|
||||
# 精度重視(推奨、2.9GB)
|
||||
bash models/download-ggml-model.sh large-v3
|
||||
|
||||
# 速度重視(動作確認用、142MB)
|
||||
bash models/download-ggml-model.sh base
|
||||
```
|
||||
|
||||
| モデル | サイズ | 日本語精度 |
|
||||
|--------|--------|------------|
|
||||
| large-v3 | 2.9 GB | 最高 |
|
||||
| medium | 1.5 GB | 高 |
|
||||
| small | 466 MB | 中 |
|
||||
| base | 142 MB | 低(動作確認用途) |
|
||||
|
||||
---
|
||||
|
||||
## ステップ 6: 音声ファイルの準備
|
||||
|
||||
whisper.cpp は **16kHz / 16bit / モノラル WAV** のみ対応。
|
||||
他の形式の場合は ffmpeg で変換する。
|
||||
|
||||
```bash
|
||||
ffmpeg -i 会議録音.mp4 -ar 16000 -ac 1 -c:a pcm_s16le 会議録音.wav
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ステップ 7: 実行
|
||||
|
||||
スクリプト: `examples/python/diarize.py`
|
||||
|
||||
```bash
|
||||
cd c:\work\30.Projects\102.AI_Projects\whisper.cpp\whisper.cpp
|
||||
|
||||
C:/work/60.Tools/Anaconda/miniconda3/envs/whisper-diarize/python.exe \
|
||||
examples/python/diarize.py \
|
||||
-f 会議録音.wav \
|
||||
-m large-v3 \
|
||||
--hf-token hf_xxxxxxxxxxxx \
|
||||
--language ja
|
||||
```
|
||||
|
||||
### オプション一覧
|
||||
|
||||
| オプション | 省略形 | デフォルト | 説明 |
|
||||
|---|---|---|---|
|
||||
| `--file` | `-f` | (必須) | 入力WAVファイル |
|
||||
| `--model` | `-m` | `large-v3` | ggmlモデル名 |
|
||||
| `--language` | `-l` | `ja` | 言語コード |
|
||||
| `--hf-token` | | (必須) | HuggingFace トークン |
|
||||
| `--num-speakers` | | 自動検出 | 参加者数(既知の場合に指定すると精度向上) |
|
||||
| `--threads` | `-t` | `4` | whisper-cli スレッド数 |
|
||||
| `--output-json` | | OFF | JSON形式で出力 |
|
||||
|
||||
### 出力例
|
||||
|
||||
```
|
||||
[00:00:00 --> 00:00:03] SPEAKER_00: 本日はお集まりいただきありがとうございます。
|
||||
[00:00:03 --> 00:00:07] SPEAKER_01: よろしくお願いします。
|
||||
[00:00:07 --> 00:00:12] SPEAKER_00: では、議題に入りましょう。
|
||||
[00:00:12 --> 00:00:18] SPEAKER_02: 先週の進捗を報告します。
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## トラブルシューティング
|
||||
|
||||
### `torchcodec` の警告が大量に出る
|
||||
|
||||
```
|
||||
UserWarning: torchcodec is not installed correctly so built-in audio decoding will fail.
|
||||
```
|
||||
|
||||
**→ 無視して問題ありません。**
|
||||
`soundfile` による回避策が有効になっているため、実際の動作に影響しません。
|
||||
|
||||
### `403 Forbidden` / `GatedRepoError`
|
||||
|
||||
pyannote モデルへのアクセス権がありません。
|
||||
**→ ステップ 4-2 の3リポジトリ全てに同意済みか確認してください。**
|
||||
|
||||
### `TypeError: Pipeline.from_pretrained() got an unexpected keyword argument 'use_auth_token'`
|
||||
|
||||
pyannote.audio 4.x で `use_auth_token` 引数が廃止されました。
|
||||
**→ `token=` に変更してください(`diarize.py` は修正済み)。**
|
||||
|
||||
### `AttributeError: 'DiarizeOutput' object has no attribute 'itertracks'`
|
||||
|
||||
pyannote.audio 4.x で出力型が `DiarizeOutput` に変更されました。
|
||||
**→ `result.speaker_diarization.itertracks()` を使用してください(`diarize.py` は修正済み)。**
|
||||
|
||||
---
|
||||
|
||||
## 環境情報(動作確認済み)
|
||||
|
||||
| 項目 | バージョン |
|
||||
|---|---|
|
||||
| OS | Windows 11 Pro 10.0.26200 |
|
||||
| Python | 3.11 (conda) |
|
||||
| torch | 2.11.0+cpu |
|
||||
| pyannote.audio | 4.0.4 |
|
||||
| soundfile | 0.13.1 |
|
||||
| whisper.cpp | v1.8.4 (master) |
|
||||
| whisper-cliモデル | ggml-base.bin / ggml-large-v3.bin |
|
||||
Loading…
Reference in New Issue