Voice Activity Detection
Detect voice activity in audio buffers
Voice Activity Detection (VAD) localizes parts of an audio buffer where human voice is present. Use cases include:
- Silence removal in audio and video editing applications
- Transcription optimization to increase accuracy and speed by chunking long audio files into speech-aligned input audio segments for fast parallel processing
- Word-level timestamp refinement to improve the synchronization of video captions by leveraging the higher-resolution VAD results to clamp lower-resolution word-level timestamps returned from the transcription request
Argmax Pro SDK offers high-accuracy VAD through an optimized deep learning model (Pyannote v3). For accuracy benchmarks, please refer to Table 1 of this paper.
Basic Example
Pro SDK
import Argmax
// Initialize Argmax SDK to enable Pro access
await ArgmaxSDK.with(ArgmaxConfig(apiKey: "ax_*****"))
// Load and process audio
let audioBuffer = try AudioProcessor.loadAudio(fromPath: "path/to/audio.m4a")
let audioArray = AudioProcessor.convertBufferToArray(buffer: audioBuffer)
// Initialize VAD with model
let vad = try await VoiceActivityDetector.modelVAD()
let voiceSegments = vad.voiceActivity(in: audioArray)
// Process results
for (index, isVoice) in voiceSegments.enumerated() {
let timestamp = vad.voiceActivityIndexToSeconds(index)
print("\(timestamp)s: \(isVoice ? "Voice" : "Silence")")
}
Open-source SDK
Voice Activity Detection is not yet supported in the Open-source SDK as a standalone API. Please see Open-source vs Pro SDK
Advanced Features
Model Configuration
VoiceActivityDetector.modelVAD
uses a 1.5 MB SpeakerSegmenter model that provides high-accuracy voice detection at ~16 ms resolution.
You can customize the model configuration like so:
let vad = try await VoiceActivityDetector.modelVAD(
modelInfo: ModelInfo.segmenter(version: "segmenter version", variant: "segmenter variant"),
modelFolder: customModelPath, // Optional: custom model location
concurrentInferences: 4 // Optional: parallel processing
)
The VAD model is the same as the speaker segmenter model used in SpeakerKitPro
and is available on Hugging Face. The default modelInfo
configuration ill automatically download and use the latest model, which is recommended for most use cases.
Integration with File Transcription
VAD is also integrated with WhisperKitPro
transcription to improve accuracy and speed. Please see VAD-based Chunking for details.