Docs
Voice Activity Detection

Voice Activity Detection

Detect voice activity in audio buffers

Voice Activity Detection (VAD) localizes parts of an audio buffer where human voice is present. Use cases include:

  1. Silence removal in audio and video editing applications
  2. Transcription optimization to increase accuracy and speed by chunking long audio files into speech-aligned input audio segments for fast parallel processing
  3. Word-level timestamp refinement to improve the synchronization of video captions by leveraging the higher-resolution VAD results to clamp lower-resolution word-level timestamps returned from the transcription request

Argmax Pro SDK offers high-accuracy VAD through an optimized deep learning model (Pyannote v3). For accuracy benchmarks, please refer to Table 1 of this paper.

Basic Example

Pro SDK

import Argmax
 
// Initialize Argmax SDK to enable Pro access
await ArgmaxSDK.with(ArgmaxConfig(apiKey: "ax_*****"))
 
// Load and process audio
let audioBuffer = try AudioProcessor.loadAudio(fromPath: "path/to/audio.m4a")
let audioArray = AudioProcessor.convertBufferToArray(buffer: audioBuffer)
 
// Initialize VAD with model
let vad = try await VoiceActivityDetector.modelVAD()
let voiceSegments = vad.voiceActivity(in: audioArray)
 
// Process results
for (index, isVoice) in voiceSegments.enumerated() {
    let timestamp = vad.voiceActivityIndexToSeconds(index)
    print("\(timestamp)s: \(isVoice ? "Voice" : "Silence")")
}

Open-source SDK

Voice Activity Detection is not yet supported in the Open-source SDK as a standalone API. Please see Open-source vs Pro SDK

Advanced Features

Model Configuration

VoiceActivityDetector.modelVAD uses a 1.5 MB SpeakerSegmenter model that provides high-accuracy voice detection at ~16 ms resolution.

You can customize the model configuration like so:

let vad = try await VoiceActivityDetector.modelVAD(
    modelInfo: ModelInfo.segmenter(version: "segmenter version", variant: "segmenter variant"),
    modelFolder: customModelPath,  // Optional: custom model location
    concurrentInferences: 4  // Optional: parallel processing
)

The VAD model is the same as the speaker segmenter model used in SpeakerKitPro and is available on Hugging Face. The default modelInfo configuration ill automatically download and use the latest model, which is recommended for most use cases.

Integration with File Transcription

VAD is also integrated with WhisperKitPro transcription to improve accuracy and speed. Please see VAD-based Chunking for details.