File Transcription
Implementing file-based speech-to-text in your applications
File transcription processes complete audio files offline, unlike real-time transcription that processes audio online in a streaming fashion:
If this is your first time, start with the Open-source SDK. You can always upgrade to the Pro SDK later for more features and better performance.
Basic Example
Pro SDK
Argmax Pro SDK includes the WhisperKitPro
framework that implements file transcription:
import Argmax
// Initialize Argmax SDK to enable Pro access
await ArgmaxSDK.with(ArgmaxConfig(apiKey: "ax_*****"))
let config = WhisperKitProConfig(model: "large-v3-v20240930")
let whisperKitPro = try await WhisperKitPro(config)
let transcript = try? await whisperKitPro.transcribe(audioPath: "path/to/audio.m4a").text
Open-source SDK
Argmax Open-source SDK includes the WhisperKit
framework that implements file transcription:
import WhisperKit
let config = WhisperKitConfig(model: "large-v3-v20240930")
let whisperKit = try await WhisperKit(config)
let transcript = try? await whisperKit.transcribe(audioPath: "path/to/audio.m4a").text
Applications that require live results while recording or low latency results at the end of a recording session, check out Real-time Transcription.
Advanced Features
Pro Models
Pro SDK offers additional models with significantly higher speed, accuracy, and energy-efficiency.
Nvidia Parakeet Models
This model is faster than Whisper Tiny and more accurate than Whisper Large v3.
We recommend using this model for all English-only applications. For multilingual, please refer to the available Pro SDK Whisper models in the next section.
In order use Parakeet models, simply apply this diff to your initial configuration code:
- let config = WhisperKitConfig(model: "large-v3-v20240930")
+ let config = WhisperKitProConfig(
+ model: "parakeet-v2",
+ modelRepo: "argmaxinc/parakeetkit-pro",
+ modelToken: "hf_*****" // Request access at https://huggingface.co/argmaxinc/parakeetkit-pro
+ )
Pro Model Access Credentials. Please request access here. We are working on removing this extra credential requirement in the near term.
iOS must use compressed models. Please use parakeet-v2_478MB
instead of parakeet-v2
for iOS apps. This compressed model is benchmarked and verified to achieve an accuracy within 0.5% of the original model.
Whisper Models
This second set of Whisper models are further optimized for speed and energy-efficiency on top of their open-source counterparts. During this upgrade, accuracy remains identical while speed improves.
In order to use upgraded Whisper models, simply apply this diff to your initial configuration code:
- let config = WhisperKitConfig(model: "large-v3-v20240930")
+ let config = WhisperKitProConfig(
+ model: "large-v3-v20240930",
+ modelRepo: "argmaxinc/whisperkit-pro",
+ )
OS Compatibility. Note that argmaxinc/whisperkit-pro
models support iOS 18/macOS15 and newer. For users still on iOS 17/macOS 14, please keep using argmaxinc/whisperkit-coreml
VAD-based Audio Chunking
Audio files that are longer than 30 seconds are processed in chunks. Naive chunking with 30 second intervals (.chunkingStrategy = none
) may lead to middle-of-speech cuts and extended silence in the beginning of an audio chunk, both of which are known to lead to lower quality transcriptions.
Voice Activity Detection (VAD) is built into the Pro SDK to help find precise seek points for chunking to improve the transcription accuracy even further. This feature downloads the 1.5 MB SpeakerSegmenter model which accurately separates voice from non-voice audio segments at ~16 ms resolution.
Pro SDK
Set voiceActivityDetector
when initializing WhisperKitPro
and then .chunkingStrategy = .vad
for transcription options to activate this Pro SDK feature:
import Argmax
...
// Initialize VAD model (Downloads and loads the 1.5 MB model)
let vad = try await VoiceActivityDetector.modelVAD()
// Pass VAD model in WhisperKitPro config
let config = WhisperKitProConfig(voiceActivityDetector: vad)
let whisperKitPro = try await WhisperKitPro(config)
// Activate (Model)VAD-based chunking in DecodingOptions
let options = DecodingOptions(chunkingStrategy: .vad)
let result = try await whisperKitPro.transcribe(audioArray: audioArray, decodeOptions: options)
Open-source SDK
Open-source SDK implements the same VAD feature based on an "audio energy" function that does not rely on a deep learning model. You may set .chunkingStrategy = .vad
for transcription options to activate this Open-source SDK feature.
import WhisperKit
...
// No change to WhisperKit initialization
let config = WhisperKitConfig()
let whisperKit = try await WhisperKit(config)
// Activate (Energy)VAD-based chunking in DecodingOptions
let options = DecodingOptions(chunkingStrategy: .vad)
let result = try await whisperKit.transcribe(audioArray: audioArray, decodeOptions: options)
Multi-Channel Audio
Both WhisperKit
and WhisperKitPro
support multi-channel audio processing, which can be useful when working with audio files containing multiple speakers or audio sources.
The SDK allows you to specify how to handle multi-channel audio:
- Default Behavior: Merges all channels into a mono track for processing
- Channel Selection: Allows selecting specific channels for transcription
- Channel Summing: Combines selected channels with normalization
Here's how to configure WhisperKit to use specific audio channels:
let config = WhisperKitConfig(
audioInputConfig: AudioInputConfig(channelMode: .sumChannels([1, 3, 5]))
)
The audio merging algorithm works as follows:
- Finds the peak amplitude across all channels
- Checks if the peak of the mono (summed) version is higher than any of the peaks of the individual channels
- Normalizes the combined track so that the peak of the mono channel matches the peak of the loudest channel
This approach ensures the merged audio maintains appropriate volume levels while combining information from multiple channels.