Docs
File Transcription

File Transcription

Implementing file-based speech-to-text in your applications

File transcription processes complete audio files offline, unlike real-time transcription that processes audio online in a streaming fashion:

Basic Example

Pro SDK

Argmax Pro SDK includes the WhisperKitPro framework that implements file transcription:

import Argmax
 
// Initialize Argmax SDK to enable Pro access
await ArgmaxSDK.with(ArgmaxConfig(apiKey: "ax_*****"))
 
let config = WhisperKitProConfig(model: "large-v3-v20240930")
let whisperKitPro = try await WhisperKitPro(config)
let transcript = try? await whisperKitPro.transcribe(audioPath: "path/to/audio.m4a").text

Open-source SDK

Argmax Open-source SDK includes the WhisperKit framework that implements file transcription:

import WhisperKit
 
let config = WhisperKitConfig(model: "large-v3-v20240930")
let whisperKit = try await WhisperKit(config)
let transcript = try? await whisperKit.transcribe(audioPath: "path/to/audio.m4a").text

Advanced Examples

Record-then-transcribe

/// Record audio to a temporary file for batch processing
func startFileRecording() {
    guard !isProcessing && !isStreaming else { return }
 
    if audioProcessor == nil {
        audioProcessor = AudioProcessor()
    }
 
    guard let audioProcessor = self.audioProcessor else {
        processingError = "AudioProcessor is not initialized"
        return
    }
 
    stopRequestedAt = nil
 
    Task {
        await MainActor.run {
            isProcessing = true
            isStreaming = true
            processingError = nil
            transcriptionResult = ""
            hypothesisText = ""
            startAudioLevelTimer()
        }
 
        do {
            try audioProcessor.startRecordingLive()
        } catch {
            await MainActor.run {
                isProcessing = false
                isStreaming = false
                processingError = "Failed to start recording: \(error.localizedDescription)"
                stopAudioLevelTimer()
            }
        }
    }
}
 
/// Stops recording to file and transcribes the recorded audio
@discardableResult
func stopFileRecording(useProVersion: Bool) async -> URL? {
    guard isStreaming, let audioProcessor = self.audioProcessor else { return nil }
 
    await MainActor.run {
        isStreaming = false
        stopAudioLevelTimer()
    }
 
    let audioSamples = Array(audioProcessor.audioSamples)
    audioProcessor.stopRecording()
 
    let tempURL = FileManager.default.temporaryDirectory.appendingPathComponent("recording-\(Date().timeIntervalSince1970).wav")
 
    let success = saveAudioSamplesToFile(audioSamples, url: tempURL)
 
    if success {
        await transcribeAudio(url: tempURL, useProVersion: useProVersion, useStreaming: false)
        return tempURL
    } else {
        await MainActor.run {
            processingError = "Failed to save audio file"
            isProcessing = false
        }
        return nil
    }
}
 
/// Saves audio samples to a WAV file
private func saveAudioSamplesToFile(_ samples: [Float], url: URL) -> Bool {
    let sampleRate = 16000
    let channelCount = 1
 
    guard let format = AVAudioFormat(
        commonFormat: .pcmFormatFloat32,
        sampleRate: Double(sampleRate),
        channels: AVAudioChannelCount(channelCount),
        interleaved: false
    ) else {
        print("Failed to create audio format")
        return false
    }
 
    guard let buffer = AVAudioPCMBuffer(
        pcmFormat: format,
        frameCapacity: AVAudioFrameCount(samples.count)
    ) else {
        print("Failed to create audio buffer")
        return false
    }
 
    for i in 0..<min(samples.count, Int(buffer.frameCapacity)) {
        buffer.floatChannelData?[0][i] = samples[i]
    }
    buffer.frameLength = AVAudioFrameCount(min(samples.count, Int(buffer.frameCapacity)))
 
    do {
        let audioFile = try AVAudioFile(
            forWriting: url,
            settings: format.settings,
            commonFormat: .pcmFormatFloat32,
            interleaved: false
        )
 
        try audioFile.write(from: buffer)
        return true
    } catch {
        print("Failed to save audio file: \(error)")
        return false
    }
}

Advanced Features

Pro Models

Pro SDK offers significantly faster and more energy-efficient models. These models also lead to higher accuracy word-level timestamps.

To upgrade, simply apply this diff to your initial configuration code:

- let config = WhisperKitConfig(model: "large-v3-v20240930")
+ let config = WhisperKitProConfig(
+     model: "large-v3-v20240930",
+     modelRepo: "argmaxinc/whisperkit-pro",
+     modelToken: "hf_*****" // Request access at https://huggingface.co/argmaxinc/whisperkit-pro
+ )

For now, you need to request model access here. We are working on removing this extra credential requirement.

VAD-based Audio Chunking

Audio files that are longer than 30 seconds are processed in chunks. Naive chunking with 30 second intervals (.chunkingStrategy = none) may lead to middle-of-speech cuts and extended silence in the beginning of an audio chunk, both of which are known to lead to lower quality transcriptions.

Voice Activity Detection (VAD) is built into the Pro SDK to help find precise seek points for chunking to improve the transcription accuracy even further. This feature downloads the 1.5 MB SpeakerSegmenter model which accurately separates voice from non-voice audio segments at ~16 ms resolution.

Pro SDK

Set voiceActivityDetector when initializing WhisperKitPro and then .chunkingStrategy = .vad for transcription options to activate this Pro SDK feature:

    import Argmax
    ...
    // Initialize VAD model (Downloads and loads the 1.5 MB model)
    let vad = try await VoiceActivityDetector.modelVAD()
 
    // Pass VAD model in WhisperKitPro config
    let config = WhisperKitProConfig(voiceActivityDetector: vad)
    let whisperKitPro = try await WhisperKitPro(config)
 
    // Activate (Model)VAD-based chunking in DecodingOptions
    let options = DecodingOptions(chunkingStrategy: .vad)
    let result = try await whisperKitPro.transcribe(audioArray: audioArray, decodeOptions: options)

Open-source SDK

Open-source SDK implements the same VAD feature based on an "audio energy" function that does not rely on a deep learning model. You may set .chunkingStrategy = .vad for transcription options to activate this Open-source SDK feature.

    import WhisperKit
    ...
    // No change to WhisperKit initialization
    let config = WhisperKitConfig()
    let whisperKit = try await WhisperKit(config)
 
    // Activate (Energy)VAD-based chunking in DecodingOptions
    let options = DecodingOptions(chunkingStrategy: .vad)
    let result = try await whisperKit.transcribe(audioArray: audioArray, decodeOptions: options)

Multi-Channel Audio

Both WhisperKit and WhisperKitPro support multi-channel audio processing, which can be useful when working with audio files containing multiple speakers or audio sources.

The SDK allows you to specify how to handle multi-channel audio:

  1. Default Behavior: Merges all channels into a mono track for processing
  2. Channel Selection: Allows selecting specific channels for transcription
  3. Channel Summing: Combines selected channels with normalization

Here's how to configure WhisperKit to use specific audio channels:

let config = WhisperKitConfig(
    audioInputConfig: AudioInputConfig(channelMode: .sumChannels([1, 3, 5]))
)

The audio merging algorithm works as follows:

  1. Finds the peak amplitude across all channels
  2. Checks if the peak of the mono (summed) version is higher than any of the peaks of the individual channels
  3. Normalizes the combined track so that the peak of the mono channel matches the peak of the loudest channel

This approach ensures the merged audio maintains appropriate volume levels while combining information from multiple channels.

UI Considerations

When implementing file transcription in your application, consider these UI elements:

Recording Indicator

Here's a quick snippet on how to make a recording visualizer.

private func startAudioLevelTimer() {
    stopAudioLevelTimer()
 
    levelUpdateTimer = Timer.scheduledTimer(withTimeInterval: 0.1, repeats: true) { [weak self] _ in
        guard let self = self, let audioProcessor = self.audioProcessor else { return }
 
        let energy = audioProcessor.relativeEnergy
 
        if !energy.isEmpty {
            var levels = self.audioLevels
            let newLevel = energy.last ?? 0.0
            levels.removeFirst()
            levels.append(newLevel)
            self.audioLevels = levels
            self.objectWillChange.send()
        }
    }
}
 
private var audioVisualizerView: some View {
    HStack(alignment: .bottom, spacing: 4) {
        ForEach(0..<transcriptionManager.audioLevels.count, id: \.self) { index in
            AudioBar(level: transcriptionManager.audioLevels[index])
        }
    }
    .frame(height: isMacOS ? 40 : 50)
    .padding()
    .background(Color(hex: "0F3460").opacity(0.3))
    .cornerRadius(12)
}

Progress Indicator

Here's a simple way of displaying the status and progress of the SDK.

private var initializationView: some View {
    VStack(spacing: 20) {
        ProgressView()
            .scaleEffect(isMacOS ? 1.2 : 1.5)
            .progressViewStyle(CircularProgressViewStyle(tint: Color(hex: "E94560")))
            .padding()
 
        Text(transcriptionManager.initializationStatus)
            .font(.system(size: 16))
            .multilineTextAlignment(.center)
            .foregroundColor(.white)
            .padding()
            .frame(maxWidth: .infinity)
            .background(Color(hex: "0F3460").opacity(0.3))
            .cornerRadius(12)
    }
    .padding(.vertical, isMacOS ? 20 : 30)
    .transition(.opacity)
    .animation(.easeInOut, value: transcriptionManager.isWhisperKitReady)
}

Results Preview

Here's a simple way to display hypothesis and confirmed text to the user.

private var transcriptionResultView: some View {
    ScrollView {
        VStack(alignment: .leading, spacing: 0) {
            if transcriptionManager.transcriptionResult.isEmpty && transcriptionManager.hypothesisText.isEmpty {
                Text("Your transcription will appear here")
                    .font(.system(size: isMacOS ? 16 : 18))
                    .foregroundColor(.white)
                    .padding()
                    .frame(maxWidth: .infinity, alignment: .leading)
            } else {
                if !transcriptionManager.transcriptionResult.isEmpty {
                    Text(transcriptionManager.transcriptionResult)
                        .font(.system(size: isMacOS ? 16 : 18))
                        .foregroundColor(.white)
                        .padding()
                        .frame(maxWidth: .infinity, alignment: .leading)
                }
 
                if !transcriptionManager.hypothesisText.isEmpty {
                    Text(transcriptionManager.hypothesisText)
                        .font(.system(size: isMacOS ? 16 : 18))
                        .foregroundColor(.white.opacity(0.6))
                        .padding()
                        .frame(maxWidth: .infinity, alignment: .leading)
                }
            }
        }
        .frame(maxWidth: .infinity)
        .background(Color(hex: "0F3460").opacity(0.3))
        .cornerRadius(16)
        .animation(.easeInOut(duration: 0.3), value: transcriptionManager.transcriptionResult)
        .animation(.easeInOut(duration: 0.3), value: transcriptionManager.hypothesisText)
    }
    .frame(maxHeight: .infinity)
    .padding(.vertical, isMacOS ? 16 : 20)
}

Editor

Allow users to review and correct transcription results