Real Time Transcription
Context
Argmax Pro SDK includes the WhisperKitPro
framework which implements an advanced streaming inference algorithm described in our paper. Key features include:
- Accuracy is identical to offline file transcription
- Dual output text streams can be leveraged in the user experience to build trust in stable and accurate results (Confirmed) while maximizing responsiveness (Hypothesis).
- Streaming API design that exposes event-based callbacks, minimizing the burden on the caller
Real-time transcription streams input audio and the corresponding output text continuously during a live recording session:
- Input audio stream: Capturing audio in small user-defined intervals
- Inference: Incremental speech-to-text inference on the input stream
- Output text streams:
- Confirmed: Finalized portion of the transcript that gets longer over time.
- Hypothesis: Preliminary transcript that may still be refined as more audio context arrives.
This approach creates an ultra low-latency user experience where words appear on the screen almost as they're spoken, with occasional refinements as the model gathers more context.
Migration from Cloud APIs. The Confirmed output text stream is also referred to as immutable
or final
in other products.
The Hypothesis output text stream is also referred to as mutable
or interim
in other products.
Real-time with Open-source SDK. Argmax Open-source SDK includes the WhisperKit
framework that includes basic building blocks to approximate the real-time behavior of the Pro SDK. An open-source example app that demonstrates real-time transcription with the Open-source SDK here.
Basic Example
This is a complete and self-contained CLI example project that demonstrates the usage of Argmax Pro SDK for real-time transcription from a microphone input stream.
Step 0: Verify Pro SDK setup
Argmax Pro SDK access must be set up with SwiftPM before going through this example. If unsure, please see Upgrading to Pro SDK (Step 1 only).
Step 1: Create project directory
Create a project directory as shown below and insert the code shared below into ArgmaxTestCommand.swift
and Package.swift
ArgmaxSDKRealTimeTranscriptionBasicExample
├── Package.swift
└── Sources
└── ArgmaxTestCLI
└── ArgmaxTestCommand.swift
Package.swift
:
// swift-tools-version: 5.10
// The swift-tools-version declares the minimum version of Swift required to build this package.
import PackageDescription
let package = Package(
name: "Argmax Test CLI",
platforms: [
.macOS(.v14)
],
products: [
.executable(
name: "argmax-test-cli",
targets: ["ArgmaxTestCLI"]
)
],
dependencies: [
.package(id: "argmaxinc.argmax-sdk-swift", .upToNextMinor(from: "1.3.3")),
.package(url: "https://github.com/apple/swift-argument-parser.git", exact: "1.3.0")
],
targets: [
.executableTarget(
name: "ArgmaxTestCLI",
dependencies: [
.product(name: "Argmax", package: "argmaxinc.argmax-sdk-swift"),
.product(name: "ArgumentParser", package: "swift-argument-parser")
]
),
]
)
ArgmaxTestCommand.swift
:
import Foundation
import ArgumentParser
import Argmax
@main
struct ArgmaxTestCommand: AsyncParsableCommand {
static let configuration = CommandConfiguration(
abstract: "An example CLI tool for Argmax Pro SDK",
subcommands: [Transcribe.self]
)
struct Transcribe: AsyncParsableCommand {
static let configuration = CommandConfiguration(
abstract: "Real-time transcription using system microphone"
)
@Option(help: "Argmax Pro SDK API key")
var apiKey: String
@Option(help: "Model name: e.g. `tiny.en` or `large-v3-v20240930_626MB`. Default: `tiny.en`")
var modelName: String = "tiny.en"
@Option(help: "HuggingFace token if accessing Pro models")
var modelToken: String?
func run() async throws {
print("Initializing Argmax Pro SDK...")
let sdkConfig = ArgmaxConfig(apiKey: apiKey)
await ArgmaxSDK.with(sdkConfig)
let modelRepo = "argmaxinc/whisperkit-coreml"
// Uncomment to access Pro models (requires `modelToken`)
// let modelRepo = "argmaxinc/whisperkit-pro"
print("Downloading \(modelName) model ...")
let downloadURL = try await WhisperKitPro.download(
variant: modelName,
from: modelRepo,
token: modelToken) { progress in
if let progressString = progress.localizedDescription {
print("\rDownload progress: \(progressString)", terminator: "")
fflush(stdout)
print("Calling cancel!")
progress.cancel()
}
}
let modelFolder = downloadURL.path(percentEncoded: false)
print("\nDownload completed: \(modelFolder)")
let whisperKitPro = try await setupWhisperKitPro(modelFolder: modelFolder)
try await transcribeStream(whisperKitPro: whisperKitPro)
}
private func setupWhisperKitPro(modelFolder: String) async throws -> WhisperKitPro {
print("Initializing WhisperKit Pro...")
let whisperConfig = WhisperKitProConfig(
modelFolder: modelFolder,
verbose: true,
logLevel: .debug
)
let whisperKitPro = try await WhisperKitPro(whisperConfig)
print("Loading WhisperKit Pro models...")
try await whisperKitPro.loadModels()
return whisperKitPro
}
private func transcribeStream(whisperKitPro: WhisperKitPro) async throws {
print("Transcribing while streaming audio from microphone...")
let baseOptions = DecodingOptions(
verbose: true,
task: .transcribe,
wordTimestamps: true,
chunkingStrategy: .vad
)
let options = DecodingOptionsPro(
base: baseOptions,
transcribeInterval: 0.1
)
// Start recording
var audioBuffer: [Float] = []
let lock = NSLock()
try whisperKitPro.audioProcessor.startRecordingLive { samples in
lock.withLock {
audioBuffer.append(contentsOf: samples)
}
}
print("Started audio capture... press Ctrl+C to stop...")
// Process the stream
let dateFormatter = DateFormatter()
dateFormatter.dateFormat = "HH:mm:ss.SSS"
var accumulatedConfirmedText = ""
let recordingTask = whisperKitPro.transcribeWhileRecording(
options: options,
audioCallback: {
let samples = lock.withLock {
let samples = audioBuffer
audioBuffer.removeAll()
return samples
}
return AudioSamples(samples: samples)
},
resultCallback: { result in
let timestamp = dateFormatter.string(from: Date())
accumulatedConfirmedText += result.text
let hypothesisText = result.hypothesisText ?? ""
print("[\(timestamp)] \(accumulatedConfirmedText)\u{001B}[34m\(hypothesisText)\u{001B}[0m")
return true
}
)
var signalHandled = false
defer {
if !signalHandled {
print("Stop recording...")
recordingTask.stop()
}
}
signal(SIGINT, SIG_IGN)
let signalSource = DispatchSource.makeSignalSource(signal: SIGINT, queue: DispatchQueue.main)
signalSource.setEventHandler(handler: DispatchWorkItem(block: {
print("Stop recording...")
signalHandled = true
whisperKitPro.audioProcessor.stopRecording()
print("Finalizing transcription...")
let group = DispatchGroup()
group.enter()
Task {
do {
let results = try await recordingTask.finalize()
let mergedResult = WhisperKitProUtils.mergeTranscriptionResults(results)
print("\n\nTranscription: \n\n\(mergedResult.text)\n")
} catch {
print("Error finalizing recording: \(error)")
}
group.leave()
}
group.wait()
Foundation.exit(0)
}))
signalSource.resume()
try await recordingTask.start()
}
}
}
Step 2: Build and run in Terminal
Run the following command in your Terminal from within the top-level project directory:
Example usage:
swift run argmax-test-cli transcribe --api-key <API_KEY>
If you observe error: no registry configured for 'argmaxinc' scope
, go back to Step 0.
Here is an example output upon successful build and launch:

Advanced Features
Pro Models
Pro SDK offers additional models with significantly higher speed, accuracy, and energy-efficiency.
Nvidia Parakeet Models
These models are not yet supported for real-time transcription. Coming soon.
Whisper Models
This second set of Whisper models are further optimized for speed and energy-efficiency on top of their open-source counterparts. During this upgrade, accuracy remains identical while speed improves.
In order to use upgraded Whisper models, simply apply this diff to your initial configuration code:
- let config = WhisperKitConfig(model: "large-v3-v20240930")
+ let config = WhisperKitProConfig(
+ model: "large-v3-v20240930",
+ modelRepo: "argmaxinc/whisperkit-pro",
+ )
OS Compatibility. Note that argmaxinc/whisperkit-pro
models support iOS 18/macOS15 and newer. For users still on iOS 17/macOS 14, please keep using argmaxinc/whisperkit-coreml
Multiple Audio Streams
This feature allows multiple input audio streams to be real-time transcribed by the same WhisperKitPro
object. An example use case is concurrent real-time transcription of system audio and microphone for meeting transcriptions.
Before implementing multi-stream transcription, ensure that the ArgmaxTestCommand
from Step 1 works correctly, particularly its transcribeStream
function which demonstrates the basic single-stream implementation.
Multi-Stream Architecture
The same WhisperKitPro
instance can efficiently handle multiple audio streams simultaneously. Each stream gets its own recordingTask
that shares the same WhisperKitPro
instance but maintains independent processing context, allowing them to run concurrently without interference.
Implementation Overview
For multi-stream setup, you'll need to:
- Audio Stream Sources: Bring your own audio stream sources (e.g., system audio, network streams, file streams). Optionally, you can also include microphone audio using
whisperKitPro.audioProcessor.startRecordingLive
- Audio Processing: Convert incoming audio data to the required
[Float]
format - Independent Buffers: Maintain separate audio buffers and locks for each stream
- Concurrent Processing: Start all recording tasks concurrently using a task group
Example Implementation
private func transcribeMultipleStreams(whisperKitPro: WhisperKitPro) async throws {
let baseOptions = DecodingOptions(
verbose: true,
task: .transcribe,
wordTimestamps: true,
chunkingStrategy: .vad
)
let options = DecodingOptionsPro(
base: baseOptions,
transcribeInterval: 0.1
)
// Stream 1: System audio buffer (custom stream)
var systemAudioBuffer: [Float] = []
let systemLock = NSLock()
// Stream 2: Network/File audio buffer (custom stream)
var networkAudioBuffer: [Float] = []
let networkLock = NSLock()
// OPTIONAL: Stream 3: Microphone audio buffer (using built-in recorder)
var micAudioBuffer: [Float] = []
let micLock = NSLock()
// Start your custom audio streams
// Custom Stream 1: System audio capture
startSystemAudioCapture { avAudioPCMBuffer in
// Convert AVAudioPCMBuffer to [Float]
// See conversion example: https://github.com/argmaxinc/WhisperKit/blob/8c0acbd2fdff83f4081aaae8b3bb7c01823d79e1/Sources/WhisperKit/Core/Audio/AudioProcessor.swift#L988
let samples = convertAVAudioPCMBufferToFloatArray(avAudioPCMBuffer)
systemLock.withLock {
systemAudioBuffer.append(contentsOf: samples)
}
}
// Custom Stream 2: Network/File audio
startNetworkAudioStream { avAudioPCMBuffer in
// Convert AVAudioPCMBuffer to [Float]
let samples = convertAVAudioPCMBufferToFloatArray(avAudioPCMBuffer)
networkLock.withLock {
networkAudioBuffer.append(contentsOf: samples)
}
}
// OPTIONAL: Built-in microphone recording (you can skip this if not needed)
try whisperKitPro.audioProcessor.startRecordingLive { samples in
micLock.withLock {
micAudioBuffer.append(contentsOf: samples)
}
}
// Create recording tasks for each stream
let systemRecordingTask = whisperKitPro.transcribeWhileRecording(
options: options,
audioCallback: {
let samples = systemLock.withLock {
let samples = systemAudioBuffer
systemAudioBuffer.removeAll()
return samples
}
return AudioSamples(samples: samples)
},
resultCallback: { result in
print("[SYSTEM] \(result.text)")
return true
}
)
let networkRecordingTask = whisperKitPro.transcribeWhileRecording(
options: options,
audioCallback: {
let samples = networkLock.withLock {
let samples = networkAudioBuffer
networkAudioBuffer.removeAll()
return samples
}
return AudioSamples(samples: samples)
},
resultCallback: { result in
print("[NETWORK] \(result.text)")
return true
}
)
// OPTIONAL: Microphone recording task (only if using microphone)
let micRecordingTask = whisperKitPro.transcribeWhileRecording(
options: options,
audioCallback: {
let samples = micLock.withLock {
let samples = micAudioBuffer
micAudioBuffer.removeAll()
return samples
}
return AudioSamples(samples: samples)
},
resultCallback: { result in
print("[MIC] \(result.text)")
return true
}
)
// Start all recording tasks concurrently
try await withTaskGroup(of: Void.self) { group in
group.addTask {
try await systemRecordingTask.start()
}
group.addTask {
try await networkRecordingTask.start()
}
// OPTIONAL: Add microphone task only if using microphone
group.addTask {
try await micRecordingTask.start()
}
// Wait for all tasks to complete
try await group.waitForAll()
}
}
// Helper function to convert AVAudioPCMBuffer to [Float]
private func convertAVAudioPCMBufferToFloatArray(_ buffer: AVAudioPCMBuffer) -> [Float] {
// Implementation depends on your audio format
// See: https://github.com/argmaxinc/WhisperKit/blob/8c0acbd2fdff83f4081aaae8b3bb7c01823d79e1/Sources/WhisperKit/Core/Audio/AudioProcessor.swift#L728
guard let channelData = buffer.floatChannelData else { return [] }
let frameLength = Int(buffer.frameLength)
return Array(UnsafeBufferPointer(start: channelData[0], count: frameLength))
}
Key Considerations
-
Audio Format Conversion: When working with
AVAudioPCMBuffer
from system audio or other sources, you'll need to convert the audio data to[Float]
format. Refer to the AudioProcessor conversion example for implementation details. -
Thread Safety: Each stream maintains its own audio buffer and lock to ensure thread-safe operations when multiple audio sources are writing simultaneously.
-
Concurrent Processing: Use Swift's structured concurrency (
TaskGroup
) to start all recording tasks simultaneously, enabling true multi-stream processing. -
Resource Management: The shared
WhisperKitPro
instance efficiently manages computational resources across all streams while maintaining independent processing contexts.