Real Time Transcription
Context
Argmax Pro SDK includes the WhisperKitPro
framework which implements an advanced streaming inference algorithm described in our paper. Key features include:
- Accuracy is identical to offline file transcription
- Dual output text streams can be leveraged in the user experience to build trust in stable and accurate results (Confirmed) while maximizing responsiveness (Hypothesis).
- Streaming API design that exposes event-based callbacks, minimizing the burden on the caller
Real-time transcription streams input audio and the corresponding output text continuously during a live recording session:
- Input audio stream: Capturing audio in small user-defined intervals
- Inference: Incremental speech-to-text inference on the input stream
- Output text streams:
- Confirmed: Finalized portion of the transcript that gets longer over time.
- Hypothesis: Preliminary transcript that may still be refined as more audio context arrives.
This approach creates an ultra low-latency user experience where words appear on the screen almost as they're spoken, with occasional refinements as the model gathers more context.
Migration from Cloud APIs. The Confirmed output text stream is also referred to as immutable
or final
in other products.
The Hypothesis output text stream is also referred to as mutable
or interim
in other products.
Real-time with Open-source SDK. Argmax Open-source SDK includes the WhisperKit
framework that includes basic building blocks to approximate the real-time behavior of the Pro SDK. An open-source example app that demonstrates real-time transcription with the Open-source SDK here.
Basic Example
This is a complete and self-contained CLI example project that demonstrates the usage of Argmax Pro SDK for real-time transcription from a microphone input stream. This basic example uses higher-level APIs such as WhisperKitCoordinator
and LiveTranscriber
to minimize the lines of code you have to write. If you would like to more advanced configurability with lower-level APIs, please see the Advanced Example section.
Step 0: Verify Pro SDK setup
Argmax Pro SDK access must be set up with SwiftPM before going through this example. If unsure, please see Upgrading to Pro SDK (Step 1 only).
Step 1: Create project directory
Create a project directory as shown below and insert the code shared below into ArgmaxTestCommand.swift
and Package.swift
ArgmaxSDKRealTimeTranscriptionBasicExample
├── Package.swift
└── Sources
└── ArgmaxTestCLI
└── ArgmaxTestCommand.swift
Package.swift
:
// swift-tools-version: 5.10
// The swift-tools-version declares the minimum version of Swift required to build this package.
import PackageDescription
let package = Package(
name: "Argmax Test CLI",
platforms: [
.macOS(.v14)
],
products: [
.executable(
name: "argmax-test-cli",
targets: ["ArgmaxTestCLI"]
)
],
dependencies: [
.package(id: "argmaxinc.argmax-sdk-swift", .upToNextMinor(from: "1.3.3")),
.package(url: "https://github.com/apple/swift-argument-parser.git", exact: "1.3.0")
],
targets: [
.executableTarget(
name: "ArgmaxTestCLI",
dependencies: [
.product(name: "Argmax", package: "argmaxinc.argmax-sdk-swift"),
.product(name: "ArgumentParser", package: "swift-argument-parser")
]
),
]
)
ArgmaxTestCommand.swift
:
import Foundation
import ArgumentParser
import Argmax
@main
struct ArgmaxTestCommand: AsyncParsableCommand {
static let configuration = CommandConfiguration(
abstract: "An example CLI tool for Argmax Pro SDK",
subcommands: [Transcribe.self]
)
struct Transcribe: AsyncParsableCommand {
static let configuration = CommandConfiguration(
abstract: "Real-time transcription using system microphone"
)
@Option(help: "Argmax Pro SDK API key")
var apiKey: String
@Option(help: "Model name: e.g. `tiny.en` or `large-v3-v20240930_626MB`. Default: `tiny.en`")
var modelName: String = "tiny.en"
@Option(help: "HuggingFace token if accessing Pro models")
var modelToken: String?
func run() async throws {
let coordinator = WhisperKitCoordinator(argmaxKey: apiKey, huggingFaceToken: modelToken)
let modelRepo = "argmaxinc/whisperkit-coreml"
// Uncomment to access Pro models (requires `modelToken`)
// let modelRepo = "argmaxinc/whisperkit-pro"
try await coordinator.prepare(modelName: modelName, repo: modelRepo)
guard let whisperKitPro = coordinator.engine as? WhisperKitPro else {
print("WhisperKitPro unavailable")
return
}
let transcriber = LiveTranscriber(whisperKit: whisperKitPro)
try await transcribeStream(transcriber: transcriber)
}
private func transcribeStream(transcriber: LiveTranscriber) async throws {
print("Transcribing while streaming audio from microphone…")
let stamp = DateFormatter()
stamp.dateFormat = "HH:mm:ss.SSS"
var confirmed = ""
// Listener task: consumes LiveTranscriber events
let listener: Task<Void, Never> = Task {
do {
var started = false
for try await event in transcriber.events() {
if !started {
print("Started audio capture... press Ctrl+C to stop...")
started = true
}
let ts = stamp.string(from: Date())
switch event {
case .confirm(let txt, _):
if !confirmed.isEmpty && !confirmed.hasSuffix(" ") {
confirmed += " "
}
confirmed += txt
case .hypothesis(let hypo, _):
print("[\(ts)] \(confirmed)\u{001B}[34m\(hypo)\u{001B}[0m")
}
}
if !started { print("No transcription events received.") }
} catch {
if !(error is CancellationError) {
print("Transcriber stream error:", error)
}
}
}
// (Ctrl-C) handler graceful shutdown
signal(SIGINT, SIG_IGN)
let sigSrc = DispatchSource.makeSignalSource(signal: SIGINT, queue: .main)
sigSrc.setEventHandler { [weak transcriber] in
Task {
do {
if let result = try await transcriber?.stop() {
print("\n\nTranscription:\n\n\(result.text)\n")
}
} catch {
print("Finalization error:", error)
}
CFRunLoopStop(CFRunLoopGetMain())
}
}
sigSrc.resume()
// Suspend until the listener finishes (cancelled by SIGINT)
_ = await listener.value
}
}
}
Step 2: Build and run in Terminal
Run the following command in your Terminal from within the top-level project directory:
Example usage:
swift run argmax-test-cli transcribe --api-key <API_KEY>
If you observe error: no registry configured for 'argmaxinc' scope
, go back to Step 0.
Here is an example output upon successful build and launch:

Advanced Example
The functionality of this advanced example is identical to that of the Basic Example. The only difference is the usage of lower-level APIs to allow advanced configurability.
Step 1: Modify Basic Example
After setting up the basic example, simply change the source code of ArgmaxTestCommand.swift
to the following:
import Foundation
import ArgumentParser
import Argmax
@main
struct ArgmaxTestCommand: AsyncParsableCommand {
static let configuration = CommandConfiguration(
abstract: "An example CLI tool for Argmax Pro SDK",
subcommands: [Transcribe.self]
)
struct Transcribe: AsyncParsableCommand {
static let configuration = CommandConfiguration(
abstract: "Real-time transcription using system microphone"
)
@Option(help: "Argmax Pro SDK API key")
var apiKey: String
@Option(help: "Model name: e.g. `tiny.en` or `large-v3-v20240930_626MB`. Default: `tiny.en`")
var modelName: String = "tiny.en"
@Option(help: "HuggingFace token if accessing Pro models")
var modelToken: String?
func run() async throws {
print("Initializing Argmax Pro SDK...")
let sdkConfig = ArgmaxConfig(apiKey: apiKey)
await ArgmaxSDK.with(sdkConfig)
let modelRepo = "argmaxinc/whisperkit-coreml"
// Uncomment to access Pro models (requires `modelToken`)
// let modelRepo = "argmaxinc/whisperkit-pro"
print("Downloading \(modelName) model ...")
let downloadURL = try await WhisperKitPro.download(
variant: modelName,
from: modelRepo,
token: modelToken) { progress in
if let progressString = progress.localizedDescription {
print("\rDownload progress: \(progressString)", terminator: "")
fflush(stdout)
print("Calling cancel!")
progress.cancel()
}
}
let modelFolder = downloadURL.path(percentEncoded: false)
print("\nDownload completed: \(modelFolder)")
let whisperKitPro = try await setupWhisperKitPro(modelFolder: modelFolder)
try await transcribeStream(whisperKitPro: whisperKitPro)
}
private func setupWhisperKitPro(modelFolder: String) async throws -> WhisperKitPro {
print("Initializing WhisperKit Pro...")
let whisperConfig = WhisperKitProConfig(
modelFolder: modelFolder,
verbose: true,
logLevel: .debug
)
let whisperKitPro = try await WhisperKitPro(whisperConfig)
print("Loading WhisperKit Pro models...")
try await whisperKitPro.loadModels()
return whisperKitPro
}
private func transcribeStream(whisperKitPro: WhisperKitPro) async throws {
print("Transcribing while streaming audio from microphone...")
let baseOptions = DecodingOptions(
verbose: true,
task: .transcribe,
wordTimestamps: true,
chunkingStrategy: .vad
)
let options = DecodingOptionsPro(
base: baseOptions,
transcribeInterval: 0.1
)
// Start recording
var audioBuffer: [Float] = []
let lock = NSLock()
try whisperKitPro.audioProcessor.startRecordingLive { samples in
lock.withLock {
audioBuffer.append(contentsOf: samples)
}
}
print("Started audio capture... press Ctrl+C to stop...")
// Process the stream
let dateFormatter = DateFormatter()
dateFormatter.dateFormat = "HH:mm:ss.SSS"
var accumulatedConfirmedText = ""
let recordingTask = whisperKitPro.transcribeWhileRecording(
options: options,
audioCallback: {
let samples = lock.withLock {
let samples = audioBuffer
audioBuffer.removeAll()
return samples
}
return AudioSamples(samples: samples)
},
resultCallback: { result in
let timestamp = dateFormatter.string(from: Date())
accumulatedConfirmedText += result.text
let hypothesisText = result.hypothesisText ?? ""
print("[\(timestamp)] \(accumulatedConfirmedText)\u{001B}[34m\(hypothesisText)\u{001B}[0m")
return true
}
)
var signalHandled = false
defer {
if !signalHandled {
print("Stop recording...")
recordingTask.stop()
}
}
signal(SIGINT, SIG_IGN)
let signalSource = DispatchSource.makeSignalSource(signal: SIGINT, queue: DispatchQueue.main)
signalSource.setEventHandler(handler: DispatchWorkItem(block: {
print("Stop recording...")
signalHandled = true
whisperKitPro.audioProcessor.stopRecording()
print("Finalizing transcription...")
let group = DispatchGroup()
group.enter()
Task {
do {
let results = try await recordingTask.finalize()
let mergedResult = WhisperKitProUtils.mergeTranscriptionResults(results)
print("\n\nTranscription: \n\n\(mergedResult.text)\n")
} catch {
print("Error finalizing recording: \(error)")
}
group.leave()
}
group.wait()
Foundation.exit(0)
}))
signalSource.resume()
try await recordingTask.start()
}
}
}
Advanced Features
Pro Models
Pro SDK offers significantly faster and more energy-efficient models. These models also lead to higher accuracy word-level timestamps.
Upgrade patch for Advanced Example:
- let config = WhisperKitConfig(model: "large-v3-v20240930")
+ let config = WhisperKitProConfig(
+ model: "large-v3-v20240930",
+ modelRepo: "argmaxinc/whisperkit-pro",
+ modelToken: "hf_*****" // Request access at https://huggingface.co/argmaxinc/whisperkit-pro
+ )
- // let modelRepo = "argmaxinc/whisperkit-pro"
+ let modelRepo = "argmaxinc/whisperkit-pro"
Pro Model Access Credentials. Please request access here. We are working on removing this extra credential requirement in the near term.
Multiple Audio Streams
This feature allows multiple input audio streams to be real-time transcribed by the same WhisperKitPro
object. An example use case is concurrent real-time transcription of system audio and microphone for meeting transcriptions.
Documentation coming soon.