Real Time Transcription
Implementing real-time speech-to-text in your applications
Context
Real-time transcription streams input audio and output text continuously during a live session:
- Input audio stream: Capturing audio in small user-defined intervals
- Inference: Incremental speech-to-text model inference on the input stream
- Output text streams:
- Confirmed Text: Finalized historical transcription that will not change
- Hypothesis Text: Preliminary text that may be refined as more audio context arrives
This approach creates an ultra low-latency user experience where words appear on the screen almost as they're spoken, with occasional refinements as the model gathers more context.
If this is your first time, start with the Open-source SDK. You can always upgrade to the Pro SDK later for more features and better performance.
Basic Example
Pro SDK
Argmax Pro SDK includes the WhisperKitPro
framework which implements an advanced streaming inference algorithm described here.
Key features:
- Accuracy is identical to offline file-based transcription
- Dual text streams can be leveraged in the user experience to build trust in stable and accurate results (Confirmed Text) while preserving responsiveness (Hypothesis Text).
- Streaming API design that exposes event-based callbacks, minimizing the burden on the caller
import Argmax
// Initialize Argmax SDK to enable Pro access
await ArgmaxSDK.with(ArgmaxConfig(apiKey: "ax_*****"))
let config = WhisperKitProConfig(model: "large-v3-v20240930")
let whisperKitPro = try await WhisperKitPro(config)
var transcription = "" // Confirmed transcription text
var hypothesisText = "" // Hypothesis text from most recent transcription
var latestAudioSampleIndex = 0 // Track the latest audio samples sent to the transcribe task
/// Capture audio as a float array into `yourRecordingAudio`
var yourRecordingAudio: [Float] = []
...
let transcribeTask = whisperKitPro.transcribeWhileRecording(
audioCallback: {
// Get latest audio samples
let newAudioToTranscribe = yourRecordingAudio[latestAudioSampleIndex...]
latestAudioSampleIndex = yourRecordingAudio.count - 1
// Send the new audio samples to the transcribe task
return AudioSamples(samples: newAudioToTranscribe)
},
resultCallback: { result in
transcription += result.text
hypothesisText = result.hypothesisText
// Let the transcribe task know it should continue
return true
}
)
Open-source SDK
Argmax Open-source SDK includes the WhisperKit
framework which offers the basic building blocks to enable developers to implement a basic streaming algorithm to approximate the real-time behavior of the Pro SDK.
// Audio processor to capture samples
class AudioChunkProcessor {
private var audioEngine: AVAudioEngine
private var audioBuffer: [Float] = []
private var chunkDuration: TimeInterval = 2.0
// Start capturing audio in chunks
func startChunkedCapture(onChunkReady: @escaping ([Float]) -> Void) {
let timer = Timer.scheduledTimer(withTimeInterval: chunkDuration, repeats: true) { [weak self] _ in
guard let self = self else { return }
// Get current chunk of audio
let currentChunk = Array(self.audioBuffer)
self.audioBuffer.removeAll()
// Process this chunk
onChunkReady(currentChunk)
}
}
}
// Pseudo-real-time transcription manager
class ChunkedTranscriptionManager {
private let whisperKit: WhisperKit
private let audioProcessor = AudioChunkProcessor()
private var fullTranscription = ""
func startChunkedTranscription() {
// Start capturing audio in chunks
audioProcessor.startChunkedCapture { [weak self] audioChunk in
guard let self = self else { return }
Task {
// Save chunk to temporary file
let tempURL = try self.saveSamplesToTempFile(samples: audioChunk)
// Process with Open-Source SDK (WhisperKit)
let result = try await self.whisperKit.transcribe(audioPath: tempURL.path)
// Update UI with incremental result
await MainActor.run {
self.fullTranscription += result.text + " "
// Update UI
}
}
}
}
}
Advanced Example
Transcribe from System Microphone
Prerequisite: Argmax Pro SDK access must be set up with SwiftPM before going through this example. If unsure, please see Upgrading to Pro SDK (Step 1 only).
This is a complete and self-contained CLI example project that demonstrates the usage of Argmax Pro SDK for real-time transcription from a microphone input stream. Your project directory should look like this:
ArgmaxRealTimeTranscriptionAdvancedExample
├── Package.swift
└── Sources
└── ArgmaxTestCLI
└── ArgmaxTestCommand.swift
Package.swift
:
// swift-tools-version: 5.10
// The swift-tools-version declares the minimum version of Swift required to build this package.
import PackageDescription
let package = Package(
name: "Argmax Test CLI",
platforms: [
.macOS(.v14)
],
products: [
.executable(
name: "argmax-test-cli",
targets: ["ArgmaxTestCLI"]
)
],
dependencies: [
.package(id: "argmaxinc.argmax-sdk-swift", from: "1.2.0"),
.package(url: "https://github.com/apple/swift-argument-parser.git", exact: "1.3.0")
],
targets: [
.executableTarget(
name: "ArgmaxTestCLI",
dependencies: [
.product(name: "Argmax", package: "argmaxinc.argmax-sdk-swift"),
.product(name: "ArgumentParser", package: "swift-argument-parser")
]
),
]
)
ArgmaxTestCommand.swift
:
import Foundation
import ArgumentParser
import Argmax
@main
struct ArgmaxTestCommand: AsyncParsableCommand {
static let configuration = CommandConfiguration(
abstract: "An example CLI tool for Argmax Pro SDK",
subcommands: [Transcribe.self]
)
struct Transcribe: AsyncParsableCommand {
static let configuration = CommandConfiguration(
abstract: "Real-time transcription using system microphone"
)
@Option(help: "Argmax Pro SDK API key")
var apiKey: String
@Option(help: "Model name: e.g. `tiny.en` or `large-v3-v20240930_626MB`. Default: `tiny.en`")
var modelName: String = "tiny.en"
@Option(help: "HuggingFace token if accessing Pro models")
var modelToken: String?
func run() async throws {
print("Initializing Argmax Pro SDK...")
let sdkConfig = ArgmaxConfig(apiKey: apiKey)
await ArgmaxSDK.with(sdkConfig)
let modelRepo = "argmaxinc/whisperkit-coreml"
// Uncomment to access Pro models (requires `modelToken`)
// let modelRepo = "argmaxinc/whisperkit-pro"
print("Downloading \(modelName) model ...")
let downloadURL = try await WhisperKitPro.download(
variant: modelName,
from: modelRepo,
token: modelToken) { progress in
if let progressString = progress.localizedDescription {
print("\rDownload progress: \(progressString)", terminator: "")
fflush(stdout)
print("Calling cancel!")
progress.cancel()
}
}
let modelFolder = downloadURL.path(percentEncoded: false)
print("\nDownload completed: \(modelFolder)")
let whisperKitPro = try await setupWhisperKitPro(modelFolder: modelFolder)
try await transcribeStream(whisperKitPro: whisperKitPro)
}
private func setupWhisperKitPro(modelFolder: String) async throws -> WhisperKitPro {
print("Initializing WhisperKit Pro...")
let whisperConfig = WhisperKitProConfig(
modelFolder: modelFolder,
verbose: true,
logLevel: .debug
)
let whisperKitPro = try await WhisperKitPro(whisperConfig)
print("Loading WhisperKit Pro models...")
try await whisperKitPro.loadModels()
return whisperKitPro
}
private func transcribeStream(whisperKitPro: WhisperKitPro) async throws {
print("Transcribing while streaming audio from microphone...")
let baseOptions = DecodingOptions(
verbose: true,
task: .transcribe,
wordTimestamps: true,
chunkingStrategy: .vad
)
let options = DecodingOptionsPro(
base: baseOptions,
transcribeInterval: 0.1
)
// Start recording
var audioBuffer: [Float] = []
let lock = NSLock()
try whisperKitPro.audioProcessor.startRecordingLive { samples in
lock.withLock {
audioBuffer.append(contentsOf: samples)
}
}
print("Started audio capture... press Ctrl+C to stop...")
// Process the stream
let dateFormatter = DateFormatter()
dateFormatter.dateFormat = "HH:mm:ss.SSS"
var accumulatedConfirmedText = ""
let recordingTask = whisperKitPro.transcribeWhileRecording(
options: options,
audioCallback: {
let samples = lock.withLock {
let samples = audioBuffer
audioBuffer.removeAll()
return samples
}
return AudioSamples(samples: samples)
},
resultCallback: { result in
let timestamp = dateFormatter.string(from: Date())
accumulatedConfirmedText += result.text
let hypothesisText = result.hypothesisText ?? ""
print("[\(timestamp)] \(accumulatedConfirmedText)\u{001B}[34m\(hypothesisText)\u{001B}[0m")
return true
}
)
var signalHandled = false
defer {
if !signalHandled {
print("Stop recording...")
recordingTask.stop()
}
}
signal(SIGINT, SIG_IGN)
let signalSource = DispatchSource.makeSignalSource(signal: SIGINT, queue: DispatchQueue.main)
signalSource.setEventHandler(handler: DispatchWorkItem(block: {
print("Stop recording...")
signalHandled = true
whisperKitPro.audioProcessor.stopRecording()
print("Finalizing transcription...")
let group = DispatchGroup()
group.enter()
Task {
do {
let results = try await recordingTask.finalize()
let mergedResult = WhisperKitProUtils.mergeTranscriptionResults(results)
print("\n\nTranscription: \n\n\(mergedResult.text)\n")
} catch {
print("Error finalizing recording: \(error)")
}
group.leave()
}
group.wait()
Foundation.exit(0)
}))
signalSource.resume()
try await recordingTask.start()
}
}
}
Once the ArgmaxRealTimeTranscriptionAdvancedExample
directory is set up as shown above, you may run swift build
in your terminal from within the top-level project directory to build the CLI.
Example usage:
.build/debug/argmax-test-cli transcribe --api-key <API_KEY>
If you observe error: no registry configured for 'argmaxinc' scope
, you should set up Pro SDK access by following Upgrading to Pro SDK (Step 1 only).
Example output upon successful build and launch:

Advanced Features
Pro Models
Pro SDK offers significantly faster and more energy-efficient models. These models also lead to higher accuracy word-level timestamps.
To upgrade, simply apply this diff to your initial configuration code:
- let config = WhisperKitConfig(model: "large-v3-v20240930")
+ let config = WhisperKitProConfig(
+ model: "large-v3-v20240930",
+ modelRepo: "argmaxinc/whisperkit-pro",
+ modelToken: "hf_*****" // Request access at https://huggingface.co/argmaxinc/whisperkit-pro
+ )
For now, you need to request model access here. We are working on removing this extra credential requirement.
UI Considerations
Differentiate Confirmed and Hypothesis
In order to communicate expectations of the permanent and temporary nature of each output text stream respectively.
Audio Level Visualization
User feedback to show historical patterns of input audio levels