Beta
Docs
Real Time Transcription

Real Time Transcription

Context

Argmax Pro SDK includes the WhisperKitPro framework which implements an advanced streaming inference algorithm described in our paper. Key features include:

  • Accuracy is identical to offline file transcription
  • Dual output text streams can be leveraged in the user experience to build trust in stable and accurate results (Confirmed) while maximizing responsiveness (Hypothesis).
  • Streaming API design that exposes event-based callbacks, minimizing the burden on the caller

Real-time transcription streams input audio and the corresponding output text continuously during a live recording session:

  1. Input audio stream: Capturing audio in small user-defined intervals
  2. Inference: Incremental speech-to-text inference on the input stream
  3. Output text streams:
    • Confirmed: Finalized portion of the transcript that gets longer over time.
    • Hypothesis: Preliminary transcript that may still be refined as more audio context arrives.

This approach creates an ultra low-latency user experience where words appear on the screen almost as they're spoken, with occasional refinements as the model gathers more context.





Basic Example

This is a complete and self-contained CLI example project that demonstrates the usage of Argmax Pro SDK for real-time transcription from a microphone input stream.

Step 0: Verify Pro SDK setup

Argmax Pro SDK access must be set up with SwiftPM before going through this example. If unsure, please see Upgrading to Pro SDK (Step 1 only).

Step 1: Create project directory

Create a project directory as shown below and insert the code shared below into ArgmaxTestCommand.swift and Package.swift

ArgmaxSDKRealTimeTranscriptionBasicExample
├── Package.swift
└── Sources
    └── ArgmaxTestCLI
        └── ArgmaxTestCommand.swift

Package.swift:

// swift-tools-version: 5.10
// The swift-tools-version declares the minimum version of Swift required to build this package.
 
import PackageDescription
 
let package = Package(
    name: "Argmax Test CLI",
    platforms: [
        .macOS(.v14)
    ],
    products: [
        .executable(
            name: "argmax-test-cli",
            targets: ["ArgmaxTestCLI"]
        )
    ],
    dependencies: [
        .package(id: "argmaxinc.argmax-sdk-swift", .upToNextMinor(from: "1.3.3")),
        .package(url: "https://github.com/apple/swift-argument-parser.git", exact: "1.3.0")
    ],
    targets: [
        .executableTarget(
            name: "ArgmaxTestCLI",
            dependencies: [
                .product(name: "Argmax", package: "argmaxinc.argmax-sdk-swift"),
                .product(name: "ArgumentParser", package: "swift-argument-parser")
            ]
        ),
    ]
)
 

ArgmaxTestCommand.swift:

import Foundation
import ArgumentParser
import Argmax
 
@main
struct ArgmaxTestCommand: AsyncParsableCommand {
    static let configuration = CommandConfiguration(
        abstract: "An example CLI tool for Argmax Pro SDK",
        subcommands: [Transcribe.self]
    )
 
    struct Transcribe: AsyncParsableCommand {
        static let configuration = CommandConfiguration(
            abstract: "Real-time transcription using system microphone"
        )
 
        @Option(help: "Argmax Pro SDK API key")
        var apiKey: String
 
        @Option(help: "Model name: e.g. `tiny.en` or `large-v3-v20240930_626MB`. Default: `tiny.en`")
        var modelName: String = "tiny.en"
 
        @Option(help: "HuggingFace token if accessing Pro models")
        var modelToken: String?
 
        func run() async throws {
 
            print("Initializing Argmax Pro SDK...")
 
            let sdkConfig = ArgmaxConfig(apiKey: apiKey)
            await ArgmaxSDK.with(sdkConfig)
 
            let modelRepo = "argmaxinc/whisperkit-coreml"
            // Uncomment to access Pro models (requires `modelToken`)
            // let modelRepo = "argmaxinc/whisperkit-pro"
 
            print("Downloading \(modelName) model ...")
            let downloadURL = try await WhisperKitPro.download(
                variant: modelName,
                from: modelRepo,
                token: modelToken) { progress in
                    if let progressString = progress.localizedDescription {
                        print("\rDownload progress: \(progressString)", terminator: "")
                        fflush(stdout)
                        print("Calling cancel!")
                        progress.cancel()
                    }
                }
            let modelFolder = downloadURL.path(percentEncoded: false)
            print("\nDownload completed: \(modelFolder)")
 
            let whisperKitPro = try await setupWhisperKitPro(modelFolder: modelFolder)
            try await transcribeStream(whisperKitPro: whisperKitPro)
        }
 
        private func setupWhisperKitPro(modelFolder: String) async throws -> WhisperKitPro {
            print("Initializing WhisperKit Pro...")
            let whisperConfig = WhisperKitProConfig(
                modelFolder: modelFolder,
                verbose: true,
                logLevel: .debug
            )
            let whisperKitPro = try await WhisperKitPro(whisperConfig)
 
            print("Loading WhisperKit Pro models...")
            try await whisperKitPro.loadModels()
 
            return whisperKitPro
        }
 
        private func transcribeStream(whisperKitPro: WhisperKitPro) async throws {
            print("Transcribing while streaming audio from microphone...")
 
            let baseOptions = DecodingOptions(
                verbose: true,
                task: .transcribe,
                wordTimestamps: true,
                chunkingStrategy: .vad
            )
 
            let options = DecodingOptionsPro(
                base: baseOptions,
                transcribeInterval: 0.1
            )
 
            // Start recording
            var audioBuffer: [Float] = []
            let lock = NSLock()
            try whisperKitPro.audioProcessor.startRecordingLive { samples in
                lock.withLock {
                    audioBuffer.append(contentsOf: samples)
                }
            }
            print("Started audio capture... press Ctrl+C to stop...")
 
            // Process the stream
            let dateFormatter = DateFormatter()
            dateFormatter.dateFormat = "HH:mm:ss.SSS"
            var accumulatedConfirmedText = ""
            let recordingTask = whisperKitPro.transcribeWhileRecording(
                options: options,
                audioCallback: {
                    let samples = lock.withLock {
                        let samples = audioBuffer
                        audioBuffer.removeAll()
                        return samples
                    }
                    return AudioSamples(samples: samples)
                },
                resultCallback: { result in
                    let timestamp = dateFormatter.string(from: Date())
                    accumulatedConfirmedText += result.text
                    let hypothesisText = result.hypothesisText ?? ""
                    print("[\(timestamp)] \(accumulatedConfirmedText)\u{001B}[34m\(hypothesisText)\u{001B}[0m")
                    return true
                }
            )
 
            var signalHandled = false
            defer {
                if !signalHandled {
                    print("Stop recording...")
                    recordingTask.stop()
                }
            }
 
            signal(SIGINT, SIG_IGN)
 
            let signalSource = DispatchSource.makeSignalSource(signal: SIGINT, queue: DispatchQueue.main)
            signalSource.setEventHandler(handler: DispatchWorkItem(block: {
                print("Stop recording...")
                signalHandled = true
 
                whisperKitPro.audioProcessor.stopRecording()
 
                print("Finalizing transcription...")
                let group = DispatchGroup()
                group.enter()
 
                Task {
                    do {
                        let results = try await recordingTask.finalize()
                        let mergedResult = WhisperKitProUtils.mergeTranscriptionResults(results)
                        print("\n\nTranscription: \n\n\(mergedResult.text)\n")
                    } catch {
                        print("Error finalizing recording: \(error)")
                    }
                    group.leave()
                }
                group.wait()
                Foundation.exit(0)
            }))
            signalSource.resume()
            try await recordingTask.start()
        }
    }
}

Step 2: Build and run in Terminal

Run the following command in your Terminal from within the top-level project directory:

Example usage:

swift run argmax-test-cli transcribe --api-key <API_KEY>

If you observe error: no registry configured for 'argmaxinc' scope, go back to Step 0.

Here is an example output upon successful build and launch:


swiftpm-invalid-checksum

Advanced Features

Pro Models

Pro SDK offers additional models with significantly higher speed, accuracy, and energy-efficiency.

Nvidia Parakeet Models

These models are not yet supported for real-time transcription. Coming soon.

Whisper Models

This second set of Whisper models are further optimized for speed and energy-efficiency on top of their open-source counterparts. During this upgrade, accuracy remains identical while speed improves.

In order to use upgraded Whisper models, simply apply this diff to your initial configuration code:

- let config = WhisperKitConfig(model: "large-v3-v20240930")
+ let config = WhisperKitProConfig(
+     model: "large-v3-v20240930",
+     modelRepo: "argmaxinc/whisperkit-pro",
+ )

Multiple Audio Streams

This feature allows multiple input audio streams to be real-time transcribed by the same WhisperKitPro object. An example use case is concurrent real-time transcription of system audio and microphone for meeting transcriptions.

Before implementing multi-stream transcription, ensure that the ArgmaxTestCommand from Step 1 works correctly, particularly its transcribeStream function which demonstrates the basic single-stream implementation.

Multi-Stream Architecture

The same WhisperKitPro instance can efficiently handle multiple audio streams simultaneously. Each stream gets its own recordingTask that shares the same WhisperKitPro instance but maintains independent processing context, allowing them to run concurrently without interference.

Implementation Overview

For multi-stream setup, you'll need to:

  1. Audio Stream Sources: Bring your own audio stream sources (e.g., system audio, network streams, file streams). Optionally, you can also include microphone audio using whisperKitPro.audioProcessor.startRecordingLive
  2. Audio Processing: Convert incoming audio data to the required [Float] format
  3. Independent Buffers: Maintain separate audio buffers and locks for each stream
  4. Concurrent Processing: Start all recording tasks concurrently using a task group

Example Implementation

private func transcribeMultipleStreams(whisperKitPro: WhisperKitPro) async throws {
    let baseOptions = DecodingOptions(
        verbose: true,
        task: .transcribe,
        wordTimestamps: true,
        chunkingStrategy: .vad
    )
    
    let options = DecodingOptionsPro(
        base: baseOptions,
        transcribeInterval: 0.1
    )
    
    // Stream 1: System audio buffer (custom stream)
    var systemAudioBuffer: [Float] = []
    let systemLock = NSLock()
    
    // Stream 2: Network/File audio buffer (custom stream)
    var networkAudioBuffer: [Float] = []
    let networkLock = NSLock()
    
    // OPTIONAL: Stream 3: Microphone audio buffer (using built-in recorder)
    var micAudioBuffer: [Float] = []
    let micLock = NSLock()
    
    // Start your custom audio streams
    // Custom Stream 1: System audio capture
    startSystemAudioCapture { avAudioPCMBuffer in
        // Convert AVAudioPCMBuffer to [Float]
        // See conversion example: https://github.com/argmaxinc/WhisperKit/blob/8c0acbd2fdff83f4081aaae8b3bb7c01823d79e1/Sources/WhisperKit/Core/Audio/AudioProcessor.swift#L988
        let samples = convertAVAudioPCMBufferToFloatArray(avAudioPCMBuffer)
        
        systemLock.withLock {
            systemAudioBuffer.append(contentsOf: samples)
        }
    }
    
    // Custom Stream 2: Network/File audio
    startNetworkAudioStream { avAudioPCMBuffer in
        // Convert AVAudioPCMBuffer to [Float]
        let samples = convertAVAudioPCMBufferToFloatArray(avAudioPCMBuffer)
        
        networkLock.withLock {
            networkAudioBuffer.append(contentsOf: samples)
        }
    }
    
    // OPTIONAL: Built-in microphone recording (you can skip this if not needed)
    try whisperKitPro.audioProcessor.startRecordingLive { samples in
        micLock.withLock {
            micAudioBuffer.append(contentsOf: samples)
        }
    }
    
    // Create recording tasks for each stream
    let systemRecordingTask = whisperKitPro.transcribeWhileRecording(
        options: options,
        audioCallback: {
            let samples = systemLock.withLock {
                let samples = systemAudioBuffer
                systemAudioBuffer.removeAll()
                return samples
            }
            return AudioSamples(samples: samples)
        },
        resultCallback: { result in
            print("[SYSTEM] \(result.text)")
            return true
        }
    )
    
    let networkRecordingTask = whisperKitPro.transcribeWhileRecording(
        options: options,
        audioCallback: {
            let samples = networkLock.withLock {
                let samples = networkAudioBuffer
                networkAudioBuffer.removeAll()
                return samples
            }
            return AudioSamples(samples: samples)
        },
        resultCallback: { result in
            print("[NETWORK] \(result.text)")
            return true
        }
    )
    
    // OPTIONAL: Microphone recording task (only if using microphone)
    let micRecordingTask = whisperKitPro.transcribeWhileRecording(
        options: options,
        audioCallback: {
            let samples = micLock.withLock {
                let samples = micAudioBuffer
                micAudioBuffer.removeAll()
                return samples
            }
            return AudioSamples(samples: samples)
        },
        resultCallback: { result in
            print("[MIC] \(result.text)")
            return true
        }
    )
    
    // Start all recording tasks concurrently
    try await withTaskGroup(of: Void.self) { group in
        group.addTask {
            try await systemRecordingTask.start()
        }
        
        group.addTask {
            try await networkRecordingTask.start()
        }
        
        // OPTIONAL: Add microphone task only if using microphone
        group.addTask {
            try await micRecordingTask.start()
        }
        
        // Wait for all tasks to complete
        try await group.waitForAll()
    }
}
 
// Helper function to convert AVAudioPCMBuffer to [Float]
private func convertAVAudioPCMBufferToFloatArray(_ buffer: AVAudioPCMBuffer) -> [Float] {
    // Implementation depends on your audio format
    // See: https://github.com/argmaxinc/WhisperKit/blob/8c0acbd2fdff83f4081aaae8b3bb7c01823d79e1/Sources/WhisperKit/Core/Audio/AudioProcessor.swift#L728
    guard let channelData = buffer.floatChannelData else { return [] }
    let frameLength = Int(buffer.frameLength)
    return Array(UnsafeBufferPointer(start: channelData[0], count: frameLength))
}

Key Considerations

  • Audio Format Conversion: When working with AVAudioPCMBuffer from system audio or other sources, you'll need to convert the audio data to [Float] format. Refer to the AudioProcessor conversion example for implementation details.

  • Thread Safety: Each stream maintains its own audio buffer and lock to ensure thread-safe operations when multiple audio sources are writing simultaneously.

  • Concurrent Processing: Use Swift's structured concurrency (TaskGroup) to start all recording tasks simultaneously, enabling true multi-stream processing.

  • Resource Management: The shared WhisperKitPro instance efficiently manages computational resources across all streams while maintaining independent processing contexts.