Private Beta
Docs
Real Time Transcription

Real Time Transcription

Context

Argmax Pro SDK includes the WhisperKitPro framework which implements an advanced streaming inference algorithm described in our paper. Key features include:

  • Accuracy is identical to offline file transcription
  • Dual output text streams can be leveraged in the user experience to build trust in stable and accurate results (Confirmed) while maximizing responsiveness (Hypothesis).
  • Streaming API design that exposes event-based callbacks, minimizing the burden on the caller

Real-time transcription streams input audio and the corresponding output text continuously during a live recording session:

  1. Input audio stream: Capturing audio in small user-defined intervals
  2. Inference: Incremental speech-to-text inference on the input stream
  3. Output text streams:
    • Confirmed: Finalized portion of the transcript that gets longer over time.
    • Hypothesis: Preliminary transcript that may still be refined as more audio context arrives.

This approach creates an ultra low-latency user experience where words appear on the screen almost as they're spoken, with occasional refinements as the model gathers more context.





Basic Example

This is a complete and self-contained CLI example project that demonstrates the usage of Argmax Pro SDK for real-time transcription from a microphone input stream. This basic example uses higher-level APIs such as WhisperKitCoordinator and LiveTranscriber to minimize the lines of code you have to write. If you would like to more advanced configurability with lower-level APIs, please see the Advanced Example section.

Step 0: Verify Pro SDK setup

Argmax Pro SDK access must be set up with SwiftPM before going through this example. If unsure, please see Upgrading to Pro SDK (Step 1 only).

Step 1: Create project directory

Create a project directory as shown below and insert the code shared below into ArgmaxTestCommand.swift and Package.swift

ArgmaxSDKRealTimeTranscriptionBasicExample
├── Package.swift
└── Sources
    └── ArgmaxTestCLI
        └── ArgmaxTestCommand.swift

Package.swift:

// swift-tools-version: 5.10
// The swift-tools-version declares the minimum version of Swift required to build this package.
 
import PackageDescription
 
let package = Package(
    name: "Argmax Test CLI",
    platforms: [
        .macOS(.v14)
    ],
    products: [
        .executable(
            name: "argmax-test-cli",
            targets: ["ArgmaxTestCLI"]
        )
    ],
    dependencies: [
        .package(id: "argmaxinc.argmax-sdk-swift", .upToNextMinor(from: "1.3.3")),
        .package(url: "https://github.com/apple/swift-argument-parser.git", exact: "1.3.0")
    ],
    targets: [
        .executableTarget(
            name: "ArgmaxTestCLI",
            dependencies: [
                .product(name: "Argmax", package: "argmaxinc.argmax-sdk-swift"),
                .product(name: "ArgumentParser", package: "swift-argument-parser")
            ]
        ),
    ]
)
 

ArgmaxTestCommand.swift:

import Foundation
import ArgumentParser
import Argmax
 
@main
struct ArgmaxTestCommand: AsyncParsableCommand {
    static let configuration = CommandConfiguration(
        abstract: "An example CLI tool for Argmax Pro SDK",
        subcommands: [Transcribe.self]
    )
 
    struct Transcribe: AsyncParsableCommand {
        static let configuration = CommandConfiguration(
            abstract: "Real-time transcription using system microphone"
        )
 
        @Option(help: "Argmax Pro SDK API key")
        var apiKey: String
 
        @Option(help: "Model name: e.g. `tiny.en` or `large-v3-v20240930_626MB`. Default: `tiny.en`")
        var modelName: String = "tiny.en"
 
        @Option(help: "HuggingFace token if accessing Pro models")
        var modelToken: String?
 
        func run() async throws {
            let coordinator = WhisperKitCoordinator(argmaxKey: apiKey, huggingFaceToken: modelToken)
            let modelRepo = "argmaxinc/whisperkit-coreml"
 
            // Uncomment to access Pro models (requires `modelToken`)
            // let modelRepo = "argmaxinc/whisperkit-pro"
            try await coordinator.prepare(modelName: modelName, repo: modelRepo)
 
            guard let whisperKitPro = coordinator.engine as? WhisperKitPro else {
                print("WhisperKitPro unavailable")
                return
            }
 
            let transcriber = LiveTranscriber(whisperKit: whisperKitPro)
            try await transcribeStream(transcriber: transcriber)
        }
 
        private func transcribeStream(transcriber: LiveTranscriber) async throws {
            print("Transcribing while streaming audio from microphone…")
            let stamp = DateFormatter()
            stamp.dateFormat = "HH:mm:ss.SSS"
            var confirmed = ""
            // Listener task: consumes LiveTranscriber events
            let listener: Task<Void, Never> = Task {
                do {
                    var started = false
                    for try await event in transcriber.events() {
                        if !started {
                            print("Started audio capture... press Ctrl+C to stop...")
                            started = true
                        }
                        let ts = stamp.string(from: Date())
                        switch event {
                        case .confirm(let txt, _):
                            if !confirmed.isEmpty && !confirmed.hasSuffix(" ") {
                                confirmed += " "
                            }
                            confirmed += txt
                        case .hypothesis(let hypo, _):
                            print("[\(ts)] \(confirmed)\u{001B}[34m\(hypo)\u{001B}[0m")
                        }
                    }
                    if !started { print("No transcription events received.") }
                } catch {
                    if !(error is CancellationError) {
                        print("Transcriber stream error:", error)
                    }
                }
            }
            // (Ctrl-C) handler graceful shutdown
            signal(SIGINT, SIG_IGN)
            let sigSrc = DispatchSource.makeSignalSource(signal: SIGINT, queue: .main)
            sigSrc.setEventHandler { [weak transcriber] in
                Task {
                    do {
                        if let result = try await transcriber?.stop() {
                            print("\n\nTranscription:\n\n\(result.text)\n")
                        }
                    } catch {
                        print("Finalization error:", error)
                    }
                    CFRunLoopStop(CFRunLoopGetMain())
                }
            }
            sigSrc.resume()
            // Suspend until the listener finishes (cancelled by SIGINT)
            _ = await listener.value
        }
    }
}
 

Step 2: Build and run in Terminal

Run the following command in your Terminal from within the top-level project directory:

Example usage:

swift run argmax-test-cli transcribe --api-key <API_KEY>

If you observe error: no registry configured for 'argmaxinc' scope, go back to Step 0.

Here is an example output upon successful build and launch:


swiftpm-invalid-checksum

Advanced Example

The functionality of this advanced example is identical to that of the Basic Example. The only difference is the usage of lower-level APIs to allow advanced configurability.

Step 1: Modify Basic Example

After setting up the basic example, simply change the source code of ArgmaxTestCommand.swift to the following:

import Foundation
import ArgumentParser
import Argmax
 
@main
struct ArgmaxTestCommand: AsyncParsableCommand {
    static let configuration = CommandConfiguration(
        abstract: "An example CLI tool for Argmax Pro SDK",
        subcommands: [Transcribe.self]
    )
 
    struct Transcribe: AsyncParsableCommand {
        static let configuration = CommandConfiguration(
            abstract: "Real-time transcription using system microphone"
        )
 
        @Option(help: "Argmax Pro SDK API key")
        var apiKey: String
 
        @Option(help: "Model name: e.g. `tiny.en` or `large-v3-v20240930_626MB`. Default: `tiny.en`")
        var modelName: String = "tiny.en"
 
        @Option(help: "HuggingFace token if accessing Pro models")
        var modelToken: String?
 
        func run() async throws {
 
            print("Initializing Argmax Pro SDK...")
 
            let sdkConfig = ArgmaxConfig(apiKey: apiKey)
            await ArgmaxSDK.with(sdkConfig)
 
            let modelRepo = "argmaxinc/whisperkit-coreml"
            // Uncomment to access Pro models (requires `modelToken`)
            // let modelRepo = "argmaxinc/whisperkit-pro"
 
            print("Downloading \(modelName) model ...")
            let downloadURL = try await WhisperKitPro.download(
                variant: modelName,
                from: modelRepo,
                token: modelToken) { progress in
                    if let progressString = progress.localizedDescription {
                        print("\rDownload progress: \(progressString)", terminator: "")
                        fflush(stdout)
                        print("Calling cancel!")
                        progress.cancel()
                    }
                }
            let modelFolder = downloadURL.path(percentEncoded: false)
            print("\nDownload completed: \(modelFolder)")
 
            let whisperKitPro = try await setupWhisperKitPro(modelFolder: modelFolder)
            try await transcribeStream(whisperKitPro: whisperKitPro)
        }
 
        private func setupWhisperKitPro(modelFolder: String) async throws -> WhisperKitPro {
            print("Initializing WhisperKit Pro...")
            let whisperConfig = WhisperKitProConfig(
                modelFolder: modelFolder,
                verbose: true,
                logLevel: .debug
            )
            let whisperKitPro = try await WhisperKitPro(whisperConfig)
 
            print("Loading WhisperKit Pro models...")
            try await whisperKitPro.loadModels()
 
            return whisperKitPro
        }
 
        private func transcribeStream(whisperKitPro: WhisperKitPro) async throws {
            print("Transcribing while streaming audio from microphone...")
 
            let baseOptions = DecodingOptions(
                verbose: true,
                task: .transcribe,
                wordTimestamps: true,
                chunkingStrategy: .vad
            )
 
            let options = DecodingOptionsPro(
                base: baseOptions,
                transcribeInterval: 0.1
            )
 
            // Start recording
            var audioBuffer: [Float] = []
            let lock = NSLock()
            try whisperKitPro.audioProcessor.startRecordingLive { samples in
                lock.withLock {
                    audioBuffer.append(contentsOf: samples)
                }
            }
            print("Started audio capture... press Ctrl+C to stop...")
 
            // Process the stream
            let dateFormatter = DateFormatter()
            dateFormatter.dateFormat = "HH:mm:ss.SSS"
            var accumulatedConfirmedText = ""
            let recordingTask = whisperKitPro.transcribeWhileRecording(
                options: options,
                audioCallback: {
                    let samples = lock.withLock {
                        let samples = audioBuffer
                        audioBuffer.removeAll()
                        return samples
                    }
                    return AudioSamples(samples: samples)
                },
                resultCallback: { result in
                    let timestamp = dateFormatter.string(from: Date())
                    accumulatedConfirmedText += result.text
                    let hypothesisText = result.hypothesisText ?? ""
                    print("[\(timestamp)] \(accumulatedConfirmedText)\u{001B}[34m\(hypothesisText)\u{001B}[0m")
                    return true
                }
            )
 
            var signalHandled = false
            defer {
                if !signalHandled {
                    print("Stop recording...")
                    recordingTask.stop()
                }
            }
 
            signal(SIGINT, SIG_IGN)
 
            let signalSource = DispatchSource.makeSignalSource(signal: SIGINT, queue: DispatchQueue.main)
            signalSource.setEventHandler(handler: DispatchWorkItem(block: {
                print("Stop recording...")
                signalHandled = true
 
                whisperKitPro.audioProcessor.stopRecording()
 
                print("Finalizing transcription...")
                let group = DispatchGroup()
                group.enter()
 
                Task {
                    do {
                        let results = try await recordingTask.finalize()
                        let mergedResult = WhisperKitProUtils.mergeTranscriptionResults(results)
                        print("\n\nTranscription: \n\n\(mergedResult.text)\n")
                    } catch {
                        print("Error finalizing recording: \(error)")
                    }
                    group.leave()
                }
                group.wait()
                Foundation.exit(0)
            }))
            signalSource.resume()
            try await recordingTask.start()
        }
    }
}

Advanced Features

Pro Models

Pro SDK offers significantly faster and more energy-efficient models. These models also lead to higher accuracy word-level timestamps.

Upgrade patch for Advanced Example:

- let config = WhisperKitConfig(model: "large-v3-v20240930")
+ let config = WhisperKitProConfig(
+     model: "large-v3-v20240930",
+     modelRepo: "argmaxinc/whisperkit-pro",
+     modelToken: "hf_*****" // Request access at https://huggingface.co/argmaxinc/whisperkit-pro
+ )

Basic Example:

- // let modelRepo = "argmaxinc/whisperkit-pro"
+ let modelRepo = "argmaxinc/whisperkit-pro"

Multiple Audio Streams

This feature allows multiple input audio streams to be real-time transcribed by the same WhisperKitPro object. An example use case is concurrent real-time transcription of system audio and microphone for meeting transcriptions.

Documentation coming soon.