Docs
Real Time Transcription

Real Time Transcription

Implementing real-time speech-to-text in your applications

Context

Real-time transcription streams input audio and output text continuously during a live session:

  1. Input audio stream: Capturing audio in small user-defined intervals
  2. Inference: Incremental speech-to-text model inference on the input stream
  3. Output text streams:
    • Confirmed Text: Finalized historical transcription that will not change
    • Hypothesis Text: Preliminary text that may be refined as more audio context arrives

This approach creates an ultra low-latency user experience where words appear on the screen almost as they're spoken, with occasional refinements as the model gathers more context.

Basic Example

Pro SDK

Argmax Pro SDK includes the WhisperKitPro framework which implements an advanced streaming inference algorithm described here.

Key features:

  • Accuracy is identical to offline file-based transcription
  • Dual text streams can be leveraged in the user experience to build trust in stable and accurate results (Confirmed Text) while preserving responsiveness (Hypothesis Text).
  • Streaming API design that exposes event-based callbacks, minimizing the burden on the caller
import Argmax
 
// Initialize Argmax SDK to enable Pro access
await ArgmaxSDK.with(ArgmaxConfig(apiKey: "ax_*****"))
 
let config = WhisperKitProConfig(model: "large-v3-v20240930")
let whisperKitPro = try await WhisperKitPro(config)
 
var transcription = "" // Confirmed transcription text
var hypothesisText = "" // Hypothesis text from most recent transcription
var latestAudioSampleIndex = 0 // Track the latest audio samples sent to the transcribe task
 
/// Capture audio as a float array into `yourRecordingAudio`
var yourRecordingAudio: [Float] = []
...
 
let transcribeTask = whisperKitPro.transcribeWhileRecording(
    audioCallback: {
        // Get latest audio samples
        let newAudioToTranscribe = yourRecordingAudio[latestAudioSampleIndex...]
        latestAudioSampleIndex = yourRecordingAudio.count - 1
 
        // Send the new audio samples to the transcribe task
        return AudioSamples(samples: newAudioToTranscribe)
    },
    resultCallback: { result in
        transcription += result.text
        hypothesisText = result.hypothesisText
 
        // Let the transcribe task know it should continue
        return true
    }
)

Open-source SDK

Argmax Open-source SDK includes the WhisperKit framework which offers the basic building blocks to enable developers to implement a basic streaming algorithm to approximate the real-time behavior of the Pro SDK.

// Audio processor to capture samples
class AudioChunkProcessor {
    private var audioEngine: AVAudioEngine
    private var audioBuffer: [Float] = []
    private var chunkDuration: TimeInterval = 2.0
 
    // Start capturing audio in chunks
    func startChunkedCapture(onChunkReady: @escaping ([Float]) -> Void) {
        let timer = Timer.scheduledTimer(withTimeInterval: chunkDuration, repeats: true) { [weak self] _ in
            guard let self = self else { return }
 
            // Get current chunk of audio
            let currentChunk = Array(self.audioBuffer)
            self.audioBuffer.removeAll()
 
            // Process this chunk
            onChunkReady(currentChunk)
        }
    }
}
 
// Pseudo-real-time transcription manager
class ChunkedTranscriptionManager {
    private let whisperKit: WhisperKit
    private let audioProcessor = AudioChunkProcessor()
    private var fullTranscription = ""
 
    func startChunkedTranscription() {
        // Start capturing audio in chunks
        audioProcessor.startChunkedCapture { [weak self] audioChunk in
            guard let self = self else { return }
 
            Task {
                // Save chunk to temporary file
                let tempURL = try self.saveSamplesToTempFile(samples: audioChunk)
 
                // Process with Open-Source SDK (WhisperKit)
                let result = try await self.whisperKit.transcribe(audioPath: tempURL.path)
 
                // Update UI with incremental result
                await MainActor.run {
                    self.fullTranscription += result.text + " "
                    // Update UI
                }
            }
        }
    }
}

Advanced Example

Transcribe from System Microphone

This is a complete and self-contained CLI example project that demonstrates the usage of Argmax Pro SDK for real-time transcription from a microphone input stream. Your project directory should look like this:

ArgmaxRealTimeTranscriptionAdvancedExample
├── Package.swift
└── Sources
    └── ArgmaxTestCLI
        └── ArgmaxTestCommand.swift

Package.swift:

// swift-tools-version: 5.10
// The swift-tools-version declares the minimum version of Swift required to build this package.
 
import PackageDescription
 
let package = Package(
    name: "Argmax Test CLI",
    platforms: [
        .macOS(.v14)
    ],
    products: [
        .executable(
            name: "argmax-test-cli",
            targets: ["ArgmaxTestCLI"]
        )
    ],
    dependencies: [
        .package(id: "argmaxinc.argmax-sdk-swift", from: "1.2.0"),
        .package(url: "https://github.com/apple/swift-argument-parser.git", exact: "1.3.0")
    ],
    targets: [
        .executableTarget(
            name: "ArgmaxTestCLI",
            dependencies: [
                .product(name: "Argmax", package: "argmaxinc.argmax-sdk-swift"),
                .product(name: "ArgumentParser", package: "swift-argument-parser")
            ]
        ),
    ]
)
 

ArgmaxTestCommand.swift:

import Foundation
import ArgumentParser
import Argmax
 
@main
struct ArgmaxTestCommand: AsyncParsableCommand {
    static let configuration = CommandConfiguration(
        abstract: "An example CLI tool for Argmax Pro SDK",
        subcommands: [Transcribe.self]
    )
 
    struct Transcribe: AsyncParsableCommand {
        static let configuration = CommandConfiguration(
            abstract: "Real-time transcription using system microphone"
        )
 
        @Option(help: "Argmax Pro SDK API key")
        var apiKey: String
 
        @Option(help: "Model name: e.g. `tiny.en` or `large-v3-v20240930_626MB`. Default: `tiny.en`")
        var modelName: String = "tiny.en"
 
        @Option(help: "HuggingFace token if accessing Pro models")
        var modelToken: String?
 
        func run() async throws {
 
            print("Initializing Argmax Pro SDK...")
 
            let sdkConfig = ArgmaxConfig(apiKey: apiKey)
            await ArgmaxSDK.with(sdkConfig)
 
            let modelRepo = "argmaxinc/whisperkit-coreml"
            // Uncomment to access Pro models (requires `modelToken`)
            // let modelRepo = "argmaxinc/whisperkit-pro"
 
            print("Downloading \(modelName) model ...")
            let downloadURL = try await WhisperKitPro.download(
                variant: modelName,
                from: modelRepo,
                token: modelToken) { progress in
                    if let progressString = progress.localizedDescription {
                        print("\rDownload progress: \(progressString)", terminator: "")
                        fflush(stdout)
                        print("Calling cancel!")
                        progress.cancel()
                    }
                }
            let modelFolder = downloadURL.path(percentEncoded: false)
            print("\nDownload completed: \(modelFolder)")
 
            let whisperKitPro = try await setupWhisperKitPro(modelFolder: modelFolder)
            try await transcribeStream(whisperKitPro: whisperKitPro)
        }
 
        private func setupWhisperKitPro(modelFolder: String) async throws -> WhisperKitPro {
            print("Initializing WhisperKit Pro...")
            let whisperConfig = WhisperKitProConfig(
                modelFolder: modelFolder,
                verbose: true,
                logLevel: .debug
            )
            let whisperKitPro = try await WhisperKitPro(whisperConfig)
 
            print("Loading WhisperKit Pro models...")
            try await whisperKitPro.loadModels()
 
            return whisperKitPro
        }
 
        private func transcribeStream(whisperKitPro: WhisperKitPro) async throws {
            print("Transcribing while streaming audio from microphone...")
 
            let baseOptions = DecodingOptions(
                verbose: true,
                task: .transcribe,
                wordTimestamps: true,
                chunkingStrategy: .vad
            )
 
            let options = DecodingOptionsPro(
                base: baseOptions,
                transcribeInterval: 0.1
            )
 
            // Start recording
            var audioBuffer: [Float] = []
            let lock = NSLock()
            try whisperKitPro.audioProcessor.startRecordingLive { samples in
                lock.withLock {
                    audioBuffer.append(contentsOf: samples)
                }
            }
            print("Started audio capture... press Ctrl+C to stop...")
 
            // Process the stream
            let dateFormatter = DateFormatter()
            dateFormatter.dateFormat = "HH:mm:ss.SSS"
            var accumulatedConfirmedText = ""
            let recordingTask = whisperKitPro.transcribeWhileRecording(
                options: options,
                audioCallback: {
                    let samples = lock.withLock {
                        let samples = audioBuffer
                        audioBuffer.removeAll()
                        return samples
                    }
                    return AudioSamples(samples: samples)
                },
                resultCallback: { result in
                    let timestamp = dateFormatter.string(from: Date())
                    accumulatedConfirmedText += result.text
                    let hypothesisText = result.hypothesisText ?? ""
                    print("[\(timestamp)] \(accumulatedConfirmedText)\u{001B}[34m\(hypothesisText)\u{001B}[0m")
                    return true
                }
            )
 
            var signalHandled = false
            defer {
                if !signalHandled {
                    print("Stop recording...")
                    recordingTask.stop()
                }
            }
 
            signal(SIGINT, SIG_IGN)
 
            let signalSource = DispatchSource.makeSignalSource(signal: SIGINT, queue: DispatchQueue.main)
            signalSource.setEventHandler(handler: DispatchWorkItem(block: {
                print("Stop recording...")
                signalHandled = true
 
                whisperKitPro.audioProcessor.stopRecording()
 
                print("Finalizing transcription...")
                let group = DispatchGroup()
                group.enter()
 
                Task {
                    do {
                        let results = try await recordingTask.finalize()
                        let mergedResult = WhisperKitProUtils.mergeTranscriptionResults(results)
                        print("\n\nTranscription: \n\n\(mergedResult.text)\n")
                    } catch {
                        print("Error finalizing recording: \(error)")
                    }
                    group.leave()
                }
                group.wait()
                Foundation.exit(0)
            }))
            signalSource.resume()
            try await recordingTask.start()
        }
    }
}
 

Once the ArgmaxRealTimeTranscriptionAdvancedExample directory is set up as shown above, you may run swift build in your terminal from within the top-level project directory to build the CLI.

Example usage:

.build/debug/argmax-test-cli transcribe --api-key <API_KEY>

If you observe error: no registry configured for 'argmaxinc' scope, you should set up Pro SDK access by following Upgrading to Pro SDK (Step 1 only).

Example output upon successful build and launch:


swiftpm-invalid-checksum

Advanced Features

Pro Models

Pro SDK offers significantly faster and more energy-efficient models. These models also lead to higher accuracy word-level timestamps.

To upgrade, simply apply this diff to your initial configuration code:

- let config = WhisperKitConfig(model: "large-v3-v20240930")
+ let config = WhisperKitProConfig(
+     model: "large-v3-v20240930",
+     modelRepo: "argmaxinc/whisperkit-pro",
+     modelToken: "hf_*****" // Request access at https://huggingface.co/argmaxinc/whisperkit-pro
+ )

For now, you need to request model access here. We are working on removing this extra credential requirement.

UI Considerations

Differentiate Confirmed and Hypothesis

In order to communicate expectations of the permanent and temporary nature of each output text stream respectively.

Audio Level Visualization

User feedback to show historical patterns of input audio levels