Real Time Transcription
Implementing real-time speech recognition in your applications
Context
Real-time transcription streams input audio and output text continuously during a live session:
- Input audio stream: Capturing audio in small user-defined intervals
- Inference: Incremental speech-to-text model inference on the input stream
- Output text streams:
- Confirmed Text: Finalized historical transcription that will not change
- Hypothesis Text: Preliminary text that may be refined as more audio context arrives
This approach creates an ultra low-latency user experience where words appear on the screen almost as they're spoken, with occasional refinements as the model gathers more context.
Basic Example
If this is your first time, start with the Open-source SDK. You can always upgrade to the Pro SDK later for more features and better performance.
Pro SDK
Argmax Pro SDK includes the WhisperKitPro
framework which implements an advanced streaming inference algorithm described here.
Key features:
- Accuracy is identical to offline file-based transcription
- Dual text streams can be leveraged in the user experience to build trust in stable and accurate results (Confirmed Text) while preserving responsiveness (Hypothesis Text).
- Streaming API design that exposes event-based callbacks, minimizing the burden on the caller
import Argmax
// Initialize Argmax SDK to enable Pro access
await ArgmaxSDK.with(ArgmaxConfig(apiKey: "ax_*****"))
let config = WhisperKitProConfig(model: "large-v3-v20240930")
let whisperKitPro = try await WhisperKitPro(config)
var transcription = "" // Confirmed transcription text
var hypothesisText = "" // Hypothesis text from most recent transcription
var latestAudioSampleIndex = 0 // Track the latest audio samples sent to the transcribe task
/// Capture audio as a float array into `yourRecordingAudio`
var yourRecordingAudio: [Float] = []
...
let transcribeTask = whisperKitPro.transcribeWhileRecording(
audioCallback: {
// Get latest audio samples
let newAudioToTranscribe = yourRecordingAudio[latestAudioSampleIndex...]
latestAudioSampleIndex = yourRecordingAudio.count - 1
// Send the new audio samples to the transcribe task
return AudioSamples(samples: newAudioToTranscribe)
},
resultCallback: { result in
transcription += result.text
hypothesisText = result.hypothesisText
// Let the transcribe task know it should continue
return true
}
)
Open-source SDK
Argmax Open-source SDK includes the WhisperKit
framework which offers the basic building blocks to enable developers to implement a basic streaming algorithm and approximates real-time behavior of the Pro SDK.
// Audio processor to capture samples
class AudioChunkProcessor {
private var audioEngine: AVAudioEngine
private var audioBuffer: [Float] = []
private var chunkDuration: TimeInterval = 2.0
// Start capturing audio in chunks
func startChunkedCapture(onChunkReady: @escaping ([Float]) -> Void) {
let timer = Timer.scheduledTimer(withTimeInterval: chunkDuration, repeats: true) { [weak self] _ in
guard let self = self else { return }
// Get current chunk of audio
let currentChunk = Array(self.audioBuffer)
self.audioBuffer.removeAll()
// Process this chunk
onChunkReady(currentChunk)
}
}
}
// Pseudo-real-time transcription manager
class ChunkedTranscriptionManager {
private let whisperKit: WhisperKit
private let audioProcessor = AudioChunkProcessor()
private var fullTranscription = ""
func startChunkedTranscription() {
// Start capturing audio in chunks
audioProcessor.startChunkedCapture { [weak self] audioChunk in
guard let self = self else { return }
Task {
// Save chunk to temporary file
let tempURL = try self.saveSamplesToTempFile(samples: audioChunk)
// Process with Open-Source SDK (WhisperKit)
let result = try await self.whisperKit.transcribe(audioPath: tempURL.path)
// Update UI with incremental result
await MainActor.run {
self.fullTranscription += result.text + " "
// Update UI
}
}
}
}
}
Advanced Features
Pro Models
Pro SDK offers significantly faster and more energy-efficient models. These models also lead to higher accuracy word-level timestamps.
To upgrade, simply apply this diff to your initial configuration code:
- let config = WhisperKitConfig(model: "large-v3-v20240930")
+ let config = WhisperKitProConfig(
+ model: "large-v3-v20240930",
+ modelRepo: "argmaxinc/whisperkit-pro",
+ modelToken: "hf_*****" // Request access at https://huggingface.co/argmaxinc/whisperkit-pro
+ )
For now, you need to request model access here. We are working on removing this extra credential requirement.
UI Considerations
Differentiate Confirmed and Hypothesis
In order to communicate expectations of the permanent and temporary nature of each output text stream respectively.
Audio Level Visualization
User feedback to show historical patterns of input audio levels