Introduction
Argmax SDK is a collection of turn-key on-device inference frameworks:
WhisperKit Pro
SpeakerKit Pro
- Voice Activity Detection
- Speaker Diarization
- RTTM Output Format
- Diarized Transcription
Architecture
Argmax SDK follows an open-core architecture where the Pro SDK extends the Open-source SDK:
- Argmax Open-source SDK:
WhisperKit
- Argmax Pro SDK:
WhisperKit Pro
,SpeakerKit Pro
This architecture was explicitly designed to facilitate seamless upgrades and downgrades between the free tier (Open-source SDK) and the paid tier (Pro SDK).
Please see Open-source vs Pro SDK for a detailed feature set comparison.
Integration
Argmax SDK may be integrated as a:
- Swift Package via SwiftPM for native apps
- Node package via npm for Electron and React Native apps
- Local server that is API compatible with popular cloud-based inference providers (mac0S only)
Please see Upgrading to Pro SDK for more details.
Use Cases
Video content creation
- Offline captioning (Word timestamps, SRT and VTT output formats)
- Live captioning (Real-time transcription)
- Silence removal (Voice Activity Detection)
- Text-based video editing (Word timestamps)
- Example product built with Argmax SDK: Detail
Ambient AI for Healthcare
- Real-time streaming transcription of doctor-patient conversations
- Medically-tuned custom model support
- Speaker diarization to attribute statements to doctor and patient
- Example product built with Argmax SDK: ModMed Scribe
Meeting Notes AI
- Real-time streaming transcription of work meetings
- Custom keywords for accurate person and company names
- Speaker diarization to attribute statements to each meeting attendees
- Example product built with Argmax SDK: Macwhisper
Why on-device?
Accuracy
On-device inference does not imply usage of smaller & less accurate models. Argmax builds systems that match or exceed cloud-based API-level accuracy:
WhisperKit Pro
supports the largest and most accurate open-source speech-to-text models (Whisper Large V3) on ALL iOS and macOS devices released since 2020 (iPhone 12 or newer, M1 Mac or newer).SpeakerKit Pro
supports the state-of-the-art Pyannote-v3 system on an even wider range of devices.
For the ever-shrinking fraction of users with even older devices, Argmax offers hybrid deployment to fall back to the server-side and retain a user experience with uniform accuracy.
Upholding accuracy is our top priority (even more so than speed). We continuously benchmark our products on industry-standard test sets:
- WhisperKit is regression tested on CommonVoice 17, librispeech and earnings22. Results are hosted here.
- SpeakerKit is regression tested on 13+ datasets. Code and paper are published. Results will be hosted on Hugging Face soon.
Low Latency
There are two reasons why real-time applications enjoy lower latency inference on device compared to the cloud.
On-device inference enjoys lower latency inference compared to the cloud because it
- Does not incur internet roundtrip latency
- Decoupled from internet connection strength
- Is optimized for minimum latency for a single user as opposed to maximum throughput (at the cost of higher latency) for many concurrent users
- Is not exposed to global traffic which occasionaly leads cloud services to be unavailable or unexpectedly slow
Everything Else
Concern | On-device (with Argmax) | Cloud-based |
---|---|---|
Availability | 100% by definition | < 100% Uptime |
Scalability (Usage) | Unlimited | Rate-limited & concurrency-limited |
Scalability (Cost) | Fixed | Unlimited (Usage-based) |
Transparency | Open-core, transparent versioning | Proprietary, silent versioning |
Data Privacy | Procesed locally | Upload required |