Introduction
Argmax SDK is a collection of turn-key on-device inference frameworks:
WhisperKit Pro
SpeakerKit Pro
- Voice Activity Detection
- Speaker Diarization
- RTTM Output Format
- Diarized Transcription
Architecture
Argmax SDK follows an open-core architecture where the Pro SDK extends the Open-source SDK:
- Argmax Open-source SDK:
WhisperKit
- Argmax Pro SDK:
WhisperKit Pro
,SpeakerKit Pro
This architecture was explicitly designed to facilitate seamless upgrades and downgrades between the free tier (Open-source SDK) and the paid tier (Pro SDK).
Please see Open-source vs Pro SDK for a detailed feature set comparison.
Integration
Native Apps
Argmax Pro SDK may be integrated as a Swift Package via SwiftPM for native apps
Please see Upgrading to Pro SDK for more details.
Other Apps
Argmax Local Server is built using Argmax Pro SDK and currently offers Real-time Transcription.
Key features include:
- Node and Python client packages
- API compatible with Deepgram
- macOS only
Please see Using Local Server for more details.
Use Cases
Ambient AI for Healthcare
- Real-time streaming transcription of doctor-patient conversations
- Medically-tuned custom model support
- Speaker diarization to attribute statements to doctor and patient
- Example product built with Argmax SDK: ModMed Scribe
AI Meeting Notes
- Real-time streaming transcription of work meetings
- Custom keywords for accurate person and company names
- Speaker diarization to attribute statements to each meeting attendees
- Example product built with Argmax SDK: Macwhisper
Personal Dictation
- Ultra low-latency dictation
- Custom keywords for accurate person and company names
- Example product built with Argmax SDK: superwhisper
Video content creation
- Offline captioning (Word timestamps, SRT and VTT output formats)
- Live captioning (Real-time transcription)
- Silence removal (Voice Activity Detection)
- Text-based video editing (Word timestamps)
- Example product built with Argmax SDK: Detail
Why on-device?
Accuracy
On-device inference does not imply usage of smaller & less accurate models. Argmax builds systems that match or exceed cloud-based API-level accuracy:
WhisperKit Pro
supports the largest and most accurate open-source speech-to-text models (Whisper Large V3) on ALL iOS and macOS devices released since 2020 (iPhone 12 or newer, M1 Mac or newer).SpeakerKit Pro
supports the state-of-the-art Pyannote-v3 system on an even wider range of devices.
For the ever-shrinking fraction of users with even older devices, Argmax offers hybrid deployment to fall back to the server-side and retain a user experience with uniform accuracy.
Upholding accuracy is our top priority (even more so than speed). We continuously benchmark our products on industry-standard test sets:
- WhisperKit is regression tested on CommonVoice 17, librispeech and earnings22. Results are hosted here.
- SpeakerKit is regression tested on 13+ datasets. Code and paper are published. Results will be hosted on Hugging Face soon.
Low Latency
Applications built with real-time inference enjoy lower latency when deployed on device instead of the cloud because on-device is:
- Optimized for minimum latency for a single user instea of maximum throughput (at the cost of higher latency) for many concurrent users
- Not exposed to global traffic which occasionaly leads cloud services to be unavailable or unexpectedly slow
- Not subject to internet roundtrip latency
Everything Else
Concern | On-device (with Argmax) | Cloud-based |
---|---|---|
Availability | 100% by definition | < 100% Uptime |
Scalability (Usage) | Unlimited | Rate-limited & concurrency-limited |
Scalability (Cost) | Fixed | Unlimited (Usage-based) |
Transparency | Open-core, transparent versioning | Proprietary, silent versioning |
Data Privacy | Procesed locally | Upload required |