Introduction - Argmax Docs | Argmax - Foundation Models On Device

Introduction

Argmax SDK is a collection of turn-key on-device inference frameworks:

WhisperKit Pro
- File Transcription
- Real-time Transcription
- Language Detection
- Word Timestamps
- Custom Vocabulary
- SRT & VTT Output Format
SpeakerKit Pro
- Voice Activity Detection
- Speaker Diarization
- RTTM Output Format
- Diarized Transcription

Architecture

Argmax SDK follows an open-core architecture where the Pro SDK extends the Open-source SDK:

Argmax Open-source SDK: WhisperKit
Argmax Pro SDK: WhisperKit Pro, SpeakerKit Pro

This architecture was explicitly designed to facilitate seamless upgrades and downgrades between the free tier (Open-source SDK) and the paid tier (Pro SDK).

Please see Open-source vs Pro SDK for a detailed feature set comparison.

Integration

Native Apps

Argmax Pro SDK may be integrated as a Swift Package via SwiftPM for native apps

Please see Upgrading to Pro SDK for more details.

Other Apps

Argmax Local Server is built using Argmax Pro SDK and currently offers Real-time Transcription.

Key features include:

Node and Python client packages
API compatible with Deepgram
macOS only

Please see Using Local Server for more details.

Use Cases

Ambient AI for Healthcare

Real-time streaming transcription of doctor-patient conversations
Medically-tuned custom model support
Speaker diarization to attribute statements to doctor and patient
Example product built with Argmax SDK: ModMed Scribe

AI Meeting Notes

Real-time streaming transcription of work meetings
Custom vocabulary for accurate person and company names
Speaker diarization to attribute statements to each meeting attendees
Example product built with Argmax SDK: Macwhisper

Personal Dictation

Ultra low-latency dictation
Custom vocabulary for accurate person and company names
Example product built with Argmax SDK: superwhisper

Video content creation

Offline captioning (Word timestamps, SRT and VTT output formats)
Live captioning (Real-time transcription)
Silence removal (Voice Activity Detection)
Text-based video editing (Word timestamps)
Example product built with Argmax SDK: Detail

Why on-device?

Accuracy

On-device inference does not imply usage of smaller & less accurate models. Argmax builds systems that match or exceed cloud-based API-level accuracy:

WhisperKit Pro supports the largest and most accurate open-source speech-to-text models (Whisper Large V3) on ALL iOS and macOS devices released since 2020 (iPhone 12 or newer, M1 Mac or newer).
SpeakerKit Pro supports the state-of-the-art Pyannote-v4 system on an even wider range of devices.

For the ever-shrinking fraction of users with even older devices, Argmax offers hybrid deployment to fall back to the server-side and retain a user experience with uniform accuracy.

Upholding accuracy is our top priority (even more so than speed). We continuously benchmark our products on industry-standard test sets:

Accuracy and speed benchmarks are continuously published on Argmax OpenBench.
Competitive benchmarks are published in our WhisperKit (ICML) and SpeakerKit (Interspeech) paper

Low Latency

Applications built with real-time inference enjoy lower latency when deployed on device instead of the cloud because on-device is:

Optimized for minimum latency for a single user instea of maximum throughput (at the cost of higher latency) for many concurrent users
Not exposed to global traffic which occasionaly leads cloud services to be unavailable or unexpectedly slow
Not subject to internet roundtrip latency

Everything Else

Concern	On-device (with Argmax)	Cloud-based
Availability	100% by definition	< 100% Uptime
Scalability (Usage)	Unlimited	Rate-limited & concurrency-limited
Scalability (Cost)	Fixed	Unlimited (Usage-based)
Transparency	Open-core, transparent versioning	Proprietary, silent versioning
Data Privacy	Procesed locally	Upload required

Model Gallery

On This Page

Architecture
Integration
- Native Apps
- Other Apps
Use Cases
Why on-device?