Docs
Introduction

Introduction

Argmax SDK is a collection of turn-key on-device inference frameworks:

  • WhisperKit Pro
    • File Transcription
    • Real-time Transcription
    • Language Detection
    • Word Timestamps
    • Custom Keywords
    • SRT & VTT Output Format
  • SpeakerKit Pro
    • Voice Activity Detection
    • Speaker Diarization
    • RTTM Output Format
    • Diarized Transcription

Architecture

Argmax SDK follows an open-core architecture where the Pro SDK extends the Open-source SDK:

  • Argmax Open-source SDK: WhisperKit
  • Argmax Pro SDK: WhisperKit Pro, SpeakerKit Pro

This architecture was explicitly designed to facilitate seamless upgrades and downgrades between the free tier (Open-source SDK) and the paid tier (Pro SDK).

Please see Open-source vs Pro SDK for a detailed feature set comparison.

Integration

Native Apps

Argmax Pro SDK may be integrated as a Swift Package via SwiftPM for native apps

Please see Upgrading to Pro SDK for more details.

Other Apps

Argmax Local Server is built using Argmax Pro SDK and currently offers Real-time Transcription.

Key features include:

  • Node and Python client packages
  • API compatible with Deepgram
  • macOS only

Please see Using Local Server for more details.

Use Cases

Ambient AI for Healthcare

  • Real-time streaming transcription of doctor-patient conversations
  • Medically-tuned custom model support
  • Speaker diarization to attribute statements to doctor and patient
  • Example product built with Argmax SDK: ModMed Scribe

AI Meeting Notes

  • Real-time streaming transcription of work meetings
  • Custom keywords for accurate person and company names
  • Speaker diarization to attribute statements to each meeting attendees
  • Example product built with Argmax SDK: Macwhisper

Personal Dictation

  • Ultra low-latency dictation
  • Custom keywords for accurate person and company names
  • Example product built with Argmax SDK: superwhisper

Video content creation

  • Offline captioning (Word timestamps, SRT and VTT output formats)
  • Live captioning (Real-time transcription)
  • Silence removal (Voice Activity Detection)
  • Text-based video editing (Word timestamps)
  • Example product built with Argmax SDK: Detail

Why on-device?

Accuracy

On-device inference does not imply usage of smaller & less accurate models. Argmax builds systems that match or exceed cloud-based API-level accuracy:

  • WhisperKit Pro supports the largest and most accurate open-source speech-to-text models (Whisper Large V3) on ALL iOS and macOS devices released since 2020 (iPhone 12 or newer, M1 Mac or newer).
  • SpeakerKit Pro supports the state-of-the-art Pyannote-v3 system on an even wider range of devices.

For the ever-shrinking fraction of users with even older devices, Argmax offers hybrid deployment to fall back to the server-side and retain a user experience with uniform accuracy.

Upholding accuracy is our top priority (even more so than speed). We continuously benchmark our products on industry-standard test sets:

  • WhisperKit is regression tested on CommonVoice 17, librispeech and earnings22. Results are hosted here.
  • SpeakerKit is regression tested on 13+ datasets. Code and paper are published. Results will be hosted on Hugging Face soon.

Low Latency

Applications built with real-time inference enjoy lower latency when deployed on device instead of the cloud because on-device is:

  • Optimized for minimum latency for a single user instea of maximum throughput (at the cost of higher latency) for many concurrent users
  • Not exposed to global traffic which occasionaly leads cloud services to be unavailable or unexpectedly slow
  • Not subject to internet roundtrip latency

Everything Else

ConcernOn-device (with Argmax)Cloud-based
Availability100% by definition< 100% Uptime
Scalability (Usage)UnlimitedRate-limited & concurrency-limited
Scalability (Cost)FixedUnlimited (Usage-based)
TransparencyOpen-core, transparent versioningProprietary, silent versioning
Data PrivacyProcesed locallyUpload required