Docs
Introduction

Introduction

Argmax SDK is a collection of turn-key on-device inference frameworks:

  • WhisperKit Pro
    • File Transcription
    • Real-time Transcription
    • Language Detection
    • Word Timestamps
    • Custom Keywords
    • SRT & VTT Output Format
  • SpeakerKit Pro
    • Voice Activity Detection
    • Speaker Diarization
    • RTTM Output Format
    • Diarized Transcription

Architecture

Argmax SDK follows an open-core architecture where the Pro SDK extends the Open-source SDK:

  • Argmax Open-source SDK: WhisperKit
  • Argmax Pro SDK: WhisperKit Pro, SpeakerKit Pro

This architecture was explicitly designed to facilitate seamless upgrades and downgrades between the free tier (Open-source SDK) and the paid tier (Pro SDK).

Please see Open-source vs Pro SDK for a detailed feature set comparison.

Integration

Argmax SDK may be integrated as a:

  • Swift Package via SwiftPM for native apps
  • Node package via npm for Electron and React Native apps
  • Local server that is API compatible with popular cloud-based inference providers (mac0S only)

Please see Upgrading to Pro SDK for more details.

Use Cases

Video content creation

  • Offline captioning (Word timestamps, SRT and VTT output formats)
  • Live captioning (Real-time transcription)
  • Silence removal (Voice Activity Detection)
  • Text-based video editing (Word timestamps)
  • Example product built with Argmax SDK: Detail

Ambient AI for Healthcare

  • Real-time streaming transcription of doctor-patient conversations
  • Medically-tuned custom model support
  • Speaker diarization to attribute statements to doctor and patient
  • Example product built with Argmax SDK: ModMed Scribe

Meeting Notes AI

  • Real-time streaming transcription of work meetings
  • Custom keywords for accurate person and company names
  • Speaker diarization to attribute statements to each meeting attendees
  • Example product built with Argmax SDK: Macwhisper

Why on-device?

Accuracy

On-device inference does not imply usage of smaller & less accurate models. Argmax builds systems that match or exceed cloud-based API-level accuracy:

  • WhisperKit Pro supports the largest and most accurate open-source speech-to-text models (Whisper Large V3) on ALL iOS and macOS devices released since 2020 (iPhone 12 or newer, M1 Mac or newer).
  • SpeakerKit Pro supports the state-of-the-art Pyannote-v3 system on an even wider range of devices.

For the ever-shrinking fraction of users with even older devices, Argmax offers hybrid deployment to fall back to the server-side and retain a user experience with uniform accuracy.

Upholding accuracy is our top priority (even more so than speed). We continuously benchmark our products on industry-standard test sets:

  • WhisperKit is regression tested on CommonVoice 17, librispeech and earnings22. Results are hosted here.
  • SpeakerKit is regression tested on 13+ datasets. Code and paper are published. Results will be hosted on Hugging Face soon.

Low Latency

There are two reasons why real-time applications enjoy lower latency inference on device compared to the cloud.

On-device inference enjoys lower latency inference compared to the cloud because it

  • Does not incur internet roundtrip latency
  • Decoupled from internet connection strength
  • Is optimized for minimum latency for a single user as opposed to maximum throughput (at the cost of higher latency) for many concurrent users
  • Is not exposed to global traffic which occasionaly leads cloud services to be unavailable or unexpectedly slow

Everything Else

ConcernOn-device (with Argmax)Cloud-based
Availability100% by definition< 100% Uptime
Scalability (Usage)UnlimitedRate-limited & concurrency-limited
Scalability (Cost)FixedUnlimited (Usage-based)
TransparencyOpen-core, transparent versioningProprietary, silent versioning
Data PrivacyProcesed locallyUpload required