Docs
Open-source vs Pro SDK
Open-source vs Pro SDK
Feature Set
| Open-source SDK | Pro SDK | Pro is | |
|---|---|---|---|
| WhisperKit Features | |||
| File Transcription | ✅ | ✅ | ~9x faster |
| Language Detection | ✅ | ✅ | |
| Word Timestamps | ✅ | ✅ | |
| Custom Vocabulary | ✅ | ✅ | 10x more keywords allowed |
| SRT & VTT Output Formats | ✅ | ✅ | |
| Real-time Transcription | ⚠️ | ✅ | ~9x faster |
| Fast Model Load | ✅ | ||
| SpeakerKit Features | |||
| Voice Activity Detection | ⚠️ | ✅ | |
| Speaker Diarization | ✅ | ||
| RTTM Output Format | ✅ | ||
| Diarized Transcription | ✅ |
Rough Feature Matches
Some Pro SDK features have rough counterparts in the Open-source SDK and they are marked with ⚠️.
- Voice Activity Detection feature in the Open-source SDK is implemented as a simple audio energy thresholding algorithm (called EnergyVAD). This implementation works well for separating silence from non-silence in an audio stream or file. However, it can not distinguish between voice and non-voice, e.g. microphone noise, music etc. On the other hand, the same feature in the Pro SDK is implemented as a high-accuracy deep learning model capable of separating voice from non-voice.
- Real-time Transcription feature is not included in the Open-source SDK. However, it needs to be implemented in the App code and we share an example implementation in WhisperKit/Examples/WhisperAX. On the other hand, the Pro SDK implements real-time transcription in the
WhisperKitProframework as a unified streaming API calledtranscribeWhileRecording. This implementation has a more robust core algorithm that matches offline transcription accuracy and is battle-tested with Enterprise customers. The new algorithm also fixes several known error modes that the Open-source App example is susceptible to.
We are continuously improving both SDKs and we intend to fix the sharp corners of our Open-source SDK over time. However, our current focus is ensuring Argmax Pro SDK is best-in-class.