Custom Vocabulary
Improve transcription accuracy with contextual keywords
Context
Argmax Pro SDK includes the WhisperKitPro
framework which implements an advanced feature to improve the recognition of special terms (keywords) that are registered as a Custom Vocabulary.
The feature works by performing keyword detection for each keyword in the Custom Vocabulary and overwriting the original transcript whenever a keyword is detected. Use cases include:
- Person, company and product names
- Industry-specific jargon (e.g. financial, medical and engineering)
Usage
Please review File Transcription or Real-time Transcription to set up the baseline transcription implementation. After that is set up, you have two options for enabling the Custom Vocabulary feature.
Option 1: Eager Initialization
The Custom Vocabulary feature is disabled by default and it will remain disabled if you pass nil
for the following WhisperKitProConfig
argument:
let config = WhisperKitProConfig(..., customVocabularyConfig: nil)
To enable this feature, you may register your Custom Vocabulary as follows:
let config = WhisperKitProConfig(..., customVocabularyConfig: .init(words: ["Argmax", "WhisperKitPro", "SpeakerKitPro"]))
The WhisperKitPro
object constructed using this config
will download an auxiliary ~102MB model during first use. This model will run in parallel to your primary speech-to-text model. This model is very fast and is not expected to introduce any additional latency due to parallel processing.
Option 2: Deferred Initialization
If you would like to defer the Custom Vocabulary registration but eagerly start the model download and preparation, you may achieve that as follows:
let config = WhisperKitProConfig(..., customVocabularyConfig: .init())
Once you are ready to register your Custom Vocabulary (e.g. if the keywords in the vocabulary are determined until after a certain runtime event occurs), you may set the Custom Vocabulary on the WhisperKitPro
object as follows:
try whisperKitPro.setCustomVocabulary(["Argmax", "WhisperKitPro", "SpeakerKitPro"])
Limitations
This feature is currently in alpha and there are several limitations to note:
- This feature is only usable in conjunction with the
WhisperKitPro.transcribe
function for the moment. It will also be made available as an independent API so that the detected keywords can also be merged with transcripts not originating fromWhisperKitPro
. - The words in the Custom Vocabulary must consist of English letters. However, the words may be from any language, e.g.
["Kioxia", "Masayoshi", "HepsiBurada"]
is legal but["まさよし", "Çağdaş"]
is not. Input checks are not yet implemented. Multilingual support will be implemented soon. - This feature is a modular processing block and can, in principal, be combined with any speech-to-text model that
WhisperKitPro
offers. However, we have only run extensive testing on Nvidia Parakeet models. Hence, this feature is disabled when using Whisper models until we are able to test these combinations thoroughly.