Custom Vocabulary - Argmax Docs | Argmax - Foundation Models On Device

Custom Vocabulary

Improve transcription accuracy with contextual keywords

Context

Argmax Pro SDK includes the WhisperKitPro framework which implements an advanced feature to improve the recognition of contextual keywords that are registered as a Custom Vocabulary.

The feature works by performing audio-based keyword detection for each entry in the Custom Vocabulary and overwriting the original transcript whenever a keyword is detected. Use cases include:

Person, company and product names
Industry-specific jargon (e.g. financial, medical and engineering)

The demo above highlights the detected and inserted keywords in blue using this feature in Argmax Playground.

Accuracy benchmarks are published on OpenBench | Keyword Recognition.

Features

The list of keywords can be changed at runtime, enabling real-time customization and personalization of speech-to-text experiences
This feature is designed to take any base transcription result and enrich it with keywords, unlike similar features from other vendors that tie the feature to a specific model endpoint
This feature allows for up to 1000 keywords without a significant slowdown. However, we recommend only registering rare keywords that are highly relevant to the context to maximize accuracy.

Usage

Please review File Transcription or Real-time Transcription to set up the baseline transcription implementation. After that is set up, you have two options for enabling the Custom Vocabulary feature.

Option 1: Eager Automatic

The Custom Vocabulary feature is disabled by default and it will remain disabled if you pass nil for the following WhisperKitProConfig argument:

let config = WhisperKitProConfig(..., customVocabularyConfig: nil)

To enable this feature, you may register your Custom Vocabulary as follows:

let config = WhisperKitProConfig(..., customVocabularyConfig: .init(words: ["Argmax", "WhisperKitPro", "SpeakerKitPro"]))

The WhisperKitPro object constructed using this config will download an auxiliary model during first use. This model will run in parallel to your primary speech-to-text model. This model is very fast and is not expected to introduce any additional latency due to parallel processing.

Option 2: Deferred Automatic

If you would like to defer the Custom Vocabulary registration but eagerly start the model download and preparation, you may achieve that as follows:

let config = WhisperKitProConfig(..., customVocabularyConfig: .init())

Once you are ready to register your Custom Vocabulary (e.g. if the keywords in the vocabulary are determined later during runtime), you may set the Custom Vocabulary on the WhisperKitPro object as follows:

try whisperKitPro.setCustomVocabulary(["Argmax", "WhisperKitPro", "SpeakerKitPro"])

Option 3: Manual

The Custom Vocabulary feature relies on an auxiliary model to do keyword search in the input audio. The previous two options abstract away the underlying model selection and download. If you would like to pick a particular model and/or manage the model download explicitly, this option is for you.

Here is a convenient table with information on each model:

Model Name	Size (MB)	Language Support	Accuracy
parakeet-tdt_ctc-110m	102	English-only	0.88
canary-1b-v2_474MB	474	25 (Same as `parakeet-v3*`)	0.92

Argmax SDK defaults to canary-1b-v2_474MB for maximum accuracy and language coverage. If you would like to optimize your memory usage and can constrain input to English-only, parakeet-tdt_ctc-110m is a viable option. Please note that parakeet-tdt_ctc-110m will lead to increased false positives if used with non-English input.

If you would like to explicitly manage the model download, you may use ModelStore as follows:

// Get the ModelRepo for default custom vocabulary model
let customVocabularyModelRepo = CustomVocabularyConfig.customVocabularyModelRepo()
 
// Download the model to a local folder using Argmax ModelStore
let customVocabularyURL = try await modelStore.downloadModel(
  name: customVocabularyModelRepo.models.first,
  repo: customVocabularyModelRepo.repoId
)
 
// Alternatively, you can explicitly select one of the supported models:
// (Both models are hosted in the same repo: "argmaxinc/ctckit-pro")
//
// let customVocabularyURL = try await modelStore.downloadModel(
//   name: "canary-1b-v2_474MB",
//   repo: "argmaxinc/ctckit-pro"
// )
//
// let customVocabularyURL = try await modelStore.downloadModel(
//   name: "parakeet-tdt_ctc-110m",
//   repo: "argmaxinc/ctckit-pro"
// )

After the download, you may pass the downloaded model folder URL to WhisperKitPro as follows:

// Config custom vocabulary with pre-downloaded model path
let customVocabularyConfig = CustomVocabularyConfig(
    words: ["CustomKeyword1", "CustomKeywordN"],
    modelFolder: customVocabularyURL.path
)
 
// Initialize WhisperKitPro with the custom vocabulary config
let whisperKit = try await WhisperKitPro(
    WhisperKitProConfig(
        ...
        customVocabularyConfig: customVocabularyConfig
    )
)
 
// Keep track of custom vocabulary model's loading state
var customVocabularyModelState = whisperKitPro.customVocabularyModelState

Limitations

This feature has several limitations to note:

This feature is tethered to the WhisperKitPro.transcribe function for the moment. In a future release, it will be made available as an independent capability so that the detected keywords can also be merged with transcripts not originating from WhisperKitPro.
This feature can in principal be configured to work with any speech-to-text model that WhisperKitPro offers. However, we have only run extensive accuracy testing on Nvidia Parakeet models so far. Hence, this feature is disabled when using Whisper models until we are able to test these combinations thoroughly.
The default configuration of this feature supports 25 languages, the same language list as parakeet-v3. Overriding the configuration to use a different model may change language support. Usage of this feature in unsupported languages will lead to increased false positive rates.
Registering very short (less than 4 characters) or out-of-context keywords may lead to reduced accuracy.
Keywords with invalid characters will be dropped during setCustomVocabulary.

Voice Activity Detection Speaker Diarization

On This Page

Context
Features
Usage
Limitations