Did you ever wonder what it takes to produce the Banafo Offline Speech to Text Transcripts ?
This is how we do it behind the scenes:
While the results for English were pretty good, we were unable to keep up training the monkeys for other languages so here is how we do it nowadays (please note that we omit some proprietary parts that we consider our trade secrets 😉
Way too many parts are needed and as the experience tells us, when you try it with enough voices, accents, compression formats something eventually will go wrong in every single one of them.
Audio Format Conversion
When a file gets submitted to our servers, the first thing we do is convert it to individual 16kHz mono channels in a lossless format so that we avoid additional quality losses due to (re-) compression in the following preprocessing steps that is done for each channel in the original audio sample:
- Noise Reduction
- Speech Extraction
- Silence Trimming
When all that is done, we send it to a Voice Activity Detector pipeline (and a diarization pipeline).
We need this step as the offline decoder works with small batches of up to 30 seconds at once, anything longer and the decoding becomes very slow or may even lead to a crash if the video card we use for inference runs out of memory.
Automatic speech recognition is not very good with words cut in half, so we try to avoid cutting samples in the middle of a word. So we need to find a good moment, for example when the speaker is taking a breath to chop the recording into smaller bite size pieces.
Just looking for silence works well in low noise conditions, but does not work as well when there is any sort of background noise.
Silence & Voice Activity Detection (VAD)
That’s where the voice activity detector comes into play, as the name indicates, we look for a place where there is no voice detected (not just a place where there is silence) for a certain amount of time and we use that place to split the recording into smaller pieces.
Once we have smaller chunks of audio, we can send it to the acoustic model for further processing.
Some theory about VAD here:
If you want to play with this yourself, these are the most popular projects:
Acoustic Model Emission ( the ASR / STT part )
The acoustic model looks for the most likely letters detected, without having explicit knowledge about dictionary words or language grammar and returns a list with the most likely candidates.
Traditionally, this was done with Phonemes, nowadays mostly graphemes or wordpieces are used.
Graphemes are the easiest to use, phonemes and wordpieces typically require either lookup dictionaries to convert to and from letter representations or require G2P or wordpiece models to make a conversion estimation.
Beam Search Decoding
In the next step, we give a score for each candidate, based on a sum of scores for things like “how many words did we find in the lexicon”, what is the confidence level given by the external language model, penalty for silences etc.
Once we find the best transcripts, we force align the transcript to have accurate timings. This is how we can show timestamps per word (or even per letter).
Diarization & Speaker Recognition
When a recording is made with Zoiper or with the Banafo web extension, we know that the microphone is the user’s voice. But on the other side of the conversation there may be multiple speakers.
We use Speaker Diarization to find out when the speaker changes and Speaker recognition to detect which speaker in the conversation speaks at what time.
Punctuation / Capitalization model
People are not used to read continuous text, without line endings or capital letters. Unfortunately these are not explicility pronounced and typically are not in the output of the speech recognition models.
There are several ways to reconstruct the line endings and capitalization, some do it based on dictionaries and grammer, other based on intonation but most comonly it is done purely on a text basis.