An overview of the opensource speech to text frameworks and toolkits available in 2022. Where to get them, how hard are they to use and what languages do they support out of the box.
We at Banafo are constantly trying newer and better approaches for the automatic speech recognition, a crucial component in providing our users with the highest quality transcripts on the market.
The frameworks in this article are all tools we use or used, they mostly build on top of Pytorch or Arrayfire, this article is just a small overview of the pro’s and con’s, more detailed instructions on how to set configure or use them will be provided in future articles.
Of all frameworks listed, only Vosk is straightforward to use in production, unfortunately because it is made with somewhat outdated technology, the quality is not nearly as good as newer frameworks can do with the proper training.
Training a really good model takes many months with 8 gpu’s and tens of thousands of hours of meticiously cleaned datasets, per language. If you need something production ready with reasonable cost, setup costs and time to market, you would probably be better off looking the banafo API, amazon, google or microsoft azure offerings.
The most popular ASR toolkit options are:
1. Flashlight ASR
Flashlight is focussed on performance, written in C++, on top of arrayfire, and can be used both on CPU and GPU. It is the banafo framework of choice and we rely heavily on it for the speech to text part of banafo.
Several recipes are provided as examples as well as some example pretrained models for different languages.
- Maintained by a very capable ML team
- Many recipes to choose from (with links to the papers they implement)
- Very efficient use of GPU
- Preloading with on the fly augmentation ( noise, reverb, stretching, compression, specaugment, pitch changes).
- Relatively small user base
- Difficult to setup
- Model compatibility is not guaranteed between different releases. (models trained on one version do not work on a newer version without manual conversion).
- Default models do not work all that well in real life conversational audio situations
2. Kaldi STT
Kaldi ASR, written by Daniel Povey is the old kid on the block, it was one of the first and remains one of the most used Speech To Text engines today for university courses, speech research and commercial deployments. Many other toolkits also use parts of kaldi in the preprocessing stages for training. Recipes are called “egs” and they have scripts to download and preprocessed most common freely available datasets. The original Kaldi can produce high quality transcripts with very few hours, unlike the more recent End to end toolkits in our list that often require thousands of hours to achieve high quality.
We used Kaldi extensively in the early days of Banafo, but as it needs phonetic transcripts and we want to support many languages, we decided to move on to the more recent frameworks that can work with graphemes or wordpieces instead of phonemes.
You can find the original Kaldi here:
And the project home here:
A new version of Kaldi is in the making, we have no hands on experience with it. You can learn a bit more about it in this interspeech2021 presentation:
- Widely used
- Has a lot of ready to use recipes
- Does not require a lot of processing power for training
- Does not require a lot of cpu or gpu cycles for inference / decoding
- Requires tens or hundreds of hours of labelled data instead of tens of thousands of hours
- The pipelines are well structures in different stages
- Has quite a few languages supported throught he VOSK project.
- Supports streaming and offline decoding
- A bit hard to setup
- A bit messy with lots of python scripts, calling perl scripts
- Because of the many steps involved (it’s not end to end), you can’t just try it in the middle of the process to see how good the model is.
- Requires the use of phonemes, very few datasets come with verified phonetic transcripts and linguistic training is needed to be able to manually correct phonemes.
3. Coqui Transcripts
Mozilla used work on Deepspeech and Deepspeech2, both projects were stopped due to budget limitations but the team moved on to form Coqui, that does ASR and Text to Speech.
We used Deepspeech before it was discontinued, it has since been outperformed by the newer kids on the block.
We toyed around with Coqui for ASR but never went very deep as we had better results with the other projects. Nowadays we still use Coqui for Text To Speech research but no longer for ASR and therefor cannot judge the recent quality or performance.
- end to end, works with graphemes
- Responsive and experienced team
- supports both streaming and offline decoding
The Speechbrain project is about more than just speech to text, it is a more generic everything speech + AI related toolkit.
It’s the most recent kid on the block, well updated with contributions from many universities all over the world, comes with plenty of ready to use models hosted on Hugging Face.
We have limited experience with the speech to text part of SpeechBrain, but it looks as this project is going to be the research standard in the future.
- Cutting edge research paper / algorithms are implemented
- Building blocks to support more than just ASR
- Very modular
- Works with pytorch
- It’s not easy to setup or use for the novice user
Espnet is next to SpeechBrain, the other very popular end to end speech research tool.
Espnet1 relied heavily on Kaldi and Kaldi style scripting, Espnet2 no longer depends as much on Kaldi but still uses the familiar bash scripted staged scripting, but with python instead of perl.
- Easy to use for those migration from Kaldi
- Easy to install with pip2 for inference only
- Has some PyCollab examples
- Has nice Tensorboard integration out of the box
- Generates training loss images out of the box
- We found it hard to setup for training purposes (Especially when you still want the Kaldi support), a bit of a dependency nightmare.
- Can’t be used out of the box for production, requires custom development
Vosk is a speech to text inference framework based on Kaldi, it is developed by Alpha Cephei. They have done the hard work of training the difference languages they support.
It is based on Kaldi and is therefor super fast with reasonable quality although not the best word error rate possible, the newer frameworks provide significanly better results for noisy or accented audio.
It is however the only framework that is easy to setup with reasonable results.
You can find the Vosk project homepage here: