A commonvoice-th recipe for training ASR engine using Kaldi. The following recipe follows commonvoice
recipe with slight modification
The author use docker to run the container. GPU is required to train tdnn_chain
, else the script can train only up to tri3b
.
We will need a commonvoice corpus for training ASR Engine. We are using Commonvoice Corpus 7.0 in Thai language which can be download here. Once downloaded, unzip it as we will use it later to mount dataset to the docker container.
Before building docker, SRILM file need to be downloaded. You can download it from here. Once the file is downloaded, remove version name (e.g. from srilm-1.7.3.tar.gz
to srilm.tar.gz
and place it inside docker
directory. Your docker
directory should contains 2 files: dockerfile
, and srilm.tar.gz
.
Once you have prepared SRILM file, you are ready to build docker for training this recipe. This docker automatically install project's dependendies and stored it in an image. To build a docker image, run:
$ cd docker
$ docker build -t <docker-name> kaldi
Once the image had been built, all you have to do is interactively attach to its bash terminal via the following command:
$ docker run -it -v <path-to-repo>:/opt/kaldi/egs/commonvoice-th \
-v <path-to-repo>/labels:/mnt/labels \
-v <path-to-cv-corpus>:/mnt \
--gpus all --name <container-name> <built-docker-name> bash
Once you finish this step, you should be in a docker container's bash terminal now
We also provide an example of how to inference a trained kaldi model using Vosk. Berore we begin, let's build Vosk docker image:
$ cd docker
$ docker build -t <docker-name> vosk-inference
$ cd .. # back to root directory
The first step is to download provided Vosk model format on this github's release. Unzip it to vosk-inference
directory. Or you can just follow this code.
$ cd vosk-inference
$ wget https://github.com/vistec-AI/commonvoice-th/releases/download/vosk-v1/model.zip
$ unzip model.zip
To prevent dependencies problem, the Vosk inference python script must be run inside a docker image that we just built. First, let's initiate a docker
$ docker run -it -v <path-to-repo>:/workspace \
--name <container-name> \
-p 8000:8000 \
<build-docker-name> bash
Then, you will be attached to a linux terminal inside the container. To inference an audio file, run:
$ cd vosk-inference
$ python3.8 inference.py --wav-path <path-to-wav> # test it with test.wav
Note that audio file must be 16k samping rate and mono channel!
We also provide a fastapi
server that will allow user to transcribe their own audio file via RESTful API. To instantiate server, run this command inside a docker shell
$ cd vosk-inference
$ uvicorn server:app --host 0.0.0.0 --reload
Now, the server will instantiate at http://localhost:8000
. To see if server is correctly instantiated, try to browse http://localhost:8000/healthcheck
. If the webpage loaded then we are good to go!
The endpoint will be in form-data format where each file is attached to a form field named audios
. See python example
import requests
url = "localhost:8000/transcribe"
payload={}
files=[
('audios', (<file-name>, open(<file-path>, 'rb'), 'audio/wav')),
...
]
headers = {}
response = requests.request("POST", url, headers=headers, data=payload, files=files)
print(response.text)
Read more at this repository. The provided repository contains an easy way to deploy Kaldi tdnn-chain
model to webRTC server.
To run the training pipeline, go to recipe directory and run run.sh
script
$ cd /opt/kaldi/egs/commonvoice-th/s5
$ ./run.sh --stage 0
Here are some experiment results evaluated on dev set:
Model | dev | dev-unique | ||
---|---|---|---|---|
WER | CER | WER | CER | |
mono | 79.13% | 57.31% | 77.79% | 48.97% |
tri1 | 56.55% | 37.88% | 53.26% | 27.99% |
tri2b | 50.64% | 32.85% | 47.38% | 21.89% |
tri3b | 50.52% | 32.70% | 47.06% | 21.67% |
tri4b | 46.81% | 29.47% | 43.18% | 18.05% |
tdnn-chain | 29.15% | 14.96% | 30.84% | 8.75% |
tdnn-chain-online | 29.02% | 14.64% | 30.41% | 8.28% |
Here is final test
set result evaluated on tdnn-chain
Model | test | test-unique | ||
---|---|---|---|---|
WER | CER | WER | CER | |
tdnn-chain-online | 9.71% | 3.12% | 23.04% | 7.57% |
airesearch/wav2vec2-xlsr-53-th | - | - | 13.63 | 2.81% |
Google Web Speech API | - | - | 13.71% | 7.36% |
Microsoft Bing Search API | - | - | 12.58% | 5.01% |
Amazon Transcribe | - | - | 21.86% | 7.08% |
Chompakorn Chaksangchaichot