Kaldi-based audio-visual speech recognition

A baseline system for audio-visual speech recognition using the Kaldi speech recognition toolkit [1] is provided.

The scripts contain the early integration approach that is presented in:

H. Meutzner, N. Ma, R. Nickel, C. Schymura, D. Kolossa, "Improving Audio-Visual Speech Recognition using Deep Neural Networks with Dynamic Stream Reliability Estimates", ICASSP, New Orleans, USA, March 2017.

Future releases will also contain the late-integration approach using dynamic stream weights.

Data description

The experiments are based on the audio data of the CHiME-2 challenge and the video data of the GRID audio-visual speech corpus [3,4].

The audio data has to be manually obtained from the official CHiME-2 track 1 website [2].

The video features have been precomputed using the video files of the GRID corpus and will be automatically obtained from http://doi.org/10.5281/zenodo.260211 when running the scripts.

The video features contain the 63-dimensional DCT coefficients of the landmark points extracted using the Viola-Jones algorithm. The features have been end-pointed and interpolated using a differential digital analyser in order to match the length of the utterances when using a frame length of 25ms and a frame shift of 10ms, which is the default configuration of Kaldi's feature extraction scripts.

[1] http://kaldi-asr.org

[2] http://spandh.dcs.shef.ac.uk/chime_challenge/chime2013/chime2_task1.html

[3] http://spandh.dcs.shef.ac.uk/gridcorpus

[4] Martin Cooke, Jon Barker, and Stuart Cunningham and Xu Shao, "An audio-visual corpus for speech perception and automatic speech recognition", The Journal of the Acoustical Society of America 120, 2421 (2006); http://doi.org/10.1121/1.2229005

License and Citation

The scripts are released under the Apache 2.0 license. The video features are released under the Creative Commons Attribution-NonCommercial 4.0 license.

When using these scripts for your research, please cite the following paper

@inproceedings{meutzner2017,
  author = {Hendrik Meutzner, Ning Ma, Robert Nickel, Christopher Schymura, Dorothea Kolossa},
  title = {{Improving Audio-Visual Speech Recognition using Deep Neural Networks with Dynamic Stream Reliability Estimates}},
  booktitle = {{IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}},      
  year = {2017}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
chime2		chime2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chime2

chime2

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Kaldi-based audio-visual speech recognition

Data description

License and Citation

About

Releases

Packages

Languages

License

hmeutzner/kaldi-avsr

Folders and files

Latest commit

History

Repository files navigation

Kaldi-based audio-visual speech recognition

Data description

License and Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages