XLS-R

XLS-R is a set of large-scale models for self-supervised cross-lingual speech representation learning based on wav2vec 2.0. It was pretrained on 128 languages and approximately 436K hours of unlabeled speech data. With finetuning, these models achieve state of the art performance in speech translation, speech recognition and language identification. We evaluate the model across multiple benchmarks such as CoVoST-2 for speech translation, BABEL / MLS / CommonVoice / VoxPopuli for automatic speech recognition, and VoxLingua107 for language identification as we llas VoxCeleb1 for speaker identification. More details about this work can be found in our paper and download links can be found below.

Model	Link
XLS-R 300M	download
XLS-R 1B	download
XLS-R 2B	download

You can also download these models here and read more about it in the blogpost from Hugging Face.

Speech Translation Finetuned Models

We multilingually finetune XLS-R models on CoVoST 2, which has 21 into-English and 15 out-of-English directions.

Model	Directions	Link
XLS-R 300M	21 langs → En	download
XLS-R 300M	En → 15 langs	download
XLS-R 1B	21 langs → En	download
XLS-R 1B	En → 15 langs	download
XLS-R 2B	21 langs → En	download
XLS-R 2B	En → 15 langs	download
XLS-R 2B	21 langs → En + En → 15 langs	download

ASR Finetuning

You can refer the original wav2vec documentation on detailed instructions about how to finetune a pretrained model with CTC here. Below is an example command and you can find the values for different hyperparameters to reproduce the results in our paper.

$ fairseq-hydra-train \
    distributed_training.distributed_port=$PORT \
    task.data=/path/to/data \
    model.w2v_path=/path/to/model.pt \
    --config-dir /path/to/fairseq-py/examples/wav2vec/xlsr/config \
    --config-name finetune

For finetuning the 300M as well as 1B model, we use the same hyperparameter setting defined in finetune.yaml. We vary optimization.max_update as described in the below table and the optimization.lr is picked from the interval [2e-5, 3e-4] based on dev word error rate.

Benchmark	Total Number of Updates
Babel	26000
Common Voice	13000
VoxPopuli	50000
MLS 10h	20000

For finetuning the 2B model, we make some additional changes for finetune.yaml . We use the fully_sharded distributed_training.ddp_backend provided by the fairscale library and and set model.activation_checkpoint to true. We also increase dataset.max_tokens to 2560000 and use a total effective batch size of 2560000*24. We sweep for the best optimization.lr within the interval [3e−6,3e−5] using dev error rate. For common voice dataset, we pick the model.mask_prob for different languages among {0.30, 0.40} based on best dev error rate.

LID Inference

Model	Link
XLS-R 300M + ft Voxlingua107	download

How to run inference & calculate accuracy (step-by-step):

Download the Voxlingua107 checkpoint from the table above.
Use this python script to extract logit/embedding from the XLSR model: https://github.com/fairinternal/fairseq-py/blob/xlsr2/examples/wav2vec/gen_audio_embedding.py

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. python3 examples/wav2vec/gen_audio_embedding.py \
    /fsx/data/VoxLingua107/manifest --path "/path/to/checkpoint.pt" \
    --task audio_classification --batch-size 90 --gen-subset test \
    --infer-manifest /fsx/data/VoxLingua107/manifest/test.tsv \
    --infer-xtimes 10 --infer-max-sample-size 160000 --output-path /tmp/tmp_voxling_infer.npz

Calculate the overall accuracy, 0-5 seconds and 5-20 seconds:

PYTHONPATH='.' python examples/wav2vec/eval_speaker_clf_task.py \
    --task cls --merge mean_logit --data /tmp/tmp_voxling_infer.npz

Output: 
| run classification evaluation
| acc = 94.34% -- err = 5.66% -- correct=1518 total=1609
| acc 0to5 = 90.91% -- err = 9.09% -- c_5=230.0 t_5=253
| acc 5to20 = 94.99% -- err = 5.01% -- c_20=1288.0 t_20=1356

Citation

Please cite as:

@article{babu2021xlsr,
      title={XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale}, 
      author={Arun Babu and Changhan Wang and Andros Tjandra and Kushal Lakhotia and Qiantong Xu and Naman Goyal and Kritika Singh and Patrick von Platen and Yatharth Saraf and Juan Pino and Alexei Baevski and Alexis Conneau and Michael Auli},
      year={2021},
      volume={abs/2111.09296},
      journal={arXiv},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

XLS-R

Speech Translation Finetuned Models

ASR Finetuning

LID Inference

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

XLS-R

Speech Translation Finetuned Models

ASR Finetuning

LID Inference

Citation