Skip to content

How to use our public wav2vec2 dimensional emotion model

License

Notifications You must be signed in to change notification settings

audeering/w2v2-how-to

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

How to use our public dimensional emotion model

An introduction to our model for dimensional speech emotion recognition based on wav2vec 2.0. The model is available from doi:10.5281/zenodo.6221127 and released under CC BY-NC-SA 4.0. The model was created by fine-tuning the pre-trained wav2vec2-large-robust model on MSP-Podcast (v1.7). The pre-trained model was pruned from 24 to 12 transformer layers before fine-tuning. In this tutorial we use the ONNX export of the model. The original Torch model is hosted at Hugging Face. Further details are given in the associated paper.

License

The model can be used for non-commercial purposes, see CC BY-NC-SA 4.0. For commercial usage, a license for devAIce must be obtained. The source code in this GitHub repository is released under the following license.

Quick start

Create / activate Python virtual environment and install audonnx.

$ pip install audonnx

Load model and test on random signal.

import audeer
import audonnx
import numpy as np


url = 'https://zenodo.org/record/6221127/files/w2v2-L-robust-12.6bc4a7fd-1.1.0.zip'
cache_root = audeer.mkdir('cache')
model_root = audeer.mkdir('model')

archive_path = audeer.download_url(url, cache_root, verbose=True)
audeer.extract_archive(archive_path, model_root)
model = audonnx.load(model_root)

sampling_rate = 16000
signal = np.random.normal(size=sampling_rate).astype(np.float32)
model(signal, sampling_rate)
{'hidden_states': array([[-0.00711814,  0.00615957, -0.00820673, ...,  0.00666412,
          0.00952989,  0.00269193]], dtype=float32),
 'logits': array([[0.6717072 , 0.6421313 , 0.49881312]], dtype=float32)}

The hidden states might be used as embeddings for related speech emotion recognition tasks. The order in the logits output is: arousal, dominance, valence.

Tutorial

For a detailed introduction, please check out the notebook.

$ pip install -r requirements.txt
$ jupyter notebook notebook.ipynb 

Citation

If you use our model in your own work, please cite the following paper:

@article{wagner2023dawn,
    title={Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap},
    author={Wagner, Johannes and Triantafyllopoulos, Andreas and Wierstorf, Hagen and Schmitt, Maximilian and Burkhardt, Felix and Eyben, Florian and Schuller, Bj{\"o}rn W},
    journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
    pages={1--13},
    year={2023},
}