Please, help our community project. Star on GitHub!

🚀 What's New in SpeechBrain 1.0?

📅 On February 2024, we released SpeechBrain 1.0, the result of a year-long collaborative effort by a large international network of developers led by our exceptional core development team.

📊 Some Numbers:

SpeechBrain has evolved into a significant project and stands among the most widely used open-source toolkits for speech processing.
Over 140 developers have contributed to our repository, getting more than 7.3k stars on GitHub.
Monthly downloads from PyPI have reached an impressive 200k.
Expanded to over 200 recipes for Conversational AI, featuring more than 100 pretrained models on HuggingFace.

🌟 Key Updates:

SpeechBrain 1.0 introduces significant advancements, expanding support for diverse datasets and tasks, including NLP and EEG processing.
The toolkit now excels in Conversational AI and various sequence processing applications.
Improvements encompass key techniques in speech recognition, streamable conformer transducers, integration with K2 for Finite State Transducers, CTC decoding and n-gram rescoring, new CTC/joint attention Beam Search interface, enhanced compatibility with HuggingFace Models (including GPT2 and Llama2), and refined data augmentation, training, and inference processes.
We have created a new repository dedicated to benchmarks, accessible at here. At present, this repository features benchmarks for various domains, including speech self-supervised models (MP3S), continual learning (CL-MASR), and EEG processing (SpeechBrain-MOABB).

For detailed technical information, please refer to the section below.

🔄 Breaking Changes

People familiar with SpeechBrain know very well that we do our best to avoid backward incompatible changes. While SpeechBrain has consistently prioritized maintaining backward compatibility, the introduction of this new major version presented an opportunity for significant enhancements and refactorings.

🤗 HuggingFace Interface Refactor:
- Previously, our interfaces were limited to specific models like Whisper, HuBERT, WavLM, and wav2vec 2.0.
- We've refactored the interface to be more general, now supporting any transformer model from HuggingFace including LLMs.
- Simply inherit from our new interface and enjoy the flexibility.
- The updated interfaces can be accessed here.
🔍 BeamSearch Refactor:
- The previous beam search interface, while functional, was challenging to comprehend and modify due to the combined search and rescoring parts.
- We've introduced a new interface where scoring and search are separated, managed by distinct functions, resulting in simpler and more readable code.
- This update allows users to easily incorporate various scorers, including n-gram LM and custom heuristics, in the search part.
- Additionally, support for pure CTC training and decoding, batch and GPU decoding, partial or full candidate scoring, and N-best hypothesis output with neural LM rescorers has been added.
- An interface to K2 for search based on Finite State Transducers (FST) is now available.
- The updated decoders are available here.
🎨 Data Augmentation Refactor:
- The data augmentation capabilities have been enhanced, offering users access to various functions in speechbrain/augment.
- New techniques, such as CodecAugment, RandomShift (Time), RandomShift (Frequency), DoClip, RandAmp, ChannelDrop, ChannelSwap, CutCat, and DropBitResolution, have been introduced.
- Augmentation can now be customized and combined using the Augmenter interface in speechbrain/augment/augmenter.py, providing more control during training.
- Take a look here for a tutorial on speech augmentation.
- The updated augmenters are available here.
🧠 Brain Class Refactor:
- The fit_batch method in the Brain Class has been refactored to minimize the need for overrides in training scripts.
- Native support for different precisions (fp32, fp16, bf16), mixed precision, compilation, multiple optimizers, and improved multi-GPU training with torchrun is now available.
- Take a look at the refactored brain class here.
🔍 Inference Interfaces Refactor:
- Inference interfaces, once stored in a single file (speechbrain/pretrained/interfaces.py), are now organized into smaller libraries in speechbrain/inference, enhancing clarity and intuitiveness.
- You can access the new inference interfaces here.

🔊 Automatic Speech Recognition

Developed a new recipe for training a Streamable Conformer Transducer using Librispeech dataset (accessible here). The streamable model achieves a Word Error Rate (WER) of 2.72% on the test-clean subset.
Implemented a dedicated inference inference to support streamable ASR (accessible here).
New models, including HyperConformer andd Branchformer have been introduced. Examples of recipes utilizing them can be found here.
Additional support for datasets like RescueSpeech, CommonVoice 14.0, AMI, Tedlium 2.
The ASR search pipeline has undergone a complete refactoring and enhancement (see comment above).
A new recipe for Bayesian ASR has been added here.

🔄 Interface with Kaldi2 (K2-FSA)

Integration of an interface that seamlessly connects SpeechBrain with K2-FSA, allowing for constrained search and more.
Support for K2 CTC training and lexicon decoding, along with integration of K2 HLG and n-gram rescoring.
Competitive results achieved with Wav2vec2 on LibriSpeech test sets.
Explore an example recipe utilizing K2 here.

🎙 Speech Synthesis (TTS)

Improvements to FastSpeech2.
Development of the DiffWave Vocoder. See the recipe here.
Development of a Zero-Shot TTS baseline based on Tacotron. See the recipe here.

🌐 Speech-to-Speech Translation:

Introduction of new recipes for CVSS datasets and IWSLT 2022 Low-resource Task, based on mBART/NLLB and SAMU wav2vec.

🌟 Speech Generation

Implementation of diffusion and latent diffusion techniques with an example recipe showcased on AudioMNIST.

🎧 Interpretability of Audio Signals

Implementation of Learning to Interpret and PIQ techniques with example recipes demonstrated on ECS50.

😊 Speech Emotion Diarization

Support for Speech Emotion Diarization, featuring an example recipe on the Zaion Emotion Dataset. See the training recipe here.

🎙️ Speaker Recognition

Introduction of a new Speaker Recognition recipe on Voxceleb Speaker, based on ResNET.

🔊 Speech Enhancement

Release of a new Speech Enhancement baseline based on the DNS dataset.

🎵 Discrete Audio Representations

Support for pretrained models with discrete audio representations, including EnCodec and DAC.
Support for discretization of continuous representations provided by popular self-supervised models such as Hubert and Wav2vec2.

🤖 Interfaces with Large Language Models

Creation of interfaces with popular open-source Large Language Models, such as GPT2 and Llama2.
These models can be easily fine-tuned in SpeechBrain for tasks like Response Generation, exemplified with a recipe for the MultiWOZ dataset.
The Large Language Model can also be employed to rescore n-best ASR hypotheses.

🔄 Continuous Integration

All recipes undergo automatic testing with one or multiple GPUs, ensuring robust performance.
HuggingFace interfaces are automatically verified, contributing to a seamless integration process.
Continuous improvement of integration and unitary tests to comprehensively cover most functionalities within SpeechBrain.

🔍 Profiling

We have simplified the Profiler to enable easier identification of computing bottlenecks and quicker evaluation of model efficiency.
Now, you can profile your model during training effortlessly with:

python train.py hparams/config.yaml --profile_training --profile_warmup 10 --profile_steps 5

Check out our tutorial for more detailed information.

📈 Benchmarks

Release of a new benchmark repository, aimed at aiding the community in standardization across various areas.
1. CL-MASR (Continual Learning Benchmark for Multilingual ASR):
- A benchmark designed to assess continual learning techniques on multilingual speech recognition tasks
- Provides scripts to train multilingual ASR systems, specifically Whisper and WavLM-based, on a subset of 20 languages selected from Common Voice 13 in a continual learning fashion.
- Implementation of various methods, including rehearsal-based, architecture-based, and regularization-based approaches.
1. Multi-probe Speech Self Supervision Benchmark (MP3S):
- A benchmark for accurate assessment of speech self-supervised models.
- Noteworthy for allowing users to select multiple probing heads for downstream training.
1. SpeechBrain-MOABB:
- A benchmark offering recipes for processing electroencephalographic (EEG) signals, seamlessly integrated with the popular Mother of all BCI Benchmarks (MOABB).
- Facilitates the integration and evaluation of new models on all supported tasks, presenting an interface for easy model integration and testing, along with a fair and robust method for comparing different architectures.

🔄 Transitioning to SpeechBrain 1.0

Please, refer to this tutorial for in-depth technical information regarding the transition to SpeechBrain 1.0.

New Contributors

@ywk991112 made their first contribution in #2228
@kimmchii made their first contribution in #2320
@ppisljar made their first contribution in #2336
@gaspardpetit made their first contribution in #2335
@RISHIKREDDYL made their first contribution in #2389
@ZhaoZeyu1995 made their first contribution in #2345

Full Changelog: v0.5.16...v1.0.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0.0