Skip to content

Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

License

Notifications You must be signed in to change notification settings

Consistency-TTA/consistency-tta.github.io

Repository files navigation

ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

This is the official website for the paper
"Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation"
from Microsoft Applied Science Group and UC Berkeley
by Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, and Somayeh Sojoudi.

[Preprint Paper]      [Project Homepage]      [Code]      [Model Checkpoints]      [Generation Examples]

Main Experiment results

Our method reduce the computation of the core step of diffusion-based text-to-audio generation by a factor of 400, while observing minimal performance degradation in terms of Fréchet Audio Distance (FAD), Fréchet Distance (FD), KL Divergence, and CLAP Scores.

# queries (↓) CLAPT (↑) CLAPA (↑) FAD (↓) FD (↓) KLD (↓)
Diffusion (Baseline) 400 24.57 72.79 1.908 19.57 1.350
Consistency + CLAP FT (Ours) 1 24.69 72.54 2.406 20.97 1.358
Consistency (Ours) 1 22.50 72.30 2.575 22.08 1.354

This benchmark demonstrates how our single-step models stack up with previous methods, most of which mostly require hundreds of generation steps.

Cite Our Work (BibTeX)

@article{bai2023accelerating,
  title={Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation},
  author={Bai, Yatong and Dang, Trung and Tran, Dung and Koishida, Kazuhito and Sojoudi, Somayeh},
  journal={arXiv preprint arXiv:2309.10740},
  year={2023}
}

About

Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages