Recent advancements in generative deep learning models have revolutionized fields such as Natural Language Processing (NLP) and Computer Vision (CV). Previously, specialized models with supervised training dominated these domains, but now, a shift towards generalized models capable of performing diverse tasks with minimal explicit guidance is evident.
Large language models (LLMs) in NLP have shown versatility by successfully tackling tasks like question answering, sentiment analysis, and text summarization despite not being specifically designed for them. Similarly, in CV, pre-trained models trained on extensive image-caption pairs have achieved top performance on image-to-text benchmarks and have demonstrated remarkable results in text-to-image tasks. Transformer-based architectures have largely facilitated this progress, which leverages significantly larger datasets than previous models.
A similar trend of advancement was observed in the realm of Speech Processing and Text-to-Speech (TTS). Models now leverage thousands of hours of data to produce speech that is increasingly closer to human-like quality. Until 2022, Neural TTS models were primarily trained on a few hundred hours of audio data, limiting their ability to generalize beyond the training data and expressly render complex and ambiguous texts.
To address this limitation, researchers at Amazon AGI have introduced BASE TTS, a large TTS (LTTS) system trained on approximately 100K hours of public domain speech data. BASE TTS is designed to model the joint distribution of text tokens and discrete speech representations, known as speech codes. These speech codes are crucial as they allow the direct application of methods developed for LLMs. By employing a decoder-only autoregressive Transformer, BASE TTS can capture complex probability distributions of expressive speech, thus improving prosody rendering compared to early neural TTS systems.
Researchers also propose speaker-disentangled speech codes built on a WavLM Self-Supervised Learning (SSL) speech model. These speech codes, which aim to capture only phonemic and prosodic information, outperform baseline quantization methods. They can be decoded into high-quality waveforms using a simple, fast, and streamable decoder, even with a high level of compression.
Their contributions include introducing BASE TTS, the largest TTS model to date, demonstrating how scaling it to larger datasets and model sizes enhances its capability to render appropriate prosody for complex texts, and introducing novel discrete speech representations that outperform existing methods. These advancements represent significant progress in the field of TTS and lay the groundwork for future research and development.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our Telegram Channel
You may also like our FREE AI Courses….
Arshad is an intern at MarktechPost. He is currently pursuing his Int. MSc Physics from the Indian Institute of Technology Kharagpur. Understanding things to the fundamental level leads to new discoveries which lead to advancement in technology. He is passionate about understanding the nature fundamentally with the help of tools like mathematical models, ML models and AI.