A key aspect of generative AI is audio generation. In recent years, the popularity of generative AI has led to increasingly diverse and emerging needs in audio production. For example, text-to-sound and text-to-music technologies are projected to produce audio based on human requests for speech synthesis (TTS), voice conversion (VC), singing voice synthesis (SVS), and voice conversion (VC). Most earlier efforts on audio creation jobs have task-specific designs that largely rely on domain expertise and are only usable in fixed configurations. This study aims to create universal audio generation, which handles numerous audio-generating jobs with a single unified model rather than handling each task individually.
It is anticipated that the universal audio generation model would amass adequate past knowledge in audio and related modalities, which can offer straightforward and efficient solutions for the growing need to create a variety of audio. The Large Language Model (LLM) technology’s exceptional performance in text-generating jobs inspired several LLM-based audio generation models. Among these studies, LLM’s independence in tasks like text-to-speech (TTS) and music production has received substantial study and performs competitively. However, the potential of LLM to handle numerous jobs needs to be more utilized in audio generation research because the majority of LLM-based works are still focused on single tasks.
They contend that the LLM paradigm holds promise for reaching universality and variety in audio creation but has yet to be thoroughly investigated. In this study, researchers from The Chinese University of Hong Kong, Carnegie Mellon University, Microsoft Research Asia and Zhejiang University introduce UniAudio, which uses LLM approaches to produce a variety of audio genres (speech, noises, music, and singing) based on several input modalities, including phoneme sequences, textual descriptions, and audio itself. The following are the key features of the planned UniAudio: All audio formats and input modalities are tokenized first as discrete sequences. To successfully tokenize audio regardless of the audio format, a universal neural codec model is developed, and several tokenizers are employed to tokenize various input modalities.
The source-target pair is then combined into a single sequence by UniAudio. Finally, UniAudio uses LLM to conduct next-token prediction. The tokenization technique uses residual vector quantization based on neural codecs, producing excessively lengthy token sequences (one frame equivalent to several tokens) that LLM cannot parse effectively. The inter- and intra-frame correlation are independently modeled in a multi-scale Transformer architecture intended to decrease computing complexity. In particular, a global Transformer module represents the correlation between frames (for example, at the semantic level). In contrast, a local Transformer module models the correlation within frames (for example, at the acoustic level). The construction of UniAudio involves two steps to show its scalability for new projects.
First, the proposed UniAudio is trained on various audio-generating tasks simultaneously, giving the model enough previous knowledge of both the inherent qualities of audio and the relationships between audio and other input modalities. Second, with little tweaking, the trained model will be able to accommodate more audio creation activities that aren’t visible. Because it can continually accommodate emerging demands in audio generation, UniAudio has the potential to become a foundation model for universal audio generation. Their UniAudio supports 11 audio-generating tasks experimentally: the training stage covers seven audio-generation jobs, and the fine-tuning step adds four tasks. To accommodate 165k hours of audio and 1B parameters, the UniAudio construction method has been increased.
UniAudio consistently achieves competitive performance throughout the 11 tasks, as judged by objective and subjective standards. Modern-day outcomes are even attained for the majority of these duties. More research indicates that practicing several activities concurrently in the training stage benefits all included tasks. Additionally, UniAudio outperforms task-specific models with a non-trivial gap and can quickly adapt to new audio-generating workloads. In conclusion, their work shows that developing universal audio generation models is important, hopeful, and advantageous.
The following is a summary of this work’s key contributions:
(1) To achieve universal audio generation, UniAudio is given as a single solution for 11 audio-generating jobs, which is more than all previous efforts in the field.
(2) Concerning technique, UniAudio offers fresh ideas for (i) sequential representations of audio and other input modalities, (ii) consistent formulation for LLM-based audio production tasks, and (iii) effective model architecture created especially for audio generation.
(3) Extensive testing findings verify UniAudio’s overall performance and demonstrate the advantages of creating a flexible audio-generating paradigm.
(4) UniAudio’s demo and source code are made public, hoping that it will help emergent audio production in future studies as a foundation model.
Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
We are also on WhatsApp. Join our AI Channel on Whatsapp..
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.