Speech carries more information than writing since it takes semantic and paralinguistic information like tone. Additionally, speaking is a more practical and organic approach for people to communicate with AI. Consequently, following speech-and-language guidelines while creating a general-purpose assistant is essential. However, most big language models only accept text input, limiting their potential. Although multi-modal vision-and-language models enable significant advancement in general artificial intelligence (AGI), it is still cumbersome for humans to enter tasks via inputting text instructions.
The automated speech recognition (ASR) model is used by cascade paradigm approaches to transform speech input into text input, which the model may then utilize to process the job. The modal transition from voice to text still results in information consumption and may import ASR system errors. Recently, speech-language multi-modal models with a big language model that processes and produces voice and text have been able to comprehend and make multi-modal information. The speech signals are broken into distinct tokens and extended into the LLM’s vocabulary. In this sense, the LLM requires extensive multi-modal data and powerful computational resources to be retrained.
The authors from LinkSoul.AI, Peking University and 01.ai suggest LLaSM, a sizable speech-and-language model with cross-modal conversational capabilities that can comprehend and adhere to spoken commands in this study. They use the well-trained speech modal encoder and the LLM, much like LLaVA, which makes LLaSM more resource-friendly. They specifically employ Whisper as a voice encoder to incorporate the speech signals. The big language model’s input text embeddings are matched with speech embeddings using a modal adaptor. To create interleaved sequences, the speech and text embeddings are combined. The interleaved sequences are then fed into the LLM for supervised fine-tuning.
There are two phases to the training procedure. They employ the public ASR datasets for the modality adaption pre-training in the initial stage. Only the modal adaptor has been trained to align the voice and text embeddings; the LLM and speech encoder have been locked. Since a small portion of the modal adaptor’s parameters are introduced during this stage, and most of the model’s parameters still need to be fixed, it is not resource-intensive. In the second step, cross-modal instruction data is used to train the model to handle multi-modal instructions and analyze cross-modal interactions. While the language model and modal adaptor’s settings are being modified for cross-modal education, the voice encoder is frozen.
It’s important to note that few open-source speech-text cross-modal instruction-following datasets are available. Thus, they created and released the LLaSM-Audio-Instructions dataset. The dataset is created by carefully choosing conversations from GPT4-LLM, ShareGPT, and WizardLM and then creating a significant quantity of conversational audio data using text-to-speech technology. To their knowledge, it is the biggest Chinese and English speech-text cross-modal instruction-following dataset, with 199k dialogues, 80k Chinese audio samples, and 428k English audio samples.
Their study contributes the following:
• They create a speech-language multi-modal model that can comprehend and implement speech-language commands, offering a more practical and organic approach for people to communicate with artificial intelligence.
• They create and publish LLaSM-Audio-Instructions, a large dataset for crossmodal instruction-following that combines Chinese and English speech and text.
• The demo may be viewed at HuggingFace online, and the code is available on GitHub.
Check out the Paper and Github. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.