A team of researchers from Nankai University and ByteDance introduced a novel framework called ChatAnything, designed to generate anthropomorphized personas for large language model (LLM)-based characters in an online manner. The aim is to create personas with customized visual appearance, personality, and tones based solely on text descriptions. The researchers leverage the in-context learning capability of LLMs to generate personalities using carefully designed system prompts. They propose two innovative concepts: the mixture of voices (MoV) and the mixture of diffusers (MoD) for diverse voice and appearance generation.
MoV employs text-to-speech (TTS) algorithms with pre-defined tones, selecting the most matching one based on user-provided text descriptions. MoD combines text-to-image generation techniques and talking head algorithms to streamline the process of generating talking objects. However, the researchers observe a challenge where anthropomorphic objects generated by current models are often undetectable by pre-trained face landmark detectors, leading to failure in face motion generation. To address this, they incorporate pixel-level guidance during image generation to infuse human face landmarks. This pixel-level injection significantly increases the face landmark detection rate, enabling automatic face animation based on generated speech content.
The paper discusses recent advancements in large language models (LLMs) and their in-context learning capabilities, positioning them at the forefront of academic discussions. The researchers emphasize the need for a framework that generates LLM-enhanced personas with customized personalities, voices, and visual appearances. For personality generation, they leverage the in-context learning capability of LLMs, creating a pool of voice modules using text-to-speech (TTS) APIs. The mixture of voices (MoV) module selects tones based on user text inputs.
The visual appearance of speech-driven talking motions and expressions is addressed using recent talking head algorithms. However, the researchers encounter challenges when using images generated by diffusion models as input for talking head models. Only 30% of images are detectable by state-of-the-art talking head models, indicating a distribution misalignment. To bridge this gap, the researchers propose a zero-shot method, injecting face landmarks during the image generation phase.
The proposed ChatAnything framework comprises four main blocks: LLM-based control module, portrait initializer, mixture of text-to-speech modules, and motion generation module. The researchers incorporated diffusion models, voice changers, and structural control to create a modular and flexible system. To validate the effectiveness of guided diffusion, the researchers created a validation dataset with prompts from different categories. They use a pre-trained face keypoint detector to assess face landmark detection rates, showcasing the impact of their proposed method.
The researchers introduce a comprehensive framework, ChatAnything, for generating LLM-enhanced personas with anthropomorphic characteristics. They address challenges in face landmark detection and propose innovative solutions, presenting promising results in their validation dataset. This work opens avenues for future research in integrating generative models with talking head algorithms and improving the alignment of data distributions.
Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
Pragati Jhunjhunwala is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Kharagpur. She is a tech enthusiast and has a keen interest in the scope of software and data science applications. She is always reading about the developments in different field of AI and ML.