The charm of conversational interfaces lies in their simplicity and uniformity across different applications. If the future of user interfaces is that all apps look more or less the same, is the job of the UX designer doomed? Definitely not — conversation is an art to be taught to your LLM so it can conduct conversations that are helpful, natural, and comfortable for your users. Good conversational design emerges when we combine our knowledge of human psychology, linguistics, and UX design. In the following, we will first consider two basic choices when building a conversational system, namely whether you will use voice and/or chat, as well as the larger context of your system. Then, we will look at the conversations themselves, and see how you can design the personality of your assistant while teaching it to engage in helpful and cooperative conversations.
Conversational interfaces can be implemented using chat or voice. In a nutshell, voice is faster while chat allows users to stay private and to benefit from enriched UI functionality. Let’s dive a bit deeper into the two options since this is one of the first and most important decisions you will face when building a conversational app.
To pick between the two alternatives, start by considering the physical setting in which your app will be used. For example, why are almost all conversational systems in cars, such as those offered by Nuance Communications, based on voice? Because the hands of the driver are already busy and they cannot constantly switch between the steering wheel and a keyboard. This also applies to other activities like cooking, where users want to stay in the flow of their activity while using your app. Cars and kitchens are mostly private settings, so users can experience the joy of voice interaction without worrying about privacy or about bothering others. By contrast, if your app is to be used in a public setting like the office, a library, or a train station, voice might not be your first choice.
After understanding the physical setting, consider the emotional side. Voice can be used intentionally to transmit tone, mood, and personality — does this add value in your context? If you are building your app for leisure, voice might increase the fun factor, while an assistant for mental health could accommodate more empathy and allow a potentially troubled user a larger diapason of expression. By contrast, if your app will assist users in a professional setting like trading or customer service, a more anonymous, text-based interaction might contribute to more objective decisions and spare you the hassle of designing an overly emotional experience.
As a next step, think about the functionality. The text-based interface allows you to enrich the conversations with other media like images, as well as graphical UI elements such as buttons. For example, in an e-commerce assistant, an app that suggests products by posting their pictures and structured descriptions will be way more user-friendly than one that describes products via voice and potentially provides their identifiers.
Finally, let’s talk about the additional design and development challenges of building a voice UI:
- There is an additional step of speech recognition that happens before user inputs can be processed with LLMs and Natural Language Processing (NLP).
- Voice is a more personal and emotional medium of communication — thus, the requirements for designing a consistent, appropriate, and enjoyable persona behind your virtual assistant are higher, and you will need to take into account additional factors of “voice design” such as timbre, stress, tone, and speaking speed.
- Users expect your voice conversation to proceed at the same speed as a human conversation. To offer a natural interaction via voice, you need a much shorter latency than for chat. In human conversations, the typical gap between turns is 200 milliseconds — This prompt response is possible because we start constructing our turns while listening to our partner’s speech. Your voice assistant will need to match up with this degree of fluency in the interaction. By contrast, for chatbots, you compete with time spans of seconds, and some developers even introduce an additional delay to make the conversation feel like a typed chat between humans.
- Communication via voice is a linear, one-off enterprise — if your user didn’t get what you said, you are in for a tedious, error-prone clarification loop. Thus, your turns need to be as concise, clear, and informative as possible.
If you go for the voice solution, make sure that you not only clearly understand the advantages as compared to chat, but also have the skills and resources to address these additional challenges.
Now, let’s consider the larger context in which you can integrate conversational AI. All of us are familiar with chatbots on company websites — those widgets on the right of your screen that pop up when we open the website of a business. Personally, more often than not, my intuitive reaction is to look for the Close button. Why is that? Through initial attempts to “converse” with these bots, I have learned that they cannot satisfy more specific information requirements, and in the end, I still need to comb through the website. The moral of the story? Don’t build a chatbot because it’s cool and trendy — rather, build it because you are sure it can create additional value for your users.
Beyond the controversial widget on a company website, there are several exciting contexts to integrate those more general chatbots that have become possible with LLMs:
- Copilots: These assistants guide and advise you through specific processes and tasks, like GitHub CoPilot for programming. Normally, copilots are “tied” to a specific application (or a small suite of related applications).
- Synthetic humans (also digital humans): These creatures “emulate” real humans in the digital world. They look, act, and talk like humans and thus also need rich conversational abilities. Synthetic humans are often used in immersive applications such as gaming, and augmented and virtual reality.
- Digital twins: Digital twins are digital “copies” of real-world processes and objects, such as factories, cars, or engines. They are used to simulate, analyze, and optimize the design and behavior of the real object. Natural language interactions with digital twins allow for smoother and more versatile access to the data and models.
- Databases: Nowadays, data is available on any topic, be it investment recommendations, code snippets, or educational materials. What is often hard is to find the very specific data that users need in a specific situation. Graphical interfaces to databases are either too coarse-grained or covered with endless search and filter widgets. Versatile query languages such as SQL and GraphQL are only accessible to users with the corresponding skills. Conversational solutions allow users to query the data in natural language, while the LLM that processes the requests automatically converts them into the corresponding query language (cf. this article for an explanation of Text2SQL).
As humans, we are wired to anthropomorphize, i.e. to inflict additional human traits when we see something that vaguely resembles a human. Language is one of the most unique and fascinating characteristics of humankind, and conversational products will automatically be associated with humans. People will imagine a person behind their screen or device — and it is good practice to not leave this specific person to the chance of your users’ imaginations, but rather lend it a consistent personality that matches well with your product and brand. This process is called “persona design”.
The first step of persona design is understanding the character traits you would like your persona to display. Ideally, this is already done at the level of the training data — for example, when using RLHF, you can ask your annotators to rank the data according to traits like helpfulness, politeness, fun, etc., in order to bias the model towards the desired characteristics. These characteristics can be matched with your brand attributes to create a consistent image that continuously promotes your branding via the product experience.
Beyond general characteristics, you should also think about how your virtual assistant will deal with specific situations beyond the “happy path”. For example, how will it respond to user requests that are beyond its scope, reply to questions about itself, and deal with abusive or vulgar language?
It is important to develop explicit internal guidelines on your persona that can be used by data annotators and conversation designers. This will allow you to design your persona in a purposeful way and keep it consistent across your team and over time, as your application undergoes multiple iterations and refinements.
Have you ever had the impression of talking to a brick wall when you were actually speaking with a human? Sometimes, we find our conversation partners are just not interested in leading the conversation to success. Fortunately, in most cases, things are smoother, and humans will intuitively follow the “principle of cooperation” that was introduced by the language philosopher Paul Grice. According to this principle, humans who successfully communicate with each other follow four maxims, namely quantity, quality, relevance, and manner.
Maxim of quantity
The maxim of quantity asks the speaker to be informative and make their contribution as informative as required. On the side of the virtual assistant, this also means actively moving the conversation forward. For example, consider this snippet from an e-commerce fashion app:
Assistant: What kind of clothing items are you looking for?
User: I am looking for a dress in orange.
Assistant: Don’t: Sorry, we don’t have orange dresses at the moment.
Do: Sorry, we don’t have dresses in orange, but we have this great and very comfortable dress in yellow: …
The user hopes to leave your app with a suitable item. By stopping the conversation because you don’t have items that would fit the exact description, you kill off the possibility of success. However, if your app makes suggestions about alternative items, it will appear more helpful and leave the option of a successful interaction open.
Especially in voice interactions, it is important to find the right balance between providing all the information the user might need for success, while not overwhelming them with unnecessary information which might cloud the interaction.
Maxim of quality
The maxim of quality asks speakers to be truthful and not say things they believe are false, or for which they lack adequate evidence. There is a lot of subjectivity and personal belief involved here, thus, this maxim cannot be directly projected onto LLMs. As we have seen, LLMs tend to hallucinate, and this risk is especially high when the conversation drifts off into less familiar territories. Additional guardrails can be used in order to avoid misleading statements and bring the conversation back on track.
Maxim of relevance
The maxim of relevance asks us to be — well, relevant. As a precondition, the virtual assistant has to understand the true intent of the user. Here is an example of an airline assistant that violates this rule:
User: I need a flight from Berlin to Zurich tomorrow morning.
Assistant: OK, I found the following flights from Berlin to Zurich for tomorrow morning. Flight BZ36294 leaves Berlin at 6.30am and arrives in Zurich at 8am. Flight BZ96294 leaves Berlin at 8am and arrives at 9.30am. Flight BZ96294 leaves Berlin at 8am and arrives in Zurich at 9.30am. Flight BZ95614 leaves Berlin at 11am and arrives in Zurich at 12.30pm. There is also the possibility of a connecting flight, in which case you would leave Berlin at 11.45am and arrive in Zurich at 3pm.
User: Sorry… could you repeat this again?
The assistant is overwhelming the user with detailed information that might not be relevant. By breaking this up and guiding the user to limit their search space, the dialogue becomes much easier to digest:
User: I need a flight from Berlin to Zurich tomorrow morning.
Assistant: Ok, I found flights leaving at 6.30, 8, and 11am. Which time would you like to leave?
User: I need to be in Zurich before 9am.
Assistant: OK, so you can take the flight BZ36294. It leaves at 6.30 and arrives at 8am. Should I buy the ticket for you?
User: Yes, thanks.
Maxim of manner
Finally, the maxim of manner states that our speech acts should be clear, concise and orderly, avoiding ambiguity and obscurity of expression. Your virtual assistant should avoid technical or internal jargon, and favour simple, universally understandable formulations.
While Grice’s principles are valid for all conversations independently of a specific domain, LLMs that were not trained specifically for conversation will often fail to fulfill them. Thus, when compiling your training data, it is important to have enough dialogue samples that allow your model to learn these principles.
The domain of conversational design is developing rather quickly. Whether you are already building AI products or thinking about your career path in AI, I encourage you to dig deeper into this topic (cf. the excellent introductions in [5] and [6]). As AI is turning into a commodity, good design together with a defensible data strategy will become two important differentiators for AI products.
Let’s summarize the key takeaways from the article. Additionally, figure 6 shows a “cheatsheet” with the main points that you can download as a reference.
- LLMs enhance conversational AI: Large Language Models (LLMs) have significantly improved the quality and scalability of conversational AI applications across various industries and use cases.
- Conversational AI can add a lot of value to applications with lots of similar user requests (e.g. customer service), or which need to access a large quantity of unstructured data (e.g. knowledge management).
- Data: Fine-tuning LLMs for conversational tasks requires high-quality conversational data that closely mirrors real-world interactions. Crowdsourcing and LLM-generated data can be valuable resources for scaling data collection.
- Putting the system together: Developing conversational AI systems is an iterative and experimental process, involving constant optimization of data, fine-tuning strategies, and component integration.
- Teaching conversation skills to LLMs: Fine-tuning LLMs involves training them to recognize and respond to specific communicative intents and situations.
- Adding external data with semantic search: Integrating external and internal data sources using semantic search enhances the AI’s responses by providing more contextually relevant information.
- Memory and context awareness: Effective conversational systems must maintain context awareness, including tracking the history of the current conversation and past interactions, to provide meaningful and coherent responses.
- Setting guardrails: To ensure responsible behavior, conversational AI systems should employ guardrails to prevent inaccuracies, hallucinations, and breaches of privacy.
- Persona design: Designing a consistent persona for your conversational assistant is essential to create a cohesive and branded user experience. Persona characteristics should align with your product and brand attributes.
- Voice vs. chat: Choosing between voice and chat interfaces depends on factors like the physical setting, emotional context, functionality, and design challenges. Consider these factors when deciding on the interface for your conversational AI.
- Integration in various contexts: Conversational AI can be integrated in different contexts, including copilots, synthetic humans, digital twins, and databases, each with specific use cases and requirements.
- Observing the Principle of Cooperation: Following the principles of quantity, quality, relevance, and manner in conversations can make interactions with conversational AI more helpful and user-friendly.