Navigating Cost-Complexity: Mixture of Thought LLM Cascades Illuminate a Path to Efficient Large Language Model Deployment | by Yuval Zukerman

What if I told you that you could save 60% or more off of the cost of your LLM API spending without compromising on accuracy? Surprisingly, now you can.

Large Language Models (LLMs) are now part of our everyday lives. Companies use the technology to automate processes, improve customer experiences, build better products, save money, and more.

Hosting your own LLMs is very challenging. They offer broad capabilities but are often expensive to run. They often require complex infrastructure and massive amounts of data. Cost and complexity are why you use prompt engineering. You may even use retrieval-augmented generation (RAG) to improve context and reduce hallucinations. With both techniques, you offload running LLMs to the likes of OpenAI, Cohere, or Google. Yet, scaling LLM adoption to new use cases, especially with the latest powerful models, can drive up a new cost that was previously unaccounted for. Weaker models may be cheaper, but can you trust them with complex questions? Now, new research shows us how to save money and get as good, sometimes better, LLM results.

Get to Know LLM Cascades

In the search for lower LLM costs, researchers turned to the concept of LLM Cascades. In the dark ages, before the launch of ChatGPT, a team from Google and The University of Toronto defined this term as programs that use probability calculations to get the best results using multiple LLMs.

More recently, the FrugalGPT paper defined cascades as sending a user query to a list of LLMs, one after the other, from weaker to stronger LLMs, until the answer is good enough. FrugalGPT Cascades uses a dedicated model to determine when the answer is good enough against a quality threshold.

A recent paper titled ‘Large Language Model Cascades With Mixture of Thought Representations for Cost-Efficient Reasoning’ from George Mason University, Microsoft, and Virginia Tech offers an alternative: a function that can determine whether the answer is good enough without fine-tuning another model.

Mixture of Thought LLM Cascades

Instead of using several LLMs, ‘Mixture of thought’ (MoT) reasoning uses just two — GPT 3.5 Turbo and GPT 4. The former model is regarded as the ‘weaker’ LLM, while the latter is the ‘strong’ LLM. The authors harnessed LLM ‘answer consistency’ to flag whether an LLM’s response is good enough. LLMs produce consistent answers to similar prompts when they are confident the answers are correct. Therefore, when weaker LLM answers are consistent, there is no need to call the stronger LLM. Conversely, these LLMs produce inconsistent answers when they lack confidence. That’s when you need a stronger LLM to answer the prompt. (Note: you can use a weaker/stronger LLM pair of your choice as well.)

The prompts themselves use few-shot in-context prompting to improve LLM answer quality. Such prompts guide the LLM’s response by giving examples of similar questions and answers.

To improve model reasoning and simplify consistency measurement, the researchers introduce a new prompting technique for reasoning tasks by ‘mixing’ two prompting techniques:

Chain of Thought (CoT) Prompting encourages LLMs to generate intermediate steps or reasonings before arriving at a final answer. Generating these steps helps the model improve complicated task results. It also increases answer accuracy.
Program of Thought (PoT) extends Chain of Thought prompting and uses the model’s output as a new input for further prompts. Prompts using this technique often request the model to answer with code instead of human language.

The paper also introduces two methods to determine answer consistency:

Voting: This method samples multiple answers from LLM queries with similar prompts or by varying the response temperature option. It then measures how similar the LLM’s answers are to each other. The answer that agrees the most with all the other answers is assumed to be correct. The team also defined a flexible ‘threshold’ value that aligns answer consistency and budget constraints.
Verification: This approach compares the LLM’s most consistent answers across two distinct thought representations (e.g., CoT and PoT). The algorithm accepts the weaker LLM’s answer if the two prompt responses are identical.

Since voting requires multiple prompts, it may be more suitable when a budget exists to guide the threshold number.

The Bottom Line: Mixture of Thought Saves You Money

Let’s look at how much money the MoT technique saves and its impact on answer accuracy.

The researchers used the following sum to calculate prompt cost:

The cost of prompting the weaker model (because we may prompt it several times)
The cost of the answer evaluation process
If the evaluation process rejects the answer, we add the cost of prompting the strong model

The results were dramatic:

Using MoT variants — combining voting and verification with CoT and PoT — can lead to comparable performance at 40% of the cost of solely using GPT-4.
In testing against the CREPE Q&A dataset, MoT outperformed GPT-4 at 47% of its cost.
Mixing PoT and CoT improves decision-making compared to using one of the techniques alone.
Increasing the threshold when using the voting method did not significantly impact quality despite the additional cost.
The consistency model proved itself in reliably identifying correct LLM answers. It successfully predicted when to resort to using the strong model to obtain the optimal results.

Hosting and managing Large Language Models (LLMs) in-house comes with significant challenges. They bring complexity, high costs, and the need for extensive infrastructure and data resources. As a result, LLMs present substantial hurdles for organizations seeking to harness their broad capabilities. That may lead you to turn to hosted LLMs. Yet, this approach presents companies with unforeseen cost increases and budget challenges as they expand to new use cases. That is particularly evident when integrating the latest powerful models. To avoid that fate, you face a new dilemma: Can you trust weaker, more affordable models? Can you overcome concerns about their accuracy in handling complex questions?

LLM Cascades with Mixture of Thought (MoT) offers two significant steps forward:

Substantial cost savings over exclusively using the latest models.
Demonstrable results on par with the latest models.

This breakthrough provides organizations with a practical and efficient approach to navigating the delicate balance between the powerful capabilities of LLMs and the imperative to manage costs effectively.

Domino Staff Software Engineer Subir Mansukhani contributed to this post.

Source link

What's Hot

How I Created a Data Science Project Following CRISP-DM Lifecycle | by Gustavo Santos | Nov, 2024

Researchers from Snowflake and CMU Introduce SuffixDecoding: A Novel Model-Free Approach to Accelerating Large Language Model (LLM) Inference through Speculative Decoding

Top Hyperscience Alternatives: Ratings, Reviews & Pricing

Navigating Cost-Complexity: Mixture of Thought LLM Cascades Illuminate a Path to Efficient Large Language Model Deployment | by Yuval Zukerman | Mar, 2024

How I Created a Data Science Project Following CRISP-DM Lifecycle | by Gustavo Santos | Nov, 2024

Increase Trust in Your Regression Model The Easy Way | by Jonte Dancker | Nov, 2024

Reporting in Excel Could Be Costing Your Business More Than You Think — Here’s How to Fix It… | by Hattie Biddlecombe | Nov, 2024

Leave A Reply Cancel Reply

How ML AI Can Help Businesses Reduce Overhead Costs

How the AI Surge May Help Current WFH Employees

The ultimate contact center automation guide

Top 5AI Development Companies To Transform Your Business | by Amyra Sheldon

How I Created a Data Science Project Following CRISP-DM Lifecycle | by Gustavo Santos | Nov, 2024

Researchers from Snowflake and CMU Introduce SuffixDecoding: A Novel Model-Free Approach to Accelerating Large Language Model (LLM) Inference through Speculative Decoding

Top Hyperscience Alternatives: Ratings, Reviews & Pricing

Nous Research Introduces Two New Projects: The Forge Reasoning API Beta and Nous Chat

Our Picks

How I Created a Data Science Project Following CRISP-DM Lifecycle | by Gustavo Santos | Nov, 2024

Researchers from Snowflake and CMU Introduce SuffixDecoding: A Novel Model-Free Approach to Accelerating Large Language Model (LLM) Inference through Speculative Decoding

Top Hyperscience Alternatives: Ratings, Reviews & Pricing

What's Hot

Navigating Cost-Complexity: Mixture of Thought LLM Cascades Illuminate a Path to Efficient Large Language Model Deployment | by Yuval Zukerman | Mar, 2024

Related Posts

Leave A Reply Cancel Reply