Retrieval-augmented language models often retrieve only short chunks from a corpus, limiting overall document context. This decreases their ability to adapt to changes in the world state and incorporate long-tail knowledge. Existing retrieval-augmented approaches also need fixing. The one we tackle is that most existing methods retrieve only a few short, contiguous text chunks, which limits their ability to represent and leverage large-scale discourse structure. This is particularly relevant for thematic questions that require integrating knowledge from multiple text parts, such as understanding an entire book.
Recent developments in Large Language Models (LLMs) demonstrate their effectiveness as standalone knowledge stores, encoding facts within their parameters. Fine-tuning downstream tasks further enhances their performance. However, challenges arise in updating LLMs with evolving world knowledge. An alternative approach involves indexing text in an information retrieval system and presenting retrieved information to LLMs for current domain-specific knowledge. Existing retrieval-augmented methods are limited to retrieving only short, contiguous text chunks, hindering the representation of large-scale discourse structure, which is crucial for thematic questions and a comprehensive understanding of texts like in the NarrativeQA dataset.
The researchers from Stanford University propose RAPTOR, an innovative indexing and retrieval system designed to address limitations in existing methods. RAPTOR utilizes a tree structure to capture a text’s high-level and low-level details. It clusters text chunks, generates summaries for clusters, and constructs a tree from the bottom up. This structure enables loading different levels of text chunks into LLMs context, facilitating efficient and effective answering of questions at various levels. The key contribution is using text summarization for retrieval augmentation, enhancing context representation across different scales, as demonstrated in experiments on long document collections.
RAPTOR addresses reading semantic depth and connection issues by constructing a recursive tree structure that captures both broad thematic comprehension and granular details. The process involves segmenting the retrieval corpus into chunks, embedding them using SBERT, and clustering them with a soft clustering algorithm based on Gaussian Mixture Models (GMMs) and Uniform Manifold Approximation and Projection (UMAP). The resulting tree structure allows for efficient querying through tree traversal or a collapsed tree approach, enabling retrieval of relevant information at different levels of specificity.
RAPTOR outperforms baseline methods across three question-answering datasets: NarrativeQA, QASPER, and QuALITY. Control comparisons using UnifiedQA 3B as the reader show consistent superiority of RAPTOR over BM25 and DPR. Paired with GPT-4, RAPTOR achieves state-of-the-art results on QASPER and QuALITY datasets, showcasing its effectiveness in handling thematic and multi-hop queries. The contribution of the tree structure is validated, demonstrating the significance of upper-level nodes in capturing a broader understanding and enhancing retrieval capabilities.
In conclusion, Stanford University researchers introduce RAPTOR, an innovative tree-based retrieval system that enhances the knowledge of large language models with contextual information across different abstraction levels. RAPTOR constructs a hierarchical tree structure through recursive clustering and summarization, facilitating the effective synthesis of information from diverse sections of retrieval corpora. Controlled experiments showcase RAPTOR’s superiority over traditional methods, establishing new benchmarks in various question-answering tasks. Overall, RAPTOR proves to be a promising approach for advancing the capabilities of language models through enhanced contextual retrieval.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our Telegram Channel