Cardinality estimation (CE) is crucial in optimizing query performance in relational databases. It involves predicting the number of intermediate results a database query will return, directly influencing the choice of execution plans by query optimizers. Accurate cardinality estimates are essential for selecting efficient join orders, determining whether to use an index and choosing the best join method. These decisions significantly impact query execution times and overall database performance. Inaccurate estimates can lead to poor execution plans, resulting in significantly slower performance, sometimes by several orders of magnitude. This makes CE a fundamental aspect of database management, with extensive research dedicated to improving its accuracy and efficiency.
The challenge, however, lies in the limitations of current methods for cardinality estimation. Traditional CE techniques, widely used in modern database systems, rely on heuristics and simplified models, such as assuming data uniformity and column independence. While computationally efficient, these methods often need to accurately predict cardinalities, especially in complex queries involving multiple tables and filters. Learned CE models have emerged as a promising alternative, offering better accuracy by leveraging data-driven approaches. However, these models must overcome significant barriers to adoption in practical settings. High training overheads, the need for large datasets, and a systematic benchmark for evaluating these models’ performance across diverse databases have hindered their widespread use.
Existing methods, including traditional heuristic-based approaches, have been supplemented by learned models that utilize instance-specific features from the data. These learned models can improve accuracy but often at the cost of extensive training requirements. For example, workload-driven approaches necessitate running tens of thousands of queries to collect true cardinalities for training, leading to significant computational overheads. More recent data-driven methods attempt to model the data distribution within and across tables without executing queries, reducing some overhead but still requiring re-training as data changes. Despite these advancements, the lack of a comprehensive benchmark has made it difficult to compare different models and assess their generalizability across various datasets.
Researchers from Google Inc. have introduced CardBench, a benchmark designed to address the need for a systematic evaluation framework for learned cardinality estimation models. CardBench is a comprehensive benchmark that includes thousands of queries across 20 distinct real-world databases, significantly more than any previous benchmarks. This allows for a more thorough evaluation of learned CE models under various conditions. The benchmark supports three key setups: instance-based models, which are trained on a single dataset; zero-shot models, which are pre-trained on multiple datasets and then tested on an unseen dataset; and fine-tuned models, which are pre-trained and then fine-tuned with a small amount of data from the target dataset.
CardBench’s design includes tools for calculating necessary data statistics, generating realistic SQL queries, and creating annotated query graphs for training CE models. The benchmark offers two sets of training data: one for single table queries with multiple filter predicates and another for binary join queries involving two tables. The benchmark includes 9125 single table queries and 8454 binary join queries for one of its smaller datasets, ensuring a robust and challenging environment for model evaluation. The training data labels, derived from Google BigQuery, required seven CPU years of query execution time, highlighting the significant computational investment in creating this benchmark. By providing these datasets and tools, CardBench lowers the barrier for researchers interested in developing and testing new CE models.
Performance evaluations using CardBench show promising results, particularly for fine-tuned models. While zero-shot models struggle with accuracy when applied to unseen datasets, especially in complex queries involving joins, fine-tuned models achieve accuracy comparable to instance-based methods with far less training data. For instance, fine-tuned graph neural network (GNN) models achieved a median q-error of 1.32 and a 95th percentile q-error of 120 in binary join queries, significantly outperforming zero-shot models. The results suggest fine-tuning pre-trained models can substantially improve their performance even with 500 queries. This makes them viable for practical applications where training data may be limited.
In conclusion, CardBench represents a significant advancement in learned cardinality estimation. Researchers can systematically evaluate and compare different CE models by providing a comprehensive and diverse benchmark, fostering further innovation in this critical area. The benchmark’s ability to support fine-tuned models, which require less data and training time, offers a practical solution for real-world applications where the cost of training new models can be prohibitive.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 50k+ ML SubReddit
Here is a highly recommended webinar from our sponsor: ‘Building Performant AI Applications with NVIDIA NIMs and Haystack’
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.