When building machine learning (ML) models using preexisting datasets, experts in the field must first familiarize themselves with the data, decipher its structure, and determine which subset to use as features. So much so that a basic barrier, the great range of data formats, is slowing advancement in ML.
Text, structured data, photos, audio, and video are just a few content categories in ML datasets. Even among datasets that include the same subject matter, there is no standard layout of files or data formats. This obstacle lowers productivity through machine learning development—from data discovery to model training. Additionally, it makes it harder to create essential tools for dealing with huge datasets.
Database metadata can be expressed in various formats, including schema.org and DCAT. Unfortunately, these formats weren’t made with machine learning data in mind. ML data has unique requirements, like combining and extracting data from structured and unstructured sources, having metadata allowing for responsible data use, or describing ML usage characteristics like training, test, and validation sets.
Google has recently introduced Croissant, a new format for metadata in ML-ready datasets. Along with the format specification, example datasets, and open-source Python library for validating, consuming, and generating Croissant metadata, this 1.0 release of Croissant also includes an open-source visual editor for loading, inspecting, and intuitively creating Croissant dataset descriptions.
Although it offers a consistent method of describing and organizing data, the Croissant format does not change the data’s actual representation (such as picture or text file formats). With over 40 million datasets currently using it, schema.org is the gold standard for publishing structured data online, and Croissant is an extension of that standard. Croissant adds extensive layers for data resources, default ML semantics, metadata, and data management to make it even more ML-relevant.
From the beginning, the primary objective of the Croissant initiative was to promote Responsible AI (RAI). In addition, the team also announced the first release of the Croissant RAI vocabulary extension. This extension enhances Croissant by adding properties that describe various RAI use cases. These include data life cycle management, labeling, participatory data, ML safety and fairness evaluation, explainability, compliance, and more.
Dataset repositories and search engines can use metadata to help users locate the correct dataset. The data resources and organization information make tools for data cleaning, refining, and analysis easier to design. Thanks to this metadata and default ML semantics, ML frameworks can use data for model training and testing with little coding. Taken as a whole, these enhancements significantly lessen the load of data development.
Dataset writers also prioritize their datasets’ discoverability and use. Thanks to the readily available generation tools and support from ML data platforms, adopting Croissant enhances the value of their datasets with no effort.
Use the Croissant editor’s user interface (GitHub) to examine and alter the metadata.
By evaluating the data the user gives, the Croissant editor UI (GitHub) may automatically build a major percentage of Croissant metadata. Important metadata fields, like RAI properties, can then be filled out. Users can then publish their datasets.
Make the Croissant data easily discoverable and reusable by publishing it on their dataset website.
Croissant metadata will be automatically generated if users post their data to a Croissant-compatible repository (e.g., OpenML, Kaggle, or HuggingFace).
Important tools and repositories supporting this, including Kaggle, Hugging Face, and OpenML, are three popular ML dataset collections that will start supporting the Croissant format today. Users can search for Croissant datasets on the web with the Dataset Search tool. TensorFlow, PyTorch, and JAX, three popular ML frameworks, can load Croissant datasets easily with the TensorFlow Datasets (TFDS) package.
The researchers strongly suggest that platforms that host datasets make Croissant files available for download and provide Croissant information on dataset web pages. This will help dataset search engines find them more easily. Data analysis and labeling tools, among others that assist users in working with ML datasets, should also consider adding support for Croissant datasets. Working together, the team believes we can ease the load of data development and pave the way for a more robust ML research and development environment.
Check out the Blog and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group
If you like our work, you will love our newsletter..
Don’t Forget to join our Telegram Channel
You may also like our FREE AI Courses….
Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.