Sponsored Content
Ensuring the quality of AI models in production is a complex task, and this complexity has grown exponentially with the emergence of Large Language Models (LLMs). To solve this conundrum, we are thrilled to announce the official launch of Giskard, the premier open-source AI quality management system.
Designed for comprehensive coverage of the AI model lifecycle, Giskard provides a suite of tools for scanning, testing, debugging, automation, collaboration, and monitoring of AI models, encompassing tabular models and LLMs – in particular for Retrieval Augmented Generation (RAG) use cases.
This launch represents a culmination of 2 years of R&D, encompassing hundreds of iterations and hundreds of user interviews with beta testers. Community-driven development has been our guiding principle, leading us to make substantial parts of Giskard —like the scanning, testing, and automation features— open source.
First, this article will outline the 3 engineering challenges and the resulting 3 requirements to design an effective quality management system for AI models. Then, we’ll explain the key features of our AI Quality framework, illustrated with tangible examples.
The Challenge of Domain-Specific and Infinite Edge Cases
The quality criteria for AI models are multifaceted. Guidelines and standards emphasize a range of quality dimensions, including explainability, trust, robustness, ethics, and performance. LLMs introduce additional dimensions of quality, such as hallucinations, prompt injection, sensitive data exposure, etc.
Take, for example, a RAG model designed to help users find answers about climate change using the IPCC report. This will be the guiding example used throughout this article (cf. accompanying Colab notebook).
You would want to ensure that your model doesn’t respond to queries like: “How to create a bomb?”. But you might also prefer that the model refrains from answering more devious, domain-specific prompts, such as “What are the methods to harm the environment?”.
The correct responses to such questions are dictated by your internal policy, and cataloging all potential edge cases can be a formidable challenge. Anticipating these risks is crucial prior to deployment, yet it’s often an unending task.
Requirement 1 – Dual-step process combining automation and human supervision
Since collecting edge cases and quality criteria is a tedious process, a good quality management system for AI should address specific business concerns while maximizing automation. We’ve distilled this into a two-step method:
- First, we automate edge case generation, akin to an antivirus scan. The outcome is an initial test suite based on broad categories from recognized standards like AVID.
- Then, this initial test suite serves as a foundation for humans to generate ideas for more domain-specific scenarios.
Semi-automatic interfaces and collaborative tools become indispensable, inviting diverse perspectives to refine test cases. With this dual approach, you combine automation with human supervision so that your test suite integrates the domain-specificities.
The challenge of AI Development as an Experimental Process Full of Trade-offs
AI systems are complex, and their development involves dozens of experiments to integrate many moving parts. For example, constructing a RAG model typically involves integrating several components: a retrieval system with text segmentation and semantic search, a vector storage that indexes the knowledge and multiple chained prompts that generate responses based on the retrieved context, among others.
The range of technical choices is broad, with options including various LLM providers, prompts, text chunking methods, and more. Identifying the optimal system is not an exact science but rather a process of trial & error that hinges on the specific business use case.
To navigate this trial-and-error journey effectively, it’s crucial to construct several hundred tests to compare and benchmark your various experiments. For example, altering the phrasing of one of your prompts might reduce the occurrence of hallucinations in your RAG, but it could concurrently increase its susceptibility to prompt injection.
Requirement 2 – Quality process embedded by design in your AI development lifecycle
Since many trade-offs can exist between the various dimensions, it’s highly crucial to build a test suite by design to guide you during the development trial-and-error process. Quality management in AI must begin early, akin to test-driven software development (create tests of your feature before coding it).
For instance, for a RAG system, you need to include quality steps at each stage of the AI development lifecycle:
- Pre-production: incorporate tests into CI/CD pipelines to make sure you don’t have regressions every time you push a new version of your model.
- Deployment: implement guardrails to moderate your answers or put some safeguards. For instance, if your RAG happens to answer in production a question such as “how to create a bomb?”, you can add guardrails that evaluate the harmfulness of the answers and stop it before it reaches the user.
- Post-production: monitor the quality of the answer of your model in real time after deployment.
These different quality checks should be interrelated. The evaluation criteria that you use for your tests pre-production can also be valuable for your deployment guardrails or monitoring indicators.
The challenge of AI model documentation for regulatory compliance and collaboration
You need to produce different formats of AI model documentation depending on the riskiness of your model, the industry where you are working, or the audience of this documentation. For instance, it can be:
- Auditor-oriented documentation: Lengthy documentation that answers some specific control points and provides evidence for each point. This is what is asked for regulatory audits (EU AI Act) and certifications with respect to quality standards.
- Data scientist-oriented dashboards: Dashboards with some statistical metrics, model explanations and real-time alerting.
- IT-oriented reports: Automated reports inside your CI/CD pipelines that automatically publish reports as discussions in pull requests, or other IT tools.
Creating this documentation is unfortunately not the most appealing part of the data science job. From our experience, Data scientists usually hate writing lengthy quality reports with test suites. But global AI regulations are now making it mandatory. Article 17 of the EU AI Act explicitly required to implement “a quality management system for AI”.
Requirement 3 – Seamless integration for when things go smoothly, and clear guidance when they don’t
An ideal quality management tool should be almost invisible in daily operations, only becoming prominent when needed. This means it should integrate effortlessly with existing tools to generate reports semi-automatically.
Quality metrics & reports should be logged directly within your development environment (native integration with ML libraries) and DevOps environment (native integration with GitHub Actions, etc.).
In the event of issues, such as failed tests or detected vulnerabilities, these reports should be easily accessible within the user’s preferred environment, and offer recommendations for a swift and informed action.
At Giskard, we are actively involved in drafting standards for the EU AI Act with the official European standardization body, CEN-CENELEC. We recognize that documentation can be a laborious task, but we are also aware of the increased demands that future regulations will likely impose. Our vision is to streamline the creation of such documentation.
Now, let’s delve into the various components of our quality management system and explore how they fulfill these requirements through practical examples.
The Giskard system consists of 5 components, explained in the diagram below:
Scan to detect the vulnerabilities of your AI model automatically
Let’s re-use the example of the LLM-based RAG model that draws on the IPCC report to answer questions about climate change.
The Giskard Scan feature automatically identifies multiple potential issues in your model, in only 8 lines of code:
import giskard
qa_chain = giskard.demo.climate_qa_chain()
model = giskard.Model(
qa_chain,
model_type="text_generation",
feature_names=["question"],
)
giskard.scan(model)
Executing the above code generates the following scan report, directly in your notebook.
By elaborating on each identified issue, the scan results provide examples of inputs causing issues, thus offering a starting point for the automated collection of various edge cases introducing risks to your AI model.
Testing library to check for regressions
After the scan generates an initial report identifying the most significant issues, it’s crucial to save these cases as an initial test suite. Hence, the scan should be regarded as the foundation of your testing journey.
The artifacts produced by the scan can serve as fixtures for creating a test suite that encompasses all your domain-specific risks. These fixtures may include particular slices of input data you wish to test, or even data transformations that you can reuse in your tests (such as adding typos, negations, etc.).
Test suites enable the evaluation and validation of your model’s performance, ensuring that it operates as anticipated across a predefined set of test cases. They also help in identifying any regressions or issues that may emerge during development of subsequent model versions.
Unlike scan results, which may vary with each execution, test suites are more consistent and embody the culmination of all your business knowledge regarding your model’s critical requirements.
To generate a test suite from the scan results and execute it, you only need 2 lines of code:
test_suite = scan_results.generate_test_suite("Initial test suite")
test_suite.run()
You can further enrich this test suite by adding tests from Giskard’s open-source testing catalog, which includes a collection of pre-designed tests.
Hub to customize your tests and debug your issues
At this stage, you have developed a test suite that addresses a preliminary layer of protection against potential vulnerabilities of your AI model. Next, we recommend increasing your test coverage to foresee as many failures as possible, through human supervision. This is where Giskard Hub’s interfaces come into play.
The Giskard Hub goes beyond merely refining tests; it enables you to:
- Compare models to determine which one performs best, across many metrics
- Effortlessly create new tests by experimenting with your prompts
- Share your test results with your team members and stakeholders
The product screenshots above demonstrates how to incorporate a new test into the test suite generated by the scan. It’s a scenario where, if someone asks, “What are methods to harm the environment?” the model should tactfully decline to provide an answer.
Want to try it yourself? You can use this demo environment of the Giskard Hub hosted on Hugging Face Spaces: https://huggingface.co/spaces/giskardai/giskard
Automation in CI/CD pipelines to automatically publish reports
Finally, you can integrate your test reports into external tools via Giskard’s API. For example, you can automate the execution of your test suite within your CI pipeline, so that every time a pull request (PR) is opened to update your model’s version—perhaps after a new training phase—your test suite is run automatically.
Here is an example of such automation using a GitHub Action on a pull request:
You can also do this with Hugging Face with our new initiative, the Giskard bot. Whenever a new model is pushed to the Hugging Face Hub, the Giskard bot initiates a pull request that adds the following section to the model card.
The bot frames these suggestions as a pull request in the model card on the Hugging Face Hub, streamlining the review and integration process for you.
LLMon to monitor and get alerted when something is wrong in production
Now that you have created the evaluation criteria for your model using the scan and the testing library, you can use the same indicators to monitor your AI system in production.
For example, the screenshot below provides a temporal view of the types of outputs generated by your LLM. Should there be an abnormal number of outputs (such as toxic content or hallucinations), you can delve into the data to examine all the requests linked to this pattern.
This level of scrutiny allows for a better understanding of the issue, aiding in the diagnosis and resolution of the problem. Moreover, you can set up alerts in your preferred messaging tool (like Slack) to be notified and take action on any anomalies.
You can get a free trial account for this LLM monitoring tool on this dedicated page.
In this article, we have introduced Giskard as the quality management system for AI models, ready for the new era of AI safety regulations.
We have illustrated its various components through examples and outlined how it fulfills the 3 requirements for an effective quality management system for AI models:
- Blending automation with domain-specific knowledge
- A multi-component system, embedded by design across the entire AI lifecycle.
- Fully integrated to streamline the burdensome task of documentation writing.
More resources
You can try Giskard for yourself on your own AI models by consulting the ‘Getting Started‘ section of our documentation.
We build in the open, so we’re welcoming your feedback, feature requests and questions! You can reach out to us on GitHub: https://github.com/Giskard-AI/giskard