How to add Llama Guard to your RAG pipelines to moderate LLM inputs and outputs and combat prompt injection
LLM security is an area that we all know deserves ample attention. Organizations eager to adopt Generative AI, from big to small, face a huge challenge in securing their LLM apps. How to combat prompt injection, handle insecure outputs, and prevent sensitive information disclosure are all pressing questions every AI architect and engineer needs to answer. Enterprise production grade LLM apps cannot survive in the wild without solid solutions to address LLM security.
Llama Guard, open-sourced by Meta on December 7th, 2023, offers a viable solution to address the LLM input-output vulnerabilities and combat prompt injection. Llama Guard falls under the umbrella project Purple Llama, “featuring open trust and safety tools and evaluations meant to level the playing field for developers to deploy generative AI models responsibly.”[1]
We explored the OWASP top 10 for LLM applications a month ago. With Llama Guard, we now have a pretty reasonable solution to start addressing some of those top 10 vulnerabilities, namely:
- LLM01: Prompt injection
- LLM02: Insecure output handling
- LLM06: Sensitive information disclosure
In this article, we will explore how to add Llama Guard to an RAG pipeline to:
- Moderate the user inputs
- Moderate the LLM outputs
- Experiment with customizing the out-of-the-box unsafe categories to tailor to your use case
- Combat prompt injection attempts
Llama Guard “is a 7B parameter Llama 2-based input-output safeguard model. It can be used to classify content in both LLM inputs (prompt classification) and LLM responses (response classification). It acts as an LLM: it generates text in its output that indicates whether a given prompt or response is safe/unsafe, and if unsafe based on a policy, it also lists the violating subcategories.”[2]