In this post we’ll implement Text-to-image search (allowing us to search for an image via text) and Image-to-image search (allowing us to search for an image based on a reference image) using a lightweight pre-trained model. The model we’ll be using to calculate image and text similarity is inspired by Contrastive Language Image Pre-Training (CLIP), which I discuss in another article.
Who is this useful for? Any developers who want to implement image search, data scientists interested in practical applications, or non-technical readers who want to learn about A.I. in practice.
How advanced is this post? This post will walk you through implementing image search as quickly and simply as possible.
Pre-requisites: Basic coding experience.
This article is a companion piece to my article on “Contrastive Language-Image Pre-Training”. Feel free to check it out if you want a more thorough understanding of the theory:
CLIP models are trained to predict if an arbitrary caption belongs with an arbitrary image. We’ll be using this general functionality to create our image search system. Specifically, we’ll be using the image and text encoders from CLIP to condense inputs into a vector, called an embedding, which can be thought of as a summary of the input.
The whole idea behind CLIP is that similar text and images will have similar vector embeddings.