Improving Performance and Explainability of Zero-Shot CLIP | by Alexey Kravets

Part 2 — Visual classification via description from LLMs

This is the second part of a series on enhancing Zero-Shot CLIP performance. In the first part, I provided a detailed explanation of how the CLIP model operates and described a straightforward method to improve its performance. This involved extending standard prompts like “A picture of {class}” with customized prompts generated by a large language model (LLM). If you haven’t already, you can find part 1 here. In this article we will present a relatively similar method to improve zero-shot CLIP performance which is additionally highly explainable.

The CLIP model is an impressive zero-shot predictor, enabling predictions on tasks it hasn’t explicitly been trained for. Despite its inherent capabilities, there exist several strategies to notably improve its performance. In the first article we have seen one of these strategies, however, while achieving enhanced performance is valuable, there are instances where we might be willing to make trade-offs to prioritize better explainability. In this second article of our series we will explore a method that not only enhances the performance of the zero-shot CLIP model but also ensures that its predictions are easily understandable and interpretable.

Various explainability techniques are available for deep learning models today. In a previous article, I delved into Integrated Gradients, a method that tells how each feature of an input influences the output of a machine learning model, especially deep neural networks. Another popular approach for model interpretation relies on Shap values, where we assign the contribution of each feature to the model’s output based on concepts from cooperative game theory. While these methods are versatile and can be applied to any deep learning model, they can be somewhat challenging to implement and interpret. CLIP, which has been trained to map image and text features into the same embedding space, provides an alternative explainability method based on text. This approach is more user-friendly and offers easy interpretability, providing a different perspective on model explanation.

Source link