The ability to deal with unseen objects in a zero-shot manner makes machine learning models very attractive for applications in robotics, allowing robots to enter previously unseen environments and manipulating unknown objects therein.
While their accuracy in doing so is incredible compared with was conceivable just a few years ago, uncertainty is not only here to stay, but also requires a different treatment than customary in machine learning when used in decision making.
This article describes recent results on dealing with what we call “trial-and-error” tasks and explain how optimal decisions can be derived by modeling the system as a continuous-time Markov chain, aka Markov Jump Process.
The image above shows the average performance for zero-shot image labeling from CLIP, a groundbreaking model from OpenAI that forms the basis for large multi-modal models such as LLava and GPTv4. Let’s assume, it is able to label an image containing a chicken with 70% accuracy. While this is incredible performance, in 30% of the cases, the label will be wrong.
Labeling is not the use case we are interested in when using this output for decision making. For example, if we want to operate an automated chicken repeller, we will need a clear answer as to whether there is a chicken or not. Unfortunately, things are not as a “yes” and “no” answer, but we have to consider four cases:
- True Positive: There is a chicken and the vision model sees it
- False Positive: There is a chicken, but the vision model sees a dog, a cat, or a screwdriver.
- True Negative: There is no chicken, and the model thinks so too.
- False Negative: There is a chicken, but the vision models does not see it.
These cases are summarized in the image above. As you can see, what is provided as “accuracy” in the model only covers the “True Positive” case. What remains unknown is what the probabilities of the other possible outcomes are.