Used it first in my Visual-Language Models for Object Detection and Segmentation project that I did for a startup (THEKER) and now I encountered it again in my Generative AI in Robotic Applications course at Twente.

Resource: Original paper, this blog on medium

CLIP was trained on 400 million (image, text) pairs collected from the internet.

CLIP is a pretrained model for telling you how well a given image and a given text fit together.

  • In training CLIP, the similarity scores of the correct image-text pairs are found on the diagonal of the similarity score matrix for the current batch.

In training, it tries to maximize the cosine similarity between correct image-caption vector pairs, and minimize the similarity scores between all incorrect pairs.

In inference, it calculates the similarity scores between the vector of a single image with a bunch of possible caption vectors, and picks the caption with the highest similarity.

The training method mentioned above is called contrastive loss: a contrastive function that will modify the weights of the model such that correct image-caption pairs get a high similarity score, and incorrect pairs get low similarity scores. It’s extremely similar to the Skip-gram loss function from Word2Vec.