First used them in Visual-Language Models for Object Detection and Segmentation and encountered them again in my GenAI Models and Robotic Applications course at Twente.

VLMs are a type of generative models that take image and text inputs, and generate text outputs. These models can output bounding boxes or segmentation masks when prompted to detect or segment a particular subject, or they can localize different entities or answer questions about their relative or absolute positions.

CLIP is a VLM, but it’s not generative (it’s contrastive).

There are a few setups:

  • In captioning / VQA tasks, the VLM conditions on image tokens and then predicts text tokens autoregressively (e.g., “a dog chasing a ball”).
  • In multimodal generation (e.g., image generation with text prompts), the VLM conditions on text tokens and predicts image tokens, which are then decoded back into pixels.
  • In bidirectional encoders (e.g., CLIP, ALIGN), the VLM doesn’t “predict” tokens autoregressively at all, but instead aligns text tokens with visual embeddings in a joint latent space

Visual Question Answering (VQA)

It’s taking an image and asking a question about it: