First used them in Visual-Language Models for Object Detection and Segmentation and encountered them again in my GenAI Models and Robotic Applications course at Twente.

VLMs are a type of generative models that take image and text inputs, and generate text outputs. These models can output bounding boxes or segmentation masks when prompted to detect or segment a particular subject, or they can localize different entities or answer questions about their relative or absolute positions.

CLIP is a VLM, but it’s not generative (it’s contrastive).

There are a few setups:

In captioning / VQA tasks, the VLM conditions on image tokens and then predicts text tokens autoregressively (e.g., “a dog chasing a ball”).
In multimodal generation (e.g., image generation with text prompts), the VLM conditions on text tokens and predicts image tokens, which are then decoded back into pixels.
In bidirectional encoders (e.g., CLIP, ALIGN), the VLM doesn’t “predict” tokens autoregressively at all, but instead aligns text tokens with visual embeddings in a joint latent space

Visual Question Answering (VQA)

It’s taking an image and asking a question about it:

🚀 Costin Chitic

Recent Notes

ROS2 - Writing Publishers and Subscribers

ROS2 Commands Basics

ROS2 Starting Basics

Error Analysis of Airborne Laser Scanning Data

Point Cloud Segmentation Practical

Visual Language Models (VLM)

Visual Question Answering (VQA)

Graph View

Backlinks