Multimodal Models: CLIP and Beyond

Overview

Multimodal models learn to understand and connect information from multiple modalities (text, images, audio, video). They enable tasks like image captioning, visual question answering, and cross-modal retrieval.

CLIP: Contrastive Language-Image Pre-training

Background and Motivation

Problem:

Traditional vision models require labeled datasets
Limited to specific tasks (classification, detection)
Can't understand natural language descriptions

Solution:

Learn visual concepts from natural language supervision
Use text-image pairs from the internet
Contrastive learning to align text and image representations

Key Insight: Instead of predicting fixed labels, predict which text goes with which image.

Architecture

CLIP consists of two encoders:

1. Text Encoder:

Transformer-based (similar to GPT)
Takes text as input
Outputs text embeddings

2. Image Encoder:

Vision Transformer (ViT) or ResNet
Takes images as input
Outputs image embeddings

3. Contrastive Learning:

Project both to same embedding space
Learn to match corresponding text-image pairs
Maximize similarity for matching pairs
Minimize similarity for non-matching pairs

Training Procedure

Step 1: Data Collection

Collect 400M text-image pairs from internet
Natural language descriptions of images
No manual labeling needed!

Step 2: Preprocessing

Images: Resize, normalize
Text: Tokenize, truncate/pad to fixed length

Step 3: Forward Pass

For batch of N text-image pairs:
  1. Encode images: I = ImageEncoder(images)  # Shape: (N, d)
  2. Encode texts: T = TextEncoder(texts)     # Shape: (N, d)
  3. Normalize embeddings: I = I / ||I||, T = T / ||T||

Step 4: Contrastive Loss

# Compute similarity matrix
logits = I @ T^T  # Shape: (N, N)
# logits[i, j] = similarity between image i and text j

# Create labels (diagonal = matching pairs)
labels = range(N)  # Image i matches text i

# Symmetric loss (image-to-text and text-to-image)
loss_i2t = CrossEntropy(logits, labels)
loss_t2i = CrossEntropy(logits^T, labels)
loss = (loss_i2t + loss_t2i) / 2

Why Contrastive Learning Works:

Positive pairs: Image and its description (high similarity)
Negative pairs: Image and other descriptions (low similarity)
Model learns to distinguish matching from non-matching pairs
Creates aligned embedding space

Step 5: Optimization

Large batch size (32,768) for many negatives
Adam optimizer
Learning rate schedule
Train for many epochs

Key Design Choices

1. Contrastive Objective:

Instead of predicting exact text, predict which text matches
Much easier learning problem
Scales to large datasets

2. Large Batch Size:

More negative examples per batch
Better contrastive learning
Harder negatives (more similar but wrong)

3. Simple Architecture:

No task-specific heads
Just encoders + contrastive loss
General-purpose representations

4. Web-Scale Data:

Use naturally occurring text-image pairs
No manual annotation
Diverse concepts and styles

Zero-Shot Transfer

How CLIP Works for New Tasks:

1. Image Classification:

# Create text prompts
prompts = ["a photo of a cat", "a photo of a dog", ...]

# Encode prompts
text_features = TextEncoder(prompts)

# Encode image
image_features = ImageEncoder(image)

# Find most similar prompt
similarity = image_features @ text_features^T
prediction = argmax(similarity)

2. Image-Text Retrieval:

# Find images matching text query
query = "a red car"
query_features = TextEncoder(query)
image_features = ImageEncoder(images)
similarity = query_features @ image_features^T
top_images = argsort(similarity, descending=True)

3. Image Captioning (with additional training):