Multimodal Models: CLIP and Beyond

Overview

Multimodal models learn to understand and connect information from multiple modalities (text, images, audio, video). They enable tasks like image captioning, visual question answering, and cross-modal retrieval.

CLIP: Contrastive Language-Image Pre-training

Background and Motivation

Problem:

  • Traditional vision models require labeled datasets
  • Limited to specific tasks (classification, detection)
  • Can't understand natural language descriptions

Solution:

  • Learn visual concepts from natural language supervision
  • Use text-image pairs from the internet
  • Contrastive learning to align text and image representations

Key Insight: Instead of predicting fixed labels, predict which text goes with which image.

Architecture

CLIP consists of two encoders:

1. Text Encoder:

  • Transformer-based (similar to GPT)
  • Takes text as input
  • Outputs text embeddings

2. Image Encoder:

  • Vision Transformer (ViT) or ResNet
  • Takes images as input
  • Outputs image embeddings

3. Contrastive Learning:

  • Project both to same embedding space
  • Learn to match corresponding text-image pairs
  • Maximize similarity for matching pairs
  • Minimize similarity for non-matching pairs

Training Procedure

Step 1: Data Collection

  • Collect 400M text-image pairs from internet
  • Natural language descriptions of images
  • No manual labeling needed!

Step 2: Preprocessing

  • Images: Resize, normalize
  • Text: Tokenize, truncate/pad to fixed length

Step 3: Forward Pass

For batch of N text-image pairs:
  1. Encode images: I = ImageEncoder(images)  # Shape: (N, d)
  2. Encode texts: T = TextEncoder(texts)     # Shape: (N, d)
  3. Normalize embeddings: I = I / ||I||, T = T / ||T||

Step 4: Contrastive Loss

# Compute similarity matrix
logits = I @ T^T  # Shape: (N, N)
# logits[i, j] = similarity between image i and text j

# Create labels (diagonal = matching pairs)
labels = range(N)  # Image i matches text i

# Symmetric loss (image-to-text and text-to-image)
loss_i2t = CrossEntropy(logits, labels)
loss_t2i = CrossEntropy(logits^T, labels)
loss = (loss_i2t + loss_t2i) / 2

Why Contrastive Learning Works:

  • Positive pairs: Image and its description (high similarity)
  • Negative pairs: Image and other descriptions (low similarity)
  • Model learns to distinguish matching from non-matching pairs
  • Creates aligned embedding space

Step 5: Optimization

  • Large batch size (32,768) for many negatives
  • Adam optimizer
  • Learning rate schedule
  • Train for many epochs

Key Design Choices

1. Contrastive Objective:

  • Instead of predicting exact text, predict which text matches
  • Much easier learning problem
  • Scales to large datasets

2. Large Batch Size:

  • More negative examples per batch
  • Better contrastive learning
  • Harder negatives (more similar but wrong)

3. Simple Architecture:

  • No task-specific heads
  • Just encoders + contrastive loss
  • General-purpose representations

4. Web-Scale Data:

  • Use naturally occurring text-image pairs
  • No manual annotation
  • Diverse concepts and styles

Zero-Shot Transfer

How CLIP Works for New Tasks:

1. Image Classification:

# Create text prompts
prompts = ["a photo of a cat", "a photo of a dog", ...]

# Encode prompts
text_features = TextEncoder(prompts)

# Encode image
image_features = ImageEncoder(image)

# Find most similar prompt
similarity = image_features @ text_features^T
prediction = argmax(similarity)

2. Image-Text Retrieval:

# Find images matching text query
query = "a red car"
query_features = TextEncoder(query)
image_features = ImageEncoder(images)
similarity = query_features @ image_features^T
top_images = argsort(similarity, descending=True)

3. Image Captioning (with additional training):

  • Fine-tune on captioning dataset
  • Use CLIP features as initialization

Evaluation

1. Zero-Shot Image Classification:

  • Datasets: ImageNet, CIFAR-10, etc.
  • Method: Create text prompts for each class
  • Metric: Accuracy
  • Result: CLIP matches supervised models on many datasets!

2. Image-Text Retrieval:

  • Flickr30K, MS-COCO: Find matching images/texts
  • Metrics: Recall@1, Recall@5, Recall@10
  • Result: Strong performance without task-specific training

3. Visual Question Answering:

  • VQA datasets: Answer questions about images
  • Method: Combine CLIP with language model
  • Result: Good performance with minimal fine-tuning

4. Robustness:

  • Out-of-distribution: Test on different domains
  • Adversarial: Test robustness to perturbations
  • Result: More robust than supervised models

Limitations

1. Fine-grained Tasks:

  • Counting objects (e.g., "how many cats?")
  • Spatial reasoning
  • Not as good as task-specific models

2. Text Understanding:

  • Limited to short descriptions
  • Can't handle complex reasoning

3. Bias:

  • Inherits biases from web data
  • Can produce biased outputs

4. Computational Cost:

  • Large models (hundreds of millions of parameters)
  • Requires significant compute

Applications

1. Image Search:

  • Search images by natural language
  • "Find images of sunset over ocean"

2. Content Moderation:

  • Detect inappropriate content
  • Match images to text descriptions

3. Accessibility:

  • Generate alt text for images
  • Describe images to visually impaired

4. Creative Tools:

  • Text-to-image generation (DALL-E uses CLIP)
  • Image editing with text

Other Multimodal Models

ALIGN (A Large-scale ImaGe and Noisy-text embedding)

Similar to CLIP:

  • Contrastive learning
  • Web-scale data (1.8B pairs)
  • Larger scale → better performance

DALL-E / DALL-E 2

Text-to-Image Generation:

  • Uses CLIP for guidance
  • Generates images from text
  • Autoregressive or diffusion-based

Flamingo

Few-Shot Learning:

  • Interleaves images and text
  • Few-shot visual question answering
  • Can learn from examples

GPT-4V (Vision)

Multimodal LLM:

  • Processes images and text together
  • Can reason about visual content
  • General-purpose multimodal understanding

Training Multimodal Models

General Pipeline

1. Data Collection:

  • Collect paired data (text-image, text-audio, etc.)
  • Web scraping, existing datasets
  • Ensure quality and diversity

2. Preprocessing:

  • Images: Resize, normalize, augment
  • Text: Tokenize, clean
  • Alignment: Ensure pairs are correctly matched

3. Architecture:

  • Separate encoders for each modality
  • Projection layers to common space
  • Contrastive or generative objective

4. Training:

  • Large batch size (for contrastive)
  • Long training (many epochs)
  • Careful learning rate scheduling

5. Evaluation:

  • Zero-shot transfer
  • Downstream tasks
  • Robustness testing

Challenges

1. Alignment:

  • Ensuring modalities are properly aligned
  • Handling noisy pairs

2. Scale:

  • Need large datasets
  • Computational resources

3. Evaluation:

  • How to measure multimodal understanding?
  • Need diverse benchmarks

4. Bias:

  • Models inherit biases from data
  • Need careful evaluation

Summary

Key Points:

  1. CLIP: Contrastive learning for text-image alignment
  2. Zero-shot: Works on new tasks without fine-tuning
  3. Web-scale data: Uses naturally occurring pairs
  4. Simple architecture: Just encoders + contrastive loss
  5. Strong performance: Matches supervised models on many tasks

Future Directions:

  • Better alignment methods
  • More modalities (video, audio)
  • Fewer-shot learning
  • Better evaluation metrics