Multimodal Models: CLIP and Beyond
Overview
Multimodal models learn to understand and connect information from multiple modalities (text, images, audio, video). They enable tasks like image captioning, visual question answering, and cross-modal retrieval.
CLIP: Contrastive Language-Image Pre-training
Background and Motivation
Problem:
- Traditional vision models require labeled datasets
- Limited to specific tasks (classification, detection)
- Can't understand natural language descriptions
Solution:
- Learn visual concepts from natural language supervision
- Use text-image pairs from the internet
- Contrastive learning to align text and image representations
Key Insight: Instead of predicting fixed labels, predict which text goes with which image.
Architecture
CLIP consists of two encoders:
1. Text Encoder:
- Transformer-based (similar to GPT)
- Takes text as input
- Outputs text embeddings
2. Image Encoder:
- Vision Transformer (ViT) or ResNet
- Takes images as input
- Outputs image embeddings
3. Contrastive Learning:
- Project both to same embedding space
- Learn to match corresponding text-image pairs
- Maximize similarity for matching pairs
- Minimize similarity for non-matching pairs
Training Procedure
Step 1: Data Collection
- Collect 400M text-image pairs from internet
- Natural language descriptions of images
- No manual labeling needed!
Step 2: Preprocessing
- Images: Resize, normalize
- Text: Tokenize, truncate/pad to fixed length
Step 3: Forward Pass
For batch of N text-image pairs:
1. Encode images: I = ImageEncoder(images) # Shape: (N, d)
2. Encode texts: T = TextEncoder(texts) # Shape: (N, d)
3. Normalize embeddings: I = I / ||I||, T = T / ||T||
Step 4: Contrastive Loss
# Compute similarity matrix
logits = I @ T^T # Shape: (N, N)
# logits[i, j] = similarity between image i and text j
# Create labels (diagonal = matching pairs)
labels = range(N) # Image i matches text i
# Symmetric loss (image-to-text and text-to-image)
loss_i2t = CrossEntropy(logits, labels)
loss_t2i = CrossEntropy(logits^T, labels)
loss = (loss_i2t + loss_t2i) / 2
Why Contrastive Learning Works:
- Positive pairs: Image and its description (high similarity)
- Negative pairs: Image and other descriptions (low similarity)
- Model learns to distinguish matching from non-matching pairs
- Creates aligned embedding space
Step 5: Optimization
- Large batch size (32,768) for many negatives
- Adam optimizer
- Learning rate schedule
- Train for many epochs
Key Design Choices
1. Contrastive Objective:
- Instead of predicting exact text, predict which text matches
- Much easier learning problem
- Scales to large datasets
2. Large Batch Size:
- More negative examples per batch
- Better contrastive learning
- Harder negatives (more similar but wrong)
3. Simple Architecture:
- No task-specific heads
- Just encoders + contrastive loss
- General-purpose representations
4. Web-Scale Data:
- Use naturally occurring text-image pairs
- No manual annotation
- Diverse concepts and styles
Zero-Shot Transfer
How CLIP Works for New Tasks:
1. Image Classification:
# Create text prompts
prompts = ["a photo of a cat", "a photo of a dog", ...]
# Encode prompts
text_features = TextEncoder(prompts)
# Encode image
image_features = ImageEncoder(image)
# Find most similar prompt
similarity = image_features @ text_features^T
prediction = argmax(similarity)
2. Image-Text Retrieval:
# Find images matching text query
query = "a red car"
query_features = TextEncoder(query)
image_features = ImageEncoder(images)
similarity = query_features @ image_features^T
top_images = argsort(similarity, descending=True)
3. Image Captioning (with additional training):
- Fine-tune on captioning dataset
- Use CLIP features as initialization
Evaluation
1. Zero-Shot Image Classification:
- Datasets: ImageNet, CIFAR-10, etc.
- Method: Create text prompts for each class
- Metric: Accuracy
- Result: CLIP matches supervised models on many datasets!
2. Image-Text Retrieval:
- Flickr30K, MS-COCO: Find matching images/texts
- Metrics: Recall@1, Recall@5, Recall@10
- Result: Strong performance without task-specific training
3. Visual Question Answering:
- VQA datasets: Answer questions about images
- Method: Combine CLIP with language model
- Result: Good performance with minimal fine-tuning
4. Robustness:
- Out-of-distribution: Test on different domains
- Adversarial: Test robustness to perturbations
- Result: More robust than supervised models
Limitations
1. Fine-grained Tasks:
- Counting objects (e.g., "how many cats?")
- Spatial reasoning
- Not as good as task-specific models
2. Text Understanding:
- Limited to short descriptions
- Can't handle complex reasoning
3. Bias:
- Inherits biases from web data
- Can produce biased outputs
4. Computational Cost:
- Large models (hundreds of millions of parameters)
- Requires significant compute
Applications
1. Image Search:
- Search images by natural language
- "Find images of sunset over ocean"
2. Content Moderation:
- Detect inappropriate content
- Match images to text descriptions
3. Accessibility:
- Generate alt text for images
- Describe images to visually impaired
4. Creative Tools:
- Text-to-image generation (DALL-E uses CLIP)
- Image editing with text
Other Multimodal Models
ALIGN (A Large-scale ImaGe and Noisy-text embedding)
Similar to CLIP:
- Contrastive learning
- Web-scale data (1.8B pairs)
- Larger scale → better performance
DALL-E / DALL-E 2
Text-to-Image Generation:
- Uses CLIP for guidance
- Generates images from text
- Autoregressive or diffusion-based
Flamingo
Few-Shot Learning:
- Interleaves images and text
- Few-shot visual question answering
- Can learn from examples
GPT-4V (Vision)
Multimodal LLM:
- Processes images and text together
- Can reason about visual content
- General-purpose multimodal understanding
Training Multimodal Models
General Pipeline
1. Data Collection:
- Collect paired data (text-image, text-audio, etc.)
- Web scraping, existing datasets
- Ensure quality and diversity
2. Preprocessing:
- Images: Resize, normalize, augment
- Text: Tokenize, clean
- Alignment: Ensure pairs are correctly matched
3. Architecture:
- Separate encoders for each modality
- Projection layers to common space
- Contrastive or generative objective
4. Training:
- Large batch size (for contrastive)
- Long training (many epochs)
- Careful learning rate scheduling
5. Evaluation:
- Zero-shot transfer
- Downstream tasks
- Robustness testing
Challenges
1. Alignment:
- Ensuring modalities are properly aligned
- Handling noisy pairs
2. Scale:
- Need large datasets
- Computational resources
3. Evaluation:
- How to measure multimodal understanding?
- Need diverse benchmarks
4. Bias:
- Models inherit biases from data
- Need careful evaluation
Summary
Key Points:
- CLIP: Contrastive learning for text-image alignment
- Zero-shot: Works on new tasks without fine-tuning
- Web-scale data: Uses naturally occurring pairs
- Simple architecture: Just encoders + contrastive loss
- Strong performance: Matches supervised models on many tasks
Future Directions:
- Better alignment methods
- More modalities (video, audio)
- Fewer-shot learning
- Better evaluation metrics