Multimodal Model Evaluation

Overview

Evaluating multimodal models is challenging because they need to be tested on multiple modalities and their interactions. This document covers evaluation methods for multimodal models, especially CLIP and similar models.

Evaluation Categories

1. Zero-Shot Image Classification

What it tests:

Can model classify images without task-specific training?
Does it understand natural language descriptions?

Method:

Create text prompts for each class: "a photo of a {class}"
Encode prompts using text encoder
Encode test images using image encoder
Find most similar prompt for each image
Compute accuracy

Datasets:

ImageNet: 1,000 classes, standard benchmark
CIFAR-10/100: 10/100 classes
Oxford-IIIT Pet: 37 pet breeds
Food-101: 101 food categories

Metrics:

Accuracy: Percentage of correct predictions
Top-5 Accuracy: Correct class in top 5 predictions

Example:

# Create prompts
prompts = ["a photo of a cat", "a photo of a dog", ...]

# Encode
text_features = clip.encode_text(prompts)
image_features = clip.encode_image(images)

# Classify
similarities = image_features @ text_features.T
predictions = np.argmax(similarities, axis=1)
accuracy = (predictions == labels).mean()

Results:

CLIP matches or exceeds supervised models on many datasets
Especially good on natural images
Struggles on fine-grained tasks (e.g., counting)

2. Image-Text Retrieval

What it tests:

Can model find relevant images given text query?
Can model find relevant text given image query?

Tasks:

a) Image Retrieval (Text → Image):

Given text query, find matching images
Example: "a red car" → find images of red cars

b) Text Retrieval (Image → Text):

Given image, find matching captions
Example: Image of cat → find "a photo of a cat"

Datasets:

Flickr30K: 31,000 images, 5 captions each
MS-COCO: 123,000 images, 5 captions each
Conceptual Captions: 3.3M images with captions

Metrics:

Recall@K: Percentage of queries where correct result in top K
- Recall@1: Correct result is #1
- Recall@5: Correct result in top 5
- Recall@10: Correct result in top 10
Median Rank: Median rank of correct result (lower is better)

Example:

# Image retrieval
query = "a red car"
query_features = clip.encode_text([query])
image_features = clip.encode_image(images)

similarities = query_features @ image_features.T
top_k_images = np.argsort(similarities[0])[-k:][::-1]

# Compute Recall@K
recall_at_k = compute_recall_at_k(correct_images, top_k_images, k)

Results:

CLIP achieves strong retrieval performance
Better than previous methods without task-specific training
Scales well with model size

3. Visual Question Answering (VQA)

What it tests:

Can model answer questions about images?
Does it understand both visual and textual information?

Method:

Combine CLIP with language model
Encode image and question
Generate answer

Datasets:

VQA v2: 1.1M questions, 11M answers
GQA: Scene graph-based questions
TextVQA: Questions about text in images

Metrics:

Accuracy: Percentage of correct answers
Per-question-type accuracy: Accuracy by question type

Challenges:

Requires reasoning about image content
Need to combine visual and textual understanding
Some questions require counting, spatial reasoning

Results:

CLIP alone not sufficient (needs additional components)
CLIP + language model achieves good performance
Better on visual questions than text-heavy questions

4. Image Captioning

What it tests:

Can model describe images in natural language?
Quality of generated captions

Method:

Fine-tune CLIP + language model on captioning data
Generate captions autoregressively

Datasets:

MS-COCO Captions: 123,000 images with captions
Flickr30K: 31,000 images with captions

Metrics:

BLEU: N-gram precision
METEOR: Considers synonyms, paraphrases
CIDEr: Consensus-based image description evaluation
SPICE: Semantic propositional image caption evaluation
Human evaluation: Best but expensive

Results:

CLIP features help with captioning
Better than training from scratch
Still not as good as task-specific models

5. Robustness Evaluation

What it tests:

How robust is model to distribution shifts?
Performance on out-of-distribution data

Tests:

a) Distribution Shift:

Train on one domain, test on another
Example: Train on photos, test on sketches

b) Adversarial Examples:

Small perturbations to images
Test if model is fooled

c) Natural Variations:

Different lighting, angles, backgrounds
Test generalization

Datasets:

ImageNet-A: Adversarial examples
ImageNet-R: Renditions (art, cartoons, etc.)
ImageNet-Sketch: Sketch versions

Results:

CLIP more robust than supervised models
Better generalization to new domains
Still vulnerable to adversarial attacks

6. Bias Evaluation

What it tests:

Does model have biases?
Fairness across different groups

Tests:

a) Gender Bias:

Test if model associates certain occupations with gender
Example: "doctor" → mostly male images?

b) Racial Bias:

Test if model has racial biases
Example: Crime-related queries → certain groups?

c) Cultural Bias:

Test if model understands diverse cultures
Example: Wedding images → only Western weddings?

Methods:

Analyze predictions by demographic groups
Test on diverse datasets
Measure fairness metrics

Results:

CLIP inherits biases from training data
Can produce biased outputs
Need careful evaluation and mitigation

7. Few-Shot Learning

What it tests:

Can model learn from few examples?
Transfer learning capability

Method:

Provide few examples (1-16) of new task
Test if model can generalize
Compare to training from scratch

Tasks:

Few-shot image classification
Few-shot object detection
Few-shot visual question answering

Results:

CLIP good at few-shot learning
Better than training from scratch
Can learn from just a few examples

Evaluation Best Practices

1. Multiple Metrics

Don't rely on single metric:

Accuracy can be misleading
Use multiple metrics (accuracy, recall, precision)
Consider per-category performance

2. Diverse Datasets

Test on diverse data:

Different domains (photos, art, sketches)
Different languages
Different cultures

3. Human Evaluation

When possible, use human evaluation:

Automatic metrics have limitations
Human judgment is gold standard
Especially for generation tasks

4. Error Analysis

Understand failure cases:

What types of errors does model make?
Are errors systematic?
How to improve?

5. Fairness Testing

Test for biases:

Evaluate across demographic groups
Measure fairness metrics
Mitigate biases if found

Summary

Key Evaluation Methods:

Zero-shot classification: Test generalization
Image-text retrieval: Test alignment
VQA: Test reasoning
Captioning: Test generation
Robustness: Test generalization
Bias: Test fairness
Few-shot: Test transfer learning

Best Practices:

Use multiple metrics
Test on diverse datasets
Consider human evaluation
Analyze errors
Test for biases

Challenges:

No single metric captures everything
Need diverse evaluation
Human evaluation expensive
Bias evaluation complex

ML & LLM Interview Prep — Deep Dives