Topic 38: Multimodal Models & Embedding History

🔥 For interviews, read these first:

MULTIMODAL_EMBEDDING_DEEP_DIVE.md — frontier-lab deep dive: BoW/TF-IDF → Word2Vec/GloVe → BERT → Sentence-BERT → CLIP → multimodal LLMs (Flamingo, LLaVA), full CLIP loss derivation, InfoNCE as MI bound, SigLIP, vector search (HNSW/IVF-PQ), hybrid retrieval.

INTERVIEW_GRILL.md — 55 active-recall questions.

What You'll Learn

The lineage of representation learning:

Bag of words and TF-IDF
Distributed word embeddings (Word2Vec, GloVe)
Contextual embeddings (ELMo, BERT, GPT)
Sentence embeddings (Sentence-BERT, modern retrievers)
Image-text contrastive learning (CLIP, ALIGN, SigLIP)
Multimodal LLMs (Flamingo, LLaVA, GPT-4V, Gemini)
Vector search infrastructure (HNSW, IVF-PQ, hybrid)

Modern systems (RAG, search, recommendation, multimodal agents) all rest on embeddings. Frontier interviews probe whether you understand why CLIP works, what InfoNCE optimizes, and how multimodal LLMs are actually built. The historical lineage is the cleanest way to convey design judgment.

Next Steps

Topic 33: Information theory — MI bounds underlying contrastive learning.
Topic 39: RAG — production retrieval systems built on these embeddings.
Topic 15: Tokenization — multi-modal tokenization (image patches, audio frames).

ML & LLM Interview Prep — Deep Dives

Topic 38: Multimodal Models & Embedding History

What You'll Learn

Why This Matters

Next Steps