History of NLP Embeddings: From TF-IDF to Modern Embeddings

Overview

This document traces the evolution of NLP embeddings from simple statistical methods to modern neural approaches, explaining how each method was trained and why it was developed.

Timeline

TF-IDF (1970s): Statistical weighting
N-grams (1980s-1990s): Sequence modeling
Word2Vec (2013): Neural word embeddings
GloVe (2014): Global word vectors
Contextual Embeddings (2018+): BERT, ELMo, etc.
Modern LLMs (2020+): GPT, T5, etc.

1. TF-IDF (Term Frequency-Inverse Document Frequency)

Background (1970s)

Problem:

Need to represent documents as vectors
Simple word counts don't capture importance
Common words (the, is, a) appear everywhere

Solution:

Weight words by frequency in document (TF)
Penalize words that appear in many documents (IDF)
TF-IDF = TF × IDF

How It Was "Trained"

Not really "training" - it's a statistical computation:

Step 1: Compute Term Frequency (TF)

For each document d and term t:
  TF(t, d) = count(t in d) / total_terms_in_d

Step 2: Compute Inverse Document Frequency (IDF)

For each term t:
  IDF(t) = log(N / documents_containing_t)
  
Where N = total number of documents

Step 3: Compute TF-IDF

TF-IDF(t, d) = TF(t, d) × IDF(t)

Step 4: Create Document Vectors

For each document:
  vector = [TF-IDF(term1, doc), TF-IDF(term2, doc), ...]

Characteristics:

No learning: Just statistical computation
Sparse vectors: Most terms have 0 TF-IDF
High-dimensional: One dimension per unique term
Interpretable: Can see which terms are important

Limitations:

No semantic understanding
Can't capture word relationships
High-dimensional, sparse

2. N-gram Models (1980s-1990s)

Background

Problem:

Need to model language sequences
Predict next word given context
Capture local dependencies

Solution:

Model sequences of n words (n-grams)
Estimate probabilities from counts
Use for language modeling

How They Were Trained

Step 1: Collect Text Corpus

Large collection of text
Books, news articles, web text

Step 2: Count N-grams

For each n-gram (w₁, w₂, ..., wₙ):
  count(w₁, w₂, ..., wₙ) += 1

Step 3: Estimate Probabilities

P(wₙ|w₁, ..., wₙ₋₁) = count(w₁, ..., wₙ) / count(w₁, ..., wₙ₋₁)

Step 4: Apply Smoothing (Laplace, Kneser-Ney)

P(wₙ|w₁, ..., wₙ₋₁) = (count + k) / (total + k×V)

Characteristics:

Statistical: Based on counts
Local context: Only n-1 previous words
Sparse: Many n-grams never seen
No embeddings: Just probabilities

Limitations:

Can't capture long-range dependencies
Data sparsity (many unseen n-grams)
No semantic understanding
Curse of dimensionality

3. Word2Vec (2013)

Background

Problem:

Need dense, low-dimensional word representations
Want to capture semantic relationships
"King - Man + Woman ≈ Queen"

Solution:

Neural network to learn word embeddings
Two architectures: Skip-gram and CBOW
Train on large text corpus

Architecture: Skip-gram

Idea:

Predict context words given center word
"The cat sat on the mat"
Given "sat", predict ["The", "cat", "on", "the"]

Architecture:

Input: One-hot vector for center word
Hidden: Embedding layer (V × d, where V=vocab size, d=embedding dim)
Output: Softmax over vocabulary (predict context words)

Training:

Step 1: Create Training Pairs

For each sentence:
  For each word w at position i:
    For each context word c in window [i-w, i+w]:
      Add training pair (w, c)

Step 2: Forward Pass

1. Embed center word: h = W_embedding × one_hot(w)
2. Predict context: o = W_output × h
3. Apply softmax: p = softmax(o)

Step 3: Loss Function

Loss = -log P(context_words | center_word)
     = -Σ log P(c | w) for each context word c

Step 4: Backpropagation

Update embedding matrix W_embedding
Update output matrix W_output

Step 5: Negative Sampling (Optimization) Instead of softmax over all V words (expensive):

Sample k negative examples (random words)
Binary classification: positive (context) vs negative
Much faster training!

Loss with Negative Sampling:

Loss = -log σ(v_c · v_w) - Σ log σ(-v_neg · v_w)
     where σ = sigmoid

Training Details:

Data: Billions of words from web
Window size: 5-10 words
Embedding dimension: 100-300
Negative samples: 5-20
Training time: Hours to days on CPU

Characteristics:

Dense vectors: Low-dimensional (100-300 dim)
Semantic relationships: Similar words close in space
Fast training: Negative sampling makes it efficient
Fixed embeddings: Same word always has same embedding

Limitations:

No context: "bank" (river) and "bank" (financial) have same embedding
Limited to words: Can't handle subwords or phrases well
No sentence-level: Just word-level

Architecture: CBOW (Continuous Bag of Words)

Idea:

Predict center word given context
Opposite of Skip-gram

Architecture:

Input: Average of context word embeddings
Hidden: Embedding layer
Output: Softmax over vocabulary (predict center word)

Training:

Similar to Skip-gram but reversed
Usually Skip-gram performs better

4. GloVe (Global Vectors for Word Representation, 2014)

Background

Problem:

Word2Vec uses local context (windows)
Want to use global co-occurrence statistics
Combine benefits of global and local methods

Solution:

Use global word co-occurrence matrix
Learn embeddings that preserve co-occurrence ratios
Matrix factorization approach

How It Was Trained

Step 1: Build Co-occurrence Matrix

For each word pair (i, j) in corpus:
  X_ij = count of times word j appears in context of word i
  
Context: Words within window of word i

Step 2: Define Objective

Want: w_i · w_j ≈ log(X_ij)

Where:
- w_i, w_j: Word embeddings
- X_ij: Co-occurrence count

Step 3: Weighted Least Squares

Loss = Σ f(X_ij) (w_i · w_j + b_i + b_j - log X_ij)²

Where:
- f(X_ij): Weighting function (more weight to frequent pairs)
- b_i, b_j: Bias terms

Weighting Function:

f(x) = (x/x_max)^α if x < x_max else 1

Where α = 3/4 (empirically chosen)

Step 4: Optimization

Stochastic gradient descent
Iterate over co-occurrence matrix
Update embeddings and biases

Training Details:

Data: 6B tokens from Wikipedia + Gigaword
Window size: 10 words
Embedding dimension: 50-300
Training time: Hours on CPU

Characteristics:

Global statistics: Uses entire corpus statistics
Interpretable: Preserves co-occurrence ratios
Fast training: More efficient than Word2Vec
Good performance: Often better than Word2Vec

Key Insight:

w_i · w_j = log P(j|i) - log P(j)

This captures: "ice" is to "steam" as "solid" is to "gas"
Because: P(solid|ice) / P(solid|steam) ≈ P(gas|ice) / P(gas|steam)

Limitations:

Still fixed embeddings (no context)
Limited to word level
Can't handle OOV (out-of-vocabulary) words

5. Contextual Embeddings (2018+)

Background

Problem:

Word2Vec/GloVe: Same embedding for "bank" (river) and "bank" (financial)
Need context-dependent representations
Want sentence-level understanding

Solution:

Use deep language models (LSTM, Transformer)
Generate different embeddings based on context
Pre-train on large corpus, fine-tune for tasks

ELMo (Embeddings from Language Models, 2018)

Architecture:

Bidirectional LSTM language model
Forward: Predict next word given previous
Backward: Predict previous word given next

Training:

1. Train forward LM: P(w_t | w_1, ..., w_{t-1})
2. Train backward LM: P(w_t | w_{t+1}, ..., w_n)
3. Use hidden states as embeddings

Usage:

Concatenate forward and backward representations
Weighted combination of all layers
Task-specific fine-tuning

Characteristics:

Contextual: Different embeddings for same word in different contexts
Bidirectional: Uses both left and right context
Task-specific: Fine-tune for downstream tasks

BERT (Bidirectional Encoder Representations from Transformers, 2018)

Architecture:

Transformer encoder (self-attention)
Bidirectional (unlike GPT which is unidirectional)

Pre-training Tasks:

1. Masked Language Modeling (MLM):

Input: "The [MASK] sat on the mat"
Predict: "cat" (the masked word)

Randomly mask 15% of tokens
Predict masked tokens
Learn bidirectional context

2. Next Sentence Prediction (NSP):

Input: [CLS] sentence1 [SEP] sentence2 [SEP]
Predict: Is sentence2 the next sentence after sentence1?

Learn sentence relationships
Useful for QA, NLI tasks

Training:

1. Collect large corpus (BooksCorpus + Wikipedia, 3.3B words)
2. Create training examples:
   - Mask tokens (MLM)
   - Pair sentences (NSP)
3. Train with both objectives
4. Many epochs, large batch size

Characteristics:

Contextual: Different embeddings for same word
Bidirectional: Full context understanding
Transfer learning: Pre-train once, fine-tune for many tasks
State-of-the-art: Best performance on many tasks

Limitations:

Computationally expensive: Large models, long training
Fixed context length: Can't handle very long sequences
No generation: Only encoder, not decoder

GPT (Generative Pre-trained Transformer, 2018-2020)

Architecture:

Transformer decoder (causal attention)
Unidirectional (left-to-right)

Training:

1. Language modeling objective:
   P(w_t | w_1, ..., w_{t-1})
2. Autoregressive: Predict next token
3. Train on large corpus

Characteristics:

Generative: Can generate text
Unidirectional: Only left context
Scaling: Larger models → better performance

Evolution:

GPT-1 (2018): 117M parameters
GPT-2 (2019): 1.5B parameters
GPT-3 (2020): 175B parameters
GPT-4 (2023): Much larger, multimodal

6. Modern LLMs (2020+)

T5 (Text-to-Text Transfer Transformer, 2020)

Idea:

Frame all tasks as text-to-text
"Translate English to German: hello → hallo"
"Summarize: long text → short text"

Training:

Pre-train on large corpus
Fine-tune on multiple tasks
Unified framework

Modern Trends

1. Scaling:

Larger models (billions of parameters)
More data (trillions of tokens)
Better performance

2. Multimodal:

CLIP: Text-image
GPT-4V: Vision-language
DALL-E: Text-to-image

3. Instruction Tuning:

Fine-tune on instructions
Better following user intent
RLHF for alignment

4. Efficient Training:

LoRA, QLoRA
Mixed precision
Gradient checkpointing

Comparison Table

Method	Year	Training	Context	Embedding Type
TF-IDF	1970s	Statistical	None	Sparse, high-dim
N-grams	1980s	Count-based	Local (n words)	Probabilities
Word2Vec	2013	Neural (local)	Window (5-10)	Dense, fixed
GloVe	2014	Matrix factorization	Global	Dense, fixed
ELMo	2018	Bidirectional LM	Sentence	Dense, contextual
BERT	2018	Masked LM	Bidirectional	Dense, contextual
GPT	2018+	Autoregressive LM	Left-to-right	Dense, contextual
Modern LLMs	2020+	Large-scale LM	Long context	Dense, contextual

Key Insights

1. Evolution:

Statistical → Neural
Local → Global → Contextual
Fixed → Context-dependent

2. Training Data:

TF-IDF: Document collection
Word2Vec: Billions of words
BERT: 3.3B words
GPT-3: Trillions of tokens

3. Computational Requirements:

TF-IDF: Minutes
Word2Vec: Hours
BERT: Days (GPU)
GPT-3: Months (many GPUs)

4. Capabilities:

TF-IDF: Keyword matching
Word2Vec: Semantic similarity
BERT: Context understanding
GPT: Text generation

Summary

History shows:

From sparse to dense: TF-IDF → Word2Vec
From local to global: Word2Vec → GloVe
From fixed to contextual: GloVe → BERT
From understanding to generation: BERT → GPT
From single to multimodal: Text → Text+Image+Audio

Current State:

Large language models dominate
Multimodal is the future
Scaling continues to improve performance

ML & LLM Interview Prep — Deep Dives