From BERT to Foundation Models: The Evolution

Overview

This document traces the evolution from BERT (2018) to modern foundation models (GPT-4, Claude, etc.), explaining the key innovations, architectural changes, and paradigm shifts that led to today's large language models.

Timeline: Key Milestones

2018: BERT (Bidirectional Encoder)
2019: GPT-2 (Generative Pre-trained Transformer)
2020: GPT-3 (Scaling Laws, In-Context Learning)
2021: Codex, InstructGPT (RLHF)
2022: ChatGPT, PaLM, Chinchilla
2023: GPT-4, Claude, LLaMA
2024: GPT-4 Turbo, Claude 3, Gemini

Phase 1: BERT (2018) - Bidirectional Understanding

What BERT Did

Key Innovation:

Bidirectional context: Unlike previous models, BERT reads text in both directions
Masked Language Modeling (MLM): Predict masked tokens using full context
Pre-training + Fine-tuning: Train on large corpus, fine-tune on specific tasks

Architecture:

BERT-Base: 110M parameters, 12 layers
BERT-Large: 340M parameters, 24 layers

Training:

Pre-training:
- Masked Language Modeling (15% tokens masked)
- Next Sentence Prediction (NSP)
- Data: BooksCorpus + English Wikipedia (3.3B tokens)
Fine-tuning: Add task-specific head, train on labeled data

Impact:

State-of-the-art on 11 NLP tasks
Showed power of pre-training
Established transformer encoder as standard

Limitations:

Encoder-only: Can't generate text
Fine-tuning required: Need labeled data for each task
Task-specific: Different model for each task

Phase 2: GPT-2 (2019) - Generative Capabilities

What GPT-2 Did

Key Innovation:

Generative: Can generate coherent text
Zero-shot: No fine-tuning needed for some tasks
Unidirectional: Autoregressive generation (left-to-right)

Architecture:

GPT-2 Small: 117M parameters
GPT-2 Medium: 345M parameters
GPT-2 Large: 762M parameters
GPT-2 XL: 1.5B parameters

Training:

Pre-training: Next token prediction (language modeling)
Data: WebText (40GB, 8M documents)
No fine-tuning: Directly use for generation

Key Insight:

Language modeling is transfer learning: Pre-training on language modeling transfers to many tasks
Zero-shot learning: Model can do tasks without explicit training

Impact:

Showed generative models can be powerful
Demonstrated zero-shot capabilities
Raised concerns about misuse (initially not released)

Limitations:

Unidirectional: Only left-to-right context
No bidirectional understanding: Can't see future tokens
Limited context: 1024 tokens
Still needs fine-tuning: For best performance on specific tasks

Phase 3: GPT-3 (2020) - Scaling and In-Context Learning

What GPT-3 Did

Key Innovation:

Massive scale: 175B parameters (100x larger than GPT-2)
In-context learning: Few-shot learning without gradient updates
Scaling laws: Showed performance improves predictably with scale

Architecture:

GPT-3: 175B parameters
- 96 transformer layers
- 12,288 dimensions
- Context: 2048 tokens

Training:

Data: Common Crawl, WebText2, Books, Wikipedia (300B tokens)
Compute: Massive (estimated $4.6M in compute)
Few-shot: Provide examples in prompt, model learns from context

Key Insights:

1. Scaling Laws:

Performance ∝ (Model Size)^α × (Data Size)^β × (Compute)^γ

Performance improves predictably with scale
Larger models = better performance
More data = better performance

2. In-Context Learning:

Zero-shot: "Translate to French: hello →"
Few-shot: "Translate to French: hello → bonjour, cat → chat, dog →"
One-shot: Single example

3. Emergent Abilities:

Arithmetic: Can do math (not explicitly trained)
Code generation: Can write code
Reasoning: Some logical reasoning
Emerges at scale: Not present in smaller models

Impact:

Proved scaling works
In-context learning paradigm
Foundation for modern LLMs
API-based access (no open-source)

Limitations:

Hallucination: Makes up facts
No fine-tuning: Can't update model
Limited context: 2048 tokens
Expensive: Very costly to train/run

Phase 4: InstructGPT (2021) - Alignment and RLHF

What InstructGPT Did

Key Innovation:

Reinforcement Learning from Human Feedback (RLHF): Align model with human preferences
Instruction following: Model follows instructions better
Helpful, harmless, honest: Three principles

Training Process:

Step 1: Supervised Fine-tuning (SFT)

1. Collect human-written prompts and responses
2. Fine-tune GPT-3 on this data
3. Model learns to follow instructions

Step 2: Reward Modeling

1. Collect comparisons: Which response is better?
2. Train reward model to predict human preferences
3. Reward model scores responses

Step 3: Reinforcement Learning (PPO)

1. Generate responses from SFT model
2. Score with reward model
3. Update model to maximize reward
4. Use PPO (Proximal Policy Optimization)

Key Insight:

Alignment matters: Model behavior ≠ model capability
Human feedback: Better than just next-token prediction
Safety: Can make models safer and more helpful

Impact:

Foundation for ChatGPT
RLHF becomes standard
Alignment research grows
Better user experience

Phase 5: ChatGPT (2022) - Conversational AI

What ChatGPT Did

Key Innovation:

Conversational interface: Natural dialogue
RLHF: Aligned with human preferences
System prompts: Can control behavior
Multi-turn: Maintains context across turns

Architecture:

Based on GPT-3.5 (InstructGPT)
Fine-tuned with RLHF
Optimized for dialogue

Key Features:

Conversational: Natural back-and-forth
Helpful: Tries to be useful
Admits mistakes: Can say "I don't know"
Refuses harmful requests: Safety built-in

Impact:

Viral adoption: 100M users in 2 months
Paradigm shift: From tools to assistants
Industry transformation: Every company wants LLM
Research acceleration: Massive investment

Phase 6: GPT-4 (2023) - Multimodal and Reasoning

What GPT-4 Did

Key Innovation:

Multimodal: Text + images
Better reasoning: Improved logical reasoning
Larger context: 8K tokens (later 32K, 128K)
Better performance: State-of-the-art on many benchmarks

Architecture:

Exact details not disclosed
Estimated: 1.7T parameters (mixture of experts)
Multimodal: Vision + language

Key Improvements:

Reasoning: Better at complex reasoning
Code: Better code generation
Safety: Improved safety measures
Steerability: Better instruction following

Training:

Pre-training: Large-scale data
RLHF: Human feedback
Red teaming: Safety testing

Impact:

State-of-the-art: Best performance on many tasks
Multimodal: Can process images
Production use: Used in many applications
Research: Drives research directions

Phase 7: Modern Foundation Models (2023-2024)

Key Models

OpenAI:

GPT-4, GPT-4 Turbo
Multimodal, large context

Anthropic:

Claude 2, Claude 3
Constitutional AI, better safety

Google:

PaLM, PaLM 2, Gemini
Multimodal, large scale

Meta:

LLaMA, LLaMA 2
Open-source, efficient

Others:

Mistral, Mixtral
Open-source alternatives

Key Trends

1. Scaling Continues:

Models getting larger
More parameters
More data
More compute

2. Efficiency:

Mixture of Experts (MoE): Sparse models
Quantization: Lower precision
Distillation: Smaller models
Better architectures: More efficient

3. Multimodality:

Text + images
Text + audio
Text + video
Unified models

4. Alignment:

RLHF: Standard practice
Constitutional AI: Alternative to RLHF
Safety: Ongoing focus
Red teaming: Testing for vulnerabilities

5. Open Source:

LLaMA, Mistral
Community models
Fine-tuning frameworks (LoRA, QLoRA)

6. Specialization:

Code models: Codex, StarCoder
Scientific: Galactica, Minerva
Domain-specific: Medical, legal, etc.

Key Architectural Evolution

From BERT to GPT

BERT (Encoder):

Input → Encoder → [CLS] token → Task head
- Bidirectional
- Good for understanding
- Can't generate

GPT (Decoder):

Input → Decoder → Next token
- Unidirectional
- Good for generation
- Can do understanding (with prompting)

T5 (Encoder-Decoder):

Input → Encoder → Decoder → Output
- Both understanding and generation
- Good for tasks like summarization

Modern Architecture Choices

Decoder-only (GPT-style):

Pros: Simple, good for generation, in-context learning
Cons: Unidirectional, can't see future
Use: GPT-3, GPT-4, LLaMA, Claude

Encoder-Decoder (T5-style):

Pros: Bidirectional understanding, good for tasks
Cons: More complex, less efficient
Use: T5, BART, some specialized models

Encoder-only (BERT-style):

Pros: Bidirectional, efficient
Cons: Can't generate
Use: BERT, RoBERTa, specialized understanding tasks

Training Evolution

Pre-training

BERT Era:

Masked language modeling
Next sentence prediction
~3B tokens

GPT-2 Era:

Next token prediction
~40GB text
Simple objective

GPT-3 Era:

Next token prediction
~300B tokens
Massive scale

Modern Era:

Next token prediction
Trillions of tokens
Filtered, high-quality data
Multimodal data

Fine-tuning Evolution

BERT Era:

Task-specific fine-tuning
Different model per task
Supervised learning

GPT-2 Era:

Zero-shot (no fine-tuning)
Prompt engineering
In-context learning

GPT-3 Era:

Few-shot in-context learning
Prompt engineering
No gradient updates

Modern Era:

RLHF: Human feedback
Instruction tuning: Follow instructions
Multi-task: Single model for many tasks
Fine-tuning: Still used for specialization

Key Paradigm Shifts

1. From Task-Specific to General

Before (BERT):

Train model for specific task
Different model per task
Need labeled data

After (GPT-3+):

Single general model
Works for many tasks
In-context learning

2. From Fine-tuning to Prompting

Before:

Fine-tune model on task
Update weights
Task-specific model

After:

Provide examples in prompt
No weight updates
Same model for all tasks

3. From Understanding to Generation

Before:

Models for understanding (classification, NER)
Encoder architectures

After:

Models for generation
Decoder architectures
Can do both with prompting

4. From Supervised to Self-Supervised

Before:

Need labeled data
Supervised learning
Expensive annotation

After:

Self-supervised pre-training
Unlabeled data
Fine-tune with less data

5. From Capability to Alignment

Before:

Focus on capability
Better performance on benchmarks

After:

Focus on alignment
Helpful, harmless, honest
RLHF, safety measures

Scaling Laws and Insights

Neural Scaling Laws

Key Findings:

Performance = f(Model Size, Data Size, Compute)

1. Performance improves predictably with scale
2. Larger models need more data
3. Optimal compute allocation
4. Predictable improvements

Implications:

Bigger is better: Larger models perform better
Data matters: Need more data for larger models
Compute: Massive compute needed
Predictable: Can predict performance

Emergent Abilities

What are Emergent Abilities?

Abilities that appear only at large scale
Not present in smaller models
Examples: Arithmetic, code, reasoning

Examples:

Arithmetic: Can do math (not explicitly trained)
Code generation: Can write code
Few-shot learning: Learns from examples
Reasoning: Some logical reasoning

Why Important:

Shows scale matters
Unexpected capabilities
Hard to predict what will emerge

Modern Foundation Model Characteristics

1. Scale

Parameters:

GPT-3: 175B
GPT-4: ~1.7T (estimated, MoE)
PaLM: 540B
LLaMA 2: 70B (open-source)

Data:

Trillions of tokens
Filtered, high-quality
Multimodal

Compute:

Massive training costs
Millions of dollars
Specialized hardware

2. Capabilities

Text:

Generation, understanding
Many tasks
Few-shot learning

Multimodal:

Images, audio, video
Unified models

Reasoning:

Logical reasoning
Math, code
Problem-solving

3. Alignment

RLHF:

Human feedback
Aligned with preferences
Helpful, harmless

Safety:

Refuses harmful requests
Admits limitations
Red teaming

4. Access

API:

OpenAI, Anthropic
Pay-per-use
No model access

Open Source:

LLaMA, Mistral
Community models
Fine-tuning frameworks

Challenges and Future Directions

Current Challenges

1. Hallucination:

Makes up facts
Confident but wrong
Hard to detect

2. Context Length:

Limited context
Can't handle very long documents
Working on longer contexts

3. Cost:

Expensive to train
Expensive to run
Need efficiency

4. Safety:

Can be misused
Bias issues
Alignment challenges

5. Evaluation:

Hard to evaluate
Benchmarks may not reflect real use
Need better metrics

Future Directions

1. Longer Context:

1M+ tokens
Better attention mechanisms
Efficient processing

2. Better Reasoning:

Chain-of-thought
Tool use
Multi-step reasoning

3. Multimodality:

More modalities
Better integration
Unified models

4. Efficiency:

Smaller models
Better architectures
Quantization, distillation

5. Alignment:

Better alignment methods
Safety guarantees
Interpretability

6. Specialization:

Domain-specific models
Fine-tuning frameworks
Task-specific optimization

Summary: The Journey

2018 - BERT:

Bidirectional understanding
Pre-training + fine-tuning
Task-specific models

2019 - GPT-2:

Generative capabilities
Zero-shot learning
Unidirectional

2020 - GPT-3:

Massive scale (175B)
In-context learning
Scaling laws

2021 - InstructGPT:

RLHF
Alignment
Instruction following

2022 - ChatGPT:

Conversational AI
RLHF
Viral adoption

2023 - GPT-4:

Multimodal
Better reasoning
Large context

2024 - Modern Era:

Foundation models
Multimodal
Open source alternatives
Specialization

Key Insights:

Scale matters: Larger models = better performance
In-context learning: Few-shot without fine-tuning
Alignment: RLHF makes models more useful
Emergent abilities: Unexpected capabilities at scale
Multimodality: Text + other modalities
Efficiency: Need for efficient models

The Path Forward:

Longer contexts
Better reasoning
More efficient
Better alignment
Specialization
Open source

Key Takeaways

From BERT to GPT: Encoder → Decoder, Understanding → Generation
Scaling Works: Larger models perform better predictably
In-Context Learning: Few-shot without fine-tuning
Alignment Matters: RLHF makes models more useful
Emergent Abilities: Unexpected capabilities at scale
Multimodality: Text + images + more
Foundation Models: Single model for many tasks
Open Source: Community-driven alternatives

The evolution from BERT to modern foundation models represents one of the most significant advances in AI, transforming how we build and use language models.

ML & LLM Interview Prep — Deep Dives