From BERT to Foundation Models: The Evolution
Overview
This document traces the evolution from BERT (2018) to modern foundation models (GPT-4, Claude, etc.), explaining the key innovations, architectural changes, and paradigm shifts that led to today's large language models.
Timeline: Key Milestones
2018: BERT (Bidirectional Encoder)
2019: GPT-2 (Generative Pre-trained Transformer)
2020: GPT-3 (Scaling Laws, In-Context Learning)
2021: Codex, InstructGPT (RLHF)
2022: ChatGPT, PaLM, Chinchilla
2023: GPT-4, Claude, LLaMA
2024: GPT-4 Turbo, Claude 3, Gemini
Phase 1: BERT (2018) - Bidirectional Understanding
What BERT Did
Key Innovation:
- Bidirectional context: Unlike previous models, BERT reads text in both directions
- Masked Language Modeling (MLM): Predict masked tokens using full context
- Pre-training + Fine-tuning: Train on large corpus, fine-tune on specific tasks
Architecture:
BERT-Base: 110M parameters, 12 layers
BERT-Large: 340M parameters, 24 layers
Training:
- Pre-training:
- Masked Language Modeling (15% tokens masked)
- Next Sentence Prediction (NSP)
- Data: BooksCorpus + English Wikipedia (3.3B tokens)
- Fine-tuning: Add task-specific head, train on labeled data
Impact:
- State-of-the-art on 11 NLP tasks
- Showed power of pre-training
- Established transformer encoder as standard
Limitations:
- Encoder-only: Can't generate text
- Fine-tuning required: Need labeled data for each task
- Task-specific: Different model for each task
Phase 2: GPT-2 (2019) - Generative Capabilities
What GPT-2 Did
Key Innovation:
- Generative: Can generate coherent text
- Zero-shot: No fine-tuning needed for some tasks
- Unidirectional: Autoregressive generation (left-to-right)
Architecture:
GPT-2 Small: 117M parameters
GPT-2 Medium: 345M parameters
GPT-2 Large: 762M parameters
GPT-2 XL: 1.5B parameters
Training:
- Pre-training: Next token prediction (language modeling)
- Data: WebText (40GB, 8M documents)
- No fine-tuning: Directly use for generation
Key Insight:
- Language modeling is transfer learning: Pre-training on language modeling transfers to many tasks
- Zero-shot learning: Model can do tasks without explicit training
Impact:
- Showed generative models can be powerful
- Demonstrated zero-shot capabilities
- Raised concerns about misuse (initially not released)
Limitations:
- Unidirectional: Only left-to-right context
- No bidirectional understanding: Can't see future tokens
- Limited context: 1024 tokens
- Still needs fine-tuning: For best performance on specific tasks
Phase 3: GPT-3 (2020) - Scaling and In-Context Learning
What GPT-3 Did
Key Innovation:
- Massive scale: 175B parameters (100x larger than GPT-2)
- In-context learning: Few-shot learning without gradient updates
- Scaling laws: Showed performance improves predictably with scale
Architecture:
GPT-3: 175B parameters
- 96 transformer layers
- 12,288 dimensions
- Context: 2048 tokens
Training:
- Data: Common Crawl, WebText2, Books, Wikipedia (300B tokens)
- Compute: Massive (estimated $4.6M in compute)
- Few-shot: Provide examples in prompt, model learns from context
Key Insights:
1. Scaling Laws:
Performance ∝ (Model Size)^α × (Data Size)^β × (Compute)^γ
- Performance improves predictably with scale
- Larger models = better performance
- More data = better performance
2. In-Context Learning:
Zero-shot: "Translate to French: hello →"
Few-shot: "Translate to French: hello → bonjour, cat → chat, dog →"
One-shot: Single example
3. Emergent Abilities:
- Arithmetic: Can do math (not explicitly trained)
- Code generation: Can write code
- Reasoning: Some logical reasoning
- Emerges at scale: Not present in smaller models
Impact:
- Proved scaling works
- In-context learning paradigm
- Foundation for modern LLMs
- API-based access (no open-source)
Limitations:
- Hallucination: Makes up facts
- No fine-tuning: Can't update model
- Limited context: 2048 tokens
- Expensive: Very costly to train/run
Phase 4: InstructGPT (2021) - Alignment and RLHF
What InstructGPT Did
Key Innovation:
- Reinforcement Learning from Human Feedback (RLHF): Align model with human preferences
- Instruction following: Model follows instructions better
- Helpful, harmless, honest: Three principles
Training Process:
Step 1: Supervised Fine-tuning (SFT)
1. Collect human-written prompts and responses
2. Fine-tune GPT-3 on this data
3. Model learns to follow instructions
Step 2: Reward Modeling
1. Collect comparisons: Which response is better?
2. Train reward model to predict human preferences
3. Reward model scores responses
Step 3: Reinforcement Learning (PPO)
1. Generate responses from SFT model
2. Score with reward model
3. Update model to maximize reward
4. Use PPO (Proximal Policy Optimization)
Key Insight:
- Alignment matters: Model behavior ≠ model capability
- Human feedback: Better than just next-token prediction
- Safety: Can make models safer and more helpful
Impact:
- Foundation for ChatGPT
- RLHF becomes standard
- Alignment research grows
- Better user experience
Phase 5: ChatGPT (2022) - Conversational AI
What ChatGPT Did
Key Innovation:
- Conversational interface: Natural dialogue
- RLHF: Aligned with human preferences
- System prompts: Can control behavior
- Multi-turn: Maintains context across turns
Architecture:
- Based on GPT-3.5 (InstructGPT)
- Fine-tuned with RLHF
- Optimized for dialogue
Key Features:
- Conversational: Natural back-and-forth
- Helpful: Tries to be useful
- Admits mistakes: Can say "I don't know"
- Refuses harmful requests: Safety built-in
Impact:
- Viral adoption: 100M users in 2 months
- Paradigm shift: From tools to assistants
- Industry transformation: Every company wants LLM
- Research acceleration: Massive investment
Phase 6: GPT-4 (2023) - Multimodal and Reasoning
What GPT-4 Did
Key Innovation:
- Multimodal: Text + images
- Better reasoning: Improved logical reasoning
- Larger context: 8K tokens (later 32K, 128K)
- Better performance: State-of-the-art on many benchmarks
Architecture:
- Exact details not disclosed
- Estimated: 1.7T parameters (mixture of experts)
- Multimodal: Vision + language
Key Improvements:
- Reasoning: Better at complex reasoning
- Code: Better code generation
- Safety: Improved safety measures
- Steerability: Better instruction following
Training:
- Pre-training: Large-scale data
- RLHF: Human feedback
- Red teaming: Safety testing
Impact:
- State-of-the-art: Best performance on many tasks
- Multimodal: Can process images
- Production use: Used in many applications
- Research: Drives research directions
Phase 7: Modern Foundation Models (2023-2024)
Key Models
OpenAI:
- GPT-4, GPT-4 Turbo
- Multimodal, large context
Anthropic:
- Claude 2, Claude 3
- Constitutional AI, better safety
Google:
- PaLM, PaLM 2, Gemini
- Multimodal, large scale
Meta:
- LLaMA, LLaMA 2
- Open-source, efficient
Others:
- Mistral, Mixtral
- Open-source alternatives
Key Trends
1. Scaling Continues:
- Models getting larger
- More parameters
- More data
- More compute
2. Efficiency:
- Mixture of Experts (MoE): Sparse models
- Quantization: Lower precision
- Distillation: Smaller models
- Better architectures: More efficient
3. Multimodality:
- Text + images
- Text + audio
- Text + video
- Unified models
4. Alignment:
- RLHF: Standard practice
- Constitutional AI: Alternative to RLHF
- Safety: Ongoing focus
- Red teaming: Testing for vulnerabilities
5. Open Source:
- LLaMA, Mistral
- Community models
- Fine-tuning frameworks (LoRA, QLoRA)
6. Specialization:
- Code models: Codex, StarCoder
- Scientific: Galactica, Minerva
- Domain-specific: Medical, legal, etc.
Key Architectural Evolution
From BERT to GPT
BERT (Encoder):
Input → Encoder → [CLS] token → Task head
- Bidirectional
- Good for understanding
- Can't generate
GPT (Decoder):
Input → Decoder → Next token
- Unidirectional
- Good for generation
- Can do understanding (with prompting)
T5 (Encoder-Decoder):
Input → Encoder → Decoder → Output
- Both understanding and generation
- Good for tasks like summarization
Modern Architecture Choices
Decoder-only (GPT-style):
- Pros: Simple, good for generation, in-context learning
- Cons: Unidirectional, can't see future
- Use: GPT-3, GPT-4, LLaMA, Claude
Encoder-Decoder (T5-style):
- Pros: Bidirectional understanding, good for tasks
- Cons: More complex, less efficient
- Use: T5, BART, some specialized models
Encoder-only (BERT-style):
- Pros: Bidirectional, efficient
- Cons: Can't generate
- Use: BERT, RoBERTa, specialized understanding tasks
Training Evolution
Pre-training
BERT Era:
- Masked language modeling
- Next sentence prediction
- ~3B tokens
GPT-2 Era:
- Next token prediction
- ~40GB text
- Simple objective
GPT-3 Era:
- Next token prediction
- ~300B tokens
- Massive scale
Modern Era:
- Next token prediction
- Trillions of tokens
- Filtered, high-quality data
- Multimodal data
Fine-tuning Evolution
BERT Era:
- Task-specific fine-tuning
- Different model per task
- Supervised learning
GPT-2 Era:
- Zero-shot (no fine-tuning)
- Prompt engineering
- In-context learning
GPT-3 Era:
- Few-shot in-context learning
- Prompt engineering
- No gradient updates
Modern Era:
- RLHF: Human feedback
- Instruction tuning: Follow instructions
- Multi-task: Single model for many tasks
- Fine-tuning: Still used for specialization
Key Paradigm Shifts
1. From Task-Specific to General
Before (BERT):
- Train model for specific task
- Different model per task
- Need labeled data
After (GPT-3+):
- Single general model
- Works for many tasks
- In-context learning
2. From Fine-tuning to Prompting
Before:
- Fine-tune model on task
- Update weights
- Task-specific model
After:
- Provide examples in prompt
- No weight updates
- Same model for all tasks
3. From Understanding to Generation
Before:
- Models for understanding (classification, NER)
- Encoder architectures
After:
- Models for generation
- Decoder architectures
- Can do both with prompting
4. From Supervised to Self-Supervised
Before:
- Need labeled data
- Supervised learning
- Expensive annotation
After:
- Self-supervised pre-training
- Unlabeled data
- Fine-tune with less data
5. From Capability to Alignment
Before:
- Focus on capability
- Better performance on benchmarks
After:
- Focus on alignment
- Helpful, harmless, honest
- RLHF, safety measures
Scaling Laws and Insights
Neural Scaling Laws
Key Findings:
Performance = f(Model Size, Data Size, Compute)
1. Performance improves predictably with scale
2. Larger models need more data
3. Optimal compute allocation
4. Predictable improvements
Implications:
- Bigger is better: Larger models perform better
- Data matters: Need more data for larger models
- Compute: Massive compute needed
- Predictable: Can predict performance
Emergent Abilities
What are Emergent Abilities?
- Abilities that appear only at large scale
- Not present in smaller models
- Examples: Arithmetic, code, reasoning
Examples:
- Arithmetic: Can do math (not explicitly trained)
- Code generation: Can write code
- Few-shot learning: Learns from examples
- Reasoning: Some logical reasoning
Why Important:
- Shows scale matters
- Unexpected capabilities
- Hard to predict what will emerge
Modern Foundation Model Characteristics
1. Scale
Parameters:
- GPT-3: 175B
- GPT-4: ~1.7T (estimated, MoE)
- PaLM: 540B
- LLaMA 2: 70B (open-source)
Data:
- Trillions of tokens
- Filtered, high-quality
- Multimodal
Compute:
- Massive training costs
- Millions of dollars
- Specialized hardware
2. Capabilities
Text:
- Generation, understanding
- Many tasks
- Few-shot learning
Multimodal:
- Images, audio, video
- Unified models
Reasoning:
- Logical reasoning
- Math, code
- Problem-solving
3. Alignment
RLHF:
- Human feedback
- Aligned with preferences
- Helpful, harmless
Safety:
- Refuses harmful requests
- Admits limitations
- Red teaming
4. Access
API:
- OpenAI, Anthropic
- Pay-per-use
- No model access
Open Source:
- LLaMA, Mistral
- Community models
- Fine-tuning frameworks
Challenges and Future Directions
Current Challenges
1. Hallucination:
- Makes up facts
- Confident but wrong
- Hard to detect
2. Context Length:
- Limited context
- Can't handle very long documents
- Working on longer contexts
3. Cost:
- Expensive to train
- Expensive to run
- Need efficiency
4. Safety:
- Can be misused
- Bias issues
- Alignment challenges
5. Evaluation:
- Hard to evaluate
- Benchmarks may not reflect real use
- Need better metrics
Future Directions
1. Longer Context:
- 1M+ tokens
- Better attention mechanisms
- Efficient processing
2. Better Reasoning:
- Chain-of-thought
- Tool use
- Multi-step reasoning
3. Multimodality:
- More modalities
- Better integration
- Unified models
4. Efficiency:
- Smaller models
- Better architectures
- Quantization, distillation
5. Alignment:
- Better alignment methods
- Safety guarantees
- Interpretability
6. Specialization:
- Domain-specific models
- Fine-tuning frameworks
- Task-specific optimization
Summary: The Journey
2018 - BERT:
- Bidirectional understanding
- Pre-training + fine-tuning
- Task-specific models
2019 - GPT-2:
- Generative capabilities
- Zero-shot learning
- Unidirectional
2020 - GPT-3:
- Massive scale (175B)
- In-context learning
- Scaling laws
2021 - InstructGPT:
- RLHF
- Alignment
- Instruction following
2022 - ChatGPT:
- Conversational AI
- RLHF
- Viral adoption
2023 - GPT-4:
- Multimodal
- Better reasoning
- Large context
2024 - Modern Era:
- Foundation models
- Multimodal
- Open source alternatives
- Specialization
Key Insights:
- Scale matters: Larger models = better performance
- In-context learning: Few-shot without fine-tuning
- Alignment: RLHF makes models more useful
- Emergent abilities: Unexpected capabilities at scale
- Multimodality: Text + other modalities
- Efficiency: Need for efficient models
The Path Forward:
- Longer contexts
- Better reasoning
- More efficient
- Better alignment
- Specialization
- Open source
Key Takeaways
- From BERT to GPT: Encoder → Decoder, Understanding → Generation
- Scaling Works: Larger models perform better predictably
- In-Context Learning: Few-shot without fine-tuning
- Alignment Matters: RLHF makes models more useful
- Emergent Abilities: Unexpected capabilities at scale
- Multimodality: Text + images + more
- Foundation Models: Single model for many tasks
- Open Source: Community-driven alternatives
The evolution from BERT to modern foundation models represents one of the most significant advances in AI, transforming how we build and use language models.