From BERT to Foundation Models: The Evolution

Overview

This document traces the evolution from BERT (2018) to modern foundation models (GPT-4, Claude, etc.), explaining the key innovations, architectural changes, and paradigm shifts that led to today's large language models.


Timeline: Key Milestones

2018: BERT (Bidirectional Encoder)
2019: GPT-2 (Generative Pre-trained Transformer)
2020: GPT-3 (Scaling Laws, In-Context Learning)
2021: Codex, InstructGPT (RLHF)
2022: ChatGPT, PaLM, Chinchilla
2023: GPT-4, Claude, LLaMA
2024: GPT-4 Turbo, Claude 3, Gemini

Phase 1: BERT (2018) - Bidirectional Understanding

What BERT Did

Key Innovation:

  • Bidirectional context: Unlike previous models, BERT reads text in both directions
  • Masked Language Modeling (MLM): Predict masked tokens using full context
  • Pre-training + Fine-tuning: Train on large corpus, fine-tune on specific tasks

Architecture:

BERT-Base: 110M parameters, 12 layers
BERT-Large: 340M parameters, 24 layers

Training:

  • Pre-training:
    • Masked Language Modeling (15% tokens masked)
    • Next Sentence Prediction (NSP)
    • Data: BooksCorpus + English Wikipedia (3.3B tokens)
  • Fine-tuning: Add task-specific head, train on labeled data

Impact:

  • State-of-the-art on 11 NLP tasks
  • Showed power of pre-training
  • Established transformer encoder as standard

Limitations:

  • Encoder-only: Can't generate text
  • Fine-tuning required: Need labeled data for each task
  • Task-specific: Different model for each task

Phase 2: GPT-2 (2019) - Generative Capabilities

What GPT-2 Did

Key Innovation:

  • Generative: Can generate coherent text
  • Zero-shot: No fine-tuning needed for some tasks
  • Unidirectional: Autoregressive generation (left-to-right)

Architecture:

GPT-2 Small: 117M parameters
GPT-2 Medium: 345M parameters
GPT-2 Large: 762M parameters
GPT-2 XL: 1.5B parameters

Training:

  • Pre-training: Next token prediction (language modeling)
  • Data: WebText (40GB, 8M documents)
  • No fine-tuning: Directly use for generation

Key Insight:

  • Language modeling is transfer learning: Pre-training on language modeling transfers to many tasks
  • Zero-shot learning: Model can do tasks without explicit training

Impact:

  • Showed generative models can be powerful
  • Demonstrated zero-shot capabilities
  • Raised concerns about misuse (initially not released)

Limitations:

  • Unidirectional: Only left-to-right context
  • No bidirectional understanding: Can't see future tokens
  • Limited context: 1024 tokens
  • Still needs fine-tuning: For best performance on specific tasks

Phase 3: GPT-3 (2020) - Scaling and In-Context Learning

What GPT-3 Did

Key Innovation:

  • Massive scale: 175B parameters (100x larger than GPT-2)
  • In-context learning: Few-shot learning without gradient updates
  • Scaling laws: Showed performance improves predictably with scale

Architecture:

GPT-3: 175B parameters
- 96 transformer layers
- 12,288 dimensions
- Context: 2048 tokens

Training:

  • Data: Common Crawl, WebText2, Books, Wikipedia (300B tokens)
  • Compute: Massive (estimated $4.6M in compute)
  • Few-shot: Provide examples in prompt, model learns from context

Key Insights:

1. Scaling Laws:

Performance ∝ (Model Size)^α × (Data Size)^β × (Compute)^γ
  • Performance improves predictably with scale
  • Larger models = better performance
  • More data = better performance

2. In-Context Learning:

Zero-shot: "Translate to French: hello →"
Few-shot: "Translate to French: hello → bonjour, cat → chat, dog →"
One-shot: Single example

3. Emergent Abilities:

  • Arithmetic: Can do math (not explicitly trained)
  • Code generation: Can write code
  • Reasoning: Some logical reasoning
  • Emerges at scale: Not present in smaller models

Impact:

  • Proved scaling works
  • In-context learning paradigm
  • Foundation for modern LLMs
  • API-based access (no open-source)

Limitations:

  • Hallucination: Makes up facts
  • No fine-tuning: Can't update model
  • Limited context: 2048 tokens
  • Expensive: Very costly to train/run

Phase 4: InstructGPT (2021) - Alignment and RLHF

What InstructGPT Did

Key Innovation:

  • Reinforcement Learning from Human Feedback (RLHF): Align model with human preferences
  • Instruction following: Model follows instructions better
  • Helpful, harmless, honest: Three principles

Training Process:

Step 1: Supervised Fine-tuning (SFT)

1. Collect human-written prompts and responses
2. Fine-tune GPT-3 on this data
3. Model learns to follow instructions

Step 2: Reward Modeling

1. Collect comparisons: Which response is better?
2. Train reward model to predict human preferences
3. Reward model scores responses

Step 3: Reinforcement Learning (PPO)

1. Generate responses from SFT model
2. Score with reward model
3. Update model to maximize reward
4. Use PPO (Proximal Policy Optimization)

Key Insight:

  • Alignment matters: Model behavior ≠ model capability
  • Human feedback: Better than just next-token prediction
  • Safety: Can make models safer and more helpful

Impact:

  • Foundation for ChatGPT
  • RLHF becomes standard
  • Alignment research grows
  • Better user experience

Phase 5: ChatGPT (2022) - Conversational AI

What ChatGPT Did

Key Innovation:

  • Conversational interface: Natural dialogue
  • RLHF: Aligned with human preferences
  • System prompts: Can control behavior
  • Multi-turn: Maintains context across turns

Architecture:

  • Based on GPT-3.5 (InstructGPT)
  • Fine-tuned with RLHF
  • Optimized for dialogue

Key Features:

  • Conversational: Natural back-and-forth
  • Helpful: Tries to be useful
  • Admits mistakes: Can say "I don't know"
  • Refuses harmful requests: Safety built-in

Impact:

  • Viral adoption: 100M users in 2 months
  • Paradigm shift: From tools to assistants
  • Industry transformation: Every company wants LLM
  • Research acceleration: Massive investment

Phase 6: GPT-4 (2023) - Multimodal and Reasoning

What GPT-4 Did

Key Innovation:

  • Multimodal: Text + images
  • Better reasoning: Improved logical reasoning
  • Larger context: 8K tokens (later 32K, 128K)
  • Better performance: State-of-the-art on many benchmarks

Architecture:

  • Exact details not disclosed
  • Estimated: 1.7T parameters (mixture of experts)
  • Multimodal: Vision + language

Key Improvements:

  • Reasoning: Better at complex reasoning
  • Code: Better code generation
  • Safety: Improved safety measures
  • Steerability: Better instruction following

Training:

  • Pre-training: Large-scale data
  • RLHF: Human feedback
  • Red teaming: Safety testing

Impact:

  • State-of-the-art: Best performance on many tasks
  • Multimodal: Can process images
  • Production use: Used in many applications
  • Research: Drives research directions

Phase 7: Modern Foundation Models (2023-2024)

Key Models

OpenAI:

  • GPT-4, GPT-4 Turbo
  • Multimodal, large context

Anthropic:

  • Claude 2, Claude 3
  • Constitutional AI, better safety

Google:

  • PaLM, PaLM 2, Gemini
  • Multimodal, large scale

Meta:

  • LLaMA, LLaMA 2
  • Open-source, efficient

Others:

  • Mistral, Mixtral
  • Open-source alternatives

1. Scaling Continues:

  • Models getting larger
  • More parameters
  • More data
  • More compute

2. Efficiency:

  • Mixture of Experts (MoE): Sparse models
  • Quantization: Lower precision
  • Distillation: Smaller models
  • Better architectures: More efficient

3. Multimodality:

  • Text + images
  • Text + audio
  • Text + video
  • Unified models

4. Alignment:

  • RLHF: Standard practice
  • Constitutional AI: Alternative to RLHF
  • Safety: Ongoing focus
  • Red teaming: Testing for vulnerabilities

5. Open Source:

  • LLaMA, Mistral
  • Community models
  • Fine-tuning frameworks (LoRA, QLoRA)

6. Specialization:

  • Code models: Codex, StarCoder
  • Scientific: Galactica, Minerva
  • Domain-specific: Medical, legal, etc.

Key Architectural Evolution

From BERT to GPT

BERT (Encoder):

Input → Encoder → [CLS] token → Task head
- Bidirectional
- Good for understanding
- Can't generate

GPT (Decoder):

Input → Decoder → Next token
- Unidirectional
- Good for generation
- Can do understanding (with prompting)

T5 (Encoder-Decoder):

Input → Encoder → Decoder → Output
- Both understanding and generation
- Good for tasks like summarization

Modern Architecture Choices

Decoder-only (GPT-style):

  • Pros: Simple, good for generation, in-context learning
  • Cons: Unidirectional, can't see future
  • Use: GPT-3, GPT-4, LLaMA, Claude

Encoder-Decoder (T5-style):

  • Pros: Bidirectional understanding, good for tasks
  • Cons: More complex, less efficient
  • Use: T5, BART, some specialized models

Encoder-only (BERT-style):

  • Pros: Bidirectional, efficient
  • Cons: Can't generate
  • Use: BERT, RoBERTa, specialized understanding tasks

Training Evolution

Pre-training

BERT Era:

  • Masked language modeling
  • Next sentence prediction
  • ~3B tokens

GPT-2 Era:

  • Next token prediction
  • ~40GB text
  • Simple objective

GPT-3 Era:

  • Next token prediction
  • ~300B tokens
  • Massive scale

Modern Era:

  • Next token prediction
  • Trillions of tokens
  • Filtered, high-quality data
  • Multimodal data

Fine-tuning Evolution

BERT Era:

  • Task-specific fine-tuning
  • Different model per task
  • Supervised learning

GPT-2 Era:

  • Zero-shot (no fine-tuning)
  • Prompt engineering
  • In-context learning

GPT-3 Era:

  • Few-shot in-context learning
  • Prompt engineering
  • No gradient updates

Modern Era:

  • RLHF: Human feedback
  • Instruction tuning: Follow instructions
  • Multi-task: Single model for many tasks
  • Fine-tuning: Still used for specialization

Key Paradigm Shifts

1. From Task-Specific to General

Before (BERT):

  • Train model for specific task
  • Different model per task
  • Need labeled data

After (GPT-3+):

  • Single general model
  • Works for many tasks
  • In-context learning

2. From Fine-tuning to Prompting

Before:

  • Fine-tune model on task
  • Update weights
  • Task-specific model

After:

  • Provide examples in prompt
  • No weight updates
  • Same model for all tasks

3. From Understanding to Generation

Before:

  • Models for understanding (classification, NER)
  • Encoder architectures

After:

  • Models for generation
  • Decoder architectures
  • Can do both with prompting

4. From Supervised to Self-Supervised

Before:

  • Need labeled data
  • Supervised learning
  • Expensive annotation

After:

  • Self-supervised pre-training
  • Unlabeled data
  • Fine-tune with less data

5. From Capability to Alignment

Before:

  • Focus on capability
  • Better performance on benchmarks

After:

  • Focus on alignment
  • Helpful, harmless, honest
  • RLHF, safety measures

Scaling Laws and Insights

Neural Scaling Laws

Key Findings:

Performance = f(Model Size, Data Size, Compute)

1. Performance improves predictably with scale
2. Larger models need more data
3. Optimal compute allocation
4. Predictable improvements

Implications:

  • Bigger is better: Larger models perform better
  • Data matters: Need more data for larger models
  • Compute: Massive compute needed
  • Predictable: Can predict performance

Emergent Abilities

What are Emergent Abilities?

  • Abilities that appear only at large scale
  • Not present in smaller models
  • Examples: Arithmetic, code, reasoning

Examples:

  • Arithmetic: Can do math (not explicitly trained)
  • Code generation: Can write code
  • Few-shot learning: Learns from examples
  • Reasoning: Some logical reasoning

Why Important:

  • Shows scale matters
  • Unexpected capabilities
  • Hard to predict what will emerge

Modern Foundation Model Characteristics

1. Scale

Parameters:

  • GPT-3: 175B
  • GPT-4: ~1.7T (estimated, MoE)
  • PaLM: 540B
  • LLaMA 2: 70B (open-source)

Data:

  • Trillions of tokens
  • Filtered, high-quality
  • Multimodal

Compute:

  • Massive training costs
  • Millions of dollars
  • Specialized hardware

2. Capabilities

Text:

  • Generation, understanding
  • Many tasks
  • Few-shot learning

Multimodal:

  • Images, audio, video
  • Unified models

Reasoning:

  • Logical reasoning
  • Math, code
  • Problem-solving

3. Alignment

RLHF:

  • Human feedback
  • Aligned with preferences
  • Helpful, harmless

Safety:

  • Refuses harmful requests
  • Admits limitations
  • Red teaming

4. Access

API:

  • OpenAI, Anthropic
  • Pay-per-use
  • No model access

Open Source:

  • LLaMA, Mistral
  • Community models
  • Fine-tuning frameworks

Challenges and Future Directions

Current Challenges

1. Hallucination:

  • Makes up facts
  • Confident but wrong
  • Hard to detect

2. Context Length:

  • Limited context
  • Can't handle very long documents
  • Working on longer contexts

3. Cost:

  • Expensive to train
  • Expensive to run
  • Need efficiency

4. Safety:

  • Can be misused
  • Bias issues
  • Alignment challenges

5. Evaluation:

  • Hard to evaluate
  • Benchmarks may not reflect real use
  • Need better metrics

Future Directions

1. Longer Context:

  • 1M+ tokens
  • Better attention mechanisms
  • Efficient processing

2. Better Reasoning:

  • Chain-of-thought
  • Tool use
  • Multi-step reasoning

3. Multimodality:

  • More modalities
  • Better integration
  • Unified models

4. Efficiency:

  • Smaller models
  • Better architectures
  • Quantization, distillation

5. Alignment:

  • Better alignment methods
  • Safety guarantees
  • Interpretability

6. Specialization:

  • Domain-specific models
  • Fine-tuning frameworks
  • Task-specific optimization

Summary: The Journey

2018 - BERT:

  • Bidirectional understanding
  • Pre-training + fine-tuning
  • Task-specific models

2019 - GPT-2:

  • Generative capabilities
  • Zero-shot learning
  • Unidirectional

2020 - GPT-3:

  • Massive scale (175B)
  • In-context learning
  • Scaling laws

2021 - InstructGPT:

  • RLHF
  • Alignment
  • Instruction following

2022 - ChatGPT:

  • Conversational AI
  • RLHF
  • Viral adoption

2023 - GPT-4:

  • Multimodal
  • Better reasoning
  • Large context

2024 - Modern Era:

  • Foundation models
  • Multimodal
  • Open source alternatives
  • Specialization

Key Insights:

  1. Scale matters: Larger models = better performance
  2. In-context learning: Few-shot without fine-tuning
  3. Alignment: RLHF makes models more useful
  4. Emergent abilities: Unexpected capabilities at scale
  5. Multimodality: Text + other modalities
  6. Efficiency: Need for efficient models

The Path Forward:

  • Longer contexts
  • Better reasoning
  • More efficient
  • Better alignment
  • Specialization
  • Open source

Key Takeaways

  1. From BERT to GPT: Encoder → Decoder, Understanding → Generation
  2. Scaling Works: Larger models perform better predictably
  3. In-Context Learning: Few-shot without fine-tuning
  4. Alignment Matters: RLHF makes models more useful
  5. Emergent Abilities: Unexpected capabilities at scale
  6. Multimodality: Text + images + more
  7. Foundation Models: Single model for many tasks
  8. Open Source: Community-driven alternatives

The evolution from BERT to modern foundation models represents one of the most significant advances in AI, transforming how we build and use language models.