Topic 43: Language Modeling Training Losses

🔥 For interviews, read these first:

LM_LOSSES_DEEP_DIVE.md — frontier-lab interview deep dive: CLM/MLM/Span-corruption/PrefixLM/MoD/ELECTRA, why CLM dominates, why NSP died, how ICL emerges from CLM, multi-token prediction, prompt masking for SFT.

INTERVIEW_GRILL.md — 40 active-recall questions.

What You'll Learn

This topic teaches you different language modeling training objectives:

MLM (Masked Language Modeling) - BERT-style
CLM (Causal Language Modeling) - GPT-style
NSP (Next Sentence Prediction) - BERT-style
Mathematical formulations
Implementation details
When to use each

Why We Need This

Interview Importance

Common question: "Explain MLM vs CLM"
Implementation: May ask to implement these losses
Understanding: Core to understanding BERT vs GPT

Real-World Application

BERT: Uses MLM + NSP
GPT: Uses CLM
Modern models: Various combinations
Foundation: Understanding pre-training objectives

Industry Use Cases

1. MLM (Masked Language Modeling)

Use Case: BERT, RoBERTa

Bidirectional context
Better for understanding tasks
Used in encoder models

2. CLM (Causal Language Modeling)

Use Case: GPT, LLaMA

Autoregressive generation
Better for generation tasks
Used in decoder models

3. NSP (Next Sentence Prediction)

Use Case: BERT (original)

Sentence pair understanding
Less commonly used now
Replaced by other objectives

Core Intuition

The training objective determines what conditional distribution the model learns.

That is why BERT-style and GPT-style models behave so differently.

MLM

Masked language modeling trains the model to fill in missing tokens using both left and right context.

That makes it naturally suited for bidirectional understanding.

CLM

Causal language modeling trains the model to predict the next token from previous tokens only.

That makes it naturally suited for generation.

NSP

NSP was designed to encourage sentence-pair reasoning, though later work found it less essential than originally thought in some setups.

Technical Details Interviewers Often Want

Objective Shapes Architecture Use

This is the key answer:

MLM aligns naturally with encoder-style bidirectional models
CLM aligns naturally with decoder-style autoregressive generation

Why MLM Is Not Directly a Generation Objective

MLM predicts masked positions, not a left-to-right continuation process.

That is why BERT-style objectives are less natural for open-ended generation.

Why NSP Became Less Central

NSP is historically important, but many later models reduced or replaced it with other objectives because it was not always the main reason for strong representation learning.

Common Failure Modes

treating MLM and CLM as minor variants instead of fundamentally different learning problems
forgetting that objective choice shapes downstream use
overemphasizing NSP in modern practice
not being able to explain why GPT-style models generate naturally

Edge Cases and Follow-Up Questions

Why is MLM naturally bidirectional?
Why is CLM naturally generative?
Why can't BERT-style MLM be used as straightforwardly for autoregressive generation?
Why did NSP become less central in later models?
Why does pretraining objective shape model behavior so strongly?

What to Practice Saying Out Loud

The difference between MLM and CLM in one minute
Why objective choice changes what the model is good at
Why BERT and GPT feel different even before fine-tuning

Theory

MLM (Masked Language Modeling)

What it is:

Randomly mask tokens in input
Predict masked tokens from context
Bidirectional: can see both left and right context

Mathematical Formulation:

L_MLM = -∑ log P(x_masked | x_context)

CLM (Causal Language Modeling)

What it is:

Predict next token given previous tokens
Autoregressive: only see left context
Standard language modeling objective

Mathematical Formulation:

L_CLM = -∑ log P(x_t | x_{<t})

NSP (Next Sentence Prediction)

What it is:

Predict if sentence B follows sentence A
Binary classification task
Helps with sentence pair understanding

Mathematical Formulation:

L_NSP = -log P(is_next | sentence_A, sentence_B)

Industry-Standard Boilerplate Code

See detailed files for complete implementations:

language_modeling_losses.py: Complete implementations
loss_explanations.md: Detailed theoretical explanations
loss_comparison.md: Comparison and when to use each

Exercises

Implement MLM loss
Implement CLM loss
Implement NSP loss
Compare different masking strategies
Understand bidirectional vs unidirectional

Next Steps

Review transformer architectures
Compare encoder vs decoder models
Explore modern pre-training objectives

ML & LLM Interview Prep — Deep Dives