Topic 46: RNN and LSTM
🔥 For interviews, read these first:
RNN_LSTM_DEEP_DIVE.md— frontier-lab deep dive: vanilla RNN forward/BPTT, vanishing/exploding gradients (with Jacobian product analysis), LSTM gates and cell-state additive update, GRU, bidirectional, seq2seq + attention (Bahdanau/Luong), transformer transition, connection to modern SSMs.INTERVIEW_GRILL.md— 50 active-recall questions.
What You'll Learn
This topic teaches you RNN and LSTM with simple, precise code:
- RNN (Recurrent Neural Network) from scratch
- LSTM (Long Short-Term Memory) from scratch
- Simple, interview-writable implementations
- Key concepts and differences
Why We Need This
Interview Importance
- Common question: "Implement RNN/LSTM from scratch"
- Understanding: Foundation for sequence modeling
- Historical context: Before transformers
Real-World Application
- RNN: Simple sequence modeling
- LSTM: Long-term dependencies
- Historical: Used before transformers
- Still relevant: Understanding sequence models
Industry Use Cases
1. RNN
Use Case: Simple sequence tasks
- Character-level language modeling
- Simple time series
- Basic sequence classification
2. LSTM
Use Case: Long-term dependencies
- Machine translation (before transformers)
- Speech recognition
- Time series forecasting
Core Intuition
RNNs process sequences one step at a time while carrying a hidden state forward.
That makes them natural sequence models, but also creates optimization challenges across long time ranges.
RNN
A plain RNN updates a hidden state recurrently.
Its intuition is simple:
- current state summarizes the past
- new input updates that summary
LSTM
LSTM was introduced because plain RNNs struggle with long-term dependencies.
The gating mechanism helps control:
- what to forget
- what to remember
- what to expose
That makes gradient flow and memory behavior more stable.
Technical Details Interviewers Often Want
Why RNNs Struggle with Long-Term Dependencies
Repeated multiplication through time can make gradients:
- shrink
- explode
That is the vanishing/exploding gradient problem in recurrent form.
Why LSTM Gates Help
LSTM gates create controlled paths for information and gradient flow.
That is why LSTMs remember useful information longer than plain RNNs in many settings.
Why Transformers Replaced Them in Many NLP Tasks
Transformers parallelize training better and handle long-range interactions more directly.
But RNN/LSTM understanding is still valuable because:
- it builds sequence-modeling intuition
- it clarifies why attention was such a major shift
Common Failure Modes
- treating LSTM as just a bigger RNN without understanding gating
- not being able to explain vanishing gradients in recurrent settings
- forgetting that RNNs are sequential in time and hard to parallelize across tokens
- assuming LSTMs are obsolete rather than historically and conceptually important
Edge Cases and Follow-Up Questions
- Why do plain RNNs struggle with long dependencies?
- How do forget, input, and output gates help?
- Why are RNNs harder to parallelize than transformers?
- Why did attention become such a major replacement idea?
- When might recurrent models still make sense?
What to Practice Saying Out Loud
- Why an RNN hidden state is a running summary of the past
- Why LSTM gates help memory and gradient flow
- Why transformers changed sequence modeling so much
Theory
RNN
What it is:
- Processes sequences step by step
- Maintains hidden state
- Simple but limited memory
Key Equation:
h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b)
y_t = W_hy * h_t + b_y
LSTM
What it is:
- RNN with memory cells
- Can remember long-term dependencies
- Uses gates (forget, input, output)
Key Components:
- Forget gate: What to forget
- Input gate: What to remember
- Output gate: What to output
Industry-Standard Boilerplate Code
See detailed files for complete implementations:
rnn_lstm_code.py: Simple, precise implementationsrnn_lstm_explanations.md: Key concepts explained
Exercises
- Implement RNN from scratch
- Implement LSTM from scratch
- Compare RNN vs LSTM
- Understand vanishing gradient problem
Next Steps
- Review transformers (replaced RNNs/LSTMs)
- Understand attention mechanism
- Explore modern sequence models