Topic 31: Neural Networks from Scratch

🔥 For interviews, read these first:

NEURAL_NETWORKS_DEEP_DIVE.md — frontier-lab interview deep dive: MLP fundamentals, universal approximation, activations (ReLU/GELU/SwiGLU), loss-activation pairing, backpropagation derivation, He/Xavier init, vanishing/exploding gradients, residual connections, modern training tricks.

INTERVIEW_GRILL.md — 60 active-recall questions.

What You'll Learn

This topic teaches you to build and train neural networks from scratch:

Forward pass
Backpropagation (detailed mathematical explanation)
Activation functions
Loss functions
Training loop
Simple implementations in pure Python/NumPy

Why We Need This

Interview Importance

Common question: "Implement backpropagation from scratch"
Deep understanding: Shows you understand fundamentals
Foundation: All deep learning builds on this

Real-World Application

Understanding: Know how neural networks work internally
Debugging: Understand what's happening during training
Customization: Build custom architectures

Core Intuition

A neural network is a composition of learnable functions that gradually transforms inputs into representations useful for prediction.

The two core algorithmic ideas are:

forward pass computes outputs
backpropagation computes how each parameter affected the loss

Forward Pass

A forward pass is repeated:

linear transform
nonlinearity

Without nonlinearities, a deep network would collapse to a single linear transformation.

Backpropagation

Backpropagation is repeated chain rule.

It tells each layer how changing its outputs would affect the final loss, and from that computes parameter gradients.

Technical Details Interviewers Often Want

Why Activations Matter

Nonlinear activations are what allow deep networks to model nonlinear patterns.

Why Gradients Vanish or Explode

Backprop multiplies many derivatives across depth.

If those derivatives are consistently:

too small -> gradients vanish
too large -> gradients explode

Why Shape Tracking Matters

In interviews, shape mistakes are often the real bug, not the calculus.

You need to know both the derivative logic and the tensor shapes.

Common Failure Modes

forgetting that no nonlinearity means the model stays linear
getting matrix dimensions wrong
deriving gradients mechanically without understanding dependencies
ignoring activation saturation
mismatching output activation and loss

Edge Cases and Follow-Up Questions

Why would a deep network without nonlinearities still be linear?
Why do gradients vanish or explode?
Why is backprop really just repeated chain rule?
Why does activation choice affect optimization?
Why should output-layer activation match the task?

What to Practice Saying Out Loud

The role of nonlinearity in neural networks
Why backpropagation works conceptually
Why shape tracking is part of the derivation

Detailed Theory

Forward Pass

Mathematical Formulation:

For a simple 2-layer neural network:

Input: x (n_features,)
Layer 1: h1 = activation(W1 @ x + b1)
Layer 2: h2 = activation(W2 @ h1 + b2)
Output: y = h2

Step-by-step:

Input layer: Raw features x
Hidden layer 1:
- Linear transformation: z1 = W1 @ x + b1
- Apply activation: h1 = σ(z1) where σ is activation function
Hidden layer 2:
- Linear transformation: z2 = W2 @ h1 + b2
- Apply activation: h2 = σ(z2)
Output: Final prediction y = h2

Why activation functions?

Without activation, neural network is just linear transformation
Activation introduces non-linearity
Enables learning complex patterns

Backpropagation (Detailed Explanation)

Backpropagation is the algorithm to compute gradients of loss with respect to all parameters.

Mathematical Foundation:

We want to compute: ∂L/∂W and ∂L/∂b for all layers

Chain Rule: If y = f(g(x)), then dy/dx = (df/dg) × (dg/dx)

Step-by-step Backpropagation:

Step 1: Forward Pass

x → z1 = W1 @ x + b1 → h1 = σ(z1) → z2 = W2 @ h1 + b2 → h2 = σ(z2) → L

Step 2: Compute Output Layer Gradients

For output layer (layer 2):

Loss gradient w.r.t. output: ∂L/∂h2
This depends on loss function (e.g., MSE: ∂L/∂h2 = 2(h2 - y_true))
Gradient w.r.t. z2: ∂L/∂z2 = (∂L/∂h2) × (∂h2/∂z2) = (∂L/∂h2) × σ'(z2)
Gradient w.r.t. W2: ∂L/∂W2 = (∂L/∂z2) @ h1^T
Gradient w.r.t. b2: ∂L/∂b2 = ∂L/∂z2

Step 3: Backpropagate to Hidden Layer

For hidden layer (layer 1):

Gradient w.r.t. h1: ∂L/∂h1 = W2^T @ (∂L/∂z2)
Gradient w.r.t. z1: ∂L/∂z1 = (∂L/∂h1) × σ'(z1)
Gradient w.r.t. W1: ∂L/∂W1 = (∂L/∂z1) @ x^T
Gradient w.r.t. b1: ∂L/∂b1 = ∂L/∂z1

Why it's called "backpropagation":

Gradients flow backwards from output to input
Each layer uses gradients from next layer
Computationally efficient (one forward + one backward pass)

Activation Functions

Sigmoid:

Formula: σ(x) = 1 / (1 + e^(-x))
Range: (0, 1)
Derivative: σ'(x) = σ(x)(1 - σ(x))
Use: Output layer for binary classification
Problem: Vanishing gradients (derivative → 0 for large |x|)

ReLU (Rectified Linear Unit):

Formula: ReLU(x) = max(0, x)
Derivative: 1 if x > 0, else 0
Use: Hidden layers (most common)
Advantage: Solves vanishing gradient problem
Problem: Dead ReLU (outputs 0 forever if input < 0)

Tanh:

Formula: tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Range: (-1, 1)
Use: Hidden layers (centered around 0)
Better than sigmoid: Stronger gradients

Loss Functions

Mean Squared Error (MSE):

Formula: L = (1/n) Σ(y_pred - y_true)²
Use: Regression
Derivative: ∂L/∂y_pred = 2(y_pred - y_true)

Cross-Entropy:

Formula: L = -Σ y_true × log(y_pred)
Use: Classification
Derivative: ∂L/∂y_pred = -y_true / y_pred

Industry-Standard Boilerplate Code

See neural_network.py for complete implementation.

Exercises

Implement forward pass
Implement backpropagation
Train on simple dataset
Visualize training process

Next Steps

Topic 32: Isolation Forest and Anomaly Detection
Review neural network fundamentals

ML & LLM Interview Prep — Deep Dives