Linear Regression: Detailed Derivation with Intuitive Explanations
Overview
This document provides a complete, intuitive derivation of linear regression from first principles. We'll build understanding step-by-step, explaining not just the math, but why it works.
Intuitive Understanding
What is Linear Regression?
Simple Explanation: Linear regression finds the best straight line that fits your data points.
Example:
- You have data: (house size, house price)
- You want to predict: Given a new house size, what's the price?
- Linear regression finds: price = a × size + b (a line!)
Visual:
Price ($)
|
| * (data point)
| *
| *
|*_________________ (fitted line)
|
+------------------- Size (sq ft)
Why "Linear"?
Linear means:
- Relationship is a straight line (not curved)
- Formula: y = mx + b (or y = w₁x + w₀)
- One unit change in x → constant change in y
Example:
- If price = 100 × size + 50000
- Increasing size by 1 sq ft → price increases by $100 (constant!)
Mathematical Setup
Problem Statement
Given:
- Data points: {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}
- Model: y = w₁x + w₀ (or y = wx + b)
- Goal: Find w₁ and w₀ that best fit the data
What does "best fit" mean?
- Line should be close to all data points
- Minimize the distance between line and points
- This distance is called "error" or "residual"
Visual Intuition
Error for one point:
* (data point)
|
| error = distance from point to line
|
---*--- (fitted line)
Total error:
- Sum of all errors (squared, to avoid cancellation)
- Want to minimize this total error
Step-by-Step Derivation
Step 1: Define the Model
Linear Model:
ŷᵢ = w₁xᵢ + w₀
Where:
- ŷᵢ: Predicted value for point i
- xᵢ: Input value for point i
- w₁: Slope (weight)
- w₀: Intercept (bias)
Why this form?
- Simple: Just two parameters
- Interpretable: w₁ = how much y changes per unit x
- Flexible: Can fit any linear relationship
Step 2: Define the Error (Residual)
For each data point:
Error for point i = yᵢ - ŷᵢ
= yᵢ - (w₁xᵢ + w₀)
Why yᵢ - ŷᵢ?
- yᵢ = actual value (what we observed)
- ŷᵢ = predicted value (what model says)
- Error = difference between actual and predicted
Visual:
Actual point: (xᵢ, yᵢ) = (5, 10)
Predicted: ŷᵢ = 2×5 + 1 = 11
Error: 10 - 11 = -1 (predicted too high by 1)
Step 3: Define the Cost Function
Problem: We have errors for each point. How do we combine them?
Option 1: Sum of errors
Σ (yᵢ - ŷᵢ)
Problem: Positive and negative errors cancel out!
- Point 1: error = +5
- Point 2: error = -5
- Sum = 0 (looks perfect, but it's not!)
Option 2: Sum of absolute errors
Σ |yᵢ - ŷᵢ|
Problem: Not differentiable everywhere (hard to optimize)
Option 3: Sum of squared errors (MSE)
MSE = (1/n) Σ (yᵢ - ŷᵢ)²
= (1/n) Σ (yᵢ - w₁xᵢ - w₀)²
Why squared?
- Always positive: No cancellation
- Penalizes large errors: Squaring makes big errors much worse
- Differentiable: Smooth function, easy to optimize
- Mathematically nice: Leads to closed-form solution
Visual Intuition:
Small error: 1² = 1
Large error: 5² = 25 (much worse!)
This encourages model to avoid large errors.
Step 4: The Optimization Problem
Goal: Find w₁ and w₀ that minimize MSE
Mathematically:
minimize: MSE(w₁, w₀) = (1/n) Σ (yᵢ - w₁xᵢ - w₀)²
How to solve?
- Take derivatives with respect to w₁ and w₀
- Set derivatives to zero
- Solve for w₁ and w₀
Why derivatives?
- Derivative = slope of function
- At minimum, slope = 0
- So we find where derivative = 0
Step 5: Take Derivatives
Cost function:
MSE = (1/n) Σ (yᵢ - w₁xᵢ - w₀)²
Derivative with respect to w₀ (intercept):
Step-by-step:
∂MSE/∂w₀ = ∂/∂w₀ [(1/n) Σ (yᵢ - w₁xᵢ - w₀)²]
= (1/n) Σ ∂/∂w₀ (yᵢ - w₁xᵢ - w₀)²
= (1/n) Σ 2(yᵢ - w₁xᵢ - w₀) × (-1)
= -(2/n) Σ (yᵢ - w₁xᵢ - w₀)
= -(2/n) [Σ yᵢ - w₁Σ xᵢ - nw₀]
Set to zero:
-(2/n) [Σ yᵢ - w₁Σ xᵢ - nw₀] = 0
Σ yᵢ - w₁Σ xᵢ - nw₀ = 0
nw₀ = Σ yᵢ - w₁Σ xᵢ
w₀ = (1/n)Σ yᵢ - w₁(1/n)Σ xᵢ
w₀ = ȳ - w₁x̄
Where:
- ȳ = (1/n)Σ yᵢ (mean of y)
- x̄ = (1/n)Σ xᵢ (mean of x)
Intuition:
- Intercept = mean of y - slope × mean of x
- Line passes through point (x̄, ȳ)!
- This makes sense: line should go through center of data
Derivative with respect to w₁ (slope):
Step-by-step:
∂MSE/∂w₁ = ∂/∂w₁ [(1/n) Σ (yᵢ - w₁xᵢ - w₀)²]
= (1/n) Σ ∂/∂w₁ (yᵢ - w₁xᵢ - w₀)²
= (1/n) Σ 2(yᵢ - w₁xᵢ - w₀) × (-xᵢ)
= -(2/n) Σ xᵢ(yᵢ - w₁xᵢ - w₀)
= -(2/n) [Σ xᵢyᵢ - w₁Σ xᵢ² - w₀Σ xᵢ]
Set to zero:
-(2/n) [Σ xᵢyᵢ - w₁Σ xᵢ² - w₀Σ xᵢ] = 0
Σ xᵢyᵢ - w₁Σ xᵢ² - w₀Σ xᵢ = 0
Substitute w₀ = ȳ - w₁x̄:
Σ xᵢyᵢ - w₁Σ xᵢ² - (ȳ - w₁x̄)Σ xᵢ = 0
Σ xᵢyᵢ - w₁Σ xᵢ² - ȳΣ xᵢ + w₁x̄Σ xᵢ = 0
Σ xᵢyᵢ - ȳΣ xᵢ = w₁(Σ xᵢ² - x̄Σ xᵢ)
w₁ = (Σ xᵢyᵢ - ȳΣ xᵢ) / (Σ xᵢ² - x̄Σ xᵢ)
Simplify: Note that:
- Σ xᵢ = nx̄
- Σ xᵢ² - x̄Σ xᵢ = Σ xᵢ² - nx̄² = Σ(xᵢ - x̄)²
- Σ xᵢyᵢ - ȳΣ xᵢ = Σ xᵢyᵢ - nȳx̄ = Σ(xᵢ - x̄)(yᵢ - ȳ)
Final formula:
w₁ = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)²
Intuition:
- Numerator: Covariance (how x and y vary together)
- Denominator: Variance of x (how x varies)
- Slope = covariance / variance
- If x and y vary together → positive slope
- If x increases, y increases → positive slope
Step 6: Final Solution
Optimal parameters:
w₁ = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)²
w₀ = ȳ - w₁x̄
This is the closed-form solution!
Why it works:
- Minimizes sum of squared errors
- Unique solution (if data not all same x)
- Best possible line (in least squares sense)
Matrix Formulation (Multiple Features)
Setup
Multiple features:
- Input: x = [x₁, x₂, ..., xₚ] (p features)
- Model: ŷ = w₀ + w₁x₁ + w₂x₂ + ... + wₚxₚ
Matrix notation:
X = [1 x₁₁ x₁₂ ... x₁ₚ] (design matrix)
[1 x₂₁ x₂₂ ... x₂ₚ]
[...]
[1 xₙ₁ xₙ₂ ... xₙₚ]
w = [w₀] (weights)
[w₁]
[...]
[wₚ]
y = [y₁] (targets)
[y₂]
[...]
[yₙ]
Model:
ŷ = Xw
Cost function:
MSE = (1/n) ||y - Xw||²
= (1/n) (y - Xw)ᵀ(y - Xw)
Derivation
Expand:
MSE = (1/n) (yᵀy - 2yᵀXw + wᵀXᵀXw)
Take derivative:
∂MSE/∂w = (1/n) (-2Xᵀy + 2XᵀXw)
= (2/n) (XᵀXw - Xᵀy)
Set to zero:
(2/n) (XᵀXw - Xᵀy) = 0
XᵀXw = Xᵀy
w = (XᵀX)⁻¹Xᵀy
This is the normal equation!
Intuition:
- XᵀX: Correlation matrix (how features relate)
- Xᵀy: Correlation between features and target
- (XᵀX)⁻¹: Invert to solve for weights
- Result: Optimal weights that minimize error
Why This Works: Geometric Intuition
Projection Interpretation
Linear regression = projection onto column space of X
Visual:
y (actual vector)
|
| /
| / error = y - ŷ
|/
*---ŷ (projection onto column space)
|
Column space of X
What this means:
- Find point in column space closest to y
- This point is ŷ = Xw
- Error is minimized (perpendicular distance)
Why Least Squares?
Geometric reason:
- Perpendicular distance is shortest
- This is exactly what squared error measures
- Pythagorean theorem: distance² = sum of squared differences
Assumptions and When They Break
Assumptions
1. Linearity:
- Relationship is linear
- Breaks when: Curved relationships (use polynomial regression)
2. Independence:
- Errors are independent
- Breaks when: Time series, repeated measurements
3. Homoscedasticity:
- Constant variance of errors
- Breaks when: Variance changes with x (use weighted regression)
4. Normality:
- Errors are normally distributed
- Breaks when: Outliers, skewed errors
5. No multicollinearity:
- Features not highly correlated
- Breaks when: Redundant features (use regularization)
What Happens When Assumptions Break?
Non-linearity:
- Model can't capture relationship
- Solution: Polynomial features, transformations
Heteroscedasticity:
- Predictions less reliable in some regions
- Solution: Weighted least squares
Outliers:
- One bad point can skew entire line
- Solution: Robust regression, remove outliers
Gradient Descent Alternative
Why Gradient Descent?
Normal equation problems:
- (XᵀX)⁻¹ expensive to compute (O(p³))
- Doesn't work if XᵀX not invertible
- Happens with perfect multicollinearity (one feature is linear combination of others)
- Happens with duplicate/redundant features or dummy variable trap (all one-hot columns + intercept)
- Happens when features > samples (p > n), so full column rank is impossible
- Practical fixes: drop collinear columns, use pseudo-inverse
X⁺(np.linalg.pinv), or use Ridge regression
- Doesn't scale to very large datasets
Gradient descent:
- Iterative method
- Works for any dataset size
- Can handle non-invertible cases
Algorithm
Initialize: w = random or zeros
Repeat until convergence:
w = w - α × ∂MSE/∂w
Where:
- α: Learning rate
- ∂MSE/∂w: Gradient (direction of steepest increase)
- We move opposite to gradient (steepest decrease)
Gradient:
∂MSE/∂w = (2/n) Xᵀ(Xw - y)
Update rule:
w = w - α × (2/n) Xᵀ(Xw - y)
Intuition:
- Compute gradient (slope)
- Move in direction that decreases error
- Small steps (learning rate)
- Eventually reach minimum
Visual:
Error
|
| /\
| / \
| / \
| / \
|/ \
+---------- w
Start Minimum
Summary
Key Insights:
- Goal: Find line that minimizes sum of squared errors
- Solution: Closed-form (normal equation) or gradient descent
- Intuition: Line passes through center, slope = covariance/variance
- Geometric: Projection onto column space
- Assumptions: Linearity, independence, constant variance, normality
Formulas:
Simple: w₁ = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)²
w₀ = ȳ - w₁x̄
Matrix: w = (XᵀX)⁻¹Xᵀy
Why it works:
- Minimizes squared error (geometrically optimal)
- Unique solution (if data not degenerate)
- Interpretable (slope, intercept have meaning)