Logistic Regression: Detailed Derivation with Intuitive Explanations
Overview
This document provides a complete, intuitive derivation of logistic regression from first principles. We'll explain why we need it, how it works, and derive everything step-by-step.
Intuitive Understanding
Why Logistic Regression?
Problem with Linear Regression for Classification:
- Linear regression: y = wx + b (can be any value)
- Classification: y should be 0 or 1 (probability)
- Linear regression can give negative values or > 1 (not probabilities!)
Example:
- Predict: Will it rain? (yes/no)
- Linear regression: Might give 1.5 or -0.3 (not valid probabilities)
- Need: Values between 0 and 1 (probabilities)
Solution:
- Use sigmoid function to squash output to [0, 1]
- This is logistic regression!
What is Logistic Regression?
Simple Explanation: Logistic regression predicts probabilities (0 to 1) instead of continuous values.
Formula:
P(y=1|x) = 1 / (1 + e^(-(wx + b)))
Where:
- P(y=1|x): Probability that y=1 given x
- w, b: Parameters to learn
- e: Euler's number (~2.718)
Visual:
Probability
1.0 | *---* (sigmoid curve)
| *
| *
| *
0.5 | *
|*
0.0 |*_________________ x
S-shaped curve (sigmoid)
Key Property:
- Always between 0 and 1
- Smooth, differentiable
- S-shaped (sigmoid)
Step-by-Step Derivation
Step 1: The Problem
Given:
- Data: {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}
- yᵢ ∈ {0, 1} (binary classification)
- Goal: Predict P(y=1|x) for new x
Why not use linear regression?
- Linear: ŷ = wx + b (can be < 0 or > 1)
- Need: P(y=1|x) ∈ [0, 1]
Step 2: The Odds Ratio
Intuition: Instead of probability, think about "odds"
Odds:
Odds = P / (1 - P)
Example:
- P = 0.8 → Odds = 0.8 / 0.2 = 4 (4:1 in favor)
- P = 0.5 → Odds = 0.5 / 0.5 = 1 (1:1, even)
- P = 0.2 → Odds = 0.2 / 0.8 = 0.25 (1:4 against)
Why odds?
- Odds can be any positive number (0 to ∞)
- Can model with linear function!
Log Odds (Logit):
log(Odds) = log(P / (1 - P))
Why log?
- Odds: [0, ∞)
- Log odds: (-∞, ∞) (unbounded!)
- Can use linear function: log(Odds) = wx + b
Visual:
Probability → Odds → Log Odds
0.1 → 0.11 → -2.2
0.5 → 1.0 → 0.0
0.9 → 9.0 → 2.2
Step 3: The Model
Start with log odds:
log(P / (1 - P)) = wx + b
Solve for P:
P / (1 - P) = e^(wx + b)
P = (1 - P) × e^(wx + b)
P = e^(wx + b) - P × e^(wx + b)
P + P × e^(wx + b) = e^(wx + b)
P(1 + e^(wx + b)) = e^(wx + b)
P = e^(wx + b) / (1 + e^(wx + b))
P = 1 / (1 + e^(-(wx + b)))
This is the sigmoid function!
Notation:
σ(z) = 1 / (1 + e^(-z)) (sigmoid)
So: P(y=1|x) = σ(wx + b)
Step 4: Why Sigmoid?
Properties of sigmoid:
- Bounded: Always between 0 and 1
- Smooth: Differentiable everywhere
- Symmetric: σ(-z) = 1 - σ(z)
- Monotonic: Increases with z
Visual:
σ(z) = 1 / (1 + e^(-z))
z → -∞: σ(z) → 0
z = 0: σ(z) = 0.5
z → +∞: σ(z) → 1
Why this shape?
- When wx + b is very negative → P ≈ 0 (very unlikely)
- When wx + b = 0 → P = 0.5 (even chance)
- When wx + b is very positive → P ≈ 1 (very likely)
Step 5: The Likelihood Function
For binary classification:
- If y = 1: Want P(y=1|x) to be high
- If y = 0: Want P(y=0|x) = 1 - P(y=1|x) to be high
Combined:
P(y|x) = P(y=1|x)^y × (1 - P(y=1|x))^(1-y)
Why?
- If y = 1: P(y=1|x)^1 × (1 - P(y=1|x))^0 = P(y=1|x) ✓
- If y = 0: P(y=1|x)^0 × (1 - P(y=1|x))^1 = 1 - P(y=1|x) ✓
For all data points (assuming independence):
L(w, b) = ∏ᵢ P(yᵢ|xᵢ)
= ∏ᵢ [P(y=1|xᵢ)^yᵢ × (1 - P(y=1|xᵢ))^(1-yᵢ)]
This is the likelihood function!
Intuition:
- Likelihood = probability of observing data given parameters
- Higher likelihood = better fit
- Want to maximize likelihood
Step 6: Log-Likelihood
Why log?
- Products become sums (easier!)
- More numerically stable
- Same maximum (log is monotonic)
Log-likelihood:
log L(w, b) = log ∏ᵢ [P(y=1|xᵢ)^yᵢ × (1 - P(y=1|xᵢ))^(1-yᵢ)]
= Σᵢ log [P(y=1|xᵢ)^yᵢ × (1 - P(y=1|xᵢ))^(1-yᵢ)]
= Σᵢ [yᵢ log P(y=1|xᵢ) + (1-yᵢ) log(1 - P(y=1|xᵢ))]
Substitute P(y=1|xᵢ) = σ(wxᵢ + b):
log L(w, b) = Σᵢ [yᵢ log σ(wxᵢ + b) + (1-yᵢ) log(1 - σ(wxᵢ + b))]
This is what we want to maximize!
Step 7: Cost Function (Negative Log-Likelihood)
Convention:
- Maximize likelihood = Minimize negative log-likelihood
- This is the cost function
Cost function:
J(w, b) = -log L(w, b)
= -Σᵢ [yᵢ log σ(wxᵢ + b) + (1-yᵢ) log(1 - σ(wxᵢ + b))]
This is cross-entropy loss!
Intuition:
-
If y = 1: Want σ(wx + b) close to 1
- Cost = -log(σ(wx + b))
- If σ(wx + b) = 1 → cost = 0 ✓
- If σ(wx + b) = 0.1 → cost = -log(0.1) = 2.3 (high cost!)
-
If y = 0: Want σ(wx + b) close to 0
- Cost = -log(1 - σ(wx + b))
- If σ(wx + b) = 0 → cost = 0 ✓
- If σ(wx + b) = 0.9 → cost = -log(0.1) = 2.3 (high cost!)
Visual:
Cost
|
|\ (y=0)
| \ /
| \ /
| *
| / \
| / \ (y=1)
|/ \
+------- Probability
0 1
Step 8: Gradient Descent
Goal: Minimize J(w, b)
Method: Gradient descent
Gradient with respect to w:
∂J/∂w = ∂/∂w [-Σᵢ [yᵢ log σ(wxᵢ + b) + (1-yᵢ) log(1 - σ(wxᵢ + b))]]
= -Σᵢ [yᵢ × (1/σ(wxᵢ + b)) × ∂σ/∂w + (1-yᵢ) × (1/(1-σ(wxᵢ + b))) × ∂(1-σ)/∂w]
Key derivative:
∂σ(z)/∂z = σ(z)(1 - σ(z))
Why?
σ(z) = 1 / (1 + e^(-z))
∂σ/∂z = e^(-z) / (1 + e^(-z))²
= [1 / (1 + e^(-z))] × [e^(-z) / (1 + e^(-z))]
= σ(z) × [1 - σ(z)]
Continue gradient:
∂J/∂w = -Σᵢ [yᵢ × (1/σ(wxᵢ + b)) × σ(wxᵢ + b)(1 - σ(wxᵢ + b)) × xᵢ
+ (1-yᵢ) × (1/(1-σ(wxᵢ + b))) × (-σ(wxᵢ + b)(1 - σ(wxᵢ + b))) × xᵢ]
= -Σᵢ [yᵢ × (1 - σ(wxᵢ + b)) × xᵢ - (1-yᵢ) × σ(wxᵢ + b) × xᵢ]
= -Σᵢ [yᵢxᵢ - yᵢσ(wxᵢ + b)xᵢ - (1-yᵢ)σ(wxᵢ + b)xᵢ]
= -Σᵢ [yᵢxᵢ - σ(wxᵢ + b)xᵢ]
= Σᵢ [σ(wxᵢ + b) - yᵢ] × xᵢ
Similarly for b:
∂J/∂b = Σᵢ [σ(wxᵢ + b) - yᵢ]
Update rules:
w = w - α × ∂J/∂w
= w - α × Σᵢ [σ(wxᵢ + b) - yᵢ] × xᵢ
b = b - α × ∂J/∂b
= b - α × Σᵢ [σ(wxᵢ + b) - yᵢ]
Intuition:
- Error = σ(wx + b) - y (predicted - actual)
- If error > 0: Predicted too high → decrease w, b
- If error < 0: Predicted too low → increase w, b
- Update proportional to error and input x
Matrix Formulation
Setup
Multiple features:
X = [1 x₁₁ x₁₂ ... x₁ₚ] (design matrix with bias column)
[1 x₂₁ x₂₂ ... x₂ₚ]
[...]
[1 xₙ₁ xₙ₂ ... xₙₚ]
w = [w₀] (weights including bias)
[w₁]
[...]
[wₚ]
y = [y₁] (targets)
[y₂]
[...]
[yₙ]
Model:
P(y=1|X) = σ(Xw)
Cost:
J(w) = -Σᵢ [yᵢ log σ(Xᵢw) + (1-yᵢ) log(1 - σ(Xᵢw))]
Gradient:
∂J/∂w = Xᵀ(σ(Xw) - y)
Update:
w = w - α × Xᵀ(σ(Xw) - y)
Why Cross-Entropy Loss?
Information Theory Perspective
Cross-entropy:
- Measures difference between two distributions
- True distribution: [y, 1-y] (one-hot)
- Predicted distribution: [σ(wx+b), 1-σ(wx+b)]
Minimizing cross-entropy:
- Makes predicted distribution close to true
- Equivalent to maximum likelihood
- Information-theoretically optimal
Comparison to Other Losses
Mean Squared Error (MSE):
MSE = (1/n) Σ (y - σ(wx + b))²
Problem:
- Not convex for classification
- Can get stuck in local minima
- Doesn't work well for probabilities
Cross-Entropy:
CE = -Σ [y log σ(wx + b) + (1-y) log(1 - σ(wx + b))]
Advantages:
- Convex (guaranteed global minimum)
- Works well with probabilities
- Information-theoretically motivated
Decision Boundary
How Classification Works
After training, we have:
P(y=1|x) = σ(wx + b)
Decision rule:
If P(y=1|x) > 0.5: Predict y = 1
If P(y=1|x) < 0.5: Predict y = 0
What does P(y=1|x) = 0.5 mean?
0.5 = 1 / (1 + e^(-(wx + b)))
1 + e^(-(wx + b)) = 2
e^(-(wx + b)) = 1
-(wx + b) = 0
wx + b = 0
Decision boundary:
- Line: wx + b = 0
- Points on line: P = 0.5 (uncertain)
- Points above line: P > 0.5 (predict class 1)
- Points below line: P < 0.5 (predict class 0)
Visual:
Class 1: * * *
|
| (decision boundary: wx + b = 0)
|
Class 0: * * *
Assumptions
Key Assumptions
1. Linearity in log odds:
- log(P/(1-P)) = wx + b is linear
- Breaks when: Non-linear relationships (use polynomial features)
2. Independence:
- Observations are independent
- Breaks when: Time series, repeated measurements
3. No multicollinearity:
- Features not highly correlated
- Breaks when: Redundant features
4. Large sample size:
- Need sufficient data for stable estimates
- Breaks when: Small datasets
What Happens When Assumptions Break?
Non-linearity:
- Can't capture curved decision boundaries
- Solution: Polynomial features, feature engineering
Multicollinearity:
- Unstable coefficients
- Solution: Regularization, feature selection
Regularization
Why Regularize?
Problem:
- Overfitting with many features
- Unstable coefficients
- Poor generalization
Solution:
- Add penalty to cost function
- Shrink coefficients toward zero
L2 Regularization (Ridge)
Cost function:
J(w) = -log L(w) + λ||w||²
Gradient:
∂J/∂w = Xᵀ(σ(Xw) - y) + 2λw
Effect:
- Shrinks all coefficients
- Prevents overfitting
- More stable
L1 Regularization (Lasso)
Cost function:
J(w) = -log L(w) + λ||w||₁
Effect:
- Sets some coefficients to exactly 0
- Feature selection
- Sparse model
Summary
Key Insights:
- Problem: Need probabilities (0 to 1), not continuous values
- Solution: Use sigmoid to map linear function to [0, 1]
- Derivation: Start with log odds, solve for probability
- Cost: Cross-entropy loss (negative log-likelihood)
- Optimization: Gradient descent (no closed-form solution)
Formulas:
Model: P(y=1|x) = σ(wx + b) = 1 / (1 + e^(-(wx + b)))
Cost: J(w, b) = -Σ [y log σ(wx + b) + (1-y) log(1 - σ(wx + b))]
Gradient: ∂J/∂w = Σ [σ(wx + b) - y] × x
∂J/∂b = Σ [σ(wx + b) - y]
Update: w = w - α × gradient
Why it works:
- Sigmoid ensures probabilities in [0, 1]
- Cross-entropy is optimal for classification
- Gradient descent finds optimal parameters
- Decision boundary is linear in feature space
Intuition:
- Log odds are linear → probabilities are sigmoid
- Maximize likelihood → minimize cross-entropy
- Gradient points to steepest increase → move opposite direction