Topic 1: Classical ML Algorithms

Linear Regression: Step-by-step derivation from first principles
- Why least squares?
- Normal equation derivation
- Geometric intuition
- Gradient descent alternative
- See linear_regression_derivation.md
Logistic Regression: Complete derivation with intuitive explanations
- Why sigmoid function?
- Log odds transformation
- Cross-entropy loss derivation
- Gradient descent
- Decision boundary
- See logistic_regression_derivation.md

🔥 Logistic Regression — Interview-Grade Files (READ THESE FOR INTERVIEWS)

The base derivation file is fine for understanding. But if you're prepping for applied scientist or ML engineer interviews, the bar is much higher. The two files below cover the deep, intuitive, gotcha-style questions that interviewers actually ask:

LOGISTIC_REGRESSION_DEEP_DIVE.md — Frontier-lab interview deep dive: linear log-odds assumption, MLE derivation, why CE not MSE (likelihood + convexity), separability and divergence, IRLS, calibration, multicollinearity, L1/L2 geometry, max entropy, connections to softmax/NB/SVM/neural networks. Read this whole document before any classical-ML interview.
LOGISTIC_REGRESSION_INTERVIEW_GRILL.md — 60 active-recall questions with strong answers. Drill until you can answer 40+ cold.

Core Intuition

Classical ML is still high-value interview material because it exposes the basic logic that more complex models build on.

Linear Regression

Linear regression tries to fit a straight-line or linear relationship between input features and a continuous target.

Easy way to think about it:

each feature gets a weight
the prediction is a weighted sum plus a bias
learning means adjusting those weights to reduce squared error

Why least squares?

large mistakes should hurt more than small mistakes
the math becomes clean and differentiable
the closed-form solution and gradient solution are both easy to explain

Logistic Regression

Logistic regression is not "linear regression but for labels."

The key difference is that it models probability:

P(y = 1 | x) = sigmoid(w^T x + b)

Why sigmoid?

it maps any real number to [0, 1]
it gives a probabilistic interpretation
it connects naturally to Bernoulli likelihood and cross-entropy

KNN

KNN does not learn parameters in the usual sense.

Instead:

store the dataset
compare a new point to stored points
let nearby points vote or average

This is useful in interviews because it helps explain the idea of local similarity very directly.

K-Means

K-means is trying to summarize a dataset using a small number of centroids.

The loop is simple:

assign each point to the nearest centroid
move each centroid to the mean of the assigned points
repeat

The main intuition is that it alternates between:

deciding cluster membership
refining cluster representatives

Technical Details Interviewers Often Care About

Linear Regression

Objective: minimize mean squared error
Gradients: dw = X^T (y_pred - y) / n, db = mean(y_pred - y)
Closed-form solution exists for small well-behaved problems
Sensitive to collinearity and outliers

Logistic Regression

Uses sigmoid plus cross-entropy
Decision boundary is linear in feature space
The gradient simplifies nicely because BCE with sigmoid gives p - y
Still a linear classifier even though the output is probabilistic

KNN

Prediction cost is high because you compare with stored points
Distance metric choice matters
Feature scaling matters a lot
Small k means high variance; large k means high bias

K-Means

Assumes roughly spherical clusters
Sensitive to initialization
Uses Euclidean-style geometry
Empty clusters are a real edge case in implementation

Common Failure Modes

Linear / Logistic Regression

unscaled features slow or distort optimization
collinear features make weights unstable
outliers can dominate the fit
using the wrong loss for the task

KNN

bad feature scaling breaks nearest-neighbor behavior
large datasets make prediction slow
irrelevant features pollute distance

K-Means

non-spherical clusters are handled poorly
wrong number of clusters
bad initialization leads to poor local minima
empty clusters during updates

What the Interviewer May Ask Next

Why does logistic regression use cross-entropy instead of MSE?
Why is KNN called a lazy learner?
Why does feature scaling matter for KNN and K-means?
When would linear regression beat a larger nonlinear model?
What assumptions make linear regression a reasonable model?

What to Practice Saying Out Loud

Why is logistic regression still a linear classifier?
Why does K-means converge, and what does it converge to?
Why can KNN be strong as a baseline but weak at scale?
What is the geometric interpretation of linear regression?

Industry-Standard Boilerplate Code

Linear Regression (From Scratch)

"""
Linear Regression from Scratch
Interview question: "Implement linear regression without using sklearn"
"""
import numpy as np
from typing import Tuple

class LinearRegression:
    """
    Linear Regression: y = w*x + b
    Industry standard implementation pattern
    """
    
    def __init__(self, learning_rate: float = 0.01, n_iterations: int = 1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
        self.cost_history = []
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """
        Train the model using gradient descent
        
        Args:
            X: Features (n_samples, n_features)
            y: Target values (n_samples,)
        """
        n_samples, n_features = X.shape
        
        # Initialize parameters
        self.weights = np.zeros(n_features)
        self.bias = 0
        self.cost_history = []
        
        # Gradient descent
        for i in range(self.n_iterations):
            # Predictions
            y_pred = X.dot(self.weights) + self.bias
            
            # Compute gradients
            dw = (1/n_samples) * X.T.dot(y_pred - y)
            db = (1/n_samples) * np.sum(y_pred - y)
            
            # Update parameters
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
            
            # Compute cost (MSE)
            cost = (1/(2*n_samples)) * np.sum((y_pred - y)**2)
            self.cost_history.append(cost)
        
        return self
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Make predictions"""
        return X.dot(self.weights) + self.bias
    
    def score(self, X: np.ndarray, y: np.ndarray) -> float:
        """Calculate R² score"""
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred)**2)
        ss_tot = np.sum((y - np.mean(y))**2)
        return 1 - (ss_res / ss_tot)


# Usage Example
if __name__ == "__main__":
    # Generate sample data
    np.random.seed(42)
    X = np.random.randn(100, 1)
    y = 2 * X.flatten() + 1 + 0.1 * np.random.randn(100)
    
    # Train model
    model = LinearRegression(learning_rate=0.01, n_iterations=1000)
    model.fit(X, y)
    
    # Predict
    predictions = model.predict(X)
    
    print(f"Weights: {model.weights}")
    print(f"Bias: {model.bias}")
    print(f"R² Score: {model.score(X, y):.4f}")

Linear Regression (PyTorch)

"""
Linear Regression using PyTorch
Very simple PyTorch version
"""
import torch
import torch.nn as nn

class LinearRegressionTorch(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 1)  # Input size 1, output size 1
    
    def forward(self, x):
        return self.linear(x)

# Usage
model = LinearRegressionTorch()
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Training loop
for epoch in range(1000):
    predictions = model(X)
    loss = criterion(predictions, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Logistic Regression (Pure Python/NumPy)

"""
Logistic Regression from Scratch
Interview question: "Implement logistic regression"
"""
import numpy as np

class LogisticRegression:
    """
    Logistic Regression: P(y=1|x) = 1 / (1 + exp(-(w*x + b)))
    """
    
    def __init__(self, learning_rate: float = 0.01, n_iterations: int = 1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
        self.cost_history = []
    
    def _sigmoid(self, z: np.ndarray) -> np.ndarray:
        """Sigmoid activation function"""
        return 1 / (1 + np.exp(-np.clip(z, -250, 250)))  # Clip to prevent overflow
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Train the model"""
        n_samples, n_features = X.shape
        
        # Initialize parameters
        self.weights = np.zeros(n_features)
        self.bias = 0
        self.cost_history = []
        
        # Gradient descent
        for i in range(self.n_iterations):
            # Forward pass
            linear_model = X.dot(self.weights) + self.bias
            y_pred = self._sigmoid(linear_model)
            
            # Compute gradients
            dw = (1/n_samples) * X.T.dot(y_pred - y)
            db = (1/n_samples) * np.sum(y_pred - y)
            
            # Update parameters
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
            
            # Compute cost (log loss)
            cost = -(1/n_samples) * np.sum(
                y * np.log(y_pred + 1e-15) + 
                (1 - y) * np.log(1 - y_pred + 1e-15)
            )
            self.cost_history.append(cost)
        
        return self
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict binary class"""
        linear_model = X.dot(self.weights) + self.bias
        y_pred = self._sigmoid(linear_model)
        return (y_pred >= 0.5).astype(int)
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Predict probabilities"""
        linear_model = X.dot(self.weights) + self.bias
        return self._sigmoid(linear_model)

K-Nearest Neighbors (From Scratch)

"""
K-Nearest Neighbors from Scratch
Interview question: "Implement KNN"
"""
import numpy as np
from collections import Counter

class KNN:
    """
    K-Nearest Neighbors classifier
    """
    
    def __init__(self, k: int = 3):
        self.k = k
        self.X_train = None
        self.y_train = None
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Store training data (lazy learning)"""
        self.X_train = X
        self.y_train = y
        return self
    
    def _euclidean_distance(self, x1: np.ndarray, x2: np.ndarray) -> float:
        """Calculate Euclidean distance"""
        return np.sqrt(np.sum((x1 - x2)**2))
    
    def _get_neighbors(self, x: np.ndarray) -> list:
        """Get k nearest neighbors"""
        distances = []
        for i, x_train in enumerate(self.X_train):
            dist = self._euclidean_distance(x, x_train)
            distances.append((dist, self.y_train[i]))
        
        # Sort by distance
        distances.sort(key=lambda x: x[0])
        
        # Get k nearest
        neighbors = [label for _, label in distances[:self.k]]
        return neighbors
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class for each sample"""
        predictions = []
        for x in X:
            neighbors = self._get_neighbors(x)
            # Majority vote
            most_common = Counter(neighbors).most_common(1)[0][0]
            predictions.append(most_common)
        return np.array(predictions)

K-Means Clustering (From Scratch)

"""
K-Means Clustering from Scratch
Interview question: "Implement K-means"
"""
import numpy as np

class KMeans:
    """
    K-Means clustering algorithm
    """
    
    def __init__(self, k: int = 3, max_iters: int = 100, random_state: int = 42):
        self.k = k
        self.max_iters = max_iters
        self.random_state = random_state
        self.centroids = None
        self.labels = None
    
    def _initialize_centroids(self, X: np.ndarray):
        """Initialize centroids randomly"""
        np.random.seed(self.random_state)
        n_samples, n_features = X.shape
        self.centroids = X[np.random.choice(n_samples, self.k, replace=False)]
    
    def _assign_clusters(self, X: np.ndarray) -> np.ndarray:
        """Assign each point to nearest centroid"""
        distances = np.sqrt(((X - self.centroids[:, np.newaxis])**2).sum(axis=2))
        return np.argmin(distances, axis=0)
    
    def _update_centroids(self, X: np.ndarray, labels: np.ndarray):
        """Update centroids based on cluster assignments"""
        for i in range(self.k):
            cluster_points = X[labels == i]
            if len(cluster_points) > 0:
                self.centroids[i] = cluster_points.mean(axis=0)
    
    def fit(self, X: np.ndarray):
        """Fit K-means to data"""
        self._initialize_centroids(X)
        
        for _ in range(self.max_iters):
            # Assign clusters
            labels = self._assign_clusters(X)
            
            # Update centroids
            old_centroids = self.centroids.copy()
            self._update_centroids(X, labels)
            
            # Check convergence
            if np.allclose(old_centroids, self.centroids):
                break
        
        self.labels = labels
        return self
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict cluster for new data"""
        return self._assign_clusters(X)

Theory

Linear Regression

Goal: Find line that best fits data
Cost function: Mean Squared Error (MSE)
Optimization: Gradient descent
Assumptions: Linear relationship, independence, homoscedasticity

Logistic Regression

Goal: Predict probability of binary outcome
Activation: Sigmoid function
Cost function: Log loss (cross-entropy)
Optimization: Gradient descent

KNN

Lazy learning: No training phase
Distance metric: Usually Euclidean
K value: Trade-off between bias and variance
Time complexity: O(n) for prediction

K-Means

Goal: Partition data into k clusters
Algorithm: Iterative optimization
Initialization: Important for convergence
Limitations: Assumes spherical clusters

Exercises

Implement linear regression with regularization
Add early stopping to logistic regression
Implement weighted KNN
Add k-means++ initialization

Next Steps

Topic 2: Gradient descent variants
Topic 3: Evaluation metrics

ML & LLM Interview Prep — Deep Dives