NLP Basics — Deep Dive

Frontier-lab interview prep. Pair with INTERVIEW_GRILL.md.

Classical NLP — n-grams, smoothing, BM25, perplexity, edit distance — predates the deep-learning era but remains interview-relevant. Many production systems still use these methods (BM25 in retrieval, edit distance for spell check, Laplace smoothing in counting models). And the concepts (perplexity, smoothing, Zipf) carry through to modern LLMs.

1. Bag of words and TF-IDF

Bag of words

Represent a document as a vector of token counts. Vector dim = vocab size; sparse.

Limitations: ignores order, ignores semantics, weights all tokens equally.

TF-IDF

$TF - IDF (t, d) = TF (t, d) \cdot lo g \frac{N}{DF ( t )}$

with TF = term frequency, DF = document frequency, $N$ = total documents.

Why log on IDF? Without log, common terms like "the" overwhelm. Log compresses the dynamic range. Plus information-theoretic interpretation: $lo g (N / DF)$ is roughly the "surprise" of the term.

Variants:

Sublinear TF: $1 + lo g (TF)$ instead of raw count.
Maximum TF normalization: divide by max term count in the doc.
L2-normalized TF-IDF: each document vector unit-normalized for cosine similarity.

When still used

Sparse retrieval baseline.
Feature extraction before classical ML (e.g., logistic regression on TF-IDF for text classification).
Hybrid search (combined with dense embeddings).

2. N-gram language models

A statistical language model: assign probability to a sequence.

$P (w_{1}, \dots, w_{n}) = i \prod P (w_{i} ∣ w_{1}, \dots, w_{i - 1})$

$N$ -gram approximation: condition only on the previous $N - 1$ tokens (Markov assumption):

$P (w_{i} ∣ w_{1}, \dots, w_{i - 1}) \approx P (w_{i} ∣ w_{i - N + 1}, \dots, w_{i - 1})$

Unigram: $P (w_{i})$ . Bigram: $P (w_{i} ∣ w_{i - 1})$ . Trigram: $P (w_{i} ∣ w_{i - 2}, w_{i - 1})$ .

MLE

$P_{MLE} (w_{i} ∣ w_{i - 1}) = \frac{count ( w _{i - 1} , w _{i} )}{count ( w _{i - 1} )}$

Empirical conditional frequency. Problem: zero counts → zero probability for unseen events.

3. Smoothing

Crucial for n-gram models. Without smoothing, any OOV bigram → zero probability for entire sequence.

Laplace (add-one) smoothing

$P_{Lap} (w_{i} ∣ w_{i - 1}) = \frac{count ( w _{i - 1} , w _{i} ) + 1}{count ( w _{i - 1} ) + V}$

Adds 1 to every count. Robust but conservative — dilutes high-count probabilities.

Bayesian interpretation: corresponds to a Dirichlet prior $Dir (1, 1, \dots, 1)$ on the multinomial. (See MLE/MAP deep dive.)

Add- $α$ smoothing

$P = (count + α) / (\sum + α V)$ . Tune $α < 1$ for less aggressive smoothing.

Good-Turing

Estimate probability mass for unseen events from the count of singletons:

$P_{unseen} = \frac{N _{1}}{N}$

where $N_{1}$ = number of n-grams that appeared exactly once. Reallocate mass from seen to unseen events.

Backoff and interpolation

If trigram zero, fall back to bigram, then unigram:

Stupid backoff: $P (w ∣ w_{- 2}, w_{- 1}) = α \cdot P (w ∣ w_{- 1})$ if trigram unseen.

Linear interpolation:

$P_{interp} (w_{i} ∣ w_{i - 2}, w_{i - 1}) = λ_{3} P_{3} + λ_{2} P_{2} + λ_{1} P_{1}$

with $\sum λ = 1$ . Tune $λ$ on held-out data.

Kneser-Ney

State-of-the-art classical smoothing. Two innovations:

Absolute discounting: subtract a fixed $D$ from each non-zero count; redistribute the freed mass to lower-order distribution.
Continuation count: instead of using raw unigram count for backoff, use number of contexts in which the word appears. E.g., "Francisco" has high unigram count but appears almost always after "San" → low continuation count → not a great backoff target.

$P_{KN} (w_{i} ∣ w_{i - 1}) = \frac{max ( c ( w _{i - 1} , w _{i} ) - D , 0 )}{c ( w _{i - 1} )} + λ (w_{i - 1}) P_{cont} (w_{i})$

where $P_{cont} (w) \propto ∣ {w^{'} : c (w^{'}, w) > 0} ∣$ .

Modified Kneser-Ney (Chen & Goodman 1998) is what was used in production speech recognition before neural LMs took over.

4. Perplexity

The standard metric for evaluating a language model:

$PPL (w_{1}, \dots, w_{N}) = exp (- \frac{1}{N} i \sum lo g P (w_{i} ∣ w_{< i}))$

Lower is better. Equals $e^{H}$ where $H$ is the cross-entropy of the model on the test text.

Interpretation: average branching factor — if PPL = 100, the model is "as uncertain as if choosing uniformly among 100 options" at each position.

For n-gram LMs:

Unigram on natural English: PPL $\sim 1000$ .
Bigram: PPL $\sim 200$ .
Trigram: PPL $\sim 100$ .
Modern neural LM: PPL $\sim 10$ – $30$ on web text.

Properties

Vocab-size dependent: comparing across different vocabularies is tricky.
Sensitive to OOV handling: unsmoothed model has infinite PPL on any unseen token.
Lower-bounded by data entropy: cannot beat the true distribution's entropy.

Why neural LMs dominate

Distributional representations + parameter sharing → never assign zero probability + generalize across rare contexts. Modern transformers achieve PPL that n-gram models can't approach with any amount of data.

5. Zipf's law

Empirical observation: frequency of the $k$ -th most common word $\propto 1/ k$ .

Top word ("the") accounts for ~7% of all tokens. Top 100 words account for ~50%. Long tail of rare words.

Implications:

Vocabulary size grows with corpus (Heaps' law: vocab $\propto N^{β}$ , $β \approx 0.5$ ).
Rare events are inevitable — smoothing always matters.
Most words you encounter are common, but most unique words are rare.
Subword tokenization (BPE) handles this gracefully — common words become single tokens; rare words get split.

6. Edit distance

Levenshtein distance: minimum number of insertions, deletions, substitutions to transform string $a$ into $b$ .

Dynamic programming

$d (i, j) = min$ of:

$d (i - 1, j) + 1$ (delete from $a$ )
$d (i, j - 1) + 1$ (insert into $a$ )
$d (i - 1, j - 1) + [a_{i} \neq = b_{j}]$ (substitute)

Time: $O (∣ a ∣ \cdot ∣ b ∣)$ . Space: $O (∣ a ∣ \cdot ∣ b ∣)$ (or $O (min (∣ a ∣, ∣ b ∣))$ optimized).

Variants

Hamming: only substitutions (same-length strings).
Damerau-Levenshtein: also allows transposition.
Smith-Waterman: local alignment (used in bioinformatics).

Applications

Spell check (find closest word in dictionary).
DNA alignment.
Plagiarism detection.
BLEU score for MT (n-gram overlap rather than edit distance, but similar structural idea).

7. BM25 — the classical retrieval workhorse

Improvement over TF-IDF for ranking documents by relevance to a query.

Formula

$BM25 (q, d) = t \in q \sum IDF (t) \cdot \frac{TF ( t , d ) \cdot ( k _{1} + 1 )}{TF ( t , d ) + k _{1} \cdot ( 1 - b + b \cdot ∣ d ∣/ avgdl )}$

with $IDF (t) = lo g \frac{N - DF ( t ) + 0.5}{DF ( t ) + 0.5}$ .

Hyperparameters: $k_{1}$ (TF saturation, typical 1.2–2.0), $b$ (length normalization, typical 0.75).

Why it works

TF saturation: doubling a term's count doesn't double its contribution.
Length normalization: long documents penalized so they don't always win.
IDF: rare terms more informative.

Why still used

Strong baseline; cheap; interpretable; doesn't need training. Hybrid systems combine BM25 (sparse) with dense embeddings (semantic) for best results.

8. Common interview gotchas

Question	Common wrong answer	Right answer
What's perplexity?	Loss	$e^{H}$ where $H$ = cross-entropy; "average branching factor"
Why smooth n-gram models?	"Tradition"	Without it, any unseen n-gram → zero probability for whole sequence
Kneser-Ney's key insight?	Better discount	Continuation count: backoff uses number of contexts, not raw unigram count
BM25 vs TF-IDF?	Same thing	BM25 adds TF saturation + length normalization
Why is Zipf's law relevant?	Just trivia	Most word types are rare → smoothing always matters; subword tokenization handles long tail
Edit distance complexity?	$O (N)$	$O (∣ a ∣ \cdot ∣ b ∣)$ DP
Backoff vs interpolation?	Same	Backoff uses lower order only if higher order zero. Interpolation always combines.

9. Eight most-asked interview questions

Walk me through n-gram language models with smoothing. (MLE → Laplace → backoff → Kneser-Ney.)
What's perplexity and how is it computed? ( $exp (cross - entropy)$ ; lower = better.)
Why is Kneser-Ney smoothing popular? (Continuation counts; absolute discounting; better than Laplace.)
Compute edit distance — describe the DP. (Recursive minimum of insert/delete/substitute.)
BM25 vs TF-IDF — what's the improvement? (TF saturation + length normalization.)
What does Zipf's law imply for tokenization? (Long tail of rare words → subword (BPE) handles gracefully.)
When would you still use TF-IDF / BM25 today? (Sparse retrieval baseline; strong + cheap + interpretable; hybrid with dense.)
How does perplexity compare for n-gram vs neural LM? (Neural LMs: PPL ~10-30 on web text vs ~100+ for n-grams.)

10. Drill plan

Recite TF-IDF formula + why log.
Derive Laplace smoothing as MAP under Dirichlet prior.
Walk through Kneser-Ney's continuation count idea.
Compute edit distance for two short strings on paper.
Recite BM25 formula and explain each hyperparameter.
For each smoothing method (Laplace, Good-Turing, KN), recite when used.

11. Further reading

Jurafsky & Martin, Speech and Language Processing — chapters 3 (n-grams), 4 (smoothing), 6 (vector semantics).
Manning, Raghavan, Schütze, Introduction to Information Retrieval — TF-IDF, BM25.
Chen & Goodman (1998), An Empirical Study of Smoothing Techniques for Language Modeling.
Robertson & Zaragoza (2009), The Probabilistic Relevance Framework: BM25 and Beyond.

ML & LLM Interview Prep — Deep Dives