Bayes' Theorem: Detailed Explanation

What is Bayes' Theorem?

Bayes' theorem is one of the most important principles in probability theory and statistics. It provides a mathematical framework for updating our beliefs about an event or hypothesis when we receive new evidence. The theorem is named after Thomas Bayes, an 18th-century English statistician and philosopher.

Mathematical Formulation

Basic Form:

P(A|B) = P(B|A) * P(A) / P(B)

Extended Form (with multiple hypotheses):

P(A_i|B) = P(B|A_i) * P(A_i) / Σ P(B|A_j) * P(A_j)

Where the sum is over all possible hypotheses A_j

Detailed Component Explanation

Prior Probability P(A)

The prior probability represents what we believe about event A before we see any evidence B. It's our initial knowledge, assumptions, or background information about the event.

Why it matters: The prior is crucial because it provides a starting point for our reasoning. If we have strong prior knowledge, it takes strong evidence to change our beliefs. If we have weak or uniform priors, we're more open to being convinced by evidence.

Example: In medical diagnosis, the prior is the base rate of the disease in the population. If a disease affects 1% of people, then P(disease) = 0.01 is our prior. This means that before we know anything about a specific person, we believe there's a 1% chance they have the disease.

How to determine prior:

Empirical prior: Use historical data or population statistics
Subjective prior: Use expert knowledge or beliefs
Uniform prior: Use when you have no prior information (all outcomes equally likely)
Conjugate prior: Use mathematical convenience (in Bayesian statistics)

Likelihood P(B|A)

The likelihood represents the probability of observing evidence B given that hypothesis A is true. It answers the question: "If A is true, how likely are we to see this evidence?"

Why it matters: The likelihood connects our hypothesis to the observed data. A high likelihood means the evidence strongly supports the hypothesis. A low likelihood means the evidence contradicts the hypothesis.

Example: In medical testing, if someone has a disease, how likely are they to test positive? If the test is 95% accurate for people with the disease, then P(positive test | disease) = 0.95. This is the likelihood - it tells us how well the test detects the disease when it's actually present.

Key insight: The likelihood is not the same as P(A|B). P(B|A) asks "if A, then how likely is B?" while P(A|B) asks "if B, then how likely is A?" These are different questions, and Bayes' theorem connects them.

Evidence P(B)

The evidence, also called the marginal probability or normalizing constant, is the total probability of observing B regardless of whether A is true or not. It's computed by summing over all possible ways B could occur.

Mathematical computation:

P(B) = P(B|A) * P(A) + P(B|¬A) * P(¬A)

Or more generally:
P(B) = Σ P(B|A_i) * P(A_i)  (sum over all possible A_i)

Why it matters: The evidence serves as a normalization constant that ensures the posterior probabilities sum to 1. It accounts for all possible ways the evidence could have occurred, not just the way we're interested in.

Example: In medical testing, P(positive test) includes both true positives (people with disease who test positive) and false positives (people without disease who test positive). This total probability normalizes our calculation.

Key insight: We often don't need to compute P(B) explicitly if we're just comparing different hypotheses, because P(B) is the same for all of them. We can compute relative probabilities and normalize at the end.

Posterior Probability P(A|B)

The posterior probability is our updated belief about A after observing evidence B. It combines our prior knowledge with the new evidence to give us a revised probability.

Why it matters: The posterior is what we actually care about for decision-making. It tells us, given the evidence we've seen, what's the probability our hypothesis is true? This is what we use to make predictions, diagnoses, or decisions.

Example: In medical diagnosis, P(disease | positive test) tells us: given that someone tested positive, what's the probability they actually have the disease? This is what the doctor needs to know to make a diagnosis.

Key insight: The posterior becomes the new prior if we get additional evidence. This is the basis of Bayesian updating - we can iteratively update our beliefs as we gather more information.

Step-by-Step Example: Medical Diagnosis

Let's work through a detailed example to see how all the pieces fit together.

Problem Setup:

Disease prevalence: 1% of population has the disease
Test accuracy: 95% (if you have disease, 95% chance of positive test)
Test false positive rate: 5% (if you don't have disease, 5% chance of positive test)
Question: If someone tests positive, what's the probability they have the disease?

Step 1: Identify Components

Prior P(disease): 0.01 (1% of population)
Prior P(no disease): 0.99 (99% of population)
Likelihood P(positive | disease): 0.95 (test is 95% accurate)
Likelihood P(positive | no disease): 0.05 (5% false positive rate)
What we want: P(disease | positive)

Step 2: Compute Evidence P(positive)

The evidence is the total probability of a positive test, which can happen in two ways:

Person has disease and tests positive: P(positive | disease) * P(disease) = 0.95 * 0.01 = 0.0095
Person doesn't have disease but tests positive: P(positive | no disease) * P(no disease) = 0.05 * 0.99 = 0.0495

Total: P(positive) = 0.0095 + 0.0495 = 0.059

Step 3: Apply Bayes' Theorem

P(disease | positive) = P(positive | disease) * P(disease) / P(positive)
                       = 0.95 * 0.01 / 0.059
                       = 0.0095 / 0.059
                       ≈ 0.161 (16.1%)

Step 4: Interpret Result

Even though the test is 95% accurate, if someone tests positive, there's only a 16.1% chance they actually have the disease! This seems counterintuitive but makes sense when you think about it:

Out of 10,000 people: 100 have disease, 9,900 don't
True positives: 100 * 0.95 = 95 people
False positives: 9,900 * 0.05 = 495 people
Total positive tests: 95 + 495 = 590
Probability of disease given positive: 95 / 590 ≈ 16.1%

The large number of false positives (495) from the healthy population overwhelms the true positives (95) from the small diseased population.

Why Bayes' Theorem is Important in ML

1. Naive Bayes Classifier:

Uses Bayes' theorem to classify
Assumes features are independent given class
P(class | features) = P(features | class) * P(class) / P(features)
Works well despite "naive" independence assumption

2. Bayesian Inference:

Update model parameters as we see more data
Start with prior beliefs, update with likelihood
Quantify uncertainty in predictions

3. Spam Detection:

Prior: Base rate of spam emails
Likelihood: Probability of seeing certain words given spam/not spam
Posterior: Probability email is spam given its words

4. Recommendation Systems:

Prior: User's general preferences
Likelihood: Probability of behavior given preferences
Posterior: Updated preferences given observed behavior

5. Medical Diagnosis:

Prior: Disease prevalence
Likelihood: Test accuracy
Posterior: Disease probability given test result

Common Misconceptions

Misconception 1: "Prior doesn't matter"

Wrong! Prior is crucial, especially when evidence is weak
With strong evidence, prior matters less
With weak evidence, prior dominates

Misconception 2: "Likelihood and posterior are the same"

Wrong! P(B|A) ≠ P(A|B) in general
Likelihood: "If A, how likely is B?"
Posterior: "If B, how likely is A?"
These are different questions!

Misconception 3: "Bayes' theorem only works with probabilities"

Wrong! Can use with likelihoods, odds, or any proportional quantities
Often we compute relative probabilities and normalize

Practical Tips

1. Always consider the prior:

Don't ignore base rates
Rare events need strong evidence to change beliefs

2. Understand the likelihood:

Know what your evidence actually measures
Consider both true positives and false positives

3. Compute evidence correctly:

Account for all ways evidence could occur
Don't forget alternative hypotheses

4. Interpret posterior carefully:

Posterior depends on both prior and likelihood
Weak evidence + strong prior = posterior close to prior
Strong evidence + weak prior = posterior close to likelihood

Summary

Bayes' theorem is a powerful tool for updating beliefs with evidence. It shows us that:

Prior knowledge matters
Evidence strength matters
The combination gives us updated beliefs
Rare events need strong evidence to be convincing

Understanding Bayes' theorem is crucial for:

Probabilistic machine learning
Decision making under uncertainty
Interpreting test results
Building generative models

ML & LLM Interview Prep — Deep Dives