Bayes basic
Bayes' Theorem is a fundamental concept in probability theory that describes how to update the probability of a hypothesis based on new evidence. It relates conditional probabilities and provides a mathematical framework for reasoning about uncertainty.
Formula:#
The general formula for Bayes' Theorem is:
[
P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}
]
Where:
-
-
-
-
Key Idea:#
Bayes' Theorem allows you to update the probability of an event
Derivation:#
Bayes' Theorem is derived from the definition of conditional probability:
[
P(A \mid B) = \frac{P(A \cap B)}{P(B)}
]
And:
[
P(B \mid A) = \frac{P(A \cap B)}{P(A)}
]
Rearranging
[
P(A \cap B) = P(B \mid A) \cdot P(A)
]
Substituting this into
[
P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}
]
Example:#
Suppose a medical test for a disease is 99% accurate:
- If a person has the disease, the test is positive 99% of the time (
- If a person does not have the disease, the test is negative 95% of the time (
- The disease occurs in 1% of the population (
You take the test, and it’s positive. What’s the probability you have the disease (
Solution:#
Using Bayes' Theorem:
[
P(\text{Disease} \mid \text{Positive}) = \frac{P(\text{Positive} \mid \text{Disease}) \cdot P(\text{Disease})}{P(\text{Positive})}
]
First, calculate
[
P(\text{Positive}) = P(\text{Positive} \mid \text{Disease}) \cdot P(\text{Disease}) + P(\text{Positive} \mid \text{No Disease}) \cdot P(\text{No Disease})
]
[
P(\text{Positive}) = (0.99 \cdot 0.01) + (0.05 \cdot 0.99) = 0.0099 + 0.0495 = 0.0594
]
Now, substitute into Bayes' Theorem:
[
P(\text{Disease} \mid \text{Positive}) = \frac{0.99 \cdot 0.01}{0.0594} = \frac{0.0099}{0.0594} \approx 0.1667
]
So, even with a positive test result, the probability of actually having the disease is about 16.67%, mainly because the disease is rare.
can we say P(B|A) is posteriary#
No,
is called the posterior probability because it represents the updated probability of after observing (the evidence). is called the likelihood, which represents the probability of observing given that is true.
To summarize in the context of Bayes' theorem:
[
P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}
]
-
-
-
-
Why is not posterior#
The posterior
so with the help of priory and likelihood we update the posteriory
Yes, exactly! Using prior and likelihood, we compute the posterior probability through Bayes' Theorem.
Key Components:#
- Prior (
): - Represents your initial belief about the probability of event
before observing any evidence. -
Example: The probability of a person having a disease based on general population statistics.
-
Likelihood (
): - Measures how consistent the evidence
is with the hypothesis . -
Example: The probability of a positive test result if the person has the disease.
-
Posterior (
): - The updated probability of
, given the observed evidence . - It combines prior knowledge and the likelihood of the evidence.
-
Example: The updated probability of having the disease after getting a positive test result.
-
Marginal Probability (
): - A normalizing constant ensuring the probabilities sum to 1.
- It is the total probability of observing the evidence
, considering all possible causes.
Bayes' Theorem in Action:#
[
P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}
]
- Prior (
- Likelihood (
- Together, they update the posterior (
Intuitive Understanding:#
- Prior: "Before seeing any data, this is my belief about
." - Likelihood: "Given
, how likely is the evidence ?" - Posterior: "After seeing
, this is my new belief about ."
Example:#
Let’s revisit the medical test example:
-
-
- Using Bayes' Theorem, we update the belief about having the disease given a positive result (
The posterior reflects our updated belief about the disease's probability after accounting for the test's reliability and initial assumptions.
Bayes' Theorem fits naturally into the framework of generative models, as these models aim to describe how data is generated, often involving latent variables or hidden structures.
Generative Models and Bayes' Theorem:#
A generative model seeks to model the joint probability distribution
-
-
Using Bayes' Theorem, the generative model can infer the hidden variables or causes (
How Bayes' Theorem Fits:#
- Generative Process:
-
A generative model defines
, where: : Prior distribution over latent variables . : Likelihood or conditional distribution modeling how generates .
-
Inference with Bayes' Theorem:
- To infer
given (i.e., to compute ), we apply Bayes' Theorem:
[
P(Z \mid X) = \frac{P(X \mid Z) \cdot P(Z)}{P(X)}
] -
Here:
- Prior (
): Encodes the model's assumptions about before observing . - Likelihood (
): Captures how the latent variable generates the observed data . - Posterior (
): Updates beliefs about after observing .
- Prior (
-
Marginal Likelihood (
): - Normalizes the posterior and ensures it sums to 1.
- Calculated as:
[
P(X) = \int P(X \mid Z) P(Z) \, dZ
] - In complex models, this integral can be computationally expensive and is often approximated.
Example in Generative Models:#
1. Latent Dirichlet Allocation (LDA):#
- In LDA (a generative model for topic modeling), the observed data
are words in documents, and are the latent topics. - Bayes' Theorem is used to infer the posterior distribution of topics
, given observed words.
2. Variational Autoencoders (VAEs):#
- A VAE is a deep generative model where:
: Prior over latent variables. : Likelihood of reconstructing data given .
- The posterior
is approximated using a variational distribution because direct computation is intractable.
3. GANs (Generative Adversarial Networks):#
- While GANs are not explicitly Bayesian, they involve learning
indirectly by generating samples from an implicit and refining .
Why Bayes' Theorem is Important for Generative Models:#
- Inference: Helps compute posteriors
, linking observed data to latent causes. - Uncertainty: Allows quantification of uncertainty in predictions or latent variables.
- Learning: Bayesian principles underpin many generative learning methods, including variational inference and expectation-maximization.
In summary, Bayes' Theorem provides the theoretical foundation for making inferences in generative models, connecting data
Let's walk through numerical examples illustrating how Bayes' Theorem is applied in generative models. These examples demonstrate inferring latent variables (
1. Topic Modeling in Latent Dirichlet Allocation (LDA)#
Problem:#
You have a document
- Topic 1: Fruits
- Topic 2: Technology
Generative Model Components:#
- Prior (
):
[
P(Z = \text{Fruits}) = 0.6, \quad P(Z = \text{Technology}) = 0.4
] - Likelihood (
):
Word probabilities for each topic: - If
: - If
:
Applying Bayes' Theorem:#
We want to compute
-
Compute
(normalizing factor):
[
P(X) = P(X \mid \text{Fruits}) \cdot P(\text{Fruits}) + P(X \mid \text{Technology}) \cdot P(\text{Technology})
]
[
P(X) = (0.12 \times 0.6) + (0.0001 \times 0.4) = 0.072 + 0.00004 = 0.07204
] -
Compute posterior probabilities:
- For
:
[
P(\text{Fruits} \mid X) = \frac{P(X \mid \text{Fruits}) \cdot P(\text{Fruits})}{P(X)} = \frac{0.12 \times 0.6}{0.07204} \approx 0.999
] - For
:
[
P(\text{Technology} \mid X) = \frac{P(X \mid \text{Technology}) \cdot P(\text{Technology})}{P(X)} = \frac{0.0001 \times 0.4}{0.07204} \approx 0.001
]
Interpretation:#
The document
2. Variational Autoencoders (VAEs)#
Problem:#
You observe a datapoint
-
-
Generative Model Components:#
- Prior (
):
[
P(Z) = \frac{1}{\sqrt{2\pi}} e{-Z2 / 2}
] - Likelihood (
):
[
P(X \mid Z) = \frac{1}{\sqrt{2\pi \cdot 0.5^2}} e^{-(X - Z)^2 / (2 \cdot 0.5^2)}
] - Evidence (
):
Marginalize over : . (In practice, approximated using variational inference.)
Applying Bayes' Theorem:#
We want
-
Posterior (
):
[
P(Z \mid X) \propto P(X \mid Z) \cdot P(Z)
] -
Compute unnormalized posterior:
Substitute likelihood and prior into the formula:
[
P(Z \mid X = 3.0) \propto \left(\frac{1}{\sqrt{2\pi \cdot 0.5^2}} e^{-(3.0 - Z)^2 / (2 \cdot 0.5^2)}\right) \cdot \left(\frac{1}{\sqrt{2\pi}} e{-Z2 / 2}\right)
] -
Approximation:
Because is challenging to compute exactly, VAEs approximate using a learned variational distribution .
Interpretation:#
The posterior
3. Gaussian Mixture Models (GMMs)#
Problem:#
You observe a data point
-
-
-
Applying Bayes' Theorem:#
- Compute likelihoods:
-
-
Compute posterior probabilities:
-
Normalizing constant:
[
P(X) = P(X \mid Z = 1) \cdot P(Z = 1) + P(X \mid Z = 2) \cdot P(Z = 2)
]
[
P(X) = (0.352 \cdot 0.7) + (0.043 \cdot 0.3) = 0.2464 + 0.0129 = 0.2593
] -
Posterior for
:
[
P(Z = 1 \mid X = 4.5) = \frac{P(X \mid Z = 1) \cdot P(Z = 1)}{P(X)} = \frac{0.352 \cdot 0.7}{0.2593} \approx 0.951
] -
Posterior for
:
[
P(Z = 2 \mid X = 4.5) = \frac{P(X \mid Z = 2) \cdot P(Z = 2)}{P(X)} = \frac{0.043 \cdot 0.3}{0.2593} \approx 0.049
]
Interpretation:#
The observation
In the given example, the likelihood
Steps to Calculate the Likelihood:#
- Assumption of Independence:
- The model assumes that the probability of a document
(e.g., "apple banana orange") given a topic can be decomposed into the product of the probabilities of each individual word in the document:
[
P(X \mid Z) = P(\text{"apple"} \mid Z) \cdot P(\text{"banana"} \mid Z) \cdot P(\text{"orange"} \mid Z)
] -
This is called the bag-of-words assumption, which disregards word order and considers only word frequencies.
-
Word Probabilities:
-
The probabilities
are learned during model training. For example:- For topic
:
[
P(\text{"apple"} \mid \text{Fruits}) = 0.5, \quad P(\text{"banana"} \mid \text{Fruits}) = 0.4, \quad P(\text{"orange"} \mid \text{Fruits}) = 0.6
] - For topic
:
[
P(\text{"apple"} \mid \text{Technology}) = 0.1, \quad P(\text{"banana"} \mid \text{Technology}) = 0.05, \quad P(\text{"orange"} \mid \text{Technology}) = 0.02
]
- For topic
-
Compute the Likelihood:
- For topic
:
[
P(\text{"apple banana orange"} \mid \text{Fruits}) = P(\text{"apple"} \mid \text{Fruits}) \cdot P(\text{"banana"} \mid \text{Fruits}) \cdot P(\text{"orange"} \mid \text{Fruits})
]
[
= 0.5 \cdot 0.4 \cdot 0.6 = 0.12
] - For topic
:
[
P(\text{"apple banana orange"} \mid \text{Technology}) = P(\text{"apple"} \mid \text{Technology}) \cdot P(\text{"banana"} \mid \text{Technology}) \cdot P(\text{"orange"} \mid \text{Technology})
]
[
= 0.1 \cdot 0.05 \cdot 0.02 = 0.0001
]
Why This Works:#
The likelihood
This approach simplifies the computation but may lose some information about word relationships (e.g., word order).
Yes, the word probabilities
How Word Probabilities Are Learned#
- Data Preparation:
- The model is provided with a large collection of documents.
-
Each document is associated with one or more topics (in supervised models) or the topics are inferred (in unsupervised models like LDA).
-
Objective:
- The goal is to compute the conditional probability of each word appearing in a document, given the topic (
). -
For example, for the topic "Fruits," the probability
represents how likely the word "apple" is to appear in documents dominated by the "Fruits" topic. -
Training Process:
-
Using algorithms like Expectation-Maximization (EM) or variational inference, the model iteratively:
- Assigns probabilities for topics to documents.
- Updates word distributions for each topic based on the assigned probabilities.
-
Learned Probabilities:
- After training, the word-topic probabilities are estimated from the frequency of words in documents related to a particular topic.
- For example, if "apple" appears frequently in documents labeled or inferred as "Fruits,"
will be high.
Why Use These Probabilities?#
- Model Simplicity:
is a compact way to represent the characteristics of a topic. Each topic is essentially a distribution over words.- Generative Assumption:
- Generative models assume that words in a document are sampled from the topic's word distribution. For example:
- If
, the words in a document are sampled based on .
- If
Are the Given Probabilities Assumed in the Example?#
For the example provided, the probabilities (
- These probabilities are learned during model training from a corpus of documents.
- Once learned, they are fixed and used to compute the likelihood for new documents.
Summary#
- Assumed in Example: The probabilities are given as illustrative values for simplicity.
- Learned in Practice: These probabilities are calculated during training based on observed word frequencies in topic-specific documents.