Skip to content

Bayes basic

Bayes' Theorem is a fundamental concept in probability theory that describes how to update the probability of a hypothesis based on new evidence. It relates conditional probabilities and provides a mathematical framework for reasoning about uncertainty.

Formula:#

The general formula for Bayes' Theorem is:
[
P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}
]

Where:
- P(AB): The posterior probability of A, given B (updated probability of A after observing B).
- P(BA): The likelihood, or the probability of B given that A is true.
- P(A): The prior probability of A (initial belief about A).
- P(B): The marginal probability of B, or the total probability of observing B.

Key Idea:#

Bayes' Theorem allows you to update the probability of an event A happening based on new evidence B.


Derivation:#

Bayes' Theorem is derived from the definition of conditional probability:
[
P(A \mid B) = \frac{P(A \cap B)}{P(B)}
]
And:
[
P(B \mid A) = \frac{P(A \cap B)}{P(A)}
]
Rearranging P(BA), we get:
[
P(A \cap B) = P(B \mid A) \cdot P(A)
]
Substituting this into P(AB), we get:
[
P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}
]


Example:#

Suppose a medical test for a disease is 99% accurate:
- If a person has the disease, the test is positive 99% of the time (P(PositiveDisease)=0.99).
- If a person does not have the disease, the test is negative 95% of the time (P(NegativeNo Disease)=0.95).
- The disease occurs in 1% of the population (P(Disease)=0.01).

You take the test, and it’s positive. What’s the probability you have the disease (P(DiseasePositive))?

Solution:#

Using Bayes' Theorem:
[
P(\text{Disease} \mid \text{Positive}) = \frac{P(\text{Positive} \mid \text{Disease}) \cdot P(\text{Disease})}{P(\text{Positive})}
]

First, calculate P(Positive) using the law of total probability:
[
P(\text{Positive}) = P(\text{Positive} \mid \text{Disease}) \cdot P(\text{Disease}) + P(\text{Positive} \mid \text{No Disease}) \cdot P(\text{No Disease})
]
[
P(\text{Positive}) = (0.99 \cdot 0.01) + (0.05 \cdot 0.99) = 0.0099 + 0.0495 = 0.0594
]

Now, substitute into Bayes' Theorem:
[
P(\text{Disease} \mid \text{Positive}) = \frac{0.99 \cdot 0.01}{0.0594} = \frac{0.0099}{0.0594} \approx 0.1667
]

So, even with a positive test result, the probability of actually having the disease is about 16.67%, mainly because the disease is rare.


can we say P(B|A) is posteriary#

No, P(BA) is not referred to as the "posterior probability." Instead:

  • P(AB) is called the posterior probability because it represents the updated probability of A after observing B (the evidence).
  • P(BA) is called the likelihood, which represents the probability of observing B given that A is true.

To summarize in the context of Bayes' theorem:
[
P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}
]
- P(A): Prior probability (your initial belief about A).
- P(BA): Likelihood (how likely the evidence B is if A is true).
- P(AB): Posterior probability (updated belief about A after observing B).
- P(B): Marginal probability of B, normalizing factor for the probabilities.

Why P(BA) is not posterior#

The posterior P(AB) directly answers the question: "Given the evidence B, what is the updated probability of A?" In contrast, P(BA) is a measure of how consistent the evidence B is with the assumption that A is true—it does not update or refine the belief about A.


so with the help of priory and likelihood we update the posteriory


Yes, exactly! Using prior and likelihood, we compute the posterior probability through Bayes' Theorem.

Key Components:#

  1. Prior (P(A)):
  2. Represents your initial belief about the probability of event A before observing any evidence.
  3. Example: The probability of a person having a disease based on general population statistics.

  4. Likelihood (P(BA)):

  5. Measures how consistent the evidence B is with the hypothesis A.
  6. Example: The probability of a positive test result if the person has the disease.

  7. Posterior (P(AB)):

  8. The updated probability of A, given the observed evidence B.
  9. It combines prior knowledge and the likelihood of the evidence.
  10. Example: The updated probability of having the disease after getting a positive test result.

  11. Marginal Probability (P(B)):

  12. A normalizing constant ensuring the probabilities sum to 1.
  13. It is the total probability of observing the evidence B, considering all possible causes.

Bayes' Theorem in Action:#

[
P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}
]
- Prior (P(A)): What you believe about A before evidence B.
- Likelihood (P(BA)): How well A explains B.
- Together, they update the posterior (P(AB)).

Intuitive Understanding:#

  • Prior: "Before seeing any data, this is my belief about A."
  • Likelihood: "Given A, how likely is the evidence B?"
  • Posterior: "After seeing B, this is my new belief about A."

Example:#

Let’s revisit the medical test example:
- P(Disease)=0.01 (Prior: 1% of the population has the disease).
- P(PositiveDisease)=0.99 (Likelihood: If diseased, test is positive 99% of the time).
- Using Bayes' Theorem, we update the belief about having the disease given a positive result (P(DiseasePositive)).

The posterior reflects our updated belief about the disease's probability after accounting for the test's reliability and initial assumptions.


Bayes' Theorem fits naturally into the framework of generative models, as these models aim to describe how data is generated, often involving latent variables or hidden structures.


Generative Models and Bayes' Theorem:#

A generative model seeks to model the joint probability distribution P(X,Z), where:
- X: Observed data (e.g., images, text, or sounds).
- Z: Latent variables or hidden factors (e.g., topics in text, noise in data).

Using Bayes' Theorem, the generative model can infer the hidden variables or causes (Z) from the observed data (X) by computing the posterior P(ZX).


How Bayes' Theorem Fits:#

  1. Generative Process:
  2. A generative model defines P(X,Z)=P(XZ)P(Z), where:

    • P(Z): Prior distribution over latent variables Z.
    • P(XZ): Likelihood or conditional distribution modeling how Z generates X.
  3. Inference with Bayes' Theorem:

  4. To infer Z given X (i.e., to compute P(ZX)), we apply Bayes' Theorem:
    [
    P(Z \mid X) = \frac{P(X \mid Z) \cdot P(Z)}{P(X)}
    ]
  5. Here:

    • Prior (P(Z)): Encodes the model's assumptions about Z before observing X.
    • Likelihood (P(XZ)): Captures how the latent variable Z generates the observed data X.
    • Posterior (P(ZX)): Updates beliefs about Z after observing X.
  6. Marginal Likelihood (P(X)):

  7. Normalizes the posterior and ensures it sums to 1.
  8. Calculated as:
    [
    P(X) = \int P(X \mid Z) P(Z) \, dZ
    ]
  9. In complex models, this integral can be computationally expensive and is often approximated.

Example in Generative Models:#

1. Latent Dirichlet Allocation (LDA):#

  • In LDA (a generative model for topic modeling), the observed data X are words in documents, and Z are the latent topics.
  • Bayes' Theorem is used to infer the posterior distribution of topics P(ZX), given observed words.

2. Variational Autoencoders (VAEs):#

  • A VAE is a deep generative model where:
    • P(Z): Prior over latent variables.
    • P(XZ): Likelihood of reconstructing data X given Z.
  • The posterior P(ZX) is approximated using a variational distribution because direct computation is intractable.

3. GANs (Generative Adversarial Networks):#

  • While GANs are not explicitly Bayesian, they involve learning P(X) indirectly by generating samples from an implicit P(XZ) and refining P(Z).

Why Bayes' Theorem is Important for Generative Models:#

  1. Inference: Helps compute posteriors P(ZX), linking observed data to latent causes.
  2. Uncertainty: Allows quantification of uncertainty in predictions or latent variables.
  3. Learning: Bayesian principles underpin many generative learning methods, including variational inference and expectation-maximization.

In summary, Bayes' Theorem provides the theoretical foundation for making inferences in generative models, connecting data X with latent variables Z in a principled manner.


Let's walk through numerical examples illustrating how Bayes' Theorem is applied in generative models. These examples demonstrate inferring latent variables (Z) from observed data (X).


1. Topic Modeling in Latent Dirichlet Allocation (LDA)#

Problem:#

You have a document X="apple banana orange", and you're inferring the topic distribution Z. Suppose there are two possible topics:
- Topic 1: Fruits
- Topic 2: Technology

Generative Model Components:#

  • Prior (P(Z)):
    [
    P(Z = \text{Fruits}) = 0.6, \quad P(Z = \text{Technology}) = 0.4
    ]
  • Likelihood (P(XZ)):
    Word probabilities for each topic:
  • If Z=Fruits: P("apple banana orange"Fruits)=0.5×0.4×0.6=0.12
  • If Z=Technology: P("apple banana orange"Technology)=0.1×0.05×0.02=0.0001

Applying Bayes' Theorem:#

We want to compute P(ZX) for both topics.

  1. Compute P(X) (normalizing factor):
    [
    P(X) = P(X \mid \text{Fruits}) \cdot P(\text{Fruits}) + P(X \mid \text{Technology}) \cdot P(\text{Technology})
    ]
    [
    P(X) = (0.12 \times 0.6) + (0.0001 \times 0.4) = 0.072 + 0.00004 = 0.07204
    ]

  2. Compute posterior probabilities:

  3. For Z=Fruits:
    [
    P(\text{Fruits} \mid X) = \frac{P(X \mid \text{Fruits}) \cdot P(\text{Fruits})}{P(X)} = \frac{0.12 \times 0.6}{0.07204} \approx 0.999
    ]
  4. For Z=Technology:
    [
    P(\text{Technology} \mid X) = \frac{P(X \mid \text{Technology}) \cdot P(\text{Technology})}{P(X)} = \frac{0.0001 \times 0.4}{0.07204} \approx 0.001
    ]

Interpretation:#

The document X="apple banana orange" is overwhelmingly more likely to belong to the Fruits topic (P(FruitsX)0.999).


2. Variational Autoencoders (VAEs)#

Problem:#

You observe a datapoint X=3.0, and you're inferring the latent variable Z in a VAE. The model assumes:
- ZN(0,1) (prior).
- XZN(Z,0.52) (likelihood).

Generative Model Components:#

  1. Prior (P(Z)):
    [
    P(Z) = \frac{1}{\sqrt{2\pi}} e{-Z2 / 2}
    ]
  2. Likelihood (P(XZ)):
    [
    P(X \mid Z) = \frac{1}{\sqrt{2\pi \cdot 0.5^2}} e^{-(X - Z)^2 / (2 \cdot 0.5^2)}
    ]
  3. Evidence (P(X)):
    Marginalize over Z: P(X)=P(XZ)P(Z)dZ. (In practice, approximated using variational inference.)

Applying Bayes' Theorem:#

We want P(ZX=3.0).

  1. Posterior (P(ZX)):
    [
    P(Z \mid X) \propto P(X \mid Z) \cdot P(Z)
    ]

  2. Compute unnormalized posterior:
    Substitute likelihood and prior into the formula:
    [
    P(Z \mid X = 3.0) \propto \left(\frac{1}{\sqrt{2\pi \cdot 0.5^2}} e^{-(3.0 - Z)^2 / (2 \cdot 0.5^2)}\right) \cdot \left(\frac{1}{\sqrt{2\pi}} e{-Z2 / 2}\right)
    ]

  3. Approximation:
    Because P(X) is challenging to compute exactly, VAEs approximate P(ZX) using a learned variational distribution q(ZX).

Interpretation:#

The posterior P(ZX) explains which latent variables Z are most likely to have generated the observation X=3.0.


3. Gaussian Mixture Models (GMMs)#

Problem:#

You observe a data point X=4.5 and need to infer which component Z of a GMM generated it. Suppose:
- Z{1,2} (two Gaussian components).
- P(Z=1)=0.7,P(Z=2)=0.3 (prior).
- XZ=1N(5,12),XZ=2N(3,12) (likelihood).

Applying Bayes' Theorem:#

  1. Compute likelihoods:
  2. P(X=4.5Z=1)=12π12e(4.55)2/20.352
  3. P(X=4.5Z=2)=12π12e(4.53)2/20.043

  4. Compute posterior probabilities:

  5. Normalizing constant:
    [
    P(X) = P(X \mid Z = 1) \cdot P(Z = 1) + P(X \mid Z = 2) \cdot P(Z = 2)
    ]
    [
    P(X) = (0.352 \cdot 0.7) + (0.043 \cdot 0.3) = 0.2464 + 0.0129 = 0.2593
    ]

  6. Posterior for Z=1:
    [
    P(Z = 1 \mid X = 4.5) = \frac{P(X \mid Z = 1) \cdot P(Z = 1)}{P(X)} = \frac{0.352 \cdot 0.7}{0.2593} \approx 0.951
    ]

  7. Posterior for Z=2:
    [
    P(Z = 2 \mid X = 4.5) = \frac{P(X \mid Z = 2) \cdot P(Z = 2)}{P(X)} = \frac{0.043 \cdot 0.3}{0.2593} \approx 0.049
    ]

Interpretation:#

The observation X=4.5 is most likely generated by component Z=1 with a posterior probability of approximately 95.1%.


In the given example, the likelihood P(XZ) is calculated based on the probability of individual words occurring in a document, assuming the words are independent of each other given the topic Z. This is a simplifying assumption often used in models like Latent Dirichlet Allocation (LDA).


Steps to Calculate the Likelihood:#

  1. Assumption of Independence:
  2. The model assumes that the probability of a document X (e.g., "apple banana orange") given a topic Z can be decomposed into the product of the probabilities of each individual word in the document:
    [
    P(X \mid Z) = P(\text{"apple"} \mid Z) \cdot P(\text{"banana"} \mid Z) \cdot P(\text{"orange"} \mid Z)
    ]
  3. This is called the bag-of-words assumption, which disregards word order and considers only word frequencies.

  4. Word Probabilities:

  5. The probabilities P(wordZ) are learned during model training. For example:

    • For topic Z=Fruits:
      [
      P(\text{"apple"} \mid \text{Fruits}) = 0.5, \quad P(\text{"banana"} \mid \text{Fruits}) = 0.4, \quad P(\text{"orange"} \mid \text{Fruits}) = 0.6
      ]
    • For topic Z=Technology:
      [
      P(\text{"apple"} \mid \text{Technology}) = 0.1, \quad P(\text{"banana"} \mid \text{Technology}) = 0.05, \quad P(\text{"orange"} \mid \text{Technology}) = 0.02
      ]
  6. Compute the Likelihood:

  7. For topic Z=Fruits:
    [
    P(\text{"apple banana orange"} \mid \text{Fruits}) = P(\text{"apple"} \mid \text{Fruits}) \cdot P(\text{"banana"} \mid \text{Fruits}) \cdot P(\text{"orange"} \mid \text{Fruits})
    ]
    [
    = 0.5 \cdot 0.4 \cdot 0.6 = 0.12
    ]
  8. For topic Z=Technology:
    [
    P(\text{"apple banana orange"} \mid \text{Technology}) = P(\text{"apple"} \mid \text{Technology}) \cdot P(\text{"banana"} \mid \text{Technology}) \cdot P(\text{"orange"} \mid \text{Technology})
    ]
    [
    = 0.1 \cdot 0.05 \cdot 0.02 = 0.0001
    ]

Why This Works:#

The likelihood P(XZ) is a measure of how well the topic Z explains the document X. By assuming independence between words and knowing the word probabilities P(wordZ) for each topic, we calculate how probable it is to observe the given document X under the topic Z.

This approach simplifies the computation but may lose some information about word relationships (e.g., word order).


Yes, the word probabilities P(wordZ) are not arbitrary assumptions but are learned during the training phase of the model. Here's how this works in detail:


How Word Probabilities P(wordZ) Are Learned#

  1. Data Preparation:
  2. The model is provided with a large collection of documents.
  3. Each document is associated with one or more topics (in supervised models) or the topics are inferred (in unsupervised models like LDA).

  4. Objective:

  5. The goal is to compute the conditional probability of each word appearing in a document, given the topic (Z).
  6. For example, for the topic "Fruits," the probability P("apple""Fruits") represents how likely the word "apple" is to appear in documents dominated by the "Fruits" topic.

  7. Training Process:

  8. Using algorithms like Expectation-Maximization (EM) or variational inference, the model iteratively:

    1. Assigns probabilities for topics to documents.
    2. Updates word distributions for each topic based on the assigned probabilities.
  9. Learned Probabilities:

  10. After training, the word-topic probabilities are estimated from the frequency of words in documents related to a particular topic.
  11. For example, if "apple" appears frequently in documents labeled or inferred as "Fruits," P("apple""Fruits") will be high.

Why Use These Probabilities?#

  1. Model Simplicity:
  2. P(wordZ) is a compact way to represent the characteristics of a topic. Each topic is essentially a distribution over words.
  3. Generative Assumption:
  4. Generative models assume that words in a document are sampled from the topic's word distribution. For example:
    • If Z="Fruits", the words in a document are sampled based on P("word""Fruits").

Are the Given Probabilities Assumed in the Example?#

For the example provided, the probabilities (P("apple""Fruits")=0.5,P("banana""Fruits")=0.4,P("orange""Fruits")=0.6) are example values to demonstrate the process. In a real-world scenario:
- These probabilities are learned during model training from a corpus of documents.
- Once learned, they are fixed and used to compute the likelihood for new documents.


Summary#

  • Assumed in Example: The probabilities are given as illustrative values for simplicity.
  • Learned in Practice: These probabilities are calculated during training based on observed word frequencies in topic-specific documents.