Probability distributions

Notes on probability distributions

Last updated: 1 Mar 2020

Bernoulli

Say in an experiment (eg. coin toss) you observe two outcomes $x=\{0,1\}$. Set the probability of seeing "1" (usually the positive one) to $0.8$. Accordingly, this follows that probability of seeing "0" is $1-0.8=0.2$. So

$$ P(X=1) = 0.8 $$

and

$$ P(X=0) = 0.2 $$

can be rewritten as

$$ P(X=x) = \left\{\begin{array}{ll}{0.8} & {x=1} \\ {0.2} & {x=0} \end{array}\right. $$

which can be rewritten as

$$ P(X=x) = (0.8)^x (0.2)^{1-x} $$

Definition

$$ P(X=x) = p^x(1-p)^{1-x} $$

where $p$ is the probability of the positive outcome.

1 trial
2 outcomes

1 of the outcomes ("success") has probability $p$

What happens if you have $n$ trials where only the last is a success? What happens if you have $n$ trials, $k$ of which are successes? What happens if you have $K$ different outcomes?

Uses

Modelling the outcome of a coin toss.

Categorical

Also known as generalised Bernoulli distribution or multinoulli distribution.

This probability distribution describes the possible results of a random variable that takes one of $K$ possible categories. Moreover, the probability of each category is separately specified. An example would be that of a die.

Definition

There are 3 ways to define such a distribution:

$$ p(x=i) = p_i $$

$$ p(x) = p_1^{1_{x=1}} p_2^{1_{x=1}} \cdots p_k^{1_{x=1}} $$

$$ p(x) = 1_{x=1} p_1 + 1_{x=2} p_2 + \cdots + 1_{x=k} p_k $$

where $1$ is the indicator function.

1 trial
$K$ outcomes

Each outcome has a ("success") probability of $p_i$

Uses

Modelling the outcome of a die throw.

Geometric

Intuition

Say you need to conduct an experiment. This experiment involves repeating (called trials) until we get what we want. In each trial, there are only two outcomes, $\{0,1\}$. The probability of observing 1 (usually set to what we want) is $p$. This follows that observing 0 is $1-p$. What is the probability of finally getting 1 in one trial?

$$ p $$

And in 2 trials (getting 0 then 1)?

$$ (1-p)p $$

And in $10$ trials (getting 0 nine times then 1)?

$$ (1-p)^9p $$

Definition

And in $k$ trials (getting 0 $k-1$ times then 1)?

$$ P(X=k)= (1-p)^{k-1}p $$

$k$ trials

last trial is a success

2 outcomes

1 of the outcomes ("success") has probability $p$

Uses

Modelling the no. of times a die is thrown until a "1" is seen.

Binomial

Story

Say there is a biased coin where seeing heads has a probability 0.8. Flipping this coin 4 times and noting if it's heads or tails, I get

$$ \{H, H, H, T\} $$

exactly in that order. That's 3 heads. What's the probability of getting the above?

$$ 0.8 \times 0.8 \times 0.8 \times 0.3 = (0.8)^3 (0.2)^1 = 0.1024$$

Now what if I flipped 4 times, what's the probability of seeing 3 heads? It can't be $0.1024$ because that's just the probability of seeing $\{H,H,H,T\}$ exactly in that order. When I flip the coin 4 times, there are definitely other combinations that give 3 heads like $\{T, H, H, H\}$.

Let's list down all the possible combinations there can be from flipping the coin 4 times and getting 3 heads:

$$ \{H,H,H,T\}, \{H,H,T,H\}, $$

$$ \{H,T,H,H\}, \{T,H,H,H\} $$

That's 4. Thankfully, we have a calculator that can do ${4 \choose 3}$ = 4. Now let's list down the probabilities of each combination (they're all the same):

Combination	Probability
$\{H,H,H,T\}$	$(0.8)^3 (0.2)^1$
$\{H,H,T,H\}$	$(0.8)^3 (0.2)^1$
$\{H,T,H,H\}$	$(0.8)^3 (0.2)^1$
$\{T,H,H,H\}$	$(0.8)^3 (0.2)^1$

Adding these probabilities up gives us

$ {4 \choose 3} (0.8)^3 (0.2)^1 = 4 (0.8)^3 (0.2)^1 = 0.4096$

It is said that the number of heads you would see in 4 coin flips can be modelled by a Binomial distribution. That is to say, you would be able to tell:

the probability of seeing only 1 head
the probability of seeing only 2 heads
the probability of seeing only 3 heads
the probability of seeing all 4 heads

Also note that, inherently, a coin flip can only produce one of two outcomes (hence "bi"). "nomial" comes from the Greek word "nomos", which means parts.

Definition

$$ P(X=k)= {n \choose k} p^k (1-p)^{n-k} $$

$n$ trials

$k$ of which are successes

2 outcomes

1 of the outcomes ("success") has probability $p$

What happens if you have $(k+r)$ trials, $k$ of which are successes and $r$ are failures?

What happens if you are allowed a billion (or gazillion) trials but only 1 of which is a success and its probability is very, very small?

Poisson

Intuition

The Binomial distribution models the number of successes (with probability $p$) from $n$ independent trials. So far, the term "trial" has been used to refer to acts like flipping a biased coin, giving a medication to a person who might react adversely with probability 0.9 and so on. What if we project the concept of "trials" in Binomial distribution onto time and space? What would this mean?

As an example, suppose you play a slap game with an opponent: You are given 10 trials and at every trial, you will slap the opponent's hand, for which you hit with probability 60% or miss with 40%. The number of hits can be modelled using a Binomial distribution.

Coming back to the idea of time projection, we can imagine every second as a trial. Thus, you can look at the game this way: You are given 10 seconds and at every second, you will slap the opponent's hand, for which you hit with probability 60% or miss with 40%. The number of hits is still modelled as a Binomial distribution.

Now what if I told you that in this time frame of 10s, you can slap the opponent without having to do it at every tick of the second-hand? That would be more thrilling, wouldn't it 😈?

But how would you have to rephrase the problem such that the rules of the game still apply?

Firstly, note that the number of trials in that 10s now becomes extremely large. We're not measuring in seconds anymore. We're measuring in milliseconds. Or even microseconds. Or even microseconds. Heck, it's the concept of time we're talking about. It's not even discrete! We're talking about infinitely many trials, i.e. when $n$ approaches $\infty$.

Secondly, note that in the Binomial setting, we have 10 trials and 60% success as the parameters of the distribution. Does this mean that in the above setting (Poisson setting, no points for this one), we have $\infty$ trials and still 60% successful slap for every trial 😱? How can we port these rules in the infinite-trials setting such that the game is still fair? The answer is taking a simple average. Taking $10 \times 60\%$ tells us that for the Binomial setting there will be 6 successful slaps. This means that in the Poisson setting, there should still be 6 successful slaps on an average despite the infinitely many trials.

But wait, what happened to the success probability, you might ask? This number has to go very small. It can't be 60%. That's not fair, because you are changing the rules of the game. It has to be small. And I mean infinitesimally small. So small that when I take that large $n$ to multiply with this small $p$, I get 6. You don't have to care about the number of trials or the probability of success anymore. You just have to care about the average number, 6, which encapsulates these two numbers.

And so this becomes the Poisson distribution. And here's a re-writeup of the game: You are given 10s to slap the opponent's hand for which 6, on the average, are successful hits.

As it turns out, setting $n$ to $\infty$ approximates the Binomial PMF \ to a nicer-looking form (see below).

'Trials' in the Binomial distribution are usually attributed to man-made trials. In the Poisson distribution, 'trials' are driven by time and space, things that man cannot control.

Law of rare events / Poisson limit theorem

The Poisson distribution is an approximation to the Binomial distribution when we are allowed to have as many trials, $n$, as we want, i.e. when $n$ in the Binomial PMF goes to $\infty$:

$$ \lim _{n \rightarrow \infty} P(X=k) \\ =\lim _{n \rightarrow \infty} {n \choose k} p^{k}(1-p)^{n-k} $$

Reparametrise $\lambda = np$. This represents the no. of times an event occurs. This also indirectly specifies a time period within which this event can occur. Consequently, $p = \frac{\lambda}{n}.$

$$ =\lim _{n \rightarrow \infty} {n \choose k} \left(\frac{\lambda}{n}\right)^{k}\left(1-\frac{\lambda}{n}\right)^{n-k} \\ =\lim _{n \rightarrow \infty} \frac{n !}{k !(n-k) !} \left(\frac{\lambda}{n}\right)^{k}\left(1-\frac{\lambda}{n}\right)^{n-k} $$

Let's shift $\lambda$'s and $k$'s to the left.

$$ =\lim _{n \rightarrow \infty} \frac{\lambda^{k}}{k !} \cdot \frac{n !}{(n-k) !} \left(\frac{1}{n}\right)^{k}\left(1-\frac{\lambda}{n}\right)^{n-k} $$

Expand $n!$

$$ =\lim _{n \rightarrow \infty} \frac{\lambda^{k}}{k !} \cdot \frac{n(n-1)(n-2) ... (n-(k-1))(n-k)(n-(k+1))...(2)(1)}{n^k(n-k)!} \left(1-\frac{\lambda}{n}\right)^{n-k} $$

Cancel $(n-k)!$ from the numerator and denominator.

$$ =\lim _{n \rightarrow \infty} \frac{\lambda^{k}}{k !}\left(\frac{n}{n} \cdot \frac{n-1}{n} \cdot \frac{n-2}{n} \cdots \frac{n-(k-1)}{n}\right)\left(1-\frac{\lambda}{n}\right)^{n-k} $$

The second term converges to 1 as $n$ approaches $\infty$.

$$ =\lim _{n \rightarrow \infty} \frac{\lambda^{k}}{k !} (1) \left(1-\frac{\lambda}{n}\right)^{n-k} \\ =\lim _{n \rightarrow \infty} \frac{\lambda^{k}}{k !}\left(1-\frac{\lambda}{n}\right)^{n} \left(1-\frac{\lambda}{n}\right)^{-k} $$

The last term converges to 1 as $n$ approaches $\infty$.

$$ =\lim _{n \rightarrow \infty} \frac{\lambda^{k}}{k !}\left(1-\frac{\lambda}{n}\right)^{n} \left(1\right) \\ =\lim _{n \rightarrow \infty} \frac{\lambda^{k}}{k !}\left(1+\frac{-\lambda}{n}\right)^{n} $$

The last term converges to $e^{-\lambda}$ as $n$ approaches $\infty$.

$$ =e^{-\lambda} \frac{\lambda^{k}}{k !} $$

Definition

Probability of (observing) an event (whose rate of occurring is $\lambda$) occurring $k$ times in a given period (one unit time) is given by

$$ P(X=k) = e^{-\lambda} \frac{\lambda^{k}}{k !} $$

Proof of convergence from Binomial distribution

Inspired by this post.

Law of rare events or Poisson limit theorem states that the Poisson distribution may be used as an approximation to the binomial distribution.

Exponential

The exponential distribution is an extension of Poisson distribution. The Poisson PMF measure the probability of seeing an event occur $k$ times in one time period. This means we can measure the probability whether an event occurs, i.e. the event occurs more than 0 times, which translates to $k=1,2,..$.

$$ P(\text{event occurred}) = P(X=1) + P(X=2) + ... $$

or if the event does not occur at all:

$$ P(\text{event does not occur}) = P(X=0) $$

What happens if we want to know the probability of not observing an event within 3 units of time?

$$ \begin{aligned} & P(\text{0 events occur in the 1st time unit}) \times \\ & P(\text{0 events occur in the 2nd time unit}) \times \\ & P(\text{0 events occur in the 3rd time unit}) \\ = & e^{-\lambda} \frac{\lambda^{0}}{0 !} \times e^{-\lambda} \frac{\lambda^{0}}{0 !} \times e^{-\lambda} \frac{\lambda^{0}}{0 !} \\ = & e^{-3\lambda} \end{aligned} $$

The probabilities are multiplied because these events are independent from each other. We can also look at it differently and ask, "what's the probability of (the first) observation an event after 3 units of time?"

$$ P(\text{event observed after 3 units of time}) = e^{-3\lambda} $$

which can be rephrased as what's the probability that $T$, the unit of time of (first) observing the event is more than 3?

$$ P(T > 3) = e^{-3\lambda} $$

And the probability of observing an event after $t$ units of time?

$$ P(T > t) = e^{-\lambda t} $$

And so this is the exponential distribution. Intuitions and derivation here.

Definition

$$ P(T \leq t) = 1 - e^{-\lambda x} $$

Memoryless property

This property of the exponential distribution says that if one does not experience a car accident in $10$ days, the probability of experiencing a car accident in $(10+2)$ days is the same as the probability of experiencing a car accident in $2$ days.

$$ P(T > 10 + 2 | T > 10) = P(T > 2) $$

And more formally

$$ P(T > a + b | T > a) = P(T > b) $$

Uses

Modelling events that have a constant (hazard) rate through time, like car accidents (based on the article above).

Uniform

For an experiment with 6 different outcomes ${1,2,3,4,5,6}$, each with equal probability, what is the probability of seeing any outcome?

$$ \frac{1}{6} $$

For an experiment with $n$ different outcomes, each with equal probability, what is the probability of observing an outcome?

$$ P(X=x) = \frac{1}{n} $$

1 trial
$n$ outcomes

each outcome has equal probability, $\frac{1}{n}$

Normal / Gaussian

Definition

$$X \sim \mathcal{N}(\mu,\sigma^2)$$

$$ f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma ^2}} $$

$$Z \sim \mathcal{N}(0,1)$$

$$ f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}} $$

How is the normal distribution derived? 🤔

Uses

Modelling the height of boys in a school.

Chi-square

If $Z_i$'s are independent, then

$$ \sum_{i=1}^{k} Z_i^2 \sim \chi^{2}(k) $$

Why? 🤔

Definition

$$ f(x) = \frac{1}{2^{k / 2} \Gamma(k / 2)} x^{k / 2-1} e^{-x / 2} $$

How is this distribution derived? 🤔

Uses

Modelling a Chi-square test statistic which tests for dependence between two (random) variables.

Student's t

If $Z \sim \mathcal{N}(0,1)$ and $V \sim \chi^2(m)$ are independent, then

$$ \frac{Z}{V/\sqrt{m}} \sim t(m) $$

Definition

$$ f(x) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi} \Gamma\left(\frac{\nu}{2}\right)}\left(1+\frac{x^{2}}{\nu}\right)^{-\frac{\nu+1}{2}} $$

How is this distribution derived? 🤔

F

If $V \sim \chi^2(m)$ and $W \sim \chi^2(n)$ are independent, then

$$ \frac{V/m}{W/n} \sim F(m,n) $$

Definition

$$ f(x) = \frac{\sqrt{\frac{\left(d_{1} x\right)^{d_{1}} d_{2}^{d_{2}}}{\left(d_{1} x+d_{2}\right)^{d_{1}+d_{2}}}}}{x B\left(\frac{d_{1}}{2}, \frac{d_{2}}{2}\right)} $$

How is this distribution derived? 🤔

Uses

Models the distribution of a test statistic called F-test. This test estimates the degree of linear dependency between two random variables.

Gamma

Definition

$$ f(x) = \frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x} $$

How is this distribution derived? 🤔

Beta

Definition

$$ f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{\mathrm{B}(\alpha, \beta)} $$

where

$$ \mathrm{B}(\alpha, \beta)=\frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha+\beta)} $$

How is this distribution derived? 🤔

Multinomial

Definition

$$ P(X=k)= \frac{n !}{x_{1} ! \cdots x_{k} !} p_{1}^{x_{1}} \cdot \cdots p_{k}^{x_{k}} $$

$n$ trials
$k$ outcomes

Each outcome has probability $p_k$

Negative binomial

Definition

$$ P(X=k)= {k+r-1 \choose k} p^k (1-p)^r $$

$(k + r)$ trials

$k$ of which are successes
$r$ of which are failures

2 outcomes

1 of the outcomes ("success") has probability $p$, the other ("failure") has probability $1-p$

Distribution

A distribution is a frequency count for certain values.

Eg. The distribution of grades (where grades can be Grade A, Grade B and Grade C) in a class tells us

the number of people who scored Grade A,
the number of people who scored Grade B, and
the number of people who scored Grade C.

Probability distribution

A probability distribution is the probabilities for all possible values.

Eg. The probability distribution of grades (where grades can be Grade A, Grade B and Grade C) in a class tells us

the probability of scoring Grade A,
the probability of scoring Grade B, and
the probability of scoring Grade C.

These probabilities must add up to 1.

Probability mass function (PMF)

A probability mass function is a 'formula' for obtaining all the probabilities of the possible values.

Gamma function

For any positive integer $n$,

$$ \Gamma(n)=(n-1)! $$

Beta function

For any real number $x,y>0$,

$$ \mathrm{B}(x, y)=\int_{0}^{1} t^{x-1}(1-t)^{y-1} dt $$

Resources

https://statdist.ksmzn.com/