Notes on probability distributions
Last updated: 1 Mar 2020
Say in an experiment (eg. coin toss) you observe two outcomes $x=\{0,1\}$. Set the probability of seeing "1" (usually the positive one) to $0.8$. Accordingly, this follows that probability of seeing "0" is $1-0.8=0.2$. So
$$ P(X=1) = 0.8 $$
and
$$ P(X=0) = 0.2 $$
can be rewritten as
$$ P(X=x) = \left\{\begin{array}{ll}{0.8} & {x=1} \\ {0.2} & {x=0} \end{array}\right. $$
which can be rewritten as
$$ P(X=x) = (0.8)^x (0.2)^{1-x} $$
$$ P(X=x) = p^x(1-p)^{1-x} $$
where $p$ is the probability of the positive outcome.
What happens if you have $n$ trials where only the last is a success? What happens if you have $n$ trials, $k$ of which are successes? What happens if you have $K$ different outcomes?
Also known as generalised Bernoulli distribution or multinoulli distribution.
This probability distribution describes the possible results of a random variable that takes one of $K$ possible categories. Moreover, the probability of each category is separately specified. An example would be that of a die.
There are 3 ways to define such a distribution:
$$ p(x=i) = p_i $$
or
$$ p(x) = p_1^{1_{x=1}} p_2^{1_{x=1}} \cdots p_k^{1_{x=1}} $$
or
$$ p(x) = 1_{x=1} p_1 + 1_{x=2} p_2 + \cdots + 1_{x=k} p_k $$
where $1$ is the indicator function.
Say you need to conduct an experiment. This experiment involves repeating (called trials) until we get what we want. In each trial, there are only two outcomes, $\{0,1\}$. The probability of observing 1 (usually set to what we want) is $p$. This follows that observing 0 is $1-p$. What is the probability of finally getting 1 in one trial?
$$ p $$
And in 2 trials (getting 0 then 1)?
$$ (1-p)p $$
And in $10$ trials (getting 0 nine times then 1)?
$$ (1-p)^9p $$
And in $k$ trials (getting 0 $k-1$ times then 1)?
$$ P(X=k)= (1-p)^{k-1}p $$
Say there is a biased coin where seeing heads has a probability 0.8. Flipping this coin 4 times and noting if it's heads or tails, I get
$$ \{H, H, H, T\} $$
exactly in that order. That's 3 heads. What's the probability of getting the above?
$$ 0.8 \times 0.8 \times 0.8 \times 0.3 = (0.8)^3 (0.2)^1 = 0.1024$$
Now what if I flipped 4 times, what's the probability of seeing 3 heads? It can't be $0.1024$ because that's just the probability of seeing $\{H,H,H,T\}$ exactly in that order. When I flip the coin 4 times, there are definitely other combinations that give 3 heads like $\{T, H, H, H\}$.
Let's list down all the possible combinations there can be from flipping the coin 4 times and getting 3 heads:
$$ \{H,H,H,T\}, \{H,H,T,H\}, $$
$$ \{H,T,H,H\}, \{T,H,H,H\} $$
That's 4. Thankfully, we have a calculator that can do ${4 \choose 3}$ = 4. Now let's list down the probabilities of each combination (they're all the same):
Combination | Probability |
---|---|
$\{H,H,H,T\}$ | $(0.8)^3 (0.2)^1$ |
$\{H,H,T,H\}$ | $(0.8)^3 (0.2)^1$ |
$\{H,T,H,H\}$ | $(0.8)^3 (0.2)^1$ |
$\{T,H,H,H\}$ | $(0.8)^3 (0.2)^1$ |
Adding these probabilities up gives us
$ {4 \choose 3} (0.8)^3 (0.2)^1 = 4 (0.8)^3 (0.2)^1 = 0.4096$
It is said that the number of heads you would see in 4 coin flips can be modelled by a Binomial distribution. That is to say, you would be able to tell:
$$ P(X=k)= {n \choose k} p^k (1-p)^{n-k} $$
What happens if you have $(k+r)$ trials, $k$ of which are successes and $r$ are failures?
What happens if you are allowed a billion (or gazillion) trials but only 1 of which is a success and its probability is very, very small?
The Binomial distribution models the number of successes (with probability $p$) from $n$ independent trials. So far, the term "trial" has been used to refer to acts like flipping a biased coin, giving a medication to a person who might react adversely with probability 0.9 and so on. What if we project the concept of "trials" in Binomial distribution onto time and space? What would this mean?
As an example, suppose you play a slap game with an opponent: You are given 10 trials and at every trial, you will slap the opponent's hand, for which you hit with probability 60% or miss with 40%. The number of hits can be modelled using a Binomial distribution.
Coming back to the idea of time projection, we can imagine every second as a trial. Thus, you can look at the game this way: You are given 10 seconds and at every second, you will slap the opponent's hand, for which you hit with probability 60% or miss with 40%. The number of hits is still modelled as a Binomial distribution.
Now what if I told you that in this time frame of 10s, you can slap the opponent without having to do it at every tick of the second-hand? That would be more thrilling, wouldn't it 😈?
But how would you have to rephrase the problem such that the rules of the game still apply?
Firstly, note that the number of trials in that 10s now becomes extremely large. We're not measuring in seconds anymore. We're measuring in milliseconds. Or even microseconds. Or even microseconds. Heck, it's the concept of time we're talking about. It's not even discrete! We're talking about infinitely many trials, i.e. when $n$ approaches $\infty$.
Secondly, note that in the Binomial setting, we have 10 trials and 60% success as the parameters of the distribution. Does this mean that in the above setting (Poisson setting, no points for this one), we have $\infty$ trials and still 60% successful slap for every trial 😱? How can we port these rules in the infinite-trials setting such that the game is still fair? The answer is taking a simple average. Taking $10 \times 60\%$ tells us that for the Binomial setting there will be 6 successful slaps. This means that in the Poisson setting, there should still be 6 successful slaps on an average despite the infinitely many trials.
But wait, what happened to the success probability, you might ask? This number has to go very small. It can't be 60%. That's not fair, because you are changing the rules of the game. It has to be small. And I mean infinitesimally small. So small that when I take that large $n$ to multiply with this small $p$, I get 6. You don't have to care about the number of trials or the probability of success anymore. You just have to care about the average number, 6, which encapsulates these two numbers.
And so this becomes the Poisson distribution. And here's a re-writeup of the game: You are given 10s to slap the opponent's hand for which 6, on the average, are successful hits.
As it turns out, setting $n$ to $\infty$ approximates the Binomial PMF \ to a nicer-looking form (see below).
'Trials' in the Binomial distribution are usually attributed to man-made trials. In the Poisson distribution, 'trials' are driven by time and space, things that man cannot control.
The Poisson distribution is an approximation to the Binomial distribution when we are allowed to have as many trials, $n$, as we want, i.e. when $n$ in the Binomial PMF goes to $\infty$:
$$ \lim _{n \rightarrow \infty} P(X=k) \\ =\lim _{n \rightarrow \infty} {n \choose k} p^{k}(1-p)^{n-k} $$
Reparametrise $\lambda = np$. This represents the no. of times an event occurs. This also indirectly specifies a time period within which this event can occur. Consequently, $p = \frac{\lambda}{n}.$
$$ =\lim _{n \rightarrow \infty} {n \choose k} \left(\frac{\lambda}{n}\right)^{k}\left(1-\frac{\lambda}{n}\right)^{n-k} \\ =\lim _{n \rightarrow \infty} \frac{n !}{k !(n-k) !} \left(\frac{\lambda}{n}\right)^{k}\left(1-\frac{\lambda}{n}\right)^{n-k} $$
Let's shift $\lambda$'s and $k$'s to the left.
$$ =\lim _{n \rightarrow \infty} \frac{\lambda^{k}}{k !} \cdot \frac{n !}{(n-k) !} \left(\frac{1}{n}\right)^{k}\left(1-\frac{\lambda}{n}\right)^{n-k} $$
Expand $n!$
$$ =\lim _{n \rightarrow \infty} \frac{\lambda^{k}}{k !} \cdot \frac{n(n-1)(n-2) ... (n-(k-1))(n-k)(n-(k+1))...(2)(1)}{n^k(n-k)!} \left(1-\frac{\lambda}{n}\right)^{n-k} $$
Cancel $(n-k)!$ from the numerator and denominator.
$$ =\lim _{n \rightarrow \infty} \frac{\lambda^{k}}{k !}\left(\frac{n}{n} \cdot \frac{n-1}{n} \cdot \frac{n-2}{n} \cdots \frac{n-(k-1)}{n}\right)\left(1-\frac{\lambda}{n}\right)^{n-k} $$
The second term converges to 1 as $n$ approaches $\infty$.
$$ =\lim _{n \rightarrow \infty} \frac{\lambda^{k}}{k !} (1) \left(1-\frac{\lambda}{n}\right)^{n-k} \\ =\lim _{n \rightarrow \infty} \frac{\lambda^{k}}{k !}\left(1-\frac{\lambda}{n}\right)^{n} \left(1-\frac{\lambda}{n}\right)^{-k} $$
The last term converges to 1 as $n$ approaches $\infty$.
$$ =\lim _{n \rightarrow \infty} \frac{\lambda^{k}}{k !}\left(1-\frac{\lambda}{n}\right)^{n} \left(1\right) \\ =\lim _{n \rightarrow \infty} \frac{\lambda^{k}}{k !}\left(1+\frac{-\lambda}{n}\right)^{n} $$
The last term converges to $e^{-\lambda}$ as $n$ approaches $\infty$.
$$ =e^{-\lambda} \frac{\lambda^{k}}{k !} $$
Probability of (observing) an event (whose rate of occurring is $\lambda$) occurring $k$ times in a given period (one unit time) is given by
$$ P(X=k) = e^{-\lambda} \frac{\lambda^{k}}{k !} $$
Inspired by this post.
Law of rare events or Poisson limit theorem states that the Poisson distribution may be used as an approximation to the binomial distribution.
The exponential distribution is an extension of Poisson distribution. The Poisson PMF measure the probability of seeing an event occur $k$ times in one time period. This means we can measure the probability whether an event occurs, i.e. the event occurs more than 0 times, which translates to $k=1,2,..$.
$$ P(\text{event occurred}) = P(X=1) + P(X=2) + ... $$
or if the event does not occur at all:
$$ P(\text{event does not occur}) = P(X=0) $$
What happens if we want to know the probability of not observing an event within 3 units of time?
$$ \begin{aligned} & P(\text{0 events occur in the 1st time unit}) \times \\ & P(\text{0 events occur in the 2nd time unit}) \times \\ & P(\text{0 events occur in the 3rd time unit}) \\ = & e^{-\lambda} \frac{\lambda^{0}}{0 !} \times e^{-\lambda} \frac{\lambda^{0}}{0 !} \times e^{-\lambda} \frac{\lambda^{0}}{0 !} \\ = & e^{-3\lambda} \end{aligned} $$
The probabilities are multiplied because these events are independent from each other. We can also look at it differently and ask, "what's the probability of (the first) observation an event after 3 units of time?"
$$ P(\text{event observed after 3 units of time}) = e^{-3\lambda} $$
which can be rephrased as what's the probability that $T$, the unit of time of (first) observing the event is more than 3?
$$ P(T > 3) = e^{-3\lambda} $$
And the probability of observing an event after $t$ units of time?
$$ P(T > t) = e^{-\lambda t} $$
And so this is the exponential distribution. Intuitions and derivation here.
$$ P(T \leq t) = 1 - e^{-\lambda x} $$
This property of the exponential distribution says that if one does not experience a car accident in $10$ days, the probability of experiencing a car accident in $(10+2)$ days is the same as the probability of experiencing a car accident in $2$ days.
$$ P(T > 10 + 2 | T > 10) = P(T > 2) $$
And more formally
$$ P(T > a + b | T > a) = P(T > b) $$
For an experiment with 6 different outcomes ${1,2,3,4,5,6}$, each with equal probability, what is the probability of seeing any outcome?
$$ \frac{1}{6} $$
For an experiment with $n$ different outcomes, each with equal probability, what is the probability of observing an outcome?
$$ P(X=x) = \frac{1}{n} $$
$$X \sim \mathcal{N}(\mu,\sigma^2)$$
$$ f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma ^2}} $$
$$Z \sim \mathcal{N}(0,1)$$
$$ f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}} $$
How is the normal distribution derived? 🤔
If $Z_i$'s are independent, then
$$ \sum_{i=1}^{k} Z_i^2 \sim \chi^{2}(k) $$
Why? 🤔
$$ f(x) = \frac{1}{2^{k / 2} \Gamma(k / 2)} x^{k / 2-1} e^{-x / 2} $$
How is this distribution derived? 🤔
If $Z \sim \mathcal{N}(0,1)$ and $V \sim \chi^2(m)$ are independent, then
$$ \frac{Z}{V/\sqrt{m}} \sim t(m) $$
$$ f(x) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi} \Gamma\left(\frac{\nu}{2}\right)}\left(1+\frac{x^{2}}{\nu}\right)^{-\frac{\nu+1}{2}} $$
How is this distribution derived? 🤔
If $V \sim \chi^2(m)$ and $W \sim \chi^2(n)$ are independent, then
$$ \frac{V/m}{W/n} \sim F(m,n) $$
$$ f(x) = \frac{\sqrt{\frac{\left(d_{1} x\right)^{d_{1}} d_{2}^{d_{2}}}{\left(d_{1} x+d_{2}\right)^{d_{1}+d_{2}}}}}{x B\left(\frac{d_{1}}{2}, \frac{d_{2}}{2}\right)} $$
How is this distribution derived? 🤔
$$ f(x) = \frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x} $$
How is this distribution derived? 🤔
$$ f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{\mathrm{B}(\alpha, \beta)} $$
where
$$ \mathrm{B}(\alpha, \beta)=\frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha+\beta)} $$
How is this distribution derived? 🤔
$$ P(X=k)= \frac{n !}{x_{1} ! \cdots x_{k} !} p_{1}^{x_{1}} \cdot \cdots p_{k}^{x_{k}} $$
$$ P(X=k)= {k+r-1 \choose k} p^k (1-p)^r $$
A distribution is a frequency count for certain values.
Eg. The distribution of grades (where grades can be Grade A, Grade B and Grade C) in a class tells us
A probability distribution is the probabilities for all possible values.
Eg. The probability distribution of grades (where grades can be Grade A, Grade B and Grade C) in a class tells us
These probabilities must add up to 1.
A probability mass function is a 'formula' for obtaining all the probabilities of the possible values.
For any positive integer $n$,
$$ \Gamma(n)=(n-1)! $$
For any real number $x,y>0$,
$$ \mathrm{B}(x, y)=\int_{0}^{1} t^{x-1}(1-t)^{y-1} dt $$