Notes on naive Bayes for categorical and continuous data
Last updated: 4 Oct 2019
Naive Bayes classifiers are a set of supervised learning algorithms based on applying Bayes' theorem, but with strong independence assumptions between the features given the value of the class variable (hence naive).
There are different naive Bayes classifiers like Gaussian Naive Bayes, Multinomial Naive Bayes and Bernoulli Naive Bayes. These classifiers differ mainly by the assumptions they make on the distribution of every feature given the value of the class variable.
For this post, we focus on 2 use cases of naive Bayes for classification: categorical data and continuous data. For the first, we will assume a categorical distribution on the features. For the latter, we assume a Gaussian distribution. For each use case, we will break it down for you into these sections:
Now let's get on to it!
We collected data from 10 engineers: what OS (macOS, Linux or Windows) and deep learning framework (TensorFlow, Keras or PyTorch) they use, and their favourite fast food (KFC or McD).
No. | y | x1 | x2 |
---|---|---|---|
1 | KFC | macOS | TensorFlow |
2 | KFC | Linux | Keras |
3 | McD | Linux | TensorFlow |
4 | KFC | macOS | Keras |
5 | KFC | Linux | Keras |
6 | KFC | Windows | Keras |
7 | McD | macOS | PyTorch |
8 | McD | Windows | PyTorch |
9 | KFC | Linux | Keras |
10 | KFC | macOS | PyTorch |
? | macOS | PyTorch |
The task is to predict if a person who uses macOS and PyTorch likes KFC or McD.
We assume that, given a particular value $y$, the distribution of $x_1$ is
independent of $x_2$
Based on the 10 datapoints, we can estimate the following:
$$ p(y) = \left\{\begin{array}{ll} {7/10} & {\text{if } y=\text{KFC}} \\ {3/10} & {\text{if } y=\text{McD}} \\ \end{array}\right. $$
$$ p(x_1|y=\text{KFC}) = \left\{\begin{array}{ll} {3/7} & {\text{if } x_1=\text{macOS}} \\ {3/7} & {\text{if } x_1=\text{Linux}} \\ {1/7} & {\text{if } x_1=\text{Windows}} \\ \end{array}\right. $$
$$ p(x_2|y=\text{KFC}) = \left\{\begin{array}{ll} {1/7} & {\text{if } x_2=\text{TensorFlow}} \\ {5/7} & {\text{if } x_2=\text{Keras}} \\ {1/7} & {\text{if } x_2=\text{PyTorch}} \\ \end{array}\right. $$
$$ p(x_1|y=\text{McD}) = \left\{\begin{array}{ll} {1/3} & {\text{if } x_1=\text{macOS}} \\ {1/3} & {\text{if } x_1=\text{Linux}} \\ {1/3} & {\text{if } x_1=\text{Windows}} \\ \end{array}\right. $$
$$ p(x_2|y=\text{McD}) = \left\{\begin{array}{ll} {1/3} & {\text{if } x_2=\text{TensorFlow}} \\ {0/3} & {\text{if } x_2=\text{Keras}} \\ {2/3} & {\text{if } x_2=\text{PyTorch}} \\ \end{array}\right. $$
Take it for granted that these estimates are based on Maximum A Posteriori (MAP) estimation.
We want to predict a co-worker's favourite fast food.
Knowing that he uses macOS and PyTorch, we can find the probability of KFC
being his favourite food
$$ p(\text{eat KFC}|\text{use macOS \& PyTorch}) \\ $$
$$ = p(y=\text{KFC}|x_1=\text{macOS}, x_2=\text{PyTorch}) \\ $$
$$ = \frac{p(x_1=\text{macOS}, x_2=\text{PyTorch}|y=\text{KFC}) \cdot p(y=\text{KFC})} {p(x_1=\text{macOS}, x_2=\text{PyTorch})} $$
$$ = \frac{p(x_1=\text{macOS}, x_2=\text{PyTorch}|y=\text{KFC}) \cdot p(y=\text{KFC})} {\sum_{i=\text{KFC,McD}} p(x_1=\text{macOS}, x_2=\text{PyTorch}|y=i) \cdot p(y=i)} $$
$$ = \frac{p(x_1=\text{macOS}|y=\text{KFC}) \cdot p(x_2=\text{PyTorch}|y=\text{KFC}) \cdot p(y=\text{KFC})} {\sum_{i=\text{KFC,McD}} p(x_1=\text{macOS}|y=i) \cdot p(x_2=\text{PyTorch}|y=i) \cdot p(y=i)} $$
$$ = \frac{(\frac{3}{7})(\frac{1}{7})(\frac{7}{10})} {(\frac{3}{7})(\frac{1}{7})(\frac{7}{10}) + (\frac{1}{3})(\frac{2}{3})(\frac{3}{10})} $$
$$ = 0.39 $$
It is more probable that this co-worker of ours likes McD instead!
Unfortunately, scikit-learn (one of Python's most popular machine learning libraries) has no implementation for categorical naive Bayes 😭. See the GitHub issue here.
We collected data from 10 engineers: their height (cm) and weight (kg), and their favourite fast food (KFC or McD).
No. | y | x1 | x2 |
---|---|---|---|
1 | KFC | 180 | 75 |
2 | KFC | 165 | 61 |
3 | McD | 167 | 62 |
4 | KFC | 178 | 63 |
5 | KFC | 174 | 69 |
6 | KFC | 166 | 60 |
7 | McD | 167 | 59 |
8 | McD | 165 | 60 |
9 | KFC | 173 | 68 |
10 | KFC | 178 | 71 |
? | 177 | 72 |
The task is to predict if a person who is 177cm tall and weighs 72kg likes KFC or McD.
We assume that, given a particular value $y$, the distribution of $x_1$ is
independent of $x_2$
Based on the 10 datapoints, we can estimate:
$$ p(y) = \left\{\begin{array}{ll} {7/10} & {\text{if } y=\text{KFC}} \\ {3/10} & {\text{if } y=\text{McD}} \\ \end{array}\right. $$
We estimate their distributions using the data (i.e., conditioned on $y$'s):
$$ p(x_1|y=\text{KFC}) = \frac{1}{\sqrt{2\pi(35)}} \exp(-\frac{(x_1-173)^2}{2(35)}) $$
$$ p(x_2|y=\text{KFC}) = \frac{1}{\sqrt{2\pi(31)}} \exp(-\frac{(x_1-67)^2}{2(31)}) $$
$$ p(x_1|y=\text{McD}) = \frac{1}{\sqrt{2\pi(1.33)}} \exp(-\frac{(x_1-166)^2}{2(1.33)}) $$
$$ p(x_2|y=\text{McD}) = \frac{1}{\sqrt{2\pi(2.33)}} \exp(-\frac{(x_1-60)^2}{2(2.33)}) $$
Take it for granted that these estimates are based on Maximum A Posteriori (MAP) estimation.
We want to find the probability of a co-worker's favourite food being a KFC, knowing that he is 177cm tall and weighs 72kg.
$$ p(y=\text{KFC}|x_1=177, x_2=72) $$
$$ = \frac{p(x_1=177, x_2=72|y=\text{KFC}) \cdot p(y=\text{KFC})} {p(x_1=177, x_2=72)} $$
$$ = \frac{p(x_1=177, x_2=72|y=\text{KFC}) \cdot p(y=\text{KFC})} {\sum_{i=\text{KFC,McD}} p(x_1=177, x_2=72|y=i) \cdot p(y=i)} $$
$$ = \frac{p(x_1=177|y=\text{KFC}) \cdot p(x_2=72|y=\text{KFC}) \cdot p(y=\text{KFC})} {\sum_{i=\text{KFC,McD}} p(x_1=177|y=i) \cdot p(x_2=72|y=i) \cdot p(y=i)} $$
$$ = \frac{(0.0532)(0.0400)(\frac{7}{10})} {(0.0532)(0.0400)(\frac{7}{10}) + (0)(0)(\frac{3}{10})} $$
$$ = 1 $$
It's almost certain that this person will like KFC!
Some questions to ponder upon:
$$ P(A | B)=\frac{P(A \cap B)}{P(B)} $$
$$ P(A \cap B | C)=\operatorname{Pr}(A | C) P(B | C) $$
$$ P(A | B)=\frac{P(B | A) P(A)}{P(B)} $$
The unbiased estimator for sample variance is
$$ \frac{1}{N-1} \sum_{i=1}^N (x_i - \hat{\mu}) $$
https://en.wikipedia.org/wiki/Naive_Bayes_classifier
https://scikit-learn.org/stable/modules/naive_bayes.html