Naive Bayes classifier

Notes on naive Bayes for categorical and continuous data

Last updated: 4 Oct 2019

Introduction
Naive Bayes for categorical data
Naive Bayes for continuous data
Questions

It is recommended that the reader have a basic understanding of the following:

Introduction

Naive Bayes classifiers are a set of supervised learning algorithms based on applying Bayes' theorem, but with strong independence assumptions between the features given the value of the class variable (hence naive).

There are different naive Bayes classifiers like Gaussian Naive Bayes, Multinomial Naive Bayes and Bernoulli Naive Bayes. These classifiers differ mainly by the assumptions they make on the distribution of every feature given the value of the class variable.

For this post, we focus on 2 use cases of naive Bayes for classification: categorical data and continuous data. For the first, we will assume a categorical distribution on the features. For the latter, we assume a Gaussian distribution. For each use case, we will break it down for you into these sections:

Data
Task
Assumptions
Pre-calculations
Inference
Code

Now let's get on to it!

Naive Bayes for categorical data

Data

We collected data from 10 engineers: what OS (macOS, Linux or Windows) and deep learning framework (TensorFlow, Keras or PyTorch) they use, and their favourite fast food (KFC or McD).

Table 1
No.	y	x1	x2
1	KFC	macOS	TensorFlow
2	KFC	Linux	Keras
3	McD	Linux	TensorFlow
4	KFC	macOS	Keras
5	KFC	Linux	Keras
6	KFC	Windows	Keras
7	McD	macOS	PyTorch
8	McD	Windows	PyTorch
9	KFC	Linux	Keras
10	KFC	macOS	PyTorch
	?	macOS	PyTorch

Task

The task is to predict if a person who uses macOS and PyTorch likes KFC or McD.

Assumptions

We assume that, given a particular value $y$, the distribution of $x_1$ is independent of $x_2$ This is to say that among those people who like KFC, their choice of using an OS is independent of using a deep learning, and vice versa (this is not true, we know, hence it's a naïve assumption). The same can be said for McD: among those people who like McD, their choice of using an OS is independent of using a deep learning, and vice versa. . We also assume that each of these distributions are categorical This is like throwing a (biased) die with 3 faces..

Pre-calculations

Based on the 10 datapoints, we can estimate the following:

$$ p(y) = \left\{\begin{array}{ll} {7/10} & {\text{if } y=\text{KFC}} \\ {3/10} & {\text{if } y=\text{McD}} \\ \end{array}\right. $$

$$ p(x_1|y=\text{KFC}) = \left\{\begin{array}{ll} {3/7} & {\text{if } x_1=\text{macOS}} \\ {3/7} & {\text{if } x_1=\text{Linux}} \\ {1/7} & {\text{if } x_1=\text{Windows}} \\ \end{array}\right. $$

$$ p(x_2|y=\text{KFC}) = \left\{\begin{array}{ll} {1/7} & {\text{if } x_2=\text{TensorFlow}} \\ {5/7} & {\text{if } x_2=\text{Keras}} \\ {1/7} & {\text{if } x_2=\text{PyTorch}} \\ \end{array}\right. $$

$$ p(x_1|y=\text{McD}) = \left\{\begin{array}{ll} {1/3} & {\text{if } x_1=\text{macOS}} \\ {1/3} & {\text{if } x_1=\text{Linux}} \\ {1/3} & {\text{if } x_1=\text{Windows}} \\ \end{array}\right. $$

$$ p(x_2|y=\text{McD}) = \left\{\begin{array}{ll} {1/3} & {\text{if } x_2=\text{TensorFlow}} \\ {0/3} & {\text{if } x_2=\text{Keras}} \\ {2/3} & {\text{if } x_2=\text{PyTorch}} \\ \end{array}\right. $$

Take it for granted that these estimates are based on Maximum A Posteriori (MAP) estimation.

Inference

We want to predict a co-worker's favourite fast food. Knowing that he uses macOS and PyTorch, we can find the probability of KFC being his favourite food The probability of him liking McD is just taking one minus the above.:

$$ p(\text{eat KFC}|\text{use macOS \& PyTorch}) \\ $$

$$ = p(y=\text{KFC}|x_1=\text{macOS}, x_2=\text{PyTorch}) \\ $$

$$ = \frac{p(x_1=\text{macOS}, x_2=\text{PyTorch}|y=\text{KFC}) \cdot p(y=\text{KFC})} {p(x_1=\text{macOS}, x_2=\text{PyTorch})} $$

$$ = \frac{p(x_1=\text{macOS}, x_2=\text{PyTorch}|y=\text{KFC}) \cdot p(y=\text{KFC})} {\sum_{i=\text{KFC,McD}} p(x_1=\text{macOS}, x_2=\text{PyTorch}|y=i) \cdot p(y=i)} $$

$$ = \frac{p(x_1=\text{macOS}|y=\text{KFC}) \cdot p(x_2=\text{PyTorch}|y=\text{KFC}) \cdot p(y=\text{KFC})} {\sum_{i=\text{KFC,McD}} p(x_1=\text{macOS}|y=i) \cdot p(x_2=\text{PyTorch}|y=i) \cdot p(y=i)} $$

What next?

$$ = \frac{(\frac{3}{7})(\frac{1}{7})(\frac{7}{10})} {(\frac{3}{7})(\frac{1}{7})(\frac{7}{10}) + (\frac{1}{3})(\frac{2}{3})(\frac{3}{10})} $$

$$ = 0.39 $$

It is more probable that this co-worker of ours likes McD instead!That's contrary to what you might have thought, right? And that's because we have a training example with someone who uses macOS and PyTorch but likes KFC, in row 10 of Table 1. However, remember that we estimated the parameters of this categorical distribution based on empirical data. Firstly, we experience a class imbalance in the dataset. Secondly, Have a look at the denominator in the fraction before deriving the probability of 0.39: $(\frac{3}{7})(\frac{1}{7})(\frac{7}{10}) + (\frac{1}{3})(\frac{2}{3})(\frac{3}{10})$. Our data tells us that 2 out of 3 (66%) of the people who like McD use PyTorch. That's relatively higher than the proportion of the people who like KFC and use PyTorch (1 out of 7 = 14%). This value (14%) is the main contributor to the lower probability we see for KFC.

Code

Unfortunately, scikit-learn (one of Python's most popular machine learning libraries) has no implementation for categorical naive Bayes 😭. See the GitHub issue here.

Naive Bayes for continuous data

Data

We collected data from 10 engineers: their height (cm) and weight (kg), and their favourite fast food (KFC or McD).

Table 2
No.	y	x1	x2
1	KFC	180	75
2	KFC	165	61
3	McD	167	62
4	KFC	178	63
5	KFC	174	69
6	KFC	166	60
7	McD	167	59
8	McD	165	60
9	KFC	173	68
10	KFC	178	71
	?	177	72

Task

The task is to predict if a person who is 177cm tall and weighs 72kg likes KFC or McD.

Assumptions

We assume that, given a particular value $y$, the distribution of $x_1$ is independent of $x_2$ This is to say that among those people who like KFC, their height is independent of their weight, and vice versa (this is not true, we know, hence it's a naïve assumption). The same can be said for McD: among those people who like McD, their height is independent of their weight, and vice versa. . We also assume that each of these distributions are Gaussian This is a decent assumption for natural phenomena.. The type of naive Bayes classifier we are using for this task is thus called Gaussian naive Bayes.

Pre-calculations

Based on the 10 datapoints, we can estimate:

$$ p(y) = \left\{\begin{array}{ll} {7/10} & {\text{if } y=\text{KFC}} \\ {3/10} & {\text{if } y=\text{McD}} \\ \end{array}\right. $$

We estimate their distributions using the data (i.e., conditioned on $y$'s):

$$ p(x_1|y=\text{KFC}) = \frac{1}{\sqrt{2\pi(35)}} \exp(-\frac{(x_1-173)^2}{2(35)}) $$

Where did 173 and 35 come from?

$$ p(x_2|y=\text{KFC}) = \frac{1}{\sqrt{2\pi(31)}} \exp(-\frac{(x_1-67)^2}{2(31)}) $$

Where did 67 and 31 come from?

$$ p(x_1|y=\text{McD}) = \frac{1}{\sqrt{2\pi(1.33)}} \exp(-\frac{(x_1-166)^2}{2(1.33)}) $$

$$ p(x_2|y=\text{McD}) = \frac{1}{\sqrt{2\pi(2.33)}} \exp(-\frac{(x_1-60)^2}{2(2.33)}) $$

Take it for granted that these estimates are based on Maximum A Posteriori (MAP) estimation.

Inference

We want to find the probability of a co-worker's favourite food being a KFC, knowing that he is 177cm tall and weighs 72kg.

$$ p(y=\text{KFC}|x_1=177, x_2=72) $$

$$ = \frac{p(x_1=177, x_2=72|y=\text{KFC}) \cdot p(y=\text{KFC})} {p(x_1=177, x_2=72)} $$

$$ = \frac{p(x_1=177, x_2=72|y=\text{KFC}) \cdot p(y=\text{KFC})} {\sum_{i=\text{KFC,McD}} p(x_1=177, x_2=72|y=i) \cdot p(y=i)} $$

$$ = \frac{p(x_1=177|y=\text{KFC}) \cdot p(x_2=72|y=\text{KFC}) \cdot p(y=\text{KFC})} {\sum_{i=\text{KFC,McD}} p(x_1=177|y=i) \cdot p(x_2=72|y=i) \cdot p(y=i)} $$

What next?

$$ = \frac{(0.0532)(0.0400)(\frac{7}{10})} {(0.0532)(0.0400)(\frac{7}{10}) + (0)(0)(\frac{3}{10})} $$

$$ = 1 $$

It's almost certain that this person will like KFC! Compare the distributions of $x$'s given KFC and $x$'s given McD and you'll understand why we obtain this extreme result.

Code

import numpy as np from sklearn.naive_bayes import GaussianNB X = [[180,75], [165,61], [167,62], [178,63], [174,69], [166,60], [167,59], [165,60], [173,68], [178,71]] y = [0,0,1,0,0,0,1,1,0,0] X = np.array(X) y = np.array(y) clf = GaussianNB() clf.fit(X, y) clf.predict([[177,72]])

Questions

Some questions to ponder upon:

What if we have both categorical and continuous data?
Why do we need to make use of the conditional probabilities when calculating $p(x)$?
The probability of any exact value for a continuous distributions is 0. Why are the calculations above not equal to 0? See the StackOverflow question here for answer.
Why does the estimate for sample variance use $N-1$ instead of $N$? See Bessel's correction for an explanation.

Conditional probability

$$ P(A | B)=\frac{P(A \cap B)}{P(B)} $$

Conditional independence

$$ P(A \cap B | C)=\operatorname{Pr}(A | C) P(B | C) $$

Bayes' theorem

$$ P(A | B)=\frac{P(B | A) P(A)}{P(B)} $$

Sample variance

The unbiased estimator for sample variance is

$$ \frac{1}{N-1} \sum_{i=1}^N (x_i - \hat{\mu}) $$

References

https://en.wikipedia.org/wiki/Naive_Bayes_classifier

https://scikit-learn.org/stable/modules/naive_bayes.html

No.	y	x1	x2
1	KFC	180	75
2	KFC	165	61
3	McD	167	62
4	KFC	178	63
5	KFC	174	69
6	KFC	166	60
7	McD	167	59
8	McD	165	60
9	KFC	173	68
10	KFC	178	71
	?	177	72

No.	y	x1	x2
1	KFC	180	75
2	KFC	165	61
3	McD	167	62
4	KFC	178	63
5	KFC	174	69
6	KFC	166	60
7	McD	167	59
8	McD	165	60
9	KFC	173	68
10	KFC	178	71
	?	177	72

No.	y	x1	x2
1	KFC	180	75
2	KFC	165	61
3	McD	167	62
4	KFC	178	63
5	KFC	174	69
6	KFC	166	60
7	McD	167	59
8	McD	165	60
9	KFC	173	68
10	KFC	178	71
	?	177	72