Naive Bayes classifier

Notes on naive Bayes for categorical and continuous data

Last updated: 4 Oct 2019

Introduction

Naive Bayes classifiers are a set of supervised learning algorithms based on applying Bayes' theorem, but with strong independence assumptions between the features given the value of the class variable (hence naive).

There are different naive Bayes classifiers like Gaussian Naive Bayes, Multinomial Naive Bayes and Bernoulli Naive Bayes. These classifiers differ mainly by the assumptions they make on the distribution of every feature given the value of the class variable.

For this post, we focus on 2 use cases of naive Bayes for classification: categorical data and continuous data. For the first, we will assume a categorical distribution on the features. For the latter, we assume a Gaussian distribution. For each use case, we will break it down for you into these sections:

Now let's get on to it!

Naive Bayes for categorical data

Data

We collected data from 10 engineers: what OS (macOS, Linux or Windows) and deep learning framework (TensorFlow, Keras or PyTorch) they use, and their favourite fast food (KFC or McD).

No. y x1 x2
1 KFC macOS TensorFlow
2 KFC Linux Keras
3 McD Linux TensorFlow
4 KFC macOS Keras
5 KFC Linux Keras
6 KFC Windows Keras
7 McD macOS PyTorch
8 McD Windows PyTorch
9 KFC Linux Keras
10 KFC macOS PyTorch
? macOS PyTorch
Table 1

Task

The task is to predict if a person who uses macOS and PyTorch likes KFC or McD.

Assumptions

We assume that, given a particular value $y$, the distribution of $x_1$ is independent of $x_2$ This is to say that among those people who like KFC, their choice of using an OS is independent of using a deep learning, and vice versa (this is not true, we know, hence it's a naïve assumption). The same can be said for McD: among those people who like McD, their choice of using an OS is independent of using a deep learning, and vice versa. . We also assume that each of these distributions are categorical This is like throwing a (biased) die with 3 faces..

Pre-calculations

Based on the 10 datapoints, we can estimate the following:

$$ p(y) = \left\{\begin{array}{ll} {7/10} & {\text{if } y=\text{KFC}} \\ {3/10} & {\text{if } y=\text{McD}} \\ \end{array}\right. $$

$$ p(x_1|y=\text{KFC}) = \left\{\begin{array}{ll} {3/7} & {\text{if } x_1=\text{macOS}} \\ {3/7} & {\text{if } x_1=\text{Linux}} \\ {1/7} & {\text{if } x_1=\text{Windows}} \\ \end{array}\right. $$

$$ p(x_2|y=\text{KFC}) = \left\{\begin{array}{ll} {1/7} & {\text{if } x_2=\text{TensorFlow}} \\ {5/7} & {\text{if } x_2=\text{Keras}} \\ {1/7} & {\text{if } x_2=\text{PyTorch}} \\ \end{array}\right. $$

$$ p(x_1|y=\text{McD}) = \left\{\begin{array}{ll} {1/3} & {\text{if } x_1=\text{macOS}} \\ {1/3} & {\text{if } x_1=\text{Linux}} \\ {1/3} & {\text{if } x_1=\text{Windows}} \\ \end{array}\right. $$

$$ p(x_2|y=\text{McD}) = \left\{\begin{array}{ll} {1/3} & {\text{if } x_2=\text{TensorFlow}} \\ {0/3} & {\text{if } x_2=\text{Keras}} \\ {2/3} & {\text{if } x_2=\text{PyTorch}} \\ \end{array}\right. $$

Take it for granted that these estimates are based on Maximum A Posteriori (MAP) estimation.

Inference

We want to predict a co-worker's favourite fast food. Knowing that he uses macOS and PyTorch, we can find the probability of KFC being his favourite food The probability of him liking McD is just taking one minus the above.:

$$ p(\text{eat KFC}|\text{use macOS \& PyTorch}) \\ $$

$$ = p(y=\text{KFC}|x_1=\text{macOS}, x_2=\text{PyTorch}) \\ $$

$$ = \frac{p(x_1=\text{macOS}, x_2=\text{PyTorch}|y=\text{KFC}) \cdot p(y=\text{KFC})} {p(x_1=\text{macOS}, x_2=\text{PyTorch})} $$

$$ = \frac{p(x_1=\text{macOS}, x_2=\text{PyTorch}|y=\text{KFC}) \cdot p(y=\text{KFC})} {\sum_{i=\text{KFC,McD}} p(x_1=\text{macOS}, x_2=\text{PyTorch}|y=i) \cdot p(y=i)} $$

$$ = \frac{p(x_1=\text{macOS}|y=\text{KFC}) \cdot p(x_2=\text{PyTorch}|y=\text{KFC}) \cdot p(y=\text{KFC})} {\sum_{i=\text{KFC,McD}} p(x_1=\text{macOS}|y=i) \cdot p(x_2=\text{PyTorch}|y=i) \cdot p(y=i)} $$

$$ = \frac{(\frac{3}{7})(\frac{1}{7})(\frac{7}{10})} {(\frac{3}{7})(\frac{1}{7})(\frac{7}{10}) + (\frac{1}{3})(\frac{2}{3})(\frac{3}{10})} $$

$$ = 0.39 $$

It is more probable that this co-worker of ours likes McD instead!That's contrary to what you might have thought, right? And that's because we have a training example with someone who uses macOS and PyTorch but likes KFC, in row 10 of Table 1. However, remember that we estimated the parameters of this categorical distribution based on empirical data. Firstly, we experience a class imbalance in the dataset. Secondly, Have a look at the denominator in the fraction before deriving the probability of 0.39: $(\frac{3}{7})(\frac{1}{7})(\frac{7}{10}) + (\frac{1}{3})(\frac{2}{3})(\frac{3}{10})$. Our data tells us that 2 out of 3 (66%) of the people who like McD use PyTorch. That's relatively higher than the proportion of the people who like KFC and use PyTorch (1 out of 7 = 14%). This value (14%) is the main contributor to the lower probability we see for KFC.

Code

Unfortunately, scikit-learn (one of Python's most popular machine learning libraries) has no implementation for categorical naive Bayes 😭. See the GitHub issue here.

Naive Bayes for continuous data

Data

We collected data from 10 engineers: their height (cm) and weight (kg), and their favourite fast food (KFC or McD).

No. y x1 x2
1 KFC 180 75
2 KFC 165 61
3 McD 167 62
4 KFC 178 63
5 KFC 174 69
6 KFC 166 60
7 McD 167 59
8 McD 165 60
9 KFC 173 68
10 KFC 178 71
? 177 72
Table 2

Task

The task is to predict if a person who is 177cm tall and weighs 72kg likes KFC or McD.

Assumptions

We assume that, given a particular value $y$, the distribution of $x_1$ is independent of $x_2$ This is to say that among those people who like KFC, their height is independent of their weight, and vice versa (this is not true, we know, hence it's a naïve assumption). The same can be said for McD: among those people who like McD, their height is independent of their weight, and vice versa. . We also assume that each of these distributions are Gaussian This is a decent assumption for natural phenomena.. The type of naive Bayes classifier we are using for this task is thus called Gaussian naive Bayes.

Pre-calculations

Based on the 10 datapoints, we can estimate:

$$ p(y) = \left\{\begin{array}{ll} {7/10} & {\text{if } y=\text{KFC}} \\ {3/10} & {\text{if } y=\text{McD}} \\ \end{array}\right. $$

We estimate their distributions using the data (i.e., conditioned on $y$'s):

$$ p(x_1|y=\text{KFC}) = \frac{1}{\sqrt{2\pi(35)}} \exp(-\frac{(x_1-173)^2}{2(35)}) $$

$$ p(x_2|y=\text{KFC}) = \frac{1}{\sqrt{2\pi(31)}} \exp(-\frac{(x_1-67)^2}{2(31)}) $$

$$ p(x_1|y=\text{McD}) = \frac{1}{\sqrt{2\pi(1.33)}} \exp(-\frac{(x_1-166)^2}{2(1.33)}) $$

$$ p(x_2|y=\text{McD}) = \frac{1}{\sqrt{2\pi(2.33)}} \exp(-\frac{(x_1-60)^2}{2(2.33)}) $$

Take it for granted that these estimates are based on Maximum A Posteriori (MAP) estimation.

Inference

We want to find the probability of a co-worker's favourite food being a KFC, knowing that he is 177cm tall and weighs 72kg.

$$ p(y=\text{KFC}|x_1=177, x_2=72) $$

$$ = \frac{p(x_1=177, x_2=72|y=\text{KFC}) \cdot p(y=\text{KFC})} {p(x_1=177, x_2=72)} $$

$$ = \frac{p(x_1=177, x_2=72|y=\text{KFC}) \cdot p(y=\text{KFC})} {\sum_{i=\text{KFC,McD}} p(x_1=177, x_2=72|y=i) \cdot p(y=i)} $$

$$ = \frac{p(x_1=177|y=\text{KFC}) \cdot p(x_2=72|y=\text{KFC}) \cdot p(y=\text{KFC})} {\sum_{i=\text{KFC,McD}} p(x_1=177|y=i) \cdot p(x_2=72|y=i) \cdot p(y=i)} $$

$$ = \frac{(0.0532)(0.0400)(\frac{7}{10})} {(0.0532)(0.0400)(\frac{7}{10}) + (0)(0)(\frac{3}{10})} $$

$$ = 1 $$

It's almost certain that this person will like KFC! Compare the distributions of $x$'s given KFC and $x$'s given McD and you'll understand why we obtain this extreme result.

Code

import numpy as np from sklearn.naive_bayes import GaussianNB X = [[180,75], [165,61], [167,62], [178,63], [174,69], [166,60], [167,59], [165,60], [173,68], [178,71]] y = [0,0,1,0,0,0,1,1,0,0] X = np.array(X) y = np.array(y) clf = GaussianNB() clf.fit(X, y) clf.predict([[177,72]])

Questions

Some questions to ponder upon:

Conditional probability

$$ P(A | B)=\frac{P(A \cap B)}{P(B)} $$

Conditional independence

$$ P(A \cap B | C)=\operatorname{Pr}(A | C) P(B | C) $$

Bayes' theorem

$$ P(A | B)=\frac{P(B | A) P(A)}{P(B)} $$

Sample variance

The unbiased estimator for sample variance is

$$ \frac{1}{N-1} \sum_{i=1}^N (x_i - \hat{\mu}) $$

References

https://en.wikipedia.org/wiki/Naive_Bayes_classifier

https://scikit-learn.org/stable/modules/naive_bayes.html