In this new post, we are going to try to understand how multinomial naive Bayes classifier works and provide working examples with Python and scikit-learn.
What we’ll see:
- What is the multinomial distribution: As opposed to Gaussian Naive Bayes classifiers that rely on assumed Gaussian distribution, multinomial naive Bayes classifiers rely on multinomial distribution.
- The general approach to create classifiers that rely on Bayes theorem, together with the naive assumption that the input features are independent of each other given a target class.
- How a multinomial classifier is “fitted” by learning/estimating the multinomial probabilities for each class — using the smoothing trick to handle empty features.
- How the probabilities of a new sample are computed, using the log-space trick to avoid underflow.
All images by author.
If you are already familiar with the multinomial distribution, you can move on to the next part.
The first important step to understand the Multinomial Naive Bayes classifier is to understand what a multinomial distribution is.
In simple words, it represents the probabilities of an experiment that can have a finite number of outcomes and that is repeated N times, for example, like rolling a dice with 6 faces say 10 times and counting the number of times each face appears. Another example is counting the number of occurence each word in a vocabulary appear in a text.
You can also see the multinomial distribution as an extension of the binomial distribution: except for tossing a coin with 2 possible outcomes (binomial), you roll a dice with 6 outcomes (multinomial). As for the binomial distribution, the sum of all the probabilities of the possible outcomes must sum to 1. So we could have: