Unlike me and you, computers only work in binary numbers. So, they can’t see and understand an image. However, we can represent images using pixels. For a grayscale image, the smaller the pixel the darker it is. A pixel takes on values anywhere between 0 (black) and 255 (white), numbers in the middle are a spectrum of greys. This number range is equal to a byte in binary, which is ²⁸, this is the smallest working unit of most computers.
Below is an example image that I created in Python and its corresponding pixel values:
Using this concept, we can develop algorithms that can see patterns in these pixels to classify images. This is exactly what a Convolutional Neural Network (CNN) does.
Most images are not grayscale and have some color. They are typically represented using RGB, where we have three channels that are red, green, and blue. Each color is a two-dimensional pixel grid, which is then stacked on top of each other. So, the image input is then three-dimensional.
The code used to generate the plot is available on my GitHub:
Overview
The key part of CNNs is the convolution operation. I have a full article detailing how convolution works, but I will give a quick recap here for completeness. If you want deep understanding, then I highly recommend you check the previous post: