Fundamentals of Image Recognition: A Beginner's Approach | Written by Prerak Khandelwal

“Just as electricity transformed almost everything 100 years ago, today it’s hard to even think of an industry that I don’t think AI will transform within the next few years.” ～Andrew Ng said

A digital image is a grid-like representation of visual data. It consists of a set of pixel values indicating the brightness and color of each pixel. Image recognition is the process of taking an image as input, passing it through a neural network, and finally giving a class label as output. This class label given by the neural network becomes part of a set of predefined classes.

a convolutional neural networkAlso known as CNN, it is a class of neural networks that specializes in processing data that has a grid-like topology, such as images. Second, we perceive images. Our brains analyze vast amounts of data. Each neuron has its own receptive field and connects with other neurons so that the entire visual field is covered. Similarly, CNNs have many layers and are designed to detect simpler patterns first (lines, curves, etc.), followed by more complex patterns (faces, objects, etc.).

***Representing an image as a grid of pixels***

The input given to CNN is an image. If the image is 32 wide by 32 high and contains 3 R, G, B channels, the raw pixels are preserved ([32x32x3]) Image value.

The image is in the form of a matrix of values (see Figure 1) and is given as input to the CNN and must be passed through multiple layers.

Now let's learn about the different layers of CNN.

1. Why do enterprise AI projects fail?

2. How will AI drive the next wave of medical innovation?

3. Machine learning using regression models

4. Top Data Science Platforms of 2021 Other than Kaggle

convolutional layer

This is the core building block of neural networks. Dot multiplication is performed between two matrices: the kernel, which is a matrix with a set of learnable parameters or weights, and the image matrix given as input. The kernel has smaller dimensions than the image. The kernel slides across the image matrix and performs dot multiplication to obtain the values as elements of the resulting output matrix (see figure).

Advantages of convolutional layers –

Sparse Interaction: This is achieved by making the kernel smaller than the input. For example, an image may contain millions or thousands of pixels, but when processed using a kernel, meaningful information in tens or hundreds of pixels can be detected. This means fewer resulting parameters, less memory, and more efficiency.
Sharing parameters: Since the kernel for all element sets is the same, these shared parameters ensure equal importance for all elements.
Equivariant representation: The kernel slides over the input in a fixed way, so if you change the input in a certain way, the output will change in the same way.

pooling layer

This helps reduce the spatial size of the representation by deriving summary statistics for nearby outputs. This reduces the amount of computation. There are several pooling functions, including rectangular frame average, rectangular frame L2 norm, and weighted average based on distance from the center pixel. The most common is max pooling, which takes the maximum value of elements in a frame.

fully connected layer

A fully connected (FC) layer consists of neurons, weights, and biases. Fully connected here means that each neuron in the FC layer is connected to some neurons in the next layer. The FC layer is usually just before the output and is at a later stage in the CNN architecture. Here, the input image from the previous layer is flattened and fed to the FC layer. The flattened vector passes through several more layers where mathematical operations are performed. This is where the classification begins.

**Image recognition and classification**

Drop out

Typically, overfitting can occur in the training dataset if all features are connected to the FC layer. Overfitting occurs when a particular model performs too well on training data, negatively impacting the model's performance when used on new data. To overcome this problem, suppression layers are used to remove some neurons from the neural network during training, reducing the size of the model. Passing 0.35 will randomly remove 35% of neurons from the neural network.

Activation function

Finally, one of the most important parameters of a CNN model is the activation function. These are used to learn and approximate highly continuous and sophisticated relationships between variables in a network. Simply put, it determines what information in the model should be fired within the forward direction and what information should not be fired at the top of the network.

Adds nonlinearity to the network. There are several commonly used activation functions such as ReLU, Softmax, and tanH, the sigmoid function. Each of these functions has selected uses. For binary classification CNN models, sigmoid and softmax functions are recommended, and for multiclass classification, softmax is commonly used.

Some features are explained below.

ReLU — Rectified linear units. ReLU is an element-wise operation. That is, it is applied pixel by pixel and all negative values in the feature map are replaced with zeros. Essentially, all black values (that is, negative values) are invalid.
Sigmoid — The input to the function is converted to a value between 0.0 and 1.0. Inputs much greater than 1.0 are converted to a value of 1.0, and likewise values much less than 0.0 are snapped to 0.0.
Hyperbolic tangent function (tanh) — The hyperbolic tangent function (tanh for short) is a similarly shaped nonlinear activation function that outputs values between -1.0 and 1.0.
Softmax — The Softmax activation function calculates relative probabilities. This can be seen as a modification of the sigmoid function.

**Predict classes by combining different layers**

What is the difference between neural networks and CNNs?

A simple neural network converts the original image into a list and accepts it as input. Information between adjacent pixels may not be preserved. In contrast, CNN builds convolutional layers that preserve knowledge between neighboring pixels.

Image recognition using deep learning has a wide range of applications, including improving augmented reality games and applications, supporting education systems, optimizing medical images, predicting consumer behavior, and giving vision to machines. This was about the basics, but you can dive deeper into the world of using images as a means to improve technology and lifestyle.