Deep learning model with machine vision system for recognizing food types during food consumption

Machine Learning


Data collection

Collection of food samples is essential to creating effective management systems that impact health and the environment. Dealing with obesity requires accurate data and intelligent systems to ensure proper food supply, reflecting the diverse cultural practices of various civilizations17,18.

This study constructed a diverse dataset featuring a variety of popular foods across Iran, known for its rich culinary culture. Choosing a specific food from 32 states has proven to be extremely difficult due to the vast array of delicious options available. Selected foods reflect the interests and opinions of the public, cultural significance, and their prevalence in restaurants, hospitals and university dining halls, as well as insights from nutritionists. A total of 32 foods are prepared, each representing the different ingredients and preparations from all states. To illustrate regional variations, the recipes can vary significantly, so samples of each dish were collected from various locations, including home cooks, restaurants, and dining halls. Finally, 16 major food classes were selected, and an additional 16 products were included to increase the likelihood of a dataset training model that can recognize food during consumption. After this selection process, images and videos of the prepared dishes were taken under different conditions, including different angles, backgrounds, and lighting. This comprehensive dataset, reflecting a variety of recipes and settings, is designed to effectively train deep learning algorithms, increasing the likelihood of real-time food recognition in real-world applications. In Figure 2, several examples of selected foods extracted from the video are presented. This introduces the various backgrounds, illustrations, angles and distances used in this study and provides information on food morphology.

Figure 2
Figure 2

Source: Created by the author.

Images of data collected.

The food was divided into three main meals: breakfast, lunch and dinner. Each meal had several types of food classes and we tried to use one main meal in each class, along with drinks and appetizers in the image. Figure 3 shows foods with three main diets and a selected subset of 32 classes, along with the food names.

Figure 3
Figure 3

Statistical diet, the inner circle displays three main diet categories. The other circle details the main categories of food.

To perform the correct detection of food consumption products, food images were first taken at consumption. Videos were obtained from each product under different conditions during food consumption, as the system needs to be more appropriately analyzed and easier to use. First, 16 classes of foods were selected and several videos were prepared. Images were extracted from the video in 0.5 seconds using the Python programming language. In the 16 classes, 12,000 images with dimensions of 1280 x 680 were acquired from start to finish of consumption. To increase the database, more videos were prepared, from which 24,000 images with dimensions of 1280 x 680 for 32 classes of consumable foods were acquired. Figure 4 shows the number of acquired images of food in each class of bar charts.

Figure 4
Figure 4

Number of images for each category.

Imaging System

The imaging system designed to identify consumed foods consists of a camera mounted on a monopod and two belts for a secure attachment. A mobile phone with the right camera is connected to a monopod attached to the individual performing the sampling operation. The monopod is located on the left side of a person's body, and aligns the phone's camera almost parallel to the shoulder. The camera angle is set to approximately 160° compared to people. This angle can vary as individuals consume food. This results in body movements creating new angles for each frame of the video. A variety of light sources are used, such as moonlight and LED lamps, to accommodate a variety of environments and lighting conditions. Figure 5 shows the arrangement of the camera and associated equipment.

Figure 5
Figure 5

Camera and equipment placement for food imaging.

camera

The food identification system's sampling camera consists of a variety of models, including the Samsung S23, S23 Ultra, Huawei P50, P40, P30, P30, P30, P30 Lite. Each of these cameras offers a variety of qualities for recording videos and images. Additionally, multiple Canon and Nikon video cameras and imaging cameras with 4k resolution were used to capture 10 frames per second and 6.5 frames per second and provide a variety of devices for this purpose.

Computers and Processing

Movies and images captured using the imaging system were transferred from mobile phones and cameras to laptops with the following specifications: CPU— AMD Ryzen 5 3500U, GPU – AMD Radeon Vega 8 graphics, and RAM – 8 GB, and 128 GB SSD via 3 USB ports. In this machine vision system, computers act as the primary source of image analysis and processing, allowing them to run a set of programs based on their processing power. Data preprocessing involves steps such as deleting missing data and improving quality. Once the data was prepared it was uploaded to Google Drive and allowed access via Google Cloud for dataset analysis. This study utilized Python version 3.12 and Google Cloud to implement deep learning algorithms. Deep learning algorithms that require substantial processing power for the large amount of data generated rely on efficient processing systems, including support from Google Colab.

Data enhancement

Data augmentation is a technique for enriching data by changing the size, intensity, color and lighting of deep learning algorithms. This technique by extending the diversity of the training dataset to the original dataset was particularly useful in improving the efficiency of deep learning algorithms, preventing overfitting, and improving the generalizability of the model.19. To enhance object recognition, first rotate the image 10-15 degrees. Next, translate the left and right foods to maintain the state of the object. Next, squeal the image at a new angle. It then zooms in or out to train the model at different scales. Finally, adjust the contrast and brightness of various lighting conditions. These five data augmentation steps increased the 16 class datasets from 12,000 to 60,000, and 32 class from 24,000 to 120,000.

Deep Learning Architecture

A brief introduction to convolutional neural networks (CNNS)

Convolutional Neural Networks (CNNS) represent important advances in deep learning, primarily used in image classification20,21. These consist of convolutions, combinations and fully connected layers, which show detailed information about the processing in Figure 6. Typically, architecture involves preprocessing the input image through resizing and quality improvements before applying convolutions in the reluctance and pooling layers. CNN training includes two stages: Feedforward and Backpropagationtwenty two. Initially, the image is point-increased using neuronal parameters, followed by a convolutional operation. The network output is compared to the correct answer and calculates the error rate used to adjust the parameters. This iterative process continues until training is complete.twenty three.

Figure 6
Figure 6

Source: Created by the author.

The structure of the CNN used to extract features and recognize specific objects.

Hyperparameters

Hyperparameters of deep learning algorithms such as batch size, optimizers (such as Adam and Lion), and learning rates have a significant impact on model accuracy and loss. Adjusting these hyperparameters can improve model performance by increasing convergence rates and increasing the accuracy of the recognition system. Learning rate is an important hyperparameter in the training process of deep neural networks. Learning rates have a direct effect on underwear and overfitting. Another important hyperparameter, the optimizer, adjusts the weights of the algorithm to minimize losses and improves the accuracy of deep learning. They help prevent overfitting of invisible data, improve model predictions, and improve training speed. Various optimizers were used in this study24,25. Image size is also an important parameter in the preprocessing data of deep learning algorithms. This parameter has a significant impact on training efficiency, model performance, hardware constraints, and architectural requirements.26At the very least, batch size refers to the number of images processed together by the generator or model. It helps manage large datasets and reduces GPU and CPU usage. Overall, batch size affects memory usage and training time, depending on both batch size and image size27,28.

Implementing deep learning algorithms

After preparing the dataset and transferring the data from the model, the deep learning architecture must be adjusted based on the complexity and quantity of the dataset. As shown in Figure 7, under the conditions of image capture and processing methods using laptops for preprocessing and Google Colab, the best architecture has been refined to achieve peak accuracy. The deep learning architecture was implemented in three phases. First, nine common deep learning algorithms were selected to train on the dataset to select the best architecture. In this step, you will first train 16 classes of food, and train in 32 classes. Deep learning architecture information is shown in Table 1. This step must then be performed to reduce computing time when the deep learning model is entered in the fine-tuning stage, but it does not show reliable performance on complex, large and innovative datasets. Finally, the hyperparameters of the best architecture were tuned based on the reduction in trains and improved model accuracy and performance behind model response times.

Figure 7
Figure 7

Source: Created by the author.

The route of preprocessing and recognition.

Table 1. Information on six general deep learning architectures.

Ethics approval

All methods of this study were conducted in accordance with relevant ethical guidelines and regulatory frameworks to ensure the integrity and ethical soundness of the study. The experimental protocol was reviewed and approved by the Ethics Committee of the University of Tehran (approval number: 124/178439) and adhered to the principles outlined in the Declaration of Ethical Policy in Helsinki and the Institution.

Participants' choices and consent

Participants were selected from a diverse age group and gender (ages 16-60) considering the variability in food consumption behaviors in front of the camera. This age range allowed a comprehensive analysis of different consumption rates, angles, and interactions with different food textures, ensuring a more robust data set for model training. Before participation, informed consent was obtained from all individuals and consent was secured from legal guardians if participants were minors. Participants were fully explained about their study objectives, procedures, and right to withdraw at any stage without any impact.

Data privacy and confidentiality

To support strict data privacy and confidentiality standards, all data collected was completely anonymized and no personally identifiable information was stored. Video recordings and images were securely encrypted and stored on password-protected institutional servers, with access restricted to certified research personnel only. Data collection and storage in compliance with General Data Protection Regulation (GDPR) standards and institutional ethical policies. Furthermore, upon completion of the study, all personal data was retained in a comprehensive manner only to ensure privacy protection while maintaining the integrity of the study.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *