Enable high performance, low power inference in your EDGE application

Why Edge AI?

There has been a paradigm shift in the AI market. Previously, AI processing was primarily done in the cloud. The endpoint device collected data from the sensor, sent it to the cloud for inference processing and decision making, and the results were sent back to the endpoint device. This approach required large bandwidth to send large data to the cloud. International Data Corporation (IDC) estimates that 79.4ZB of data will be sent from IoT devices to the cloud in 2025.

There is an increasing movement towards AI inference on edge devices that allow for faster real-time response and increased data privacy and security while avoiding the latency and costs associated with cloud connectivity. This also reduces power consumption and is suitable for battery-powered IoT applications. Therefore, AI offers attractive benefits for new applications with autonomy, reduced latency, low power, lower bandwidth to the cloud and lower security costs.

Figure 1. Moving from cloud inference to edge inference

MCUs are increasingly being used for edge AI. They offer a fully integrated solution that simplifies product design, reduces development and BOM costs, ideal for low power and cost-sensitive applications, providing better real-time response, lower power consumption, lower costs. High-performance MCUs with integrated hardware accelerators are now available. It can now handle DOT products and linear algebraic operations such as fast, parallel matrix multiplication, convolution, and transposed, which are required for neural network processing. Small resource-constrained MCU-optimized neural network models, software libraries, and ecosystem solutions are also available.

Build power-efficient AI applications using the RA8P1 AI-ACCELERATED MCU

The RA8P1 MCUS is Renesas' first AI-Accelerated Single and Dual Core MCU, offering increased AI/ML, DSP, and scalar performance and low power consumption using high performance arms.^® cortex^®-The Cortex-M33 CPU core with the M85 and ARM ETHOS™-U55 Neural Network Processor (NPU) is ideal for Edge AI and IoT applications. Built on the Advanced TSMC 22NM Ultra-Low Leakage (22Ull) process, the RA8P1 MCU delivers unprecedented 7300+ Coremark Raw Performance, 256GOPS of AI performance, addressing the low power consumption needs of AI applications.

Along with large memory and a wealth of peripheral sets, these devices enable demanding voice, vision AI and real-time analytics applications directly on the device itself. The dual-core RA8P1 MCU enables high throughput, efficient task splitting between two cores, and improved real-time performance. Plus, advanced security, immutable memory, and TrustZone^® Built-in to enable truly secure AI applications.

Embedded into the RA8P1, the ETHOS-U55 NPU is a dedicated processor optimized for performing core operations on neural network models such as matrix multiplication and convolution, and is more efficient and has lower power power than the CPU core. The ETHOS-U55 is optimized for low-precision arithmetic (8-bit integer) used in AI models, reducing complexity, memory usage and power consumption without decreasing inference accuracy.

Renesas has successfully demonstrated this performance uplift using a RA8P1 MCU with ETHOS-U55 for inference processing. This shows a significant performance uplift with the ETHOS-U55 NPU compared to the CPU core.

Figure 2. Significant improvements in AI performance using ETHOS-U55 NPU compared to CPU cores

Models used:

Image classification – ResNet8, MobileNet V2, MobileNet V3
Keyword Spotting – DS-CNN
Visual Wakeword – Mobilenet V1
Object detection – yolo_fastest, yolov8n
Anomaly detection – AD_Medium

Use the Ruhmi framework to enable faster application development

The RA8P1 AI solution features a highly configurable, optimized, robust integrated hetero-engine model integration (RUHMI) framework that provides AI developers with all the tools they need for fast and efficient AI development. This is the first comprehensive AI framework for MCU and MPU Renesas, integrated into E² Studio IDE generates and deploys highly optimized neural network models in the framework's agnostic way. Ruhmi enables model optimization, quantization, graph compilation, and conversion to MCU-friendly formats. Native support for commonly used ML frameworks includes Tensorflow Lite, Pytorch, and Onnx. It is ready to use examples and models of applications optimized for RA8P1.

Figure 3. AI workflow using the RenesasRuhmi framework

Typical AI workflow with Ruhmi framework:

Model Optimization and Compilation (Offline) – Pre-trained AI models are entered through commonly used frameworks such as Tensorflow Lite, Pytorch, ONNX, etc. Using RUHMI optimization and conversion tools, the model is first quantized and optimized into an INT8 intermediate form. This process includes graph partitioning, operator separation between NPU and CPU, and compiling into an MCU-friendly format (usually *.c/ *.h).
Data entry and preprocessing – Raw input data (images from the camera, audio from the microphone) is captured by the RA8P1 MCU and preprocessed by a high-performance Cortex-M85 core for input to the AI model.
Running on NPU – The CPU core sends the processed input data and the command stream of the compiled AI model to the ETHOS-U55 NPU for execution. The NPU reads the command stream and processes each layer of the neural network using input data and model weights (usually stored in local memory).
Output and post-processing – Once the NPU handles all layers of the neural network, it can return the inference results back to the main CPU, performing the necessary post-processing and actions.

AI applications enabled by RA8P1

With high inference performance, low power consumption and real-time processing capabilities, the RA8P1 MCU is ideal for a wide range of AI applications across a wide range of market segments. Below are some important applications that are enabled on RA8P1:

Audio ai – Keyword spotting, voice recognition, voice recognition, noise reduction, speaker identification
Vision AI – Object detection, image classification, gesture recognition, face recognition, image analysis, driver/vehicle monitoring
Real-time analysis – Anomaly detection, vibration analysis, and prediction maintenance
Multimodal Applications – Smart HMI with voice and vision capabilities, surveillance cameras with voice and vision to detect events, robotics with visual and auditory inputs for environmental sensing and interaction

In the next section, we will look at how RA8P1 can help simplify the implementation of AI with two application examples.

Application Example 1: Image Classification for RA8P1

Figure 4. Image Classification System Block Diagram

The diagram above shows the implementation of an image classification application. The RA8P1 integrates the CPU cores, NPUs, memory and peripherals needed to build this vision AI application all on a single chip. The application analyzes the input images and assigns pre-assigned labels or categories. The neural network model is trained on a vast dataset of images (labels with categories on each image) and deployed into the RA8P1 MCU. For inference, a new input image is fed into the model and passed through layers of the trained network. The output layer provides a probability distribution across all categories, with categories with the highest probability assigned as labels for the image. This output data (image labels and accuracy) can be sent to the display or to the cloud. The implementation uses the EthoS-U55 to improve inference speed 33 times compared to using the CPU core.

Figure 5. Image classification for RA8P1 and performance comparison, NPU vs CPU

Image classification can be used in a wide variety of applications.

safety – Identifying weapons, people's perceptions, anomaly detection
retail – Create product catalogs for each category and manage inventory
Agriculture – Identifying crop diseases and plant classifications
Smart City – Identification of signals/signs and pedestrians
Smart Appliance – Identifies objects in the fridge

Application Example 2: Driver Monitoring System for RA8P1

This application illustrates the Nota-AI Driver Monitoring System (DMS), an in-cabin safety solution for enhancing traffic safety in all aspects of vehicle travel. Using the RA8P1, NOTA-AI DMS detects driver distractions such as unregistered drivers, driver drowsiness, cell phone use, and smoking.

Higher RA8P1 performance increases the inference performance of the four models used in this application by 4×24 times. Face detection, facial landmarks, islandmarks, phone detection.

DMS finds applications with dashboard cameras, vehicle travel data recorders and driver surveillance systems.

Figure 6. Driver monitoring system for comparison between RA8P1 and performance, NPU vs CPU

Both these Vision AI applications use the resources of the RA8P1 MCU at their best.

Efficient input image acquisition via image sensor
- The RA8P1 includes a dedicated MIPI CSI-2 interface using an image scaling unit for capturing raw image input data or a 16-bit CEU parallel camera interface.
High-performance inference processing using ETHOS-U55 NPU
- The ETHOS-U55 AI accelerator on the RA8P1 MCU offloads the CPU core and handles complex AI models more efficiently and with lower power consumption than the CPU core. Receive processed images from MIPI CSI-2 or parallel CEU.
- Pre-trained AI models (e.g., image classification models like MobileNETV1) are optimized for RA8P1 using the RUHMI tool and loaded into the NPU.
- The ETHOS-U55 NPU performs real AI inference with extremely fast (up to 256GOPS) and high power efficiency.
Fastest application processing using ARM Cortex-M85 and Cortex-M33
- The high-performance 1GHz Cortex-M85 core with ARM Helium Vector extension can be used for pre- and post-processing and inference results of input images or audio data. Operators not supported by the ETHOS-U55 can also be run by the cortex-M85 Core in fallback mode accelerated by the CMSIS-NN library. It is also used to execute application code.
- The 250MHz cortex-M33 core can be used for low power wake-up and housekeeping tasks.
Efficient storage for images, model weights, and activation using on-chip memory and memory interfaces
- Large on-chip 1MB MRAM and 2MB SRAM are important to store AI model weights, images, and intermediate activations. Integrated embedded MRAM offers advantages over flash, such as faster writing and higher durability and retention.
- MCUS also supports large models of high-throughput external memory interfaces (OSPI with XIP and in-situ Ospi and 32-bit SDRAM).
Advanced graphic peripherals for LCD panels
- Graphic LCD controllers (using parallel RGB or MIPI DSI interfaces) and 2D engines can be used to process and render image and inference results to LCD displays.
Flexible connection options
- There are several connection options available to send inference results, images, or alerts/notifications to your local device or cloud for storage or analysis.

Edge AI applications benefit greatly from using AI-accelerated MCUs. Enable applications where real-time performance, low power and security are critical concerns. Adding NPUs to low-power MCUs has been a transformative change in the AI solution landscape. The new RA8P1 MCU significantly reduces latency, enables data privacy, and minimizes power consumption, making it ideal for battery-powered applications. The entire development is supported by Renesas' comprehensive Ruhmi framework. This allows developers to efficiently optimize and deploy AI models on RA8P1 hardware.

For more information, please visit www.renesas.com/ra8p1

Source link