The fulcrum optimizes simultaneous DNN training and guesswork for edge accelerators, and maximizes throughput in 1000 seconds power mode

Coupled with growing concerns about data privacy, the increasing demand for artificial intelligence on devices promotes the need for efficient, deep learning on edge devices. Prashanthi SK, Saisamarth Taluri, and Pranav Gupta, together with colleagues, address this challenge by developing Fulcrum, a new system for optimizing simultaneous deep learning training and inference for edge accelerators. Their research recognizes that existing hardware often lacks the flexibility to efficiently support both tasks, requiring careful management of resources and power consumption. The fulcrum intelligently augments training and inference, dynamically adjusts power settings and batch sizes to maximize performance while maximizing performance, crucially achieving this with minimal costly and time-consuming profiling needs. This advancement promises to unlock the full potential of edge devices for a wide range of applications, from self-driving cars to personalized healthcare.

Deep Learning Resource Management and Power Optimization

This study focuses on efficient management of hardware resources for deep learning tasks, both in training and inference. The overall goal is to find the optimal configuration of components, cores, CPU and GPU frequencies and memory speeds that meet performance requirements, while minimizing power consumption. This presents the classic optimization challenge, especially in dynamic environments where workload and resource availability is constantly changing. To achieve this, scientists have developed a gradient-based multidimensional search (GMD) algorithm for training workloads. GMD investigates possible configurations by repeatedly adjusting resource settings, led by performance-power trade-offs.

Starting with the initial configuration, you move in the direction of pruning your search space to focus on the most likely candidates, while improving performance. For the inference workload, the team proposed Active Learning-Based Sampling (ALS). This method uses machine learning to predict performance and power consumption for different configurations and intelligently select the one that is most likely to improve the accuracy of the predictive model. Neural networks are trained to predict performance and power, and algorithms repeatedly sample configurations, profile them, update the network, and allow for efficient investigation of the configuration space and accurate prediction of optimal settings.

Intelligent time zones for edge neural networks

This research addresses the challenge of simultaneously performing deep neural network training and inference on edge devices such as the Nvidia Jetson platform. These devices often lack native support for concurrent GPU utilization and present a complex landscape of power modes. To optimize performance under power and latency constraints, researchers developed an intelligent time slice approach and formulated optimization problems that interleave training and inference mini-batches while maximizing training throughput. The key objective of this work is to minimize the need for costly profiling to achieve optimal configurations.

To solve this optimization problem, the team proposed two strategies: Active Learning Sampling (ALS) and Gradient Decent-Based Multidimensional Search (GMD). GMD quickly investigates the solution space and profiles 10-15 power modes to reach the solution within 5-10 minutes of each configuration. In contrast, ALS profiles a wide range of power modes from 50 to 150, taking around 1.5 hours, but offers the possibility to generalize to other problem configurations with varying power, latency and arrival rates. Both strategies are integrated within a scheduler called Funcrum to run the workload.

GMD operates within a 4D solution space defined by CPU/GPU/memory frequency and CPU co-account, collectively determining the power mode. This method starts with profiling the initial power mode and uses this knowledge to pneumo the search space and repeatedly select and profile subsequent power modes. Researchers have demonstrated that GMD uses domain knowledge to guide search directions to avoid accidental pruning of viable candidates. Researchers investigated the relationship between GPU frequency and training time, revealing a nonlinear correlation in which increased frequency before plateau results in a sharp decrease in training time and steadily increasing power consumption. These insights informed the development of GMD and allowed them to efficiently navigate the complex interactions between performance and power consumption.

Simultaneous DNN Training and Inference on Edge Devices

This study presents a new approach to simultaneously manage deep neural network training and inference in edge devices, sharing GPU resources, and addressing current system limitations when navigating wide range of power modes. This work focuses on time-ticking these simultaneous workloads, maximizing performance, while adhering to strict power and latency constraints and minimizing the need for broad profiling. The core of their system is an optimization problem that inserts training and inference mini-batches and dynamically adjusts the device's power mode and inference mini-batch size. To solve this complex optimization, scientists proposed two important strategies: GMD is an active learning technique that identifies the optimal power modes that can be reused while minimizing profiling costs.

The experiments show that both ALS and GMD outperform simpler and more complex baseline methods, achieving success in 97% of cases and theoretically providing solutions within 0.5% of optimal throughput. The team also improved their approach with ALS and utilized active learning to reduce the number of power modes that require profiling. This technique constructs an initial neural network model using a small set of randomly selected power modes and repeatedly selects additional modes based on the possibility of diversifying observed power and time values. The resulting system builds a partial Pareto front that represents a performance-power trade-off directly from the profiled data, eliminating prediction errors in the optimization process.

Jetson's Power and Task Optimization

This study presents a new approach to efficiently manage simultaneous deep neural network training and inference on edge devices, particularly the Nvidia Jetson platform. Recognizing the limitations of existing systems when sharing GPU resources, navigating through the vast number of power modes, the team has developed an intelligent time slice method that optimizes performance while complying with strict power and latency constraints. The core of this outcome is to develop optimization problems that carefully insert training and inference tasks, dynamically adjusting both the device's power mode and inference mini-batch size. To solve this complex problem, researchers designed two innovative optimization strategies: GMD and ALS. GMD efficiently searches for the optimal power mode with minimal profiling, while ALS leverages active learning to identify the most suitable power modes that can be reused and minimizes profiling costs.

Source link