Fragmented computational graphs currently limit the performance of Pytorch 2 despite recent advances in just-in-time compilation. GraphMend, a new high-level compiler. GraphMend automatically analyzes and converts the source code to eliminate breaks in these graphs. This results from dynamic control flows and unsupported Python features, thus preventing costly performance bottlenecks. The team has demonstrated that GraphMend successfully removes all modifiable graph breaks on some popular models, achieving significant delay reductions of up to 75%, and throughput improvements of up to 8% on the latest GPU. This achievement represents a critical step forward in simplifying the development of high-performance machine learning models within the Pytorch ecosystem, providing both ease of use and speed.
Automatic CUDA graph optimization for Pytorch models
This study introduces a system that optimizes Pytorch programs by effectively utilizing CUDA graphs. The core idea is to automatically convert Pytorch models to CUDA graphs, reducing CPU overhead and improving GPU utilization. Pytorch supports CUDA graphs, but to achieve optimal performance, you need to carefully consider building and running the graph. The system bridges the gap by automating processes and addressing challenges related to dynamic geometry and control flow. Key achievements include automatic graph transformation, reducing the need for manual structures, and techniques for handling dynamic shapes by recompiling graphs as needed.
The system also optimizes the control flow within the CUDA graph, reducing the overhead associated with branching and conditional execution. The experiments show significant performance improvements with up to twice as fast as possible over a variety of models and datasets, and the greatest advantages are observed in models with high CPU overhead. The system integrates with Pytorch Profiler to enable developers to easily identify and optimize performance bottlenecks.
GraphMend compiles Pytorch programs without fragmentation
Scientists have developed GraphMend, a high-level compiler that eliminates fragmentation of Pytorch 2 programs, dramatically improving performance and ease of use. Existing dynamic JIT compilation pipelines often encounter breaks in FX graphs due to dynamic control flows and unsupported Python constructs, forcing inefficient switches between thermal and graph modes. GraphMend addresses this limitation by analyzing and converting the source code in front It allows for larger, uninterrupted compilation of FX graphs without the need for manual code adjustments. The system is built on top of the JAC compilation framework and implements two important code transformations that are specifically targeted at dynamic control flows and Python I/O capabilities.
Experiments across eight embracing face models demonstrate the effectiveness of GraphMend, completely removing all modifiable graph breaks in six models and reducing break counts in another model. This conversion provides significant performance improvements, achieving up to 75% reductions in cold start forward latency, and achieving steady-state latency up to 25% lower on NVIDIA RTX 3090 and A40 GPUs. Additionally, the team measured end-to-end throughput of up to 8%, indicating improved data processing efficiency.
GraphMend eliminates fragmentation of Pytorch 2 compilation
Scientists have developed GraphMend, a high-level compiler that eliminates fragmentation of Pytorch 2 programs, dramatically improving performance and ease of use. This task addresses important limitations of the Pytorch 2 compilation pipeline. Dynamic control flows and unsupported Python constructs split the model into multiple FX graphs, forcing frequent CPU and GPU switches. GraphMend analyzes and converts the source code before it runs, and actively rebuilds the code to avoid these breaks. The system works within the JAC compilation framework and uses abstract syntax trees and control flow graphs to identify and eliminate patterns that can cause fragmentation. The experiments show that GraphMend successfully removes almost all modifiable graph breaks, allowing for larger, more efficient computational graphs. This provides significant performance improvements, with up to 75% latency reductions and up to 8% improvements in throughput for modern graphics processing units.
GraphMend eliminates compilation fragmentation for speed
Researchers have developed GraphMend, a technique that improves the performance of Pytorch 2 programs by eliminating fragmentation during editing. Current systems often get corrupted in the compilation process due to dynamic control flows and standard Python input/output operations, slowing down the run mode of the switch. GraphMend addresses this limitation by analyzing and converting the source code before running, reconstructing these problematic elements into forms that are compatible with persistent graph compilation. This study demonstrates the effectiveness of high-level code conversions that complement existing just-in-time compilation techniques, providing a pathway to both usability and improved performance of deep learning frameworks.