Deploying high-performance AI models for Windows applications on NVIDIA RTX AI PCS

Applications of AI


Today, Microsoft makes Windows ML available to developers. Windows ML allows C#, C++, and Python developers to optimally run AI models across PC hardware, including CPU, NPU and GPUs. NVIDIA RTX GPUs leverage the GPU tensor core and NVIDIA TENSORTORT for RTX Execution Providers (EPs) that leverage the GPU tensor core and architecture advances such as FP8 and FP4 to provide the fastest AI inference performance on Windows-based RTX AI PCS.

“Windows ML unlocks full Tensort acceleration for GeForce RTX and RTX Pro GPUs and delivers excellent AI performance in Windows 11.” “We generally look forward to being able to build and deploy powerful AI experiences at scale.”

Overview of Windows ML and Tensort for RTX EP

Video 1. Deploying high-performance AI models for NVIDIARTXAIPCS Windows applications

Windows ML is built on top of the ONNX runtime API for guessing. Extend the ONNX runtime API to handle dynamic initialization and dependency management of running providers across your PC's CPU, NPU, and GPU hardware. Additionally, Windows ML automatically downloads the required running providers on demand, reducing the need for app developers to manage dependencies and packages across multiple different hardware vendors.

FIG. 10 is a diagram showing a Windows ML architecture stack. Showing steps from the application to the running providerFIG. 10 is a diagram showing a Windows ML architecture stack. Showing steps from the application to the running provider
Figure 1. WindowsML Stack Diagram

NVIDIA TENSORT, the RTX Running Provider (EP), offers several advantages to Windows ML developers using the ONNX runtime.

  • As shown in the diagram below, we run an ONNX model with low latency inference and 50% faster throughput, compared to previous DirectML implementations on NVIDIA RTX GPUs.
  • It integrates directly with WindowsML with a flexible EP architecture and ORT integration.
  • Just-in-time compilation of streamlined deployments on end-user devices. Learn more about the editing process within RTX's Tensorrt. This compilation process is supported by the ONNX runtime as an EP context model.
  • Drawing on advances in architectures such as FP8 and FP4 with tensor core
  • Lightweight package of just under 200MB.
  • Support for various model architectures from LLMS (ONNX Runtime Genai SDK Extension), spread, CNN, and more.

Find out more about RTX's Tensorrt.

Bar chart showing generation throughput speedup for several models measured using NVIDIA RTX 5090 GPUBar chart showing generation throughput speedup for several models measured using NVIDIA RTX 5090 GPU
Figure 2. Speed ​​up the generation throughput of various models on WindowsML and Direct ML. Data measured on an NVIDIA RTX 5090 GPU.

Selecting a running provider

The 1.23.0 release of the ONNX runtime included in WindowsML provides independent APIs for vendors and running providers for device selection. This dramatically reduces the amount of application logic required to take advantage of the optimal execution provider for each hardware vendor platform. See below for a code excerpt on how to do this effectively and get maximum performance on an NVIDIA GPU.

// Register desired execution provider libraries of various vendors
auto env = Ort::Env(ORT_LOGGING_LEVEL_WARNING);
env.RegisterExecutionProviderLibrary("nv_tensorrt_rtx", L"onnxruntime_providers_nv_tensorrt_rtx.dll");

// Option 1: Rely on ONNX Runtime Execution policy
Ort::SessionOptions sessions_options;
sessions_options.SetEpSelectionPolicy(OrtExecutionProviderDevicePolicy_PREFER_GPU);

// Option 2: Interate over EpDevices to perform manual device selection 
std::vector<:constepdevice> ep_devices = env.GetEpDevices();
std::vector<:constepdevice> selected_devices = select_ep_devices(ep_devices);

Ort::SessionOptions session_options;
Ort::KeyValuePairs ep_options;
session_options.AppendExecutionProvider_V2(env, selected_devices, ep_options);
# Register desired execution provider libraries of various vendors
ort.register_execution_provider_library("NvTensorRTRTXExecutionProvider", "onnxruntime_providers_nv_tensorrt_rtx.dll")

# Option 1: Rely on ONNX Runtime Execution policy
session_options = ort.SessionOptions()
session_options.set_provider_selection_policy(ort.OrtExecutionProviderDevicePolicy.PREFER_GPU)

# Option 2: Interate over EpDevices to perform manual device selection
ep_devices = ort.get_ep_devices()
ep_device = select_ep_devices(ep_devices)

provider_options = {}
sess_options.add_provider_for_devices([ep_device], provider_options)

Pre-compiled runtimes that provide quick load times

Model Runtimes can now be pre-compensated using EP context ONNX files within the onnx runtime. Each running provider can use this to optimize the entire subgraph of the ONNX model and provide an EP-specific implementation. This process can be serialized to disk to enable quick load times in WindowsML. This is often faster than previous traditional operator-based methods of direct ML.

The chart below shows that Tensorrt in RTX EP takes time to compile, but the optimizations are already serialized, which makes the model loading and inference faster. Additionally, the runtime caching feature within Tensortort of RTX EP ensures that the kernels generated during the compilation phase are serialized and stored in a directory, so there is no need to recompile for the next inference.

Comparing load times for deepseek-r1-distill-qwen-7b models bar chart only onnx models, and use both onnx with ep context files and runtime cacheComparing load times for deepseek-r1-distill-qwen-7b models bar chart only onnx models, and use both onnx with ep context files and runtime cache
Figure 3. Different load times for runtimes for the ONNX model, EP context file, DeepSeek-R1-Distill-Qwen-7B model, including EP context and runtime cache. The lower the better.

Minimum data transfer overhead using ONNX Runtime Device API and WindowsML

The new ONNX Runtime Device API, also available in Windows ML, enumerates the available devices for each running provider. Using this new concept, developers can assign device-specific tensors without the specification of additional EP-dependent types.

This API allows developers to perform EP-independent GPU-accelerated inference with minimal runtime data transfer overhead, leveraging copy tenser and Iobnind.

Figure 5 shows a stable diffusion 3.5 model that utilizes the ONNX runtime device API. Figure 4 below shows the time required for a single iteration of the same model's diffusion loop, with or without device IO binding.

Compare medium model performance with or without binding for table devices compared with stable diffusion.Compare medium model performance with or without binding for table devices compared with stable diffusion.
Figure 4. AMDRyzen 7 7800X3D CPU + RTX 5090 GPU is connected via stable diffusion 3.5 PCI 5 when not using device binding. Lower times are better.

Using an NSIGHT system, we visualized performance overhead due to repeated copying between the host and device when no IO binding was used.

Nsight Systems Timeline highlights the increased overhead caused by additional synchronous PCI trafficNsight Systems Timeline highlights the increased overhead caused by additional synchronous PCI traffic
Figure 5. nsight system timeline showing the overhead that additional synchronous PCI traffic creates.

A copy input tensor operation is performed before all inferences are performed. This is highlighted as green in our profile, and the devices hosting copies of the output take about the same time. Additionally, the ONNX runtime uses pageable memory by default where the device hosting the copy is implicitly synchronized, while the Cudamemcpyasync API is used by the ONNX runtime.

On the other hand, when the input and output tensors are IO bound, the host-to-device copy of the input occurs once before the multi-model inference pipeline. The same applies to the output device-to-host copy, then syncs the CPU with the GPU again. The async nsight trace above shows the execution of multiple inferences in a loop without a copy or sync operation, freeing up CPU resources during that time. This will result in a copy time of 4.2ms and a one-time host copy time of 1.3ms, resulting in a total copy time of 5.5ms regardless of the number of iterations in the inference loop. For reference, this approach reduces the copy time of 30 iterative loops by about 75 times!

Tensort for RTX-specific optimization

Tensort for RTX Execution offers custom options for further optimization of performance. The most important optimizations are listed below.

  • CUDA Graph: Enabled by setting enable_cuda_graph Reduce CPU overhead to capture all Cuda kernels launched by Tensort in the graph. This is important if the Tensort graph launches many small kernels and allows the GPU to run faster than it can submit the CPU. This method generates approximately 30% performance gain on LLMS and is useful for many model types, including traditional AI models and CNN architectures.
A bar chart showing the throughput speedup achieved using CUDA graphs with ONNX runtime API measured on an NVIDIA RTX 5090 GPU with several LLMs.A bar chart showing the throughput speedup achieved using CUDA graphs with ONNX runtime API measured on an NVIDIA RTX 5090 GPU with several LLMs.
Figure 6. Here's a quicker throughput for CUDA graphs that are enabled compared to CUDA graphs that are disabled in the ONNX Runtime API. Data measured on NVIDIA RTX 5090 GPUs with some LLMs.
  • Runtime cache: nv_runtime_cache_path Cache the compiled kernel and point to a directory where you can combine them using EP context nodes to cache quick load times.
  • Dynamic shapeOverride the range of known dynamic shapes by setting three options profile_{min|max|opt]_shapes Alternatively, modify the input shape of the model by specifying the static shape using AddFreeDimensionOverRideByName. Currently, this feature is in experimental mode.

summary

We are excited to work with Microsoft to bring to Windows application developers for Windows ML and Tensort for RTX EP for maximum performance on NVIDIA RTX GPUs. Top Windows application developers such as Topaz Labs and Wondershare Filmora are currently working on integration of Windows ML and Tensort for RTX EP.

Get started with Windows ML, ONNX Runtime API, and Tensorrt for RTX EP using the following resources:

Please look forward to future improvements. Speed ​​up with the new APIs demonstrated by the sample. If you have a feature request from your side, please open the issue on GitHub and let us know.

Acknowledgments

We would like to thank Gaurav Garg, Kumar Anshuman, Umang Bhatt and Vishal Agarawal for their contributions to the blog.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *