Today, Microsoft makes Windows ML available to developers. Windows ML allows C#, C++, and Python developers to optimally run AI models across PC hardware, including CPU, NPU and GPUs. NVIDIA RTX GPUs leverage the GPU tensor core and NVIDIA TENSORTORT for RTX Execution Providers (EPs) that leverage the GPU tensor core and architecture advances such as FP8 and FP4 to provide the fastest AI inference performance on Windows-based RTX AI PCS.
“Windows ML unlocks full Tensort acceleration for GeForce RTX and RTX Pro GPUs and delivers excellent AI performance in Windows 11.” “We generally look forward to being able to build and deploy powerful AI experiences at scale.”
Overview of Windows ML and Tensort for RTX EP
Windows ML is built on top of the ONNX runtime API for guessing. Extend the ONNX runtime API to handle dynamic initialization and dependency management of running providers across your PC's CPU, NPU, and GPU hardware. Additionally, Windows ML automatically downloads the required running providers on demand, reducing the need for app developers to manage dependencies and packages across multiple different hardware vendors.


NVIDIA TENSORT, the RTX Running Provider (EP), offers several advantages to Windows ML developers using the ONNX runtime.
- As shown in the diagram below, we run an ONNX model with low latency inference and 50% faster throughput, compared to previous DirectML implementations on NVIDIA RTX GPUs.
- It integrates directly with WindowsML with a flexible EP architecture and ORT integration.
- Just-in-time compilation of streamlined deployments on end-user devices. Learn more about the editing process within RTX's Tensorrt. This compilation process is supported by the ONNX runtime as an EP context model.
- Drawing on advances in architectures such as FP8 and FP4 with tensor core
- Lightweight package of just under 200MB.
- Support for various model architectures from LLMS (ONNX Runtime Genai SDK Extension), spread, CNN, and more.
Find out more about RTX's Tensorrt.


Selecting a running provider
The 1.23.0 release of the ONNX runtime included in WindowsML provides independent APIs for vendors and running providers for device selection. This dramatically reduces the amount of application logic required to take advantage of the optimal execution provider for each hardware vendor platform. See below for a code excerpt on how to do this effectively and get maximum performance on an NVIDIA GPU.
// Register desired execution provider libraries of various vendors
auto env = Ort::Env(ORT_LOGGING_LEVEL_WARNING);
env.RegisterExecutionProviderLibrary("nv_tensorrt_rtx", L"onnxruntime_providers_nv_tensorrt_rtx.dll");
// Option 1: Rely on ONNX Runtime Execution policy
Ort::SessionOptions sessions_options;
sessions_options.SetEpSelectionPolicy(OrtExecutionProviderDevicePolicy_PREFER_GPU);
// Option 2: Interate over EpDevices to perform manual device selection
std::vector<:constepdevice> ep_devices = env.GetEpDevices();
std::vector<:constepdevice> selected_devices = select_ep_devices(ep_devices);
Ort::SessionOptions session_options;
Ort::KeyValuePairs ep_options;
session_options.AppendExecutionProvider_V2(env, selected_devices, ep_options);
# Register desired execution provider libraries of various vendors
ort.register_execution_provider_library("NvTensorRTRTXExecutionProvider", "onnxruntime_providers_nv_tensorrt_rtx.dll")
# Option 1: Rely on ONNX Runtime Execution policy
session_options = ort.SessionOptions()
session_options.set_provider_selection_policy(ort.OrtExecutionProviderDevicePolicy.PREFER_GPU)
# Option 2: Interate over EpDevices to perform manual device selection
ep_devices = ort.get_ep_devices()
ep_device = select_ep_devices(ep_devices)
provider_options = {}
sess_options.add_provider_for_devices([ep_device], provider_options)
Pre-compiled runtimes that provide quick load times
Model Runtimes can now be pre-compensated using EP context ONNX files within the onnx runtime. Each running provider can use this to optimize the entire subgraph of the ONNX model and provide an EP-specific implementation. This process can be serialized to disk to enable quick load times in WindowsML. This is often faster than previous traditional operator-based methods of direct ML.
The chart below shows that Tensorrt in RTX EP takes time to compile, but the optimizations are already serialized, which makes the model loading and inference faster. Additionally, the runtime caching feature within Tensortort of RTX EP ensures that the kernels generated during the compilation phase are serialized and stored in a directory, so there is no need to recompile for the next inference.


Minimum data transfer overhead using ONNX Runtime Device API and WindowsML
The new ONNX Runtime Device API, also available in Windows ML, enumerates the available devices for each running provider. Using this new concept, developers can assign device-specific tensors without the specification of additional EP-dependent types.
This API allows developers to perform EP-independent GPU-accelerated inference with minimal runtime data transfer overhead, leveraging copy tenser and Iobnind.
Figure 5 shows a stable diffusion 3.5 model that utilizes the ONNX runtime device API. Figure 4 below shows the time required for a single iteration of the same model's diffusion loop, with or without device IO binding.


Using an NSIGHT system, we visualized performance overhead due to repeated copying between the host and device when no IO binding was used.


A copy input tensor operation is performed before all inferences are performed. This is highlighted as green in our profile, and the devices hosting copies of the output take about the same time. Additionally, the ONNX runtime uses pageable memory by default where the device hosting the copy is implicitly synchronized, while the Cudamemcpyasync API is used by the ONNX runtime.
On the other hand, when the input and output tensors are IO bound, the host-to-device copy of the input occurs once before the multi-model inference pipeline. The same applies to the output device-to-host copy, then syncs the CPU with the GPU again. The async nsight trace above shows the execution of multiple inferences in a loop without a copy or sync operation, freeing up CPU resources during that time. This will result in a copy time of 4.2ms and a one-time host copy time of 1.3ms, resulting in a total copy time of 5.5ms regardless of the number of iterations in the inference loop. For reference, this approach reduces the copy time of 30 iterative loops by about 75 times!
Tensort for RTX-specific optimization
Tensort for RTX Execution offers custom options for further optimization of performance. The most important optimizations are listed below.
- CUDA Graph: Enabled by setting
enable_cuda_graphReduce CPU overhead to capture all Cuda kernels launched by Tensort in the graph. This is important if the Tensort graph launches many small kernels and allows the GPU to run faster than it can submit the CPU. This method generates approximately 30% performance gain on LLMS and is useful for many model types, including traditional AI models and CNN architectures.


- Runtime cache:
nv_runtime_cache_pathCache the compiled kernel and point to a directory where you can combine them using EP context nodes to cache quick load times.
- Dynamic shapeOverride the range of known dynamic shapes by setting three options
profile_{min|max|opt]_shapesAlternatively, modify the input shape of the model by specifying the static shape using AddFreeDimensionOverRideByName. Currently, this feature is in experimental mode.
summary
We are excited to work with Microsoft to bring to Windows application developers for Windows ML and Tensort for RTX EP for maximum performance on NVIDIA RTX GPUs. Top Windows application developers such as Topaz Labs and Wondershare Filmora are currently working on integration of Windows ML and Tensort for RTX EP.
Get started with Windows ML, ONNX Runtime API, and Tensorrt for RTX EP using the following resources:
Please look forward to future improvements. Speed up with the new APIs demonstrated by the sample. If you have a feature request from your side, please open the issue on GitHub and let us know.
Acknowledgments
We would like to thank Gaurav Garg, Kumar Anshuman, Umang Bhatt and Vishal Agarawal for their contributions to the blog.
