Practical humanoid training workflow

Reinforcement learning (RL) in robotics is often associated with large GPU clusters, distributed infrastructures, and x86-based development environments. Training humanoid robots with high-fidelity simulation is a resource-intensive workflow that runs in data centers.

What if that workflow could run on a single workstation?

This blog post describes a complete robotics pipeline built with Isaac Sim and Isaac Lab on NVIDIA DGX Spark with the Grace–Blackwell (GB10) superchip. Compile your software stack natively on Arm. Run massively parallel simulations. We train humanoid robots to walk over rough terrain.

Build a native robotics stack on Arm

The workflow begins by building Isaac Sim and Isaac Lab directly from source on Grace CPU (aarch64).

The full stack compiles natively on Arm without cross-compilation or using an x86 build host. This includes:

GCC11
CUDA13
Git LFS
Omniverse Simulation Component

This produces the following:

native aarch64 binary
Full CUDA acceleration
Tight integration between Grace CPU and Blackwell GPU
No architecture conversion layer

This is an important result for Arm developers. High-fidelity robot simulation has traditionally been an x86-centric workflow. In fact, the complete Isaac Sim and Isaac Lab toolchain can run natively on Arm.

More importantly, this native workflow reduces the friction of cross-compilation and environment switching. The same Arm-based development model can support multiple stages of robotic AI development. Supports workstation experimentation, large-scale training, and ultimately edge deployment. This cloud-to-edge consistency makes Arm attractive for real-world AI systems.

DGX Spark combines:

Free up CPU for orchestration, compilation, and data preparation.
Blackwell GPUs for simulation and neural network acceleration.
NVLink-C2C integrated memory reduces the overhead of data movement between the CPU and GPU.

These components work together to form a tightly integrated robot development platform.

From simulation to scalable RL training

After validating Isaac Sim, the next step is reinforcement learning using Isaac Lab.

The task selected for this experiment is Isaac-Velocity-Rough-H1-v0.

In this environment, you will train a Unitree H1 humanoid robot with 19 actuated joints to:

Follow commanded forward speed
keep balance
Walk through rough procedurally generated terrain

The training pipeline uses RSL-RL, a lightweight reinforcement learning library designed for robot locomotion workloads, and uses PPO (Proximity Policy Optimization) as the training algorithm. Training is launched in headless mode to maximize GPU usage.

To emphasize platform parallelism, training is launched in headless mode with num_envs=512.

export LD_PRELOAD="$LD_PRELOAD:/lib/aarch64-linux-gnu/libgomp.so.1"

./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/train.py \
--task=Isaac-Velocity-Rough-H1-v0 \
--headless

This means that the system performs approximately 65,000 simulation steps per second on a single desktop-class machine.

This level of performance is achieved through several architectural advantages.

Physics simulations run directly on Blackwell GPUs. This enables massive parallelization of joint dynamics across hundreds of environments.
PPO policy updates work with batched tensors. This allows reinforcement learning optimizations to be efficiently extended with parallel simulations.
Grace CPU manages orchestration and system control. This keeps your simulation and training pipeline fully utilized without relying on the CPU.

A key factor in achieving this throughput is the NVLink-C2C interconnect. Provides high-bandwidth unified memory space between Grace CPU and Blackwell GPU.

In many reinforcement learning systems, physics simulations are run on GPUs. Training logic runs on the CPU. This causes the tensor to move repeatedly on the PCIe bus. This constant data transfer introduces delays and limits overall throughput.

DGX Spark enables zero-copy data exchange between CPU and GPU with NVLink-C2C. Physics simulations can be run across hundreds of environments on Blackwell GPUs. The PPO algorithm accesses the same memory space for policy updates without the traditional transfer overhead between host devices.

The result is a tightly integrated training loop. Simulation and learning operate on the same unified memory system. Traditionally, this level of throughput required a distributed GPU cluster. With DGX Spark, this runs on a single Arm-based workstation.

Watch a humanoid learn how to walk

One of the advantages of robot RL in simulation is that learning progress can be observed directly.

I captured an evaluation clip at two checkpoints using num_envs=512.

Iteration 50: Initial exploration

At iteration 50:

Most robots fall quickly
Joint movements become noisy and unstable
No clear walking pattern
Speed tracking is disabled

At this stage, the policy is still being explored. PPO has not yet learned the available movement strategies.

Iteration 1350: Stable movement

Iterations up to 1350:

Humanoid consistently walks forward
Maintains balance even on uneven terrain
able to recover from small failures
Velocity commands are tracked more accurately

Changes in foot placement, postural control, and gait regularity are clearly visible. Random exploration gradually becomes structured movement.

Training metrics reinforce the same story.

Average compensation increases over time.
The longer the robot stays upright, the longer the episode will be.
The loss in the value function stabilizes.
As the policy converges, the noise of the actions decreases.
As the frequency of falls decreases, the termination penalty decreases.

This coordination between visual behavior and log metrics makes the training process interpretable and measurable.

Why this matters to Arm developers

This project is more than just a robotics demo. See how Arm native systems support the entire lifecycle of modern physical AI workloads.

Conventional robot RL	DGX Spark Workflow
Multi-node GPU cluster	single arm based workstation
Independent simulation and training system	Integrated simulation + training stack
Massive data transfer between CPU and GPU	NVLink-C2C integrated memory
x86-centric toolchain	Native Arm robotics workflow

For Arm developers, its importance is clear.

Arm supports more than just inference workloads.
Robotics simulation and training can be built natively on Arm.
Workstation-class systems support serious RL experimentation.
The same architectural foundation supports you from development to deployment.

This proves platform-level relevance. It’s not about training a single humanoid robot. This is to demonstrate that Arm can support a complete robotics workflow, from source build to simulation, training, evaluation, and future deployment.

Final thoughts and continued exploration

In this workflow you will:

Build Isaac Sim and Isaac Lab natively on Arm.
Create high-fidelity robotics simulation environments.
Train a humanoid robot with 19 degrees of freedom on rough terrain.
Achieves approximately 65,000 simulation steps per second.
We observe policy convergence from unstable exploration to stable migration.

All of this was completed on a single DGX Spark system. When rethinking what’s possible with robotic reinforcement learning on a single workstation, the next step is to turn that idea into a reproducible workflow.

To reproduce the workflow described in this blog post, see the accompanying learning path, Train humanoid movement policies using Isaac Lab on DGX Spark.

This step-by-step guide will teach you: