Microchannel fabrication
The microfluidic channels used in the study were produced through standard soft lithography with PDMS. Each device was fabricated using a master mould and lithographically patterned with an SU-8 negative photoresist on a 4-inch silicon wafer, which was later placed inside a Petri dish. The thermocurable PDMS prepolymer was prepared by mixing the curing agent with the base at a weight ratio of 10:1. After degassing under vacuum, the prepolymer was cast onto the mould. PDMS was crosslinked by thermal curing for 2 h at 85 °C. The PDMS was poured into the mould and then cut and peeled from the channel mould. A 0.75-mm punch was used to punch the inlet and outlet ports. The ports were created by mounting the puncher at an angle of 60°, which prevented fluid from entering the channel at an angle that may cause a break in the plasma treatment bond between the layers of PDMS, which could result in leakage and the malfunctioning of the environment. Another PDMS layer was bonded onto the PDMS channel by plasma treatment for 1 min, followed by curing at 85 °C for 2 h. A PZT was attached to the PDMS channel wall orthogonal to the aneurysm cavity. The channel flow was circulated using a pulsatile or continuous flow pump through tubes attached to the inlet and outlet. To avoid the impedance mismatch between the other side of the channel wall and the air, the entire system was placed in a water container.
Imaging pipeline
The imaging process began with an inverted microscope, which transmitted live images to our processing pipeline (Fig. 2a). We segmented the initial image into channels and obstacles using the segment anything model56, chosen for its ability to accurately differentiate complex visual elements. Following segmentation, we refined and cleaned the image with a morphological closing operation and adaptive thresholding to identify microrobots, which appeared black under the microscope. We then applied detection and tracking algorithms to identify the agent (microrobot), calculate the centre of the microrobot, plot a bounding box around it and initialize the channel and spatial reliability tracker (CSRT), which was selected due to its robust tracking capabilities in dynamic and cluttered environments. When tracking was lost, the system quickly re-detected and initiated tracking, thus minimizing computation and enhancing real-time feedback. In the processed images, microrobots are marked in blue and the target locations in red. Positive rewards were assigned when the microrobot progressed towards the target, whereas movement away incurred a negative penalty.
Reward function
This simulated setting enabled us to refine and iterate our reward functions and control strategies efficiently, without the continuous need for live experimental adjustments. This approach streamlines the development process, facilitating more precise and effective advances in microrobot control. The reward function is designed to incentivize the microrobot to efficiently reach designated target points while navigating around obstacles and taking into account various shapes and layouts of the channels. Formally, the reward R at time step t is defined by the following criteria:
$${R}_{t}=\begin{cases}\alpha, & \text{if target reached,}\\ -\beta, & \text{if a collision occurs,}\\ -\gamma f\left({d}_{t}\right), & {\rm{otherwise,}}\end{cases}$$
where α, β and γ are coefficients that weight the importance of each component in the reward function. The term dt denotes the Euclidean distance to the target point at time step t, and f(x) = 1/(d + ε) is a real, monotonic function that translates the distance into a penalty (or reward), where ε is a small positive constant used to avoid division by zero. This function was specifically chosen to inversely relate the reward to the distance, thereby encouraging the microrobot to minimize this distance (Supplementary Note 2).
After extensive experimentation, we identified the optimal settings for our system: α = 10, β = 2 and γ = 0.1. Our simulation results confirm that MBRL effectively learns advanced navigation tactics through interactions within the environment. Thereby, it can master complex navigational strategies in intricate settings such as vascular systems, mazes and racetracks.
The adapted reward function for the flow environment f(dt, \(\mathbf{X}_t,\mathbf{A}_t\)) is defined as follows:
$$\begin{array}{l}f\left({d}_{t},\mathbf{X}_t,\mathbf{A}_t\right)\\=\begin{cases}-\mu, & {\rm{if}}\;\mathbf{X}_t\;{\rm{is}}\; {\rm{on}}\; {\rm{the}}\; {\rm{wall}}\; {\rm{and}}\;\mathbf{A}_t\;{\rm{is}}\; {\rm{in}}\; {\rm{the}}\; {\rm{direction}}\; {\rm{of}}\; {\rm{the}}\; {\rm{wall}},\\ -\kappa, & {\rm{if}}\;\mathbf{X}_t\;{\rm{is}}\; {\rm{central}}\; {\rm{in}}\; {\rm{the}}\; {\rm{channel}},\\ \displaystyle\frac{1}{d+\epsilon }-\lambda, & {\rm{otherwise.}}\end{cases}\end{array}$$
The components of the reward function are initialized as follows. A step penalty λ is applied at each step to encourage the microrobot to reach the target quickly. The wall sliding penalty −μ is imposed when the microrobot is in contact with a wall and the action taken is in the direction of the wall, allowing for sliding along the wall but discouraging pushing against it. The inverse distance reward 1/(d + ε) provides a continuous incentive for the microrobot to move closer to the target, with stronger gradients as the distance decreases. Moreover, we introduced a centring penalty −κ for when the microbot is too centrally located in the channel, which encourages the microrobot to stay near the walls where the drag forces are lower. These adjustments incentivized the microrobots to navigate closer to the channel walls, where drag forces are substantially reduced due to the no-slip condition.
Training ratio
We investigated a critical parameter known as the training ratio, which denotes the number of steps trained in the imagination (within the world model) relative to each step in the physical environment. This approach capitalizes on using the world model to simulate numerous hypothetical scenarios, thus reducing the need for extensive physical interactions. The key advantage of a higher training ratio is its potential to enhance the efficiency of the learning process. It enables the agent to learn from imagined experiences, which are both quicker and less costly in terms of imagination than physical interactions. Ideally, using a higher training ratio reduces the number of environmental interactions required to achieve convergence.
We experimented with various training ratios to assess their impact on learning efficiency and performance. For example, a training ratio of 10:1 means that for every experimental step, the agent performs ten steps in the dreamed environment. This strategy enables the agent to accumulate more experience and optimize its policy without the time and resource constraints associated with physical training. Conversely, a lower training ratio, such as 1:1, entails that the agent performs an equal number of physical and simulated steps, which slows down the learning process but provides more accurate feedback from the physical environment.
Our experiments demonstrated that higher training ratios, such as 1,000:1, dramatically reduced the number of interactions with the physical environment required to achieve convergence. The results indicate that higher ratios led to faster convergence, whereas lower ratios often failed to reach convergence. To maximize the benefits of high training ratios, we developed a parallel script to run physical environment interactions and world model training on separate threads. This resulted in an adaptive training ratio that was dynamically adjusted with the agent’s performance in the physical environment.
RL implementation
We formalized the problem as a Markov decision process that includes the state space, action space, reward function and transition dynamics. The state, action and reward triplet at time t (St, At, Rt) and the transition dynamics (T) enable the RL agent to learn optimal policies for microrobot control through continuous interaction with the environment. The state space St incorporates visual information captured by cameras, including the current position extracted from the image coordinates \(({x}_{t}^\mathrm{a},{y}_{t}^\mathrm{a})\) and the target location coordinates \(({x}_{t}^\mathrm{t},{y}_{t}^\mathrm{t})\), which represent the spatial location of the desired target position that the microrobot aims to reach:
$$\mathbf{S}_t=\left\{I,{x}_{t}^\mathrm{a},{y}_{t}^\mathrm{a},{x}_{t}^\mathrm{t},{y}_{t}^\mathrm{t}\right\},$$
where I encapsulates the processed camera feed at time t. We used a convolutional neural network to extract meaningful features from the images, such as the size, shape and interactions of the microrobots: I = CNN(Imaget).
The action space At defines the set of all possible actions the control system can execute at any given time. In our settings, these actions pertain to the settings of the PZTs:
$$\mathbf{A}_t=\left[\left(\;{f}_{1},{A}_{1}\right),\left(\;{f}_{2},{A}_{2}\right),\ldots ,\left(\;{f}_{n},{A}_{n}\right)\right],$$
where f is the frequency of the ultrasonic travelling wave, A is the amplitude of the peak-to-peak voltage and n is number of transducers.
The transition dynamics T(St, At) describe how the state of the system changes in response to an action. This function is unknown to the RL algorithm and must be inferred through interactions with the environment. In our settings, the transition dynamics represent the physical changes in the system state resulting from an activated PZT:
$$\mathbf{S}_{t+1}=\mathbf{S}_t+\Delta t\times\text{dynamics}(\mathbf{S}_t,\mathbf{A}_t),$$
where Δt is the time step, and dynamics(St, At) is a function modelling the physics of microrobot motion under ultrasound stimulation, which was extracted from the differences in the images (state).
World model learning
The world model processes the state St into a latent state Zt using an encoder–decoder architecture. This model predicts future latent states and rewards based on the current latent state and actions. It trains continually trains on new samples (St, At, Rt). The key components include:
-
Encoder–decoder architecture: This architecture compresses high-dimensional observations into a compact latent space for prediction and control. The encoder qϕ maps an observation ot to a latent state Zt, where ϕ is a parameter vector shared between the encoder and all other world model components:
$$\mathbf{z}_t \sim{q}_{\phi }(\mathbf{z}_t\mid\mathbf{h}_{t},{x}_{t}).$$
The decoder (D) reconstructs the observation from the latent state: \({\hat{o}}_{t}=D(\mathbf{z}_{t})\).
-
Dynamics network: This network predicts the future states of the microrobots based on their current state and actions, following the principle of a recurrent neural network. It preserves a deterministic state ht predicted by the recurrent neural network using the previous actions at−1, ht−1 and the previous embedded state zt−1:
$$\mathbf{h}_t ={f}_{\phi }(\mathbf{h}_{t-1},\mathbf{z}_{t-1},\mathbf{a}_{t-1}).$$
-
Reward predictor: This component predicts the rewards associated with different actions, aiding the agent in optimizing its behaviour. The reward predictor R estimates the reward rt based on the latent state zt and action at:
$${\hat{r}}_{t} \sim{p}_{\phi }({\hat{r}}_{t}\mid\mathbf{h}_t,\mathbf{z}_t).$$
Latent imagination and policy optimization
The agent generates future trajectories within the latent space and uses these imagined trajectories to train the policy and value networks. This reduces the need for physical interactions and makes learning more efficient. The main steps are as follows:
-
(1)
Trajectory sampling: Generate possible future trajectories by simulating the environment using the transition model (ht = fφ(ht−1 | zt−1, at−1). The imagined trajectories start at the true model states st drawn from the replay buffer of the agent and are then carried in the imagination by the transition model. These trajectories are generated much faster than the environment interaction and are controlled by a parameter called the training ratio. We developed a multi-threaded approach in which the latent model runs continuously on a separate process without a fixed ratio with the physical environment interactions.
-
(2)
Trajectory evaluation: Assess the quality of each trajectory based on the accumulated rewards predicted by the reward model. The reward predictor (\(\hat{{r}_{t}}\sim{p}_{{{\phi }}}\left(\hat{{r}_{t}}\mid\mathbf{h}_{t},\mathbf{z}_{t}\right)\)) estimates the rewards of each state.
-
(3)
Policy and value network training: The actor–critic component is trained to maximize the expected imagined reward \(\left(E\left(\sum_{t=0}^{\infty }\gamma^{t}{r}_{t}\right)\right)\) with respect to a specific policy. The evaluated trajectories are used to update the policy and value networks, which dictate the agent’s actions in the physical environment.
This training loop leverages the predicted latent states and rewards, substantially enhancing sample efficiency by reducing the dependence on real-world interactions and relying on a very compact latent representation.
Algorithm 1
Microrobot MBRL training
Require: Configuration, frames, CSRT and segmented mask
Ensure: Environment set-up, reward calculation and state update
1: Initialize environment with configuration parameters # Set environment
2: Initialize RL state s0 # Initialize RL state
3: Downsize image to 64 × 64 px # Reduce image size
4: while Episodes < Total_Episodes do # Main training loop
5: frame ← get_camera_frame; # Capture frame
6: cleaned_frame ← segment_frame # Segment frame
7: bubble_size ← detect_cluster # Detect cluster size
8: truncated, terminated ← False, False # Initialize flags
9: if bubble_size > area_threshold then # Check bubble size
10: Track microrobot with CSRT # Track microrobot
11: agent_position ← get_agent_pos # Get agent position
12 if agent_position ≈ target_position # Near target
13: r ← Target reward, terminated ← True # Assign reward, end episode
14: else if agent_position in Channel_walls then # Collision detected
15: r ← Collision penalty, terminated ← True # Assign penalty, end episode
16: end if
17: else
18: r ← Distance_based reward # Distance reward
19: end if
20: if steps > threshold then # Check step limit
21: truncated ← True # Mark truncated
22: end if
23: Reset Collisions if necessary # Reset if collisions
24: Deactivate PZT, adjust position and recheck collisions # Execute and assess action
25: Apply Action, compute reward
26: Execute action, observe environment and compute reward # Check and return results
27: Check termination, return (obs, reward, done)
28: end while
Algorithm 2
Microrobot flow environment training and simulation
Require: Config, direction and amplitude
Ensure: Environment set-up, reward calculation and state update
1: Initialize environment # Set up environment parameters
2: Set reward_centre and flow_direction from config
3: if reward_centre or flow_direction
4: Initialize flow # Initialize flow
5: end if
6: reward ← 0 # Initialize reward for the step
7: if is_valid_move (direction, amplitude) then # Check if the move is valid
8: move_agent (direction, amplitude) # Move the agent
9: else if check_collision() then # Check for collision
10: if is_valid_move (direction, amplitude/2) then # Try moving with reduced amplitude
11: move_agent (direction, amplitude/2)
12: else
13: reward ← reward_collision # Apply collision penalty
14: update_radius() # Update the radius after collision
15: end if
16: else
17: move_agent (direction, small_amplitude) # Move with a small amplitude
18: end if
19: if flow_active then # Check if flow is active
20: if is_in_centre() then # Check if agent is in the centre
21: reward ← reward + reward_centre # Add centre reward to total
22: if is_valid_move (direction, amplitude/1.5) then # Try to move against the flow
23: move_agent (flow_direction, amplitude)
24: else
25: update_radius () # Update radius if move is not valid
26: end if
27: end if # Check step limit
28: end if # Mark truncated
29: Update step counters and check termination
30: increment_step_counter () # Increase the step counter
31: if reached_target () then # Check if target is reached
32: reward ← reward_target_reached # Add target reached reward
33: mark_as_terminated () # Mark episode as terminated
34: else if radius_too_small () then # Check if radius is too small
35: reward ← reward_termination # Add termination reward
36: reset_radius() # Reset radius for new episode
37: else
38: reward ← calculate_distance_reward () # Calculate reward based on distance
39: end if
40: Return observations, reward, done and info # Return step results
41: return get_observations(), reward, is_done(), get_info()