Model-based reinforcement learning for ultrasound-driven autonomous microrobots

Machine Learning


Microchannel fabrication

The microfluidic channels used in the study were produced through standard soft lithography with PDMS. Each device was fabricated using a master mould and lithographically patterned with an SU-8 negative photoresist on a 4-inch silicon wafer, which was later placed inside a Petri dish. The thermocurable PDMS prepolymer was prepared by mixing the curing agent with the base at a weight ratio of 10:1. After degassing under vacuum, the prepolymer was cast onto the mould. PDMS was crosslinked by thermal curing for 2 h at 85 °C. The PDMS was poured into the mould and then cut and peeled from the channel mould. A 0.75-mm punch was used to punch the inlet and outlet ports. The ports were created by mounting the puncher at an angle of 60°, which prevented fluid from entering the channel at an angle that may cause a break in the plasma treatment bond between the layers of PDMS, which could result in leakage and the malfunctioning of the environment. Another PDMS layer was bonded onto the PDMS channel by plasma treatment for 1 min, followed by curing at 85 °C for 2 h. A PZT was attached to the PDMS channel wall orthogonal to the aneurysm cavity. The channel flow was circulated using a pulsatile or continuous flow pump through tubes attached to the inlet and outlet. To avoid the impedance mismatch between the other side of the channel wall and the air, the entire system was placed in a water container.

Imaging pipeline

The imaging process began with an inverted microscope, which transmitted live images to our processing pipeline (Fig. 2a). We segmented the initial image into channels and obstacles using the segment anything model56, chosen for its ability to accurately differentiate complex visual elements. Following segmentation, we refined and cleaned the image with a morphological closing operation and adaptive thresholding to identify microrobots, which appeared black under the microscope. We then applied detection and tracking algorithms to identify the agent (microrobot), calculate the centre of the microrobot, plot a bounding box around it and initialize the channel and spatial reliability tracker (CSRT), which was selected due to its robust tracking capabilities in dynamic and cluttered environments. When tracking was lost, the system quickly re-detected and initiated tracking, thus minimizing computation and enhancing real-time feedback. In the processed images, microrobots are marked in blue and the target locations in red. Positive rewards were assigned when the microrobot progressed towards the target, whereas movement away incurred a negative penalty.

Reward function

This simulated setting enabled us to refine and iterate our reward functions and control strategies efficiently, without the continuous need for live experimental adjustments. This approach streamlines the development process, facilitating more precise and effective advances in microrobot control. The reward function is designed to incentivize the microrobot to efficiently reach designated target points while navigating around obstacles and taking into account various shapes and layouts of the channels. Formally, the reward R at time step t is defined by the following criteria:

$${R}_{t}=\begin{cases}\alpha, & \text{if target reached,}\\ -\beta, & \text{if a collision occurs,}\\ -\gamma f\left({d}_{t}\right), & {\rm{otherwise,}}\end{cases}$$

where α, β and γ are coefficients that weight the importance of each component in the reward function. The term dt denotes the Euclidean distance to the target point at time step t, and f(x) = 1/(d + ε) is a real, monotonic function that translates the distance into a penalty (or reward), where ε is a small positive constant used to avoid division by zero. This function was specifically chosen to inversely relate the reward to the distance, thereby encouraging the microrobot to minimize this distance (Supplementary Note 2).

After extensive experimentation, we identified the optimal settings for our system: α = 10, β = 2 and γ = 0.1. Our simulation results confirm that MBRL effectively learns advanced navigation tactics through interactions within the environment. Thereby, it can master complex navigational strategies in intricate settings such as vascular systems, mazes and racetracks.

The adapted reward function for the flow environment f(dt, \(\mathbf{X}_t,\mathbf{A}_t\)) is defined as follows:

$$\begin{array}{l}f\left({d}_{t},\mathbf{X}_t,\mathbf{A}_t\right)\\=\begin{cases}-\mu, & {\rm{if}}\;\mathbf{X}_t\;{\rm{is}}\; {\rm{on}}\; {\rm{the}}\; {\rm{wall}}\; {\rm{and}}\;\mathbf{A}_t\;{\rm{is}}\; {\rm{in}}\; {\rm{the}}\; {\rm{direction}}\; {\rm{of}}\; {\rm{the}}\; {\rm{wall}},\\ -\kappa, & {\rm{if}}\;\mathbf{X}_t\;{\rm{is}}\; {\rm{central}}\; {\rm{in}}\; {\rm{the}}\; {\rm{channel}},\\ \displaystyle\frac{1}{d+\epsilon }-\lambda, & {\rm{otherwise.}}\end{cases}\end{array}$$

The components of the reward function are initialized as follows. A step penalty λ is applied at each step to encourage the microrobot to reach the target quickly. The wall sliding penalty −μ is imposed when the microrobot is in contact with a wall and the action taken is in the direction of the wall, allowing for sliding along the wall but discouraging pushing against it. The inverse distance reward 1/(d + ε) provides a continuous incentive for the microrobot to move closer to the target, with stronger gradients as the distance decreases. Moreover, we introduced a centring penalty −κ for when the microbot is too centrally located in the channel, which encourages the microrobot to stay near the walls where the drag forces are lower. These adjustments incentivized the microrobots to navigate closer to the channel walls, where drag forces are substantially reduced due to the no-slip condition.

Training ratio

We investigated a critical parameter known as the training ratio, which denotes the number of steps trained in the imagination (within the world model) relative to each step in the physical environment. This approach capitalizes on using the world model to simulate numerous hypothetical scenarios, thus reducing the need for extensive physical interactions. The key advantage of a higher training ratio is its potential to enhance the efficiency of the learning process. It enables the agent to learn from imagined experiences, which are both quicker and less costly in terms of imagination than physical interactions. Ideally, using a higher training ratio reduces the number of environmental interactions required to achieve convergence.

We experimented with various training ratios to assess their impact on learning efficiency and performance. For example, a training ratio of 10:1 means that for every experimental step, the agent performs ten steps in the dreamed environment. This strategy enables the agent to accumulate more experience and optimize its policy without the time and resource constraints associated with physical training. Conversely, a lower training ratio, such as 1:1, entails that the agent performs an equal number of physical and simulated steps, which slows down the learning process but provides more accurate feedback from the physical environment.

Our experiments demonstrated that higher training ratios, such as 1,000:1, dramatically reduced the number of interactions with the physical environment required to achieve convergence. The results indicate that higher ratios led to faster convergence, whereas lower ratios often failed to reach convergence. To maximize the benefits of high training ratios, we developed a parallel script to run physical environment interactions and world model training on separate threads. This resulted in an adaptive training ratio that was dynamically adjusted with the agent’s performance in the physical environment.

RL implementation

We formalized the problem as a Markov decision process that includes the state space, action space, reward function and transition dynamics. The state, action and reward triplet at time t (St, At, Rt) and the transition dynamics (T) enable the RL agent to learn optimal policies for microrobot control through continuous interaction with the environment. The state space St incorporates visual information captured by cameras, including the current position extracted from the image coordinates \(({x}_{t}^\mathrm{a},{y}_{t}^\mathrm{a})\) and the target location coordinates \(({x}_{t}^\mathrm{t},{y}_{t}^\mathrm{t})\), which represent the spatial location of the desired target position that the microrobot aims to reach:

$$\mathbf{S}_t=\left\{I,{x}_{t}^\mathrm{a},{y}_{t}^\mathrm{a},{x}_{t}^\mathrm{t},{y}_{t}^\mathrm{t}\right\},$$

where I encapsulates the processed camera feed at time t. We used a convolutional neural network to extract meaningful features from the images, such as the size, shape and interactions of the microrobots: I = CNN(Imaget).

The action space At defines the set of all possible actions the control system can execute at any given time. In our settings, these actions pertain to the settings of the PZTs:

$$\mathbf{A}_t=\left[\left(\;{f}_{1},{A}_{1}\right),\left(\;{f}_{2},{A}_{2}\right),\ldots ,\left(\;{f}_{n},{A}_{n}\right)\right],$$

where f is the frequency of the ultrasonic travelling wave, A is the amplitude of the peak-to-peak voltage and n is number of transducers.

The transition dynamics T(St, At) describe how the state of the system changes in response to an action. This function is unknown to the RL algorithm and must be inferred through interactions with the environment. In our settings, the transition dynamics represent the physical changes in the system state resulting from an activated PZT:

$$\mathbf{S}_{t+1}=\mathbf{S}_t+\Delta t\times\text{dynamics}(\mathbf{S}_t,\mathbf{A}_t),$$

where Δt is the time step, and dynamics(St, At) is a function modelling the physics of microrobot motion under ultrasound stimulation, which was extracted from the differences in the images (state).

World model learning

The world model processes the state St into a latent state Zt using an encoder–decoder architecture. This model predicts future latent states and rewards based on the current latent state and actions. It trains continually trains on new samples (St, At, Rt). The key components include:

  • Encoder–decoder architecture: This architecture compresses high-dimensional observations into a compact latent space for prediction and control. The encoder qϕ maps an observation ot to a latent state Zt, where ϕ is a parameter vector shared between the encoder and all other world model components:

    $$\mathbf{z}_t \sim{q}_{\phi }(\mathbf{z}_t\mid\mathbf{h}_{t},{x}_{t}).$$

    The decoder (D) reconstructs the observation from the latent state: \({\hat{o}}_{t}=D(\mathbf{z}_{t})\).

  • Dynamics network: This network predicts the future states of the microrobots based on their current state and actions, following the principle of a recurrent neural network. It preserves a deterministic state ht predicted by the recurrent neural network using the previous actions at−1, ht−1 and the previous embedded state zt−1:

    $$\mathbf{h}_t ={f}_{\phi }(\mathbf{h}_{t-1},\mathbf{z}_{t-1},\mathbf{a}_{t-1}).$$

  • Reward predictor: This component predicts the rewards associated with different actions, aiding the agent in optimizing its behaviour. The reward predictor R estimates the reward rt based on the latent state zt and action at:

$${\hat{r}}_{t} \sim{p}_{\phi }({\hat{r}}_{t}\mid\mathbf{h}_t,\mathbf{z}_t).$$

Latent imagination and policy optimization

The agent generates future trajectories within the latent space and uses these imagined trajectories to train the policy and value networks. This reduces the need for physical interactions and makes learning more efficient. The main steps are as follows:

  1. (1)

    Trajectory sampling: Generate possible future trajectories by simulating the environment using the transition model (ht = fφ(ht−1 | zt−1, at−1). The imagined trajectories start at the true model states st drawn from the replay buffer of the agent and are then carried in the imagination by the transition model. These trajectories are generated much faster than the environment interaction and are controlled by a parameter called the training ratio. We developed a multi-threaded approach in which the latent model runs continuously on a separate process without a fixed ratio with the physical environment interactions.

  2. (2)

    Trajectory evaluation: Assess the quality of each trajectory based on the accumulated rewards predicted by the reward model. The reward predictor (\(\hat{{r}_{t}}\sim{p}_{{{\phi }}}\left(\hat{{r}_{t}}\mid\mathbf{h}_{t},\mathbf{z}_{t}\right)\)) estimates the rewards of each state.

  3. (3)

    Policy and value network training: The actor–critic component is trained to maximize the expected imagined reward \(\left(E\left(\sum_{t=0}^{\infty }\gamma^{t}{r}_{t}\right)\right)\) with respect to a specific policy. The evaluated trajectories are used to update the policy and value networks, which dictate the agent’s actions in the physical environment.

This training loop leverages the predicted latent states and rewards, substantially enhancing sample efficiency by reducing the dependence on real-world interactions and relying on a very compact latent representation.

Algorithm 1

Microrobot MBRL training

Require: Configuration, frames, CSRT and segmented mask

Ensure: Environment set-up, reward calculation and state update

 1: Initialize environment with configuration parameters  # Set environment

 2: Initialize RL state s0  # Initialize RL state

 3: Downsize image to 64 × 64 px # Reduce image size

 4: while Episodes < Total_Episodes do # Main training loop

 5:  frame ← get_camera_frame;   # Capture frame

 6:  cleaned_frame ← segment_frame  # Segment frame

 7:  bubble_size ← detect_cluster # Detect cluster size

 8:  truncated, terminated ← False, False  # Initialize flags

 9:  if bubble_size > area_threshold then # Check bubble size

 10:   Track microrobot with CSRT # Track microrobot

 11:   agent_position ← get_agent_pos # Get agent position

 12   if agent_position ≈ target_position   # Near target

 13:   r ← Target reward, terminated ← True # Assign reward, end episode

 14:   else if agent_position in Channel_walls then # Collision detected

 15:   r ← Collision penalty, terminated ← True # Assign penalty, end episode

 16:   end if

 17:  else

 18:   r ← Distance_based reward # Distance reward

 19:  end if

 20:  if steps > threshold then # Check step limit

 21:   truncated ← True  # Mark truncated

 22:  end if

 23:  Reset Collisions if necessary # Reset if collisions

 24:  Deactivate PZT, adjust position and recheck collisions # Execute and assess action

 25:  Apply Action, compute reward

 26:  Execute action, observe environment and compute reward # Check and return results

 27:  Check termination, return (obs, reward, done)

 28: end while

Algorithm 2

Microrobot flow environment training and simulation

Require: Config, direction and amplitude

Ensure: Environment set-up, reward calculation and state update

 1: Initialize environment # Set up environment parameters

 2: Set reward_centre and flow_direction from config

 3: if reward_centre or flow_direction

 4:  Initialize flow  # Initialize flow

 5: end if

 6: reward ← 0  # Initialize reward for the step

 7: if is_valid_move (direction, amplitude) then  # Check if the move is valid

 8:  move_agent (direction, amplitude)   # Move the agent

 9: else if check_collision() then  # Check for collision

 10:  if is_valid_move (direction, amplitude/2) then # Try moving with reduced amplitude

 11:   move_agent (direction, amplitude/2)

 12:  else

 13:   reward ← reward_collision  # Apply collision penalty

 14:   update_radius() # Update the radius after collision

 15:  end if

 16: else

 17:  move_agent (direction, small_amplitude) # Move with a small amplitude

 18: end if

 19: if flow_active then  # Check if flow is active

 20:  if is_in_centre() then # Check if agent is in the centre

 21:   reward ← reward + reward_centre  # Add centre reward to total

 22:   if is_valid_move (direction, amplitude/1.5) then # Try to move against the flow

 23:   move_agent (flow_direction, amplitude)

 24:   else

 25:   update_radius () # Update radius if move is not valid

 26:   end if

 27:  end if   # Check step limit

 28: end if   # Mark truncated

 29: Update step counters and check termination

 30: increment_step_counter ()  # Increase the step counter

 31: if reached_target () then  # Check if target is reached

 32:  reward ← reward_target_reached  # Add target reached reward

 33:  mark_as_terminated ()  # Mark episode as terminated

 34: else if radius_too_small () then  # Check if radius is too small

 35:  reward ← reward_termination  # Add termination reward

 36:  reset_radius()  # Reset radius for new episode

 37: else

 38:  reward ← calculate_distance_reward () # Calculate reward based on distance

 39: end if

 40: Return observations, reward, done and info  # Return step results

 41: return get_observations(), reward, is_done(), get_info()



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *