Model-based reinforcement learning for ultrasound-driven autonomous microrobots

Microchannel fabrication

The microfluidic channels used in the study were produced through standard soft lithography with PDMS. Each device was fabricated using a master mould and lithographically patterned with an SU-8 negative photoresist on a 4-inch silicon wafer, which was later placed inside a Petri dish. The thermocurable PDMS prepolymer was prepared by mixing the curing agent with the base at a weight ratio of 10:1. After degassing under vacuum, the prepolymer was cast onto the mould. PDMS was crosslinked by thermal curing for 2 h at 85 °C. The PDMS was poured into the mould and then cut and peeled from the channel mould. A 0.75-mm punch was used to punch the inlet and outlet ports. The ports were created by mounting the puncher at an angle of 60°, which prevented fluid from entering the channel at an angle that may cause a break in the plasma treatment bond between the layers of PDMS, which could result in leakage and the malfunctioning of the environment. Another PDMS layer was bonded onto the PDMS channel by plasma treatment for 1 min, followed by curing at 85 °C for 2 h. A PZT was attached to the PDMS channel wall orthogonal to the aneurysm cavity. The channel flow was circulated using a pulsatile or continuous flow pump through tubes attached to the inlet and outlet. To avoid the impedance mismatch between the other side of the channel wall and the air, the entire system was placed in a water container.

Imaging pipeline

The imaging process began with an inverted microscope, which transmitted live images to our processing pipeline (Fig. 2a). We segmented the initial image into channels and obstacles using the segment anything model⁵⁶, chosen for its ability to accurately differentiate complex visual elements. Following segmentation, we refined and cleaned the image with a morphological closing operation and adaptive thresholding to identify microrobots, which appeared black under the microscope. We then applied detection and tracking algorithms to identify the agent (microrobot), calculate the centre of the microrobot, plot a bounding box around it and initialize the channel and spatial reliability tracker (CSRT), which was selected due to its robust tracking capabilities in dynamic and cluttered environments. When tracking was lost, the system quickly re-detected and initiated tracking, thus minimizing computation and enhancing real-time feedback. In the processed images, microrobots are marked in blue and the target locations in red. Positive rewards were assigned when the microrobot progressed towards the target, whereas movement away incurred a negative penalty.

Reward function

This simulated setting enabled us to refine and iterate our reward functions and control strategies efficiently, without the continuous need for live experimental adjustments. This approach streamlines the development process, facilitating more precise and effective advances in microrobot control. The reward function is designed to incentivize the microrobot to efficiently reach designated target points while navigating around obstacles and taking into account various shapes and layouts of the channels. Formally, the reward R at time step t is defined by the following criteria:

$${R}_{t}=\begin{cases}\alpha, & \text{if target reached,}\\ -\beta, & \text{if a collision occurs,}\\ -\gamma f\left({d}_{t}\right), & {\rm{otherwise,}}\end{cases}$$

where α, β and γ are coefficients that weight the importance of each component in the reward function. The term d_t denotes the Euclidean distance to the target point at time step t, and f(x) = 1/(d + ε) is a real, monotonic function that translates the distance into a penalty (or reward), where ε is a small positive constant used to avoid division by zero. This function was specifically chosen to inversely relate the reward to the distance, thereby encouraging the microrobot to minimize this distance (Supplementary Note 2).

After extensive experimentation, we identified the optimal settings for our system: α = 10, β = 2 and γ = 0.1. Our simulation results confirm that MBRL effectively learns advanced navigation tactics through interactions within the environment. Thereby, it can master complex navigational strategies in intricate settings such as vascular systems, mazes and racetracks.

The adapted reward function for the flow environment f(d_t, $\mathbf{X}_t,\mathbf{A}_t$) is defined as follows:

$$\begin{array}{l}f\left({d}_{t},\mathbf{X}_t,\mathbf{A}_t\right)\\=\begin{cases}-\mu, & {\rm{if}}\;\mathbf{X}_t\;{\rm{is}}\; {\rm{on}}\; {\rm{the}}\; {\rm{wall}}\; {\rm{and}}\;\mathbf{A}_t\;{\rm{is}}\; {\rm{in}}\; {\rm{the}}\; {\rm{direction}}\; {\rm{of}}\; {\rm{the}}\; {\rm{wall}},\\ -\kappa, & {\rm{if}}\;\mathbf{X}_t\;{\rm{is}}\; {\rm{central}}\; {\rm{in}}\; {\rm{the}}\; {\rm{channel}},\\ \displaystyle\frac{1}{d+\epsilon }-\lambda, & {\rm{otherwise.}}\end{cases}\end{array}$$

The components of the reward function are initialized as follows. A step penalty λ is applied at each step to encourage the microrobot to reach the target quickly. The wall sliding penalty −μ is imposed when the microrobot is in contact with a wall and the action taken is in the direction of the wall, allowing for sliding along the wall but discouraging pushing against it. The inverse distance reward 1/(d + ε) provides a continuous incentive for the microrobot to move closer to the target, with stronger gradients as the distance decreases. Moreover, we introduced a centring penalty −κ for when the microbot is too centrally located in the channel, which encourages the microrobot to stay near the walls where the drag forces are lower. These adjustments incentivized the microrobots to navigate closer to the channel walls, where drag forces are substantially reduced due to the no-slip condition.

Training ratio

We investigated a critical parameter known as the training ratio, which denotes the number of steps trained in the imagination (within the world model) relative to each step in the physical environment. This approach capitalizes on using the world model to simulate numerous hypothetical scenarios, thus reducing the need for extensive physical interactions. The key advantage of a higher training ratio is its potential to enhance the efficiency of the learning process. It enables the agent to learn from imagined experiences, which are both quicker and less costly in terms of imagination than physical interactions. Ideally, using a higher training ratio reduces the number of environmental interactions required to achieve convergence.

We experimented with various training ratios to assess their impact on learning efficiency and performance. For example, a training ratio of 10:1 means that for every experimental step, the agent performs ten steps in the dreamed environment. This strategy enables the agent to accumulate more experience and optimize its policy without the time and resource constraints associated with physical training. Conversely, a lower training ratio, such as 1:1, entails that the agent performs an equal number of physical and simulated steps, which slows down the learning process but provides more accurate feedback from the physical environment.

Our experiments demonstrated that higher training ratios, such as 1,000:1, dramatically reduced the number of interactions with the physical environment required to achieve convergence. The results indicate that higher ratios led to faster convergence, whereas lower ratios often failed to reach convergence. To maximize the benefits of high training ratios, we developed a parallel script to run physical environment interactions and world model training on separate threads. This resulted in an adaptive training ratio that was dynamically adjusted with the agent’s performance in the physical environment.

RL implementation

We formalized the problem as a Markov decision process that includes the state space, action space, reward function and transition dynamics. The state, action and reward triplet at time t (S_t, A_t, R_t) and the transition dynamics (T) enable the RL agent to learn optimal policies for microrobot control through continuous interaction with the environment. The state space S_t incorporates visual information captured by cameras, including the current position extracted from the image coordinates $({x}_{t}^\mathrm{a},{y}_{t}^\mathrm{a})$ and the target location coordinates $({x}_{t}^\mathrm{t},{y}_{t}^\mathrm{t})$, which represent the spatial location of the desired target position that the microrobot aims to reach:

$$\mathbf{S}_t=\left\{I,{x}_{t}^\mathrm{a},{y}_{t}^\mathrm{a},{x}_{t}^\mathrm{t},{y}_{t}^\mathrm{t}\right\},$$

where I encapsulates the processed camera feed at time t. We used a convolutional neural network to extract meaningful features from the images, such as the size, shape and interactions of the microrobots: I = CNN(Image_t).

The action space A_t defines the set of all possible actions the control system can execute at any given time. In our settings, these actions pertain to the settings of the PZTs:

$$\mathbf{A}_t=\left[\left(\;{f}_{1},{A}_{1}\right),\left(\;{f}_{2},{A}_{2}\right),\ldots ,\left(\;{f}_{n},{A}_{n}\right)\right],$$

where f is the frequency of the ultrasonic travelling wave, A is the amplitude of the peak-to-peak voltage and n is number of transducers.

The transition dynamics T(S_t, A_t) describe how the state of the system changes in response to an action. This function is unknown to the RL algorithm and must be inferred through interactions with the environment. In our settings, the transition dynamics represent the physical changes in the system state resulting from an activated PZT:

$$\mathbf{S}_{t+1}=\mathbf{S}_t+\Delta t\times\text{dynamics}(\mathbf{S}_t,\mathbf{A}_t),$$

where Δt is the time step, and dynamics(S_t, A_t) is a function modelling the physics of microrobot motion under ultrasound stimulation, which was extracted from the differences in the images (state).

World model learning

The world model processes the state S_t into a latent state Z_t using an encoder–decoder architecture. This model predicts future latent states and rewards based on the current latent state and actions. It trains continually trains on new samples (S_t, A_t, R_t). The key components include:

Encoder–decoder architecture: This architecture compresses high-dimensional observations into a compact latent space for prediction and control. The encoder q_ϕ maps an observation o_t to a latent state Z_t, where ϕ is a parameter vector shared between the encoder and all other world model components:

$$\mathbf{z}_t \sim{q}_{\phi }(\mathbf{z}_t\mid\mathbf{h}_{t},{x}_{t}).$$

The decoder (D) reconstructs the observation from the latent state: ${\hat{o}}_{t}=D(\mathbf{z}_{t})$.
Dynamics network: This network predicts the future states of the microrobots based on their current state and actions, following the principle of a recurrent neural network. It preserves a deterministic state h_t predicted by the recurrent neural network using the previous actions a_t−1, h_t−1 and the previous embedded state z_t−1:

$$\mathbf{h}_t ={f}_{\phi }(\mathbf{h}_{t-1},\mathbf{z}_{t-1},\mathbf{a}_{t-1}).$$
Reward predictor: This component predicts the rewards associated with different actions, aiding the agent in optimizing its behaviour. The reward predictor R estimates the reward r_t based on the latent state z_t and action a_t:

$${\hat{r}}_{t} \sim{p}_{\phi }({\hat{r}}_{t}\mid\mathbf{h}_t,\mathbf{z}_t).$$

Latent imagination and policy optimization

The agent generates future trajectories within the latent space and uses these imagined trajectories to train the policy and value networks. This reduces the need for physical interactions and makes learning more efficient. The main steps are as follows:

(1)

Trajectory sampling: Generate possible future trajectories by simulating the environment using the transition model (h_t = f_φ(h_t−1 | z_t−1, a_t−1). The imagined trajectories start at the true model states s_t drawn from the replay buffer of the agent and are then carried in the imagination by the transition model. These trajectories are generated much faster than the environment interaction and are controlled by a parameter called the training ratio. We developed a multi-threaded approach in which the latent model runs continuously on a separate process without a fixed ratio with the physical environment interactions.
(2)

Trajectory evaluation: Assess the quality of each trajectory based on the accumulated rewards predicted by the reward model. The reward predictor ($\hat{{r}_{t}}\sim{p}_{{{\phi }}}\left(\hat{{r}_{t}}\mid\mathbf{h}_{t},\mathbf{z}_{t}\right)$) estimates the rewards of each state.
(3)

Policy and value network training: The actor–critic component is trained to maximize the expected imagined reward $\left(E\left(\sum_{t=0}^{\infty }\gamma^{t}{r}_{t}\right)\right)$ with respect to a specific policy. The evaluated trajectories are used to update the policy and value networks, which dictate the agent’s actions in the physical environment.

This training loop leverages the predicted latent states and rewards, substantially enhancing sample efficiency by reducing the dependence on real-world interactions and relying on a very compact latent representation.

Algorithm 1

Microrobot MBRL training

Require: Configuration, frames, CSRT and segmented mask

Ensure: Environment set-up, reward calculation and state update

1: Initialize environment with configuration parameters # Set environment

2: Initialize RL state s₀ # Initialize RL state

3: Downsize image to 64 × 64 px # Reduce image size

4: while Episodes < Total_Episodes do # Main training loop

5: frame ← get_camera_frame; # Capture frame

6: cleaned_frame ← segment_frame # Segment frame

7: bubble_size ← detect_cluster # Detect cluster size

8: truncated, terminated ← False, False # Initialize flags

9: if bubble_size > area_threshold then # Check bubble size

10: Track microrobot with CSRT # Track microrobot

11: agent_position ← get_agent_pos # Get agent position

12 if agent_position ≈ target_position # Near target

13: r ← Target reward, terminated ← True # Assign reward, end episode

14: else if agent_position in Channel_walls then # Collision detected

15: r ← Collision penalty, terminated ← True # Assign penalty, end episode

16: end if

17: else

18: r ← Distance_based reward # Distance reward

19: end if

20: if steps > threshold then # Check step limit

21: truncated ← True # Mark truncated

22: end if

23: Reset Collisions if necessary # Reset if collisions

24: Deactivate PZT, adjust position and recheck collisions # Execute and assess action

25: Apply Action, compute reward

26: Execute action, observe environment and compute reward # Check and return results

27: Check termination, return (obs, reward, done)

28: end while

Algorithm 2

Microrobot flow environment training and simulation

Require: Config, direction and amplitude

Ensure: Environment set-up, reward calculation and state update

1: Initialize environment # Set up environment parameters

2: Set reward_centre and flow_direction from config

3: if reward_centre or flow_direction

4: Initialize flow # Initialize flow

5: end if

6: reward ← 0 # Initialize reward for the step

7: if is_valid_move (direction, amplitude) then # Check if the move is valid

8: move_agent (direction, amplitude) # Move the agent

9: else if check_collision() then # Check for collision

10: if is_valid_move (direction, amplitude/2) then # Try moving with reduced amplitude

11: move_agent (direction, amplitude/2)

12: else

13: reward ← reward_collision # Apply collision penalty

14: update_radius() # Update the radius after collision

15: end if

16: else

17: move_agent (direction, small_amplitude) # Move with a small amplitude

18: end if

19: if flow_active then # Check if flow is active

20: if is_in_centre() then # Check if agent is in the centre

21: reward ← reward + reward_centre # Add centre reward to total

22: if is_valid_move (direction, amplitude/1.5) then # Try to move against the flow

23: move_agent (flow_direction, amplitude)

24: else

25: update_radius () # Update radius if move is not valid

26: end if

27: end if # Check step limit

28: end if # Mark truncated

29: Update step counters and check termination

30: increment_step_counter () # Increase the step counter

31: if reached_target () then # Check if target is reached

32: reward ← reward_target_reached # Add target reached reward

33: mark_as_terminated () # Mark episode as terminated

34: else if radius_too_small () then # Check if radius is too small

35: reward ← reward_termination # Add termination reward

36: reset_radius() # Reset radius for new episode

37: else

38: reward ← calculate_distance_reward () # Calculate reward based on distance

39: end if

40: Return observations, reward, done and info # Return step results

41: return get_observations(), reward, is_done(), get_info()

Source link

binance h"anvisningskod commented on New Microsoft Teams app promises faster speeds and lower memory usage: Your article helped me a lot, is there any more re
Skapa personligt konto commented on Cerebras Introduces the Bittensor Language Model Named BTLM-3B-8K: A New State-of-The-Art 3B Parameter Open-Source Language Model: Your article helped me a lot, is there any more re
Mia commented on Don’t Be Fooled By Data Drift « Machine Learning Times: This is such a valuable viewpoint on data drift in
創建binance帳戶 commented on MEGA sconto del 34% su Amazon: Your article helped me a lot, is there any more re
binance registrering commented on Global Industrial Automation Services Market Size to Reach: Your point of view caught my eye and was very inte

Model-based reinforcement learning for ultrasound-driven autonomous microrobots

Microchannel fabrication

Imaging pipeline

Reward function

Training ratio

RL implementation

World model learning

Latent imagination and policy optimization

Algorithm 1

Algorithm 2

Leave a Reply

RECENT POSTS

What is artificial intelligence (AI)?

When does a college application go from personal to plagiarized? | Admit

WPP Media Announces Commitment to Develop Standards for AI-Driven Video Purchasing

Microchannel fabrication

Imaging pipeline

Reward function

Training ratio

RL implementation

World model learning

Latent imagination and policy optimization

Algorithm 1

Algorithm 2

Related Posts

Leave a Reply