Simulation environment
A simulation environment describing a single-column bind–elute process (see Fig. 1) was implemented in Python using the Simulation Testbed for Liquid Chromatography software (STLC)13. The reinforcement learning environment was implemented using PyTorch and OpenAI Gym14. The physical system recreated in the simulation comprises a vessel containing unprocessed media, a vessel containing elution buffer, and a vessel containing wash buffer. Individual volumetric flow rates from the vessels can be set, and their outlets are combined in a common vessel, referred to as the flow controller. The column inlet is fed from the common vessel with combined volumetric flowrates of the individual upstream vessels. At the column outlet, a valve can siphon the processed media into either product collection or waste collection vessels designed to contain purified media and waste.

Schematic of the simulation environment containing chromatography system and column simulations. Arrows indicate direction of flow.
To represent the physical system, the simulation combines a set of differential equations describing the flow between the various tanks, valves, and columns. The column itself is modeled by implementing a mechanistic model known as the general rate model (GRM), which is composed of a set of partial differential equations based on one-dimensional convective and dispersive mass transfer15. The equations describe the component concentrations in the interstitial mobile, stagnant particle, and solid mobile phase as functions of space and time. Various parameters can be set to reflect the column, particle, and resin-specific characteristics. The set of equations are solved numerically using a technique called orthogonal collocation. This procedure discretizes the spatial domain of the equations and converts them into a system of ordinary differential equations. The column equations and general system equations are combined and solved in the temporal domain using a backward Euler step method discretization technique using the simulation testbed for the liquid chromatography package. To speed up the simulation times, we use a simplified abstraction of what we refer to as spent media or impurities. This is a symbolic lumped concentration representing the substance that one would wish to separate from the product in this process. The tracked compounds in our system indexed by \(i = 0,\dots ,3\), where 0 represents the concentration of salt, 1 represents the protein, 2 represents the impurities, and 3 represents the wash buffer.
The relationship between the flow rates and liquid volumes in parts of the system is governed by the following equation:
$$\frac{d{V}_{i}}{dt}={F}_{i}$$
(1)
where \(V\) represents the volume of liquid contained in one of \(i=1,\dots ,{N}_{\mathrm{v}}\), for \({N}_{\mathrm{v}}\) vessels, and \({F}_{\mathrm{i}}\) represents the specific flow rate for vessel \(i\). All vessels upstream of the column are combined into the column inlet flow, \({F}_{\mathrm{in}}={\sum }_{i=1}^{3}{F}_{i}\).
To model the column, consider a spatial variable along the axis, \(z\in {\Omega }_{z}=[0,L]\), where \(L\) is the column length. Resin particles with radius \({r}_{\mathrm{p}}\) and radial axis, \(r\in {\Omega }_{r}=\left[0,{r}_{\mathrm{p}}\right],\) are uniformly distributed along the column axis. To advance the model in time we let \(t\in \left[0,{t}_{\mathrm{f}}\right],\) for some simulation time \({t}_{\mathrm{f}}\). The GRM used in this study describes convective mass transport and axial dispersion in the column volume using the following equation:
$$\frac{\partial {c}_{i}}{\partial t}=-u\frac{\partial {c}_{i}}{\partial z}+{D}_{\mathrm{ax},i}\frac{{\partial }^{2}{c}_{i}}{\partial {z}^{2}}-\frac{1-{\epsilon }_{\mathrm{c}}}{{\epsilon }_{\mathrm{c}}}\frac{3{k}_{\mathrm{f},i}}{{r}_{\mathrm{p}}} \left({c}_{i}-{c}_{\mathrm{p},i}\left(\cdot ,\cdot ,{r}_{\mathrm{p}}\right)\right)$$
(2)
The mobile phase concentration of component \({c}_{i}\left(t,z\right)\) is given for \(i=0,\dots {,N}_{\mathrm{c}}\), for \({N}_{\mathrm{c}}\) components. The interstitial velocity \(u\) is related to the flow rate by \(u={F}_{\mathrm{in}}/{A}_{\mathrm{c}}{\upepsilon }_{\mathrm{c}}\), where \({A}_{\mathrm{c}}\) is the column cross section area and \({\epsilon }_{\mathrm{c}}\) the column porosity. Furthermore \({D}_{\mathrm{ax},i}\) is the axial dispersion coefficient and \({k}_{\mathrm{f},i}\) is the film mass transfer coefficient. The final term includes the mass flux into the pore phase, with the pore phase concentration denoted by \({c}_{\mathrm{p},i}(t,z,r)\). The mass balance in the pore phase is given by
$$\frac{\partial {c}_{\mathrm{p},i}}{\partial t}={D}_{\mathrm{p},i} \left(\frac{{\partial }^{2}{c}_{\mathrm{p},i}}{\partial {r}^{2}}+\frac{2}{r}\frac{\partial {c}_{\mathrm{p},i}}{\partial r}\right)-\frac{1-{\epsilon }_{\mathrm{p}}}{{\epsilon }_{\mathrm{p}}}\frac{\partial {q}_{i}}{\partial t}$$
(3)
where \({D}_{\mathrm{p},i}\), is the pore diffusion coefficient and \({\epsilon }_{\mathrm{p}}\) represents particle porosity. To represent binding interactions between solute components and stationary phase, we use the mobile phase modulator Langmuir model16. The model allows modulation of component adsorption and desorption rates by salt concentration. This dependency is evaluated for each individually modelled particle along the column axis. This interaction is governed by the following equation.
$$\frac{\partial {q}_{i}}{\partial t}={k}_{\mathrm{a},i}{e}^{\upgamma {c}_{\mathrm{p},0}}{c}_{\mathrm{p},i}{(q}_{\mathrm{max},i}-{q}_{i})-{k}_{\mathrm{d},i}{c}_{\mathrm{p},0}^{\upbeta }{q}_{i}$$
(4)
In the model \({k}_{\mathrm{a},i},{k}_{\mathrm{d},i},\) denote adsorption, desorption rate constants, respectively; \({q}_{\mathrm{max},i}\) represents the adsorption capacity; \(\upgamma\) and \(\upbeta\) modulation constants. In these equations, we only used β, to obtain the desired interaction in our system model by setting it to 1. The salt component in the system, \({c}_{\mathrm{p},0}, {q}_{0}\), is inert; thus \(\frac{\partial {q}_{0}}{dt}=0\). The upstream mass flux is transported into the column via the inlet boundary condition:
$$u{c}_{\mathrm{inj},i}
(5)
where \({c}_{\mathrm{inj},i}\), defines the concentration of injected components. The outlet boundary is defined as follows.
$$\frac{\partial {c}_{i}}{\partial z}\left(t,L\right)= 0$$
(6)
Mass flux into the particles is represented by the following relation:
$${\epsilon }_{\mathrm{p}}{D}_{\mathrm{p},i}\frac{\partial {c}_{i}}{\partial {r}^{2}}\left(t,z,{r}_{\mathrm{p}}\right)={k}_{\mathrm{f},i} \left({c}_{i}(t,z)-{c}_{\mathrm{p},i}(t,z,{r}_{\mathrm{p}})\right)$$
(7)
Finally, spherical symmetry in the resin particles gives the following condition:
$$\frac{\partial {c}_{{p}_{i}}}{\partial r}\left(t,z,0\right)= 0$$
(8)
The equations are combined into a single semi-discrete system of equations with the spatially dependent derivatives discretized into \({N}_{z},N_r\) elements for z and r using orthogonal collocation17. Each variable is then approximated by interpolating polynomials of order \({N}_{\mathrm{p}}\) in each element with \({C}^{1}\)-continuous boundaries. To solve the semi discrete system, a fixed-point iteration using the backward Euler method was applied:
$${y}_{k+1}={y}_{k}+hf\left({y}_{k+1},{t}_{k+1}\right).$$
(9)
Here, h indicates timestep length and k the discretization of the time domain for \({k=1,\dots ,N}_{t},\) for \({N}_{t}\) total timesteps. The semi-discrete system is merged into a single sparse matrix and solved using LU-factorization.
The simulated system exposes the control surfaces of the modeled physical system. This allows an operator to set the volumetric flow rates from the upstream vessels using the flow controller. The simulator calculates the flow rate into the column and the average flow rate of the phases migrating through the column. The column inlet concentrations are derived from the initial vessel concentration \({c}_{\mathrm{init},i}\) and flow rates in the flow controller \({c}_{\mathrm{in},i}=\frac{{F}_{i}{c}_{i}}{{F}_{\mathrm{in}}}\). The column simulation continuously calculates the concentration of the components in the column and, as the solution reaches the column outlet, the output valve allows the operator to direct the flow toward the product or waste vessels.
The simulation system observations are recorded in a time-indexed state vector. This vector maintains the current and historical records of component concentrations in the system vessels, component concentrations along the length of the chromatography column, component solution volumes in the system vessels, and the approximate component mass contained in the column and system vessels. The flow controller and output valve are controlled programmatically, with the flow controller able to set flows for each component between zero and 2.0 × 10–6 m3 s−1, and the output valve was set to either direct the flow to the waste or product tank.
Column variability
To model the column variability, a set of test columns with varying performance was created. In these columns, the initiation parameters were adjusted to create unique simulation environments. Randomized parameters for the columns were generated by adding the product of the initial value scaled with a uniformly distributed error:
$$f\left(x\right)=x+x \cdot S \cdot X$$
(10)
Here, \(X\sim U\left(-\mathrm{1,1}\right)\) is a uniformly distributed random variable, \(S\in (\mathrm{0,1}]\) is a scaling variable, and x is some input parameter. In the experiment, the axial dispersion, radial dispersion, column porosity, and resin porosity were subjected to this treatment. The scaling term was maintained for all test columns, thereby incorporating the maximum relative deviation for each parameter (relevant initial parameters in Table 1). Environmental variation was achieved by sampling X and multiplying with the error term.
Deep reinforcement learning
Deep RL combines the classical field of reinforcement learning with deep artificial neural networks18 to optimize the behavior of an agent in an environment using machine learning algorithms. The agent is optimized by maximizing a reward function, which incentivizes the desired behavior and penalizes the undesired behavior. During RL, the environment starts in an initial state St and the agent selects an action At; the environment is then updated based on the action, and it returns the next state St+1 and a corresponding reward associated with that state Rt+1 (see Fig. 2). The agent, expressed as an artificial neural network (ANN), is continuously updated by adjusting the ANN weights using the backpropagation algorithm. The interaction between the agent and environment is continued until the environment reaches a terminal state or for a set number of iterations. For the agent to learn an optimal policy, one must run multiple episodes of interactions with the environment from start to finish, often referred to as episodes or trajectories.

Illustration of reinforcement learning agents training by interacting with an environment. For each state St, the agent generates an action At, which produces an action and reward for time t + 1.
Here, we used a deep RL algorithm called twin-delayed Deep Deterministic Policy Gradient-method (TD3)19 which is an extension of the widely used DDPG20. TD3 is an actor–critic method (visualized in Fig. 3), meaning that the agent has two components during training: an actor and a critic, both implemented as ANNs. The actor network takes a state St and predicts the next action At, which is also referred to as the agent’s policy. On the other hand, the critic network is tasked with estimating the future accumulated reward gained by a state–action pair and is used to train the actor to produce actions that yield high rewards based on the current state. TD3 is a temporal difference (TD) learning algorithm21, which means that it bootstraps the estimate of future rewards using a value function instead of running the interaction with the environment until the end and gathering all future rewards. The value function is estimated by the critic network, which iteratively gets better at making these estimations. To train the critic network, we estimate how well it performs in a bootstrapping fashion by calculating \(T{D}_{\mathrm{error}}\), defined as:
$$T{D}_{\mathrm{error}}= {R}_{t}+ \upgamma Q\left({S}_{t+1},{A}_{t+1},{w}_{\mathrm{Q}}\right)-Q\left({S}_{t},{A}_{t},{w}_{\mathrm{Q}}\right)$$
(11)
where Rt St, and At, are the reward, state, and action of current timestep t, and St+1, At+1, are the state and action of timestep t + 1. Q is the critic network with weights \({w}_{\mathrm{Q}}\) and is used to estimate the future rewards; also referred to as the Q-value, for a state–action pair. The discount factor γ is used to weigh the rewards of far-future actions to indicate that they are more difficult to predict than near-future actions. \(T{D}_{\mathrm{error}}\) approaches zero when the networks improve because the Q-value (accumulated future rewards) of timestep t should be equal to the reward from timestep t added to the Q-value of timestep t + 1 given a perfect Q-value estimation. Given:

Actor–critic agent interaction with an environment. The actor part of the network predicts subsequent actions while the critic estimates the reward gained by the action–state pair and is used to train the actor network.
$$Q\left({S}_{t},{A}_{t},{w}_{\mathrm{Q}}\right)= {R}_{t}+\upgamma Q\left({S}_{t+1},{A}_{t+1},{w}_{\mathrm{Q}}\right)$$
(12)
\(T{D}_{\mathrm{error}}\) is used as the loss/cost-function to train the critic with standard deep learning optimization techniques such as gradient decent. The actor is trained to produce the optimal policy using the estimated Q-value \(\widehat{R}\) from the critic network:
$$\widehat{R}=Q\left({S}_{t},{A}_{t},{w}_{\mathrm{Q}}\right)$$
(13)
where action \({A}_{t}\) is the product of the action networks \(\uppi\) policy based on state \({S}_{t}\) and weights \({w}_{\uppi }\) as
$${A}_{t}=\uppi \left({S}_{t},{w}_{\uppi }\right)$$
(14)
Being implemented as an ANN, the critic’s reward estimate is differentiable, meaning that it can be used to optimize the actor by using gradient ascent to update the actors’ network weights \({w}_{\uppi }\) and yield a better policy. In practice, this is achieved by freezing the network weights of the critic and using a gradient descent on the negative future reward \(\widehat{R}\) estimated for the current action–state pair. The actor’s loss function is then given by
$${L}_{\mathrm{actor}}=-Q\left({S}_{t},\uppi \left({S}_{t},{w}_{\uppi }\right),{w}_{\mathrm{Q}}\right)$$
(15)
By leveraging bootstrapping in TD-learning, only the state, action, reward, next state, and next action for a timestep t are needed to calculate all the losses to train an RL agent. This is often referred to as state–action–reward–state–action (SARSA), which means that when the agent interacts with the environment, it is possible to save SARSA tuples for each time point and use the experiences for training22.
When training the agent, we run the simulation (or real-world interaction) with a specified number of iterations or until a terminal condition is reached. During training, SARSA experiences are recorded and put into a memory buffer, and after a predefined number of steps in the environment, the agent will sample the memory buffer for a fixed number of times and train the agent networks with those experiences. In practice, this means that the training examples drawn can originate from both current and past episodes. This makes training more stable because drawn experiences are less correlated with each other compared to chronologically following ones23.
In our experiments, we used a memory buffer to run multiple experiments in parallel and gathered them in a single memory buffer. To allow the agent to explore new possible actions, random noise was added to the actions. The noise amplitude decayed over time to incentivize exploration early on and gradually transferred to a higher degree of exploitation of the environment. By combining multiparallel training and action noise, the same actor explores different actions in the same environment and state owing to noise, which improves learning because of to its regularizing effect.
Reward function
A critical aspect of RL is the definition of a suitable reward function. In this experiment, the total reward for each state was calculated and then subtracted from the previously recorded rewards to give us a reward that reflects the changes made on a step-by-step basis as follows:
$${\Delta }_{R} ={R}_{t}-{R}_{t-1}$$
(16)
where \({R}_{t}\) is the reward at time t, and \({R}_{t-1}\) is the reward at time t − 1.
In our chromatography use case, the reward function must consider multiple factors. First, the product is separated into the product tank and waste into the waste tank (see Fig. 1). This is reflected in the following terms:
$${\Delta }_{\mathrm{mab}} =\mathrm{mab}_{\mathrm{prod}}-\mathrm{mab}_{\mathrm{waste}},$$
(17)
$${\Delta }_{\mathrm{sm}} =\mathrm{sm}_{\mathrm{waste}}-\mathrm{sm}_{\mathrm{prod}},$$
(18)
where \({\Delta }_{\mathrm{mab}}\) is the mass of mAb collected in the product tank \(\mathrm{mab}_{\mathrm{prod}}\) minus the mass that ends up in the waste tank \(\mathrm{mab}_{\mathrm{waste}}\). Inversely, the waste product \({\Delta }_{\mathrm{sm}}\) is the mass of waste in the waste tank \(\mathrm{sm}_{\mathrm{waste}}\) subtracted by the mass of waste in the product tank \(\mathrm{sm}_{\mathrm{prod}}\). We then define the mass-reward function as follows:
$${R}_{\mathrm{mass}} = {(\alpha \cdot \Delta }_{\mathrm{mab}} +{\Delta }_{\mathrm{sm}} )$$
(19)
where \({\Delta }_{\mathrm{mab}}\) is multiplied by a real-valued scaling factor \(\alpha\) to reflect the ratio between the waste and mAb (\(\alpha =5\) in our experiments). Preliminary experiments showed that training with \(\alpha =1\) made the agent ignore mAb in favor of putting all the waste in the waste tank.
Further preliminary experiments showed that training the algorithm using Rmass resulted in an agent that maximized the mass difference to achieve a higher reward than perfect separation but with very low mAb purity. Perfect separation requires that less feed (total amount of processed medium) is placed into the system after loading, but Rmass can be maximized by simply keeping the feed high to accumulate more product. To improve purity, we defined an additional reward function to incentivize a higher concentration of the resulting product:
$${R}_{\mathrm{concentration}}=\mathrm{max}\left( \frac{\mathrm{mab}_{\mathrm{prod}}}{\mathrm{mab}_{\mathrm{prod}}+\mathrm{sm}_{\mathrm{prod}}} , x\right)$$
(20)
using minimum value of x = 0.1; however, it would increase as the purity of mAb in the product tank increases. This yields the following intermediary reward function:
$${R}_{1} = {R}_{\mathrm{mass}}\cdot {R}_{\mathrm{concentration}}$$
(21)
Optimizing \({R}_{1}\), the agent puts all products in the column and never washes them out to only accumulate reward from waste correctly placed in the waste tank, leaving an empty product tank. Therefore, we introduced a penalty for mAb left in the column tank as follows:
$${P}_{\mathrm{remaining}}= \left\{\begin{array}{ll}{\mathrm{mab}}_{\mathrm{col}} & \text{if }{\mathrm{mab}}_{\mathrm{col}}>{\mathrm{mab}}_{\mathrm{max}}\\ 0 & \text{otherwise}\end{array}\right.$$
(22)
where \(\mathrm{mab}_\mathrm{col}\) is the mAb in the column and \({\mathrm{mab}}_{\mathrm{max}}\) is a threshold value that determines if there is too much mAb in the column at the end of the run. Further, we added another reward to incentivize “clean” mAb in the product tank.
$${R}_{\mathrm{clean}}=\mathrm{max}\left(\mathrm{mab}_{\mathrm{prod}}-\mathrm{sm}_{\mathrm{prod}},0\right)$$
(23)
Combining the terms yields the second intermediary reward function:
$${R}_{2} =\beta \cdot \left({R}_{\mathrm{mass}}\cdot {R}_{\mathrm{concentration}}+{R}_{\mathrm{clean}} – {P}_{\mathrm{remaining}}\cdot {I}_{\mathrm{done}}\right)$$
(24)
where the indicator \({I}_{\mathrm{done}}\) is 1 if the simulation has finished and 0 otherwise, and \(\beta\) is a real-valued scaling factor (\(\beta ={10}^{6}\) in our experiments).
Using \({R}_{2}\), the agent consistently achieved a product purity of approximately 35% but struggled to improve beyond that. To incentivize increased purity, we added a reward that increases linearly based solely on purity, ignoring the amount of product, after achieving 30% concentration:
$${R}_{\mathrm{purity}} = \mathrm{max}\left(\frac{{R}_{\mathrm{concentration}}-0.3}{0.7}, 0\right)$$
(25)
To define our final reward function:
$$R=\beta \cdot \left({R}_{\mathrm{mass}}\cdot {R}_{\mathrm{concentration}}+{R}_{\mathrm{clean}}+\gamma \cdot {R}_{\mathrm{purity}} – {P}_{\mathrm{remaining}}\cdot {I}_{\mathrm{done}}\right)$$
(26)
where γ is a real-valued scaling factor that adjusts the importance of purity at later stages during training (γ = 100 in the experiments).
