Augmenting the genetic code with machine learning can bridge the gap between lab and market

IDuring my first semester of graduate school at Tufts University, I sat across from a young professor who encouraged me to join his lab to work on genetic code expansion (GCE) research. I had just finished my degree in biochemistry and had no idea what he was saying. I belonged to another research lab.

Nine years later, all I can think about is GCE. It will become one of the 21 most important technologies.^cent Enabling better medicines, industrial proteins, and new applications in biotechnology. This is perhaps the most exciting potential application of machine learning and artificial intelligence (AI) in the life sciences.

Expanding the genetic code is already reshaping medicine and industry

GCE is a process by which biological systems are designed to incorporate more than 20 “non-standard” amino acids that nature typically uses to build proteins. Commercial applications are already emerging. Many popular GLP-1 drugs are manufactured using a type of GCE. Antibody-drug conjugates that enable a new generation of more targeted cancer treatments also rely on it. Beyond therapeutics, the incorporation of non-standard amino acids has been shown to significantly increase the thermal stability of enzymes across multiple systems, including significantly extended half-lives and resistance to aggregation at high temperatures, with profound implications for industrial biotechnology.¹

GCE and related technologies for producing proteins and peptides using new chemistry are attracting significant investment and scientific attention around the world. Companies like Japan’s Peptide Dream have built substantial businesses on GCE and screening platforms for product discovery. Unnatural Products, which uses chemistry to create peptides containing non-standard amino acids, announced a $45 million Series B investment after securing major deals with several major pharmaceutical companies.

It is clear that GCE is moving from an academic novelty to an industrial platform, but the real challenge—and the real opportunity—lies in the distance between these two points.

Engineering complexity makes GCE an AI problem

Taking GCE from laboratory discovery to market-ready product is not easy. Some of the most complex multivariate engineering problems in biology need to be solved, and that complexity is what makes it one of the most exciting applications of machine learning in the 21st century.^cent century.

Every non-standard amino acid added to the genetic code requires the engineering of two bespoke biomolecules: a new tRNA synthetase and a new tRNA for each additional amino acid. These engineered components must work in conjunction with the rest of the cell’s existing protein production machinery. It is a highly interconnected system with little room for error but infinite room for design variations and chemical diversity.

Researchers are building platforms specifically designed to navigate this space. OrthoRep, used in the yeast display system, provides one approach to engineer tRNA synthetases.² On the cell-free side, a recent paper by researchers at the University of Tokyo used carefully designed tRNAs and optimized translation conditions to demonstrate expansion of the genetic code to incorporate up to 32 different amino acids (adding 12 non-standard amino acids while keeping all 20 standard amino acids).³ Looking further ahead, even more ambitious GCE implementations may be possible by manipulating ribosomes and other elements of the protein production machinery.

Internal data flywheel for GCE and machine learning

These new platforms are being built for the screening and design of GCE machines and will generate vast amounts of experimental data. And that data is proprietary in nature.

Unlike sequence data stored in publicly accessible repositories, this data cannot be downloaded from external sources and cannot be replicated without building the same physical experimental infrastructure. They take years to develop and can only be produced through specific laboratory systems that require significant resources.

This is why these platforms are so strategically valuable for machine learning. AI learns from what works and what doesn’t. Proteins either function under the conditions in which they are manufactured, or they do not. Enzymes either maintain stability or break down. This tight coupling between prediction and physical reality is precisely the environment in which machine learning is most effectively compounded.

Take immunology, for example. Antibody-drug conjugates, a rapidly growing type of cancer therapy, rely on precise chemical conjugation of therapeutic payloads to antibodies. Current designs still rely heavily on trial and error to optimize this attachment. Combining high-throughput GCE and machine learning analysis, the platform can screen thousands of non-canonical amino acid variants in parallel, test their performance against real biological targets, and continuously refine predictions based on the results. This has the potential to compress years of development into months or weeks while producing composites with properties that could not be designed using traditional approaches.

In industrial biotechnology, enzymes designed for carbon capture face a different, but equally tough problem. That means they need to perform reliably under the toxic stress of high temperatures, pressures, and industrial exhaust gas conditions that rapidly degrade traditional proteins. By screening enzyme variants with different non-standard amino acids under these precise conditions, non-standard chemistry can generate unique data that improve stability. That data doesn’t exist anywhere else because it can only be generated by building a physical system to test it.

These models are refined based on in-house datasets, accelerating lab-to-market timelines by reducing the trial-and-error testing cycles that currently extend development, and enabling the design of cheaper, more complex proteins that are specifically engineered to withstand the conditions of real-world industrial deployment.

AlphaFold for proteins containing non-standard amino acids

But optimizing GCE machines is not the only important opportunity. The real advantage is building machine learning models that can predict protein structure and function when non-standard amino acids are incorporated. This will require in-house data that doesn’t already exist at the scale you need.

Let’s take a look at what makes AlphaFold possible. Its success depends on two existing databases. Protein Data Bank, which contains the precise three-dimensional structures of nearly 175,000 proteins, and UniProt, a sequence and functional database containing over 200 million entries. Decades of experimental work have built these resources. AlphaFold discovered patterns in existing data.

However, for proteins incorporating non-standard amino acids, that basis does not yet exist. In the expanded field of chemistry that GCE unlocks, there is no equivalent to a protein data bank. Databases of non-standard amino acids and proteins built for or with GCE provide data on their effects on folding, stability enhancement or degradation, and the binding affinities they exhibit, providing the training data needed to rapidly evolve GCE’s place in industrial and academic science.

But that database is only achievable by organizations that have the physical infrastructure to generate ground truth and generate data at scale within their organizations.

Once that foundation is built, downstream applications will follow. An ultra-stable industrial enzyme that works under conventional protein-destroying conditions, an entirely new class of drugs with mechanisms not available with the standard 20 amino acids, and therapeutics that are designed from the ground up to be biologically stable and therefore need to be administered much less frequently.

GCE is the bridge from discovery to market

It is clear to me now that as a young graduate student I did not understand GCE. The potential for GCE to enable new biotech products and even entirely new product categories is enormous. So is its complexity, and that complexity makes it a very interesting use case for machine learning. There is a huge opportunity to use GCE to accelerate discovery and create new systems that enable manufacturing of products.

Organizations building experimentation platforms to generate internal data are not just evolving their own pipelines. They are laying the groundwork for one of the most exciting applications of machine learning in biotechnology: turning GCE into a bridge between scientific potential and real-world products with the performance that patients and other customers need. The data generated in the process makes that bridge possible.

Source link