Deep learning powers gene function prediction

Machine Learning


In a breakthrough that bridges genomics and artificial intelligence, researchers have unveiled a deep learning model that deciphers the hidden functional language encoded within cis-regulatory DNA sequences across diverse plant species. The study, published in Nature Plants, introduces PhytoBabel, a sophisticated neural network trained to extract semantic meaning from regulatory DNA fragments, whose functional roles are surprisingly conserved despite vast evolutionary divergence. This innovation promises to revolutionize gene function prediction in plant biology, dramatically enhancing our ability to annotate genes and discover new functional elements beyond traditional sequence similarity analysis.

Cis-regulatory elements regulate gene expression by controlling when, where, and how genes are turned on or off. These DNA regions are essential for precise temporal and spatial gene regulation, but have historically posed significant challenges to functional interpretation, primarily due to rapid evolutionary divergence. Over approximately 160 million years, these regulatory sequences can diverge so widely that sequence alignments do not reveal meaningful similarities and obscure their shared biological roles. PhytoBabel circumvents this fundamental bottleneck by learning to detect semantic similarities embedded in control DNA that go beyond raw sequence conservation.

The researchers carefully selected orthologous pairs of cis-regulatory DNA from 15 species of flowering plants and built a rich dataset to train PhytoBabel. Despite the minimal nucleotide identity between these orthologous control regions, the model effectively learned to match sequences based on their semantic content, i.e., the functional “meaning” encoded in these stretches of DNA, rather than just the raw sequence. Through this approach, PhytoBabel captures complex regulatory grammars and contextual features inaccessible to traditional sequence alignment tools, uncovering deeply conserved biological functions encoded within non-coding genomes.

One of the most notable revelations is that PhytoBabel’s training relies solely on evolutionary paired control sequences, implicitly encoding a rich layer of biological knowledge. This includes spatiotemporal gene expression patterns, conserved non-coding motifs important for gene regulation, and DNA fragments that retain relevant biologically meaningful functions despite sequence divergence. Furthermore, this model surprisingly internalizes phylogenetic relationships between species, suggesting that it understands the evolutionary distances and regulatory structures formed by plant divergence over millions of years.

Beyond a theoretical advance, PhytoBabel has immediate practical applications in plant reverse genetics, a field dedicated to assigning functions to genes based on sequence. Enabling functional predictions from regulatory DNA alone opens the door to identifying genes involved in important biological processes, even in species where experimental data are lacking. For example, the researchers used PhytoBabel to identify maize genes involved in somatic embryogenesis, an essential morphogenetic process, by detecting semantic similarities with well-characterized Arabidopsis regulatory elements. This cross-species functional inference provides a powerful method for discovering novel gene regulators in economically important crops.

Traditional approaches to regulatory sequence analysis rely heavily on sequence conservation and motif scanning and often fail to capture subtle but important functional features encoded in non-coding regions. PhytoBabel’s deep learning framework leverages a neural network architecture that can learn a hierarchical regulatory language that includes the use of combinatorial motifs, epigenetic marks, and dynamic gene expression cues. This level of abstraction goes beyond the linear DNA code and allows the identification of functionally homologous regulatory regions that would otherwise be overlooked due to sequence differences by classical methods.

The success of this study highlights the untapped potential of integrating AI-driven semantic analysis into genomics. PhytoBabel effectively bridges the gap between genotype and phenotype by translating complex regulatory codes into a semantic space, revealing hidden regulatory mechanisms that drive gene function. This conceptual leap goes beyond mere sequence similarity to encompass a functional understanding akin to natural language processing, where meaning is retained despite changes in representation, revealed here as regulatory DNA diversity.

Importantly, this model demonstrates transferability and generalizability, which are important properties for broad utility in plant science. Although PhytoBabel is trained only on a specific set of angiosperm regulatory pairs, it extrapolates the learned knowledge to predict functional similarity in previously unstudied sequences, providing a scalable tool for high-throughput gene function annotation. This ability is particularly valuable for orphan crops and wild plants that lack extensive genomic resources, accelerating the understanding of plant biology and facilitating crop improvement.

PhytoBabel’s architecture also sheds light on the evolutionary dynamics of gene regulation. By revealing the functional conservation of regulatory elements that have lost sequence similarity over the course of evolution, we suggest that selective pressures maintain regulatory function through semantic content rather than strict nucleotide conservation. This insight emphasizes conservation of function over sequence and prompts a paradigm shift in how evolutionary conservation is defined and measured in regulatory genomics.

The research team emphasizes that their findings have far-reaching implications beyond plant systems. The methodological framework of semantic matching may be applicable to animal and microbial control genomics, where similar challenges exist in the interpretation of noncoding DNA. This cross-kingdom applicability points to a universal principle. In other words, regulatory DNA acts like a semantic language, and deep learning models trained on evolutionary data can dramatically improve understanding.

Additionally, PhytoBabel’s ability to uncover evolutionarily unrelated but semantically similar regulatory sequences paves the way for the discovery of novel gene networks and pathways. Such discoveries will facilitate advances in synthetic biology, enabling the design of custom control elements that mimic natural control semantics, allowing for precise control of gene expression in genetically engineered organisms. This capability holds great promise for agriculture, biotechnology, and medicine.

The development of PhytoBabel represents a synthesis of computational innovation and biological insight, demonstrating how machine learning can penetrate the complexity of gene regulation. By adopting semantic similarity as a foundational concept, this model transcends the limitations of traditional bioinformatics approaches and ushered in a new era in which regulatory DNA sequences are not simply cataloged but functionally interpreted with unprecedented scale and depth.

As the genomics research landscape evolves, tools like PhytoBabel will become essential for dissecting the regulatory genome, a largely uncharted territory. Insights gained from such models guide experimental design and inform targeted manipulation of gene expression to develop climate-resilient crops, increased yields, and improved nutritional quality. This synergy between AI and biology represents a transformative step toward fully exploiting the genetic blueprint encoded within every plant cell.

In conclusion, the emergence of PhytoBabel marks a pivotal moment in genomics, bridging vast evolutionary distances not simply by sequence but also by semantic function. This enables a new paradigm for gene function prediction rooted in deep learning-derived regulatory DNA understanding. As its applications expand, this technology will unravel the obscure regulatory dark matter of the genome and drive innovation in plant science and other fields for decades to come.

Research theme: cis-regulatory DNA sequences, gene function prediction, deep learning, plant genomics.

Article title: Deep learning-based semantic matching of cis-regulatory DNA sequences facilitates prediction of gene function.

Article references:
Li, T., Xu, H., Suo, M. et al. Deep learning-based semantic matching of cis-regulatory DNA sequences facilitates prediction of gene function. nut. Plants (2026). https://doi.org/10.1038/s41477-026-02231-w

image credits:AI generation

Toi: https://doi.org/10.1038/s41477-026-02231-w

Tags: cis-regulatory DNA sequence conserved functional element plant interspecies plant gene regulation deep learning gene function prediction evolutionary divergence gene regulation functional genomics machine learning gene annotation deep learning novel gene function discovery phytobabel neural network plant genomics artificial intelligence regulation DNA semantic similarity semantic analysis regulation DNA



Source link