New AI model reconstructs human ancestry from DNA

On the screen, the string of DNA appears stationary, with only long strings of A, T, C, and G characters. But embedded in those letters are small changes that show where lineages have split, recombined, and moved over time.

Researchers at the University of Oregon say they have built an artificial intelligence system that can read these mutation patterns, much like a language model reads text, and use them to estimate when two genes last shared a common ancestor. The tool, called cxt, is described in the Proceedings of the National Academy of Sciences and aims to reconstruct family history hidden within the genome, one of the most difficult tasks in population genetics.

Rather than predicting the next word in a sentence, the model predicts what the team calls the next coalescence, or an estimate of the common ancestor along the chromosome.

Andrew Kahn, a computational biologist at the University of Oregon’s College of Arts and Sciences, said the study leverages the ideas behind generative AI, but can also be used in fields that have relied heavily on math-heavy statistical methods. “Advances in generative AI and the architectures behind it can be useful in many areas beyond chatbots,” Khan said. “We’re borrowing strengths from the world of AI and applying them to this different, largely undeveloped context.”

Traditional methods remain the standard for this type of ancestral reconstruction. However, it can be slow, especially for large genomic datasets, and does not always handle incomplete data well.

Bones used for DNA extraction. (Credit: University of Oregon)

Kevin Kofman, lead author of the study and a former postdoctoral researcher at the University of Oregon, said the problem became a natural target for machine learning.

Teach the model to read mutation patterns

The system is based on a modified GPT-2 architecture, an older language model design well known for text generation. However, here you are not trained by books or websites. This was learned from simulations of genetic evolution across a variety of species, including primates, mosquitoes, rodents, and bacteria.

The choice is important because evolution cannot be rerun on demand in the lab. “We can’t repeat evolution, so one of our key workflows is developing simulations,” says Kaufman. “Simulations mimic the evolutionary process and use the results as training data for deep learning models.”

Briefly, the program scans the entire window of DNA for mutation density and associated signals. Areas with many differences usually indicate a more distant common ancestor. Areas with fewer values tend to indicate more recentness. From these patterns, cxt estimates pairwise time to most recent common ancestor (TMRCA), a fundamental measure used to infer genealogical history.

The Oregon team says the model is a kind of language system for population genetics, not because it reads DNA letters like other biological AI models, but because it learns sequences of evolutionary states hidden from context. The study views this as a translation problem, converting mutation patterns into coalescence times for entire chromosomes.

In tests, the model held up well against established approaches, especially Singer+Polegon and SMC++. In the constant population size scenario, the cxt narrow model produced a mean squared error of 0.2531. This was close to 0.2470 for Singer+Polegon and lower than 0.8685 for SMC++. The more complex “sawtooth” demographic scenario revealed a significant improvement in the broader version of cxt, with an error of 0.1796.

Andrew Kahn, an academic expert in population genetics and machine learning, is developing new tools to study evolutionary biology. (Credit: Charlie Litchfield)

The results surprised the group. “When you borrow technology from a completely different world and apply it to a new problem, you never know what will work,” Kahn says. “But this was a case where things worked out very well.”

Fast enough for entire chromosomes

One reason the team thinks this approach could be useful is speed.

The authors report that all pairwise coalescence curves for a sample of 50 haploid chromosomes could be estimated within 5 minutes on a single NVIDIA A100 GPU. We also found that cxt scales approximately linearly as the number of inference pairs increases, and that adding GPUs increases speed approximately linearly.

This is very different from likelihood-based and Markov chain Monte Carlo approaches, which tend to be computationally intensive and sensitive to parameter selection.

Coffman said the time savings come from the work being done. “Compared to classical inference approaches, AI tools do not need to reason about every mutation individually,” he said. “You just read the patterns. All the expensive statistical work is done upfront during training, avoiding bottlenecks.”

The model also handled incomplete DNA data better than some standard tools because missing data patterns were incorporated during training and fine-tuning. This became especially important when researchers tested cxt against the mosquito genome. Spotty data and uneven sample sizes are common in mosquito genomes.

A broader version of this model generalized well to species later added to the stdpopsim catalog, including pigs, rats, porpoises, mice, and gorillas. Still, that flexibility had its limits. In some out-of-sample cases, Singer+Polegon achieved lower error, sometimes by a large margin. The authors say that CXT struggled most in environments with low mutation rates, high recombination rates, and poor signal-to-noise conditions.

Actual and predicted coalescence times for three inference approaches across two demographic scenarios: constant population size and varying “sawtooth” demographics. (Credit: PNAS)

There were other considerations as well. This system does not reconstruct the complete pedigree topology, but only pairwise coalescence times. Also, in structured population models, the simulated samples often include a mixture of individuals from different populations, so the training settings may have biased some of the estimates within the population.

From human history to malaria mosquitoes

To see how the system performs on real data, the team applied it to the human genome from the 1000 Genomes Project and the mosquito genome from the Ag1000G consortium.

In humans, cxt restored the well-known pattern in the LCT region of chromosome 2. Lactase persistence is known to have increased in this region under recent selection. The model finds a significant drop in coalescence times there, some estimates of just over 10,000 years, consistent with a wide range of haplotype ages.

In the HLA region of chromosome 6, the situation was reversed. The model therefore estimated a deeper lineage structure that included several genes with TMRCAs on the order of tens of millions of years. This is consistent with long-standing evidence that balanced selection has preserved ancient diversity in this immune-related region.

Mosquito work and malaria vectors

Mosquito work may become even more practical. Khan studies malaria vectors, and one of the biggest problems in controlling them is insecticide resistance. “We’re now seeing insecticide resistance in all of these mosquito populations,” he says. “A big challenge in preventing the spread of malaria has been understanding the evolution of insecticide resistance. Now, using AI models, we can ask how long ago these resistance genes arose in the population and learn about the evolutionary history of this important malaria carrier.”

At the Rdl locus Anopheles gambiaecxt detected a reduction in coalescence time that varied by region. A localized sharp decline was seen in Ghana but none in Uganda, a pattern consistent with known geographic differences in resistance alleles.

The model also suggests that the most recent dates estimated for the site (hundreds to thousands of years) are likely older than modern pesticide use, but the authors caution that these dates may be held back by older variations and the fact that the method averages across sites rather than dating the exact mutations that cause resistance.

Extensive model out-of-sample evaluation in stdpopsim v0.3. Each panel shows the estimated marginal coalescence distribution (dashed line) against the true distribution (shaded). (Credit: PNAS)

The researchers used CXT to examine an ancient In(2L)a inversion on mosquito chromosome 2L and found deeper coalescence times inside the inversion than outside, especially in older signals near the breakpoint.

Practical implications of the research

This study presents an alternative way to do population genetics, replacing hand-crafted likelihood formulas with simulation-trained machine learning. This could potentially allow researchers to process much larger genomic datasets, work with messier sequence data, and work faster when studying the evolution of humans, disease-carrying animals, or other species.

It is not intended to replace best theory-based methods in all cases. Singer+Polegon still turns out to be more accurate in some scenarios, and cxt has clear limitations in unfamiliar parameter areas. But the Oregon team argues that its speed, flexibility, and ability to adapt through fine-tuning make it useful for problems where traditional methods are too slow or too rigid.

Kahn and Koffman say the next step is to move beyond lineage pairs and toward reconstructing a more complete family tree. This brings the model closer to the broader ancestral recombination graph that population geneticists ultimately hope to reconstruct.

The study results are available online in the journal PNAS.

Source link