AI can unlock the complexity of cancer — if you build the data infrastructure first

Machine Learning


  • Vast shared datasets have enabled modern AI to master complex tasks such as coding and human reasoning.
  • However, fragmented data silos are currently preventing machine learning from delivering life-saving breakthroughs in cancer immunotherapy.
  • With standardized infrastructure and global collaboration, creating AI-enabled research networks has the potential to accelerate drug discovery.

Large language models were trained on huge shared datasets, ranging from Shakespeare to software repositories, so they learned to write, code, and reason. Scale, standardization, and open access have made modern AI possible.

Cancer research deserves similar treatment.

AI models can already detect patterns across billions of variables. When applied to medicine, these systems can predict which patients will respond to treatments, understand why treatments fail, and simulate drug combinations before they reach clinical trials. In immunotherapy, where outcomes depend on millions of dynamic interactions between immune cells and tumors, this type of pattern recognition can be transformative.

Machine learning systems only perform as well as the data they are trained on.

Science is ready. Unfortunately, that’s not the case with data infrastructure. Today, most cancer research is still conducted on a lab-by-lab and dataset-by-dataset basis. Valuable information is kept in silos, locked behind institutional firewalls, scattered in supplementary files, or stored in incompatible formats. Even when research results are published, the underlying data are often incomplete (biased toward only positive results) or irreproducible.

Machine learning systems only perform as well as the data they are trained on. Fragmented and inconsistent datasets produce fragmented and inconsistent insights. Without shared standards and pooled data, no matter how powerful the algorithms become, AI will not help unlock the complexity of cancer care.

If we want AI to accelerate treatment, we first need to build the right data foundation for it to train on.

Why shared data matters

This moment is uniquely consequential. On the other hand, biology has entered a new era. Single-cell and spatial techniques now allow us to observe the immune system with extraordinary resolution, not just which cells are present, but where they are in space, how they interact, and how they evolve over time. We can measure cancer (and its treatment) as a living, dynamic system. Meanwhile, AI architectures have matured to ingest exactly this type of multimodal data (genomic, spatial, longitudinal) at scales that humans cannot process.

For the first time, measurement and calculation tools work together. But without a coordinated infrastructure, we risk missing out on immense opportunities.

The results are not theoretical. Research that cannot be replicated wastes an estimated $28 billion annually in the United States alone, and the problem starts with access. When the Open Science Center tried to test 193 experiments from some of the most influential cancer studies, it didn’t have enough information to try most of them. Of the 50 experiments they managed to complete over eight years, fewer than half had the same results. Data was locked behind paywalls, buried in file drawers, or simply not shared. research in BMC Medicine found that only 16% of oncology data is publicly available, and less than 1% when checked against standards that allow other researchers to actually use the data.

In a field where speed is everything, this inefficiency is unacceptable. And as AI has the potential to accelerate discovery, it is now our biggest obstacle.

Building the foundation: CRI Discovery Engine

To address this gap, the Cancer Research Institute recently launched the CRI Discovery Engine, not as a proprietary database but as a shared infrastructure across the field.

In collaboration with researchers at Stanford University School of Medicine, University of Pennsylvania Perelman School of Medicine, Memorial Sloan Kettering Cancer Center, and technology partner 10x Genomics, we are standardizing the way immunotherapy data is generated, structured, and shared. The goal is simple. The goal is to create large, harmonized, AI-enabled datasets that can be used by any qualified researcher. Participating scientists are working to break down silos in academic research by adding their own early discoveries to the database. After the initial stage, external researchers around the world can add data to create a living resource that continues to grow in value. We aim to create a common language for cancer immunotherapy research that makes results reproducible, comparable, and accessible with AI.

Importantly, this type of initiative only works if the incentives are aligned. It is natural for companies to protect their intellectual property. Individual laboratories compete for recognition and funding. But diseases like cancer do not respect institutional boundaries. Pre-competitive collaboration, where data infrastructure is shared even while treatments are competing, is essential.

Nonprofit and public-private partnerships can play an important role here. Convene stakeholders, set standards, and build assets that no single organization can justify building alone.

what happens next

The next breakthrough in cancer won’t come from one lab or one algorithm. They come from a network of scientists, clinicians, engineers, and policy makers working on the same foundation.

Imagine an AI model trained on harmonized data from thousands of cancer and treatment combinations. Researchers can test their hypotheses with mock experiments before running the real thing. Clinicians can identify patients who are likely to respond before starting treatment. Discoveries made at one institution can quickly accelerate progress at another.

This is not a moonshot. It’s infrastructure. And like any infrastructure project, such as roads, power grids, or the internet, it requires coordination, standardization, and co-investment.

AI can help decipher the complexity of cancer. But algorithms alone won’t save lives. The real work is building a shared foundation that allows intelligence (both human and artificial) to learn together. If we get it right, we can compress decades of discoveries into a few years.

For patients, that time is not an important metric. It’s survival.



Source link