Proteins maintain life as we know and play many important structural and functional roles throughout the body. However, these large molecules cast a long shadow on a small subclass of proteins called microproteins. Microprotein is lost in 99% of DNA that is ignored as “non-coding.” However, despite their small and elusiveness, their effects may be as great as large proteins.
Scientists at the Salk Institute are currently exploring the mysterious dark side of the genome in search of microproteins. Using the shortstop in the new tool, researchers can explore gene databases and identify genomic DNA stretches that may encode microprotein codes. Importantly, shortstop also predicts which microproteins are most likely to be biologically relevant and saves time and money in exploring microproteins involved in health and illness.
Shortstops shed new light on existing datasets and spotlights microproteins that are previously undiscoverable. In fact, the Salk team has already used this tool to analyze lung cancer datasets and use one outstanding, validated microprotein to find 210 all-new microprotein candidates, potentially achieving appropriate treatment goals in the future.
The survey results have been published on BMC Method July 31, 2025.
“Although most of the proteins in our body are well known, recent discoveries suggest that thousands of small hidden proteins (covering microproteins) encoded by overlooked regions of our genome. “For a long time, scientists were actually studying regions of DNA that encoded large proteins and rejected the rest as “junk DNA,” but we are now learning that these other regions are in fact very important and that the microproteins they produce can play an important role in regulating health and disease.”
Details of Microprotein
It is difficult to detect and catalog microproteins, mainly due to their size. Compared to standard proteins, which range in length from hundreds to thousands of amino acids, microproteins usually contain less than 150 amino acids, making the use of standard protein analysis methods difficult to detect. Therefore, instead of searching for microproteins themselves, scientists search for large, publicly available datasets of the DNA sequences that create them.
Scientists have learned that certain DNA stretching, known as small open reading frames (smouses), can include instructions for creating microproteins. Current experimental methods have already catalogued thousands of smorphs, but these tools remain time-consuming and expensive. Furthermore, the inability to separate potentially functional microproteins from non-functional microproteins stalled discovery and characterization.
How the short stop works
Not all smorphs are translated into biologically meaningful microproteins. Existing methods cannot distinguish between functional and non-functional microprotein-generating smorphs. This means that scientists need to independently test each microprotein to determine whether it is functional or not.
Shortstop fundamentally alters this workflow and optimizes smorph discovery by categorizing microproteins into functional and non-functional categories. The key to the two classes of shortstop sorting is the training method as a machine learning system. Its training relies on computer-generated negative control datasets of random smorphs. In Shortstop, these decoys and smorphs were discovered quickly to determine whether the new smorphs are functional or not.
Shortstops cannot be clearly said whether smorphs code biologically related microproteins, but this two-class system significantly narrows the experimental pool. Now, researchers can manually sort datasets to reduce the time they fail on the bench.
When researchers applied shortstops to previously published Smorf datasets, they identified 8% as potentially sensual microprotein and prioritized them for targeted follow-up. This accelerates the characterization of microproteins by filtering sequences that are unlikely to have biological relevance. Shortstops can also identify other methods of overlooked microproteins, including methods that have been detected and validated in human cells and tissues.
“What makes shortstop particularly powerful is that it works with popular data types, such as the RNA sequencing datasets that many labs already use,” says Brendan Miller, a postdoctoral researcher at Sagaterian's lab. “This means that microproteins can be searched across large healthy and diseased tissues, which unlocks new insights into human biology and unlocks new pathways to diagnose and treat diseases such as cancer and Alzheimer's.”
Shortstop finds microproteins associated with lung cancer
Researchers have already used shortstops to identify upregulated microproteins in lung cancer tumors. They analyzed genetic data from human lung tumors and adjacent normal tissues to create a list of potential functional smorphs. Among the Smorph's shortstops, one stood out. It is expressed more in tumor tissue than in normal tissues, suggesting that it may function as a biomarker or functional microprotein for lung cancer.
This identification of lung cancer-related microprotein demonstrates the value of shortstop and machine learning to prioritize candidates for future research and treatment development.
“There's a lot of data already in existence, so we can now process it with Shortstop to find new microproteins related to health and illnesses ranging from Alzheimer's to obesity,” says Sagaterian. “My team is good at creating methods and using data from other Salk faculty members can integrate these methods and accelerate science.”
reference: Miller B, De Souza EV, Pai VJ, and others: Shortstop: Machine learning framework for microprotein discovery. BMC Method. 2025; 2(1):16. doi:10.1186/s44330-025-00037-4
This article has been republished from: Note: Materials may have been edited for length and content. For more information, please contact the source quoted. You can access the press release publishing policy here.
