Machine learning identifies cancer-causing mutations in CTCF binding sites

Machine Learning

A recent study published in the journal Nucleic Acid ResearchResearchers are investigating whether machine learning can identify pan-cancer mutation hotspots in persistent CCCTC-binding factor (P-CTCF) binding sites (P-CTCFBS).

study: Machine learning identifies mutational hotspots in CTCF binding sites across cancersImage credit: Nuttapong punna /

CTCF and Cancer

Mutations in the CTCF binding site affect CTCF, a protein that regulates noncoding deoxyribonucleic acid (DNA) transcription and nuclear structure. Steady-state CTCF-BS shows resistance to CTCF knockdown and preserved binding.

These subtypes are distinguished by high binding strength, specific constitutive binding, enrichment of chromatin loop anchors, and topologically associated domain (TAD) boundaries.Mutations in CTCF binding sites can activate oncogenic genes, but these mutations remain largely unidentified.

About the Research

In this study, the researchers developed a computational tool, CTCF-In-Silico Investigation of PersisTEnt Binding (INSITE), that can predict the persistence of CTCF binding following knockdown in cancer cells.

CTCF-INSITE is a machine learning tool that evaluates both genetic and epigenetic features explaining CTCF binding persistence. Mutational burden of PCTCF binding sites was determined using International Cancer Genome Consortium (ICGC) sequencing of matched tumors by generating persistence metrics for Encyclopedia of DNA Elements (ENCODE) CTCF ChIP-sequencing data from different tissue types. GM12878 high-coverage whole genome sequencing (WGS) data from the National Center for Biotechnology Information (NCBI) and Platinum Genome Initiative were also used for analysis.

The researchers used CTCF ChIP-seq data from IMR-90, MCF7, and LNCaP cell lines isolated from lung tissue, breast cancer, and prostate adenocarcinoma, respectively, to screen cohorts with fewer mutations per individual. After identifying and eliminating outliers using the interquartile range (IQR) method, 24 cohorts containing 3,218 patients were available for study.

Twelve different cancer types were then created by combining mutations from the same cohort of cancer types. For IMR-90, LNCaP, and MCF7 cells, genomic characteristics, chromatin interactions, binding affinities, replication timing, constitutive binding, and conservation scores were investigated.

Random forest modeling was used because it showed a superior success rate compared to linear regression models in predicting CTCF binding. Computer-basedThe data was split into training and testing datasets in a 9:1 ratio.

We also performed a binding motif study to identify binding locations within ChIP-seq peaks between 200 and 2,000 base pairs (bp), and then calculated a motif score for each region of the ChIP-seq peaks.

Gene set enrichment analysis (GSEA) was used to determine the trinucleotide mutation context for all patients, and a fluorescence polarization DNA binding (FPDB) assay was used to compare the mutation burden between P-CTCF-BS and L-CTCF-BS. By aggregating these results, a background mutation rate of CTCFBS was generated for all cancers.

research result

Compared to all CTCF binding sites, P-CTCF binding sites showed significantly higher mutation rates in prostate and breast cancer. In all 12 cancer types examined, predicted P-CTCF binding sites showed significantly higher mutation burden. Mutations in P-CTCF binding sites predicted to have a functional effect on CTCF chromatin looping and binding showed significantly higher enrichment.

of In vitro Experiments confirmed that cancer mutations in the predicted disruptive P-CTCF binding site reduced CTCF binding. Mutations in the P-CTCF binding site were observed more frequently than L-CTCF in 12 different cancers. Mutations in the P-CTCF binding site were associated with loop disruptions, indicating that these mutations contribute to 3D genome dysregulation in cancer.

Binding affinity is critical for P-CTCF-BS survival, especially at chromatin loop anchors, late replication timing regions, and TAD boundaries. Furthermore, chromosomal loop coexistence is durable.

The researchers identified significant allelic imbalance in binding at 91 sites, with mutations reducing binding affinity. Ultraviolet (UV)-induced gene downregulation was observed in breast cancer, while enrichment of epithelial-mesenchymal transition genes was observed in prostate cancer. Compared to L-CTCF binding sites, P-CTCF-BS had a higher mutation rate and was significantly enriched for disruptive mutations.


The findings identify a new subclass of cancer-specific CTCF-BS DNA mutations and provide important insight into the critical role these mutations play in the overall cancer genome architecture. CTCF-INSITE showed significant enrichment of mutations across a range of cancer types, likely due to disrupted chromatin loops and reduced binding. In vitro In conjugation tests, these mutations are considered functional.

Study of mutational profiles of other cancer types may be supported by the enrichment of mutational signals at P-CTCF binding sites. Thus, the predictive power of CTCF-INSITE for CTCF-BS provides promising candidates for experimental modifications that researchers should prioritize to gain a deeper understanding of cancer pathogenesis.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *