Machine and Deep Learning Reveal Sequence Determinants Encoding Bivalent Histone Modifications

Genome-wide profiling of histone modifications in mouse embryonic stem cells

To systematically investigate the regulatory patterns of univalent and bivalent histone modifications, we compared the signal intensities and chromosomal distribution preferences of different histone marks. Among them, H3K4me3 displayed globally higher enrichment levels in mESCs compared to the two repressive modifications, H3K27me3 and H3K9me3. These marks were relatively evenly distributed across autosomes, while fewer peaks were observed on sex chromosomes, suggesting that the establishment of these modifications is largely stable across the genome (Fig. 1A). To further dissect the combinatorial patterns of H3K4me3 with either H3K27me3 or H3K9me3, we classified the peaks into monovalent, bivalent, and trivalent categories. The majority of peaks were marked by a single histone modification, while co-modified regions—those carrying two or more marks—constituted a smaller fraction of the total (Fig. 1B). Notably, regions simultaneously marked by H3K4me3, H3K27me3, and H3K9me3 (trivalent domains) accounted for approximately 1% of H3K4me3 peaks, 8% of H3K27me3 peaks, and 3% of H3K9me3 peaks.

**Fig. 1: Genome-wide distribution of H3K4me3, H3K27me3, and H3K9me3 in mESCs.**

We next investigated the underlying DNA sequence characteristics of monovalent versus bivalent peaks, focusing on peak width, GC content, and conservation. Peak length analysis revealed that most regions spanned 200 to 1000 bp, with bivalent peaks being slightly narrower than monovalent ones (Fig. 1C). Additionally, bivalent peaks exhibited significantly higher GC content and conservation scores, suggesting that these co-modified regions are preferentially located in conserved genomic elements (Fig. 1D, E).

Bivalent domains are typically found at promoter regions, as exemplified by the overlap of H3K4me3 and H3K27me3 enrichment at gene promoters (Fig. 1F, G). In contrast, H3K9me3 showed predominant enrichment within long terminal repeat (LTR) and long interspersed nuclear element (LINE) repetitive elements (Fig. 1F), consistent with its established role in retrotransposon silencing²⁹. We also evaluated the enrichment of repetitive DNA elements within H3K9me3/H3K4me3 bivalent domains. The results showed that H3K9me3-marked bivalent regions were significantly enriched in repetitive elements in mESCs, with H3K9me3/H3K4me3 domains particularly enriched for LINE/L1 elements (Fig. 1H, I).

Bivalent modifications coordinately regulate gene expression

To elucidate the effects of distinct histone modification (HM) patterns on gene expression, we quantitatively assessed the signal intensities of H3K4me3, H3K27me3, and H3K9me3 at trivalently marked genes and correlated these with corresponding transcriptional levels. Our analysis revealed that elevated H3K4me3 signals are strongly associated with high gene expression, whereas enrichment of H3K27me3 or H3K9me3 correlates with reduced expression. These findings suggest that within co-marked genomic regions, H3K4me3 likely functions as the principal activation mark, while H3K27me3 and H3K9me3 appear to contribute predominantly to repressive regulatory outcomes (Fig. 2A).

**Fig. 2: Gene expression patterns associated with bivalent marks.**

We further analyzed the expression patterns of genes associated with the two types of bivalent modifications (H3K4me3/H3K27me3 and H3K4me3/H3K9me3), and performed functional enrichment analyses. Genes marked by H3K4me3/H3K27me3 bivalency exhibited relatively low expression levels, falling between those marked by H3K4me3-only and H3K27me3-only regions, but skewed closer to the H3K27me3-only profile, indicating a general tendency toward transcriptional repression (Fig. 2B). We further categorized bivalent genes into high, medium, and low expression groups and observed a positive relationship between bivalent mark intensity and gene expression, with highly expressed genes showing the strongest bivalent signals (Fig. 2C). Consistently, a genome-wide correlation analysis revealed a weak but statistically significant positive association between bivalent signal intensity and gene expression (Spearman’s ρ = 0.186, p = 9.33 × 10⁻³⁴; Supplementary Fig. S1), in agreement with previous studies^30,31. KEGG pathway enrichment analysis of highly expressed bivalent genes revealed significant association with the “breast cancer” pathway, implying that some bivalent genes may contribute to oncogenesis. For example, the bivalent gene POU4F1 has been shown to be essential for promoting and sustaining basal-like breast cancer (BLBC), making it a potential therapeutic target³² Bivalent genes are enriched in majority of the signaling pathways related to the pluripotent maintenance and development, such as signaling pathways regulating pluripotency of stem cells, TGF-beta, Wnt, MAPK, FoxO and Hippo signaling pathways³³ (Fig. 2D).

Next, we attempted to define bivalent genes with different signal intensities by overlapping H3K4me3 and H3K9me3 Chromatin Immunoprecipitation sequencing (ChIP-seq) peaks. The overlapping regions were categorized into four groups based on signal strength: High-High, Mid-High, High-Mid, and Mid-Mid. These groups exhibited a linear distribution pattern, suggesting a certain degree of co-regulation between H3K4me3 and H3K9me3 (Fig. 2E). By integrating RNA-seq data, we found that genes located in regions with both high H3K4me3 and high H3K9me3 signals (High-High) exhibited relatively low expression levels (Fig. 2F). Bivalent domains are known to undergo dynamic transitions during cell fate specification. For instance, changes in bivalent marks are implicated in the differentiation of neural stem cells³⁴. Gene Ontology (GO) analysis of H3K4me3/H3K9me3-associated genes revealed enrichment in neurodevelopmental processes such as synapse assembly—a critical step during ESC differentiation into mature and functional neurons (Fig. 2G).

Interpretable model of bivalent histone modifications driven by sequence features

In embryonic stem cells, bivalent domains marked by H3K4me3 and H3K27me3 are preferentially enriched at promoters containing CpG islands³ and exhibit strong associations with specific DNA sequence features, such as CpG-rich elements and transcription factor binding sites³⁵. A previous study identified an enrichment of the TCCCC sequence motif at bivalent promoters in both mouse and human ESCs³⁶. Nevertheless, the sequence determinants that encode or direct the formation of bivalent histone modifications remain largely elusive. To explore whether bivalent chromatin regions harbor unique and predictive sequence features, we employed five supervised machine learning algorithms as well as deep learning models to classify bivalent versus monovalent regions. Specifically, overlapping bivalent peaks (200-1000 bp in length) were selected for sequence feature extraction. For machine learning models, we used the widely adopted k-mer approach, which has proven effective in capturing short sequence patterns relevant to chromatin state²¹. Insights from previous research were leveraged to guide the selection of parameter k, and a range of k values from 3 to 8 was adopted in this study^19,37,38. For deep learning models, sequences were encoded using one-hot encoding, enabling the models to automatically learn latent sequence patterns (Fig. 3).

**Fig. 3: Schematic representation of the interpretable model for bivalent modifications.**

We first constructed a three-class classification model to distinguish among H3K4me3-only, H3K9me3-only, and H3K4me3/H3K9me3 bivalent regions. Based on comparative analysis between Lasso regression and Random Forest for feature selection, we adopted Random Forest, which identified 797 informative 6-mer features (Supplementary Fig. S2). Evaluation across different k-mer sizes revealed that all models performed best at k = 6 (Fig. 4A), consistent with prior machine learning studies on genomic sequence prediction^18,20,39,40. Among all models tested, Support Vector Machine (SVM) yielded the highest classification performance, as indicated by ROC curves and overall accuracy (Fig. 4B, C). The confusion matrix of the SVM model confirmed strong classification capacity, particularly for H3K4me3-only regions (Fig. 4D). To exclude the possibility that the observed predictive performance arose from random correlations, we further evaluated a randomized null model in which sequence features were shuffled while class labels were preserved. Under this control, all models exhibited chance-level performance (AUROC ≈ 0.5), in contrast to the clear diagonal enrichment observed when using real genomic sequences (Supplementary Fig. S5), indicating that the predictive signals captured by the models are genuinely encoded in DNA sequence features. Using a similar strategy, we further distinguished H3K4me3/H3K27me3 bivalent regions from monovalent ones. Lasso regression was used to select 616 relevant sequence features. Optimal model parameters were determined via grid search with 5-fold cross-validation. Among all tested classifiers, the XGBoost-6mer model achieved the best predictive performance (Supplementary Fig. S3). Interestingly, H3K4me3/H3K9me3 bivalent regions were classified more accurately than H3K4me3/H3K27me3 regions, indicating that they may harbor more distinct sequence determinants.

**Fig. 4: Interpretable modeling of bivalent histone modifications based on DNA sequence features.**

Given the large number of sequence features identified by machine learning, model interpretation remains challenging due to feature complexity and potential redundancy. To address this, we performed HOMER enrichment analysis to prioritize core sequence motifs with potential biological relevance (Supplementary Fig. S6B). For the H3K4me3/H3K9me3 bivalent modifications, four key motifs were significantly enriched. We retrained classification models using only these core features and observed that all models maintained robust predictive performance, with AUC values reaching approximately 0.87. Among them, the Random Forest (RF) model consistently outperformed others, suggesting that these core sequence motifs are sufficient to capture essential information for chromatin state classification (Fig. 4E). SHAP (SHapley Additive exPlanations) analysis revealed that the motif “TCTGAA” exhibited the highest positive contribution to the prediction of bivalent regions, indicating its strong association with bivalent chromatin states. In contrast, the remaining three motifs predominantly contributed to the classification of H3K9me3-only regions. These findings suggest functional divergence among sequence motifs, where distinct short DNA sequences preferentially mark different histone modification states (Fig. 4F). These findings indicate that different short DNA motifs preferentially associate with distinct chromatin states.

Similarly, five enriched motifs were identified for H3K4me3/H3K27me3 bivalent domains. As expected, the performance was slightly reduced compared to models trained with the full feature set, yet the simplified model offers improved interpretability and highlights biologically relevant sequence patterns (Supplementary Fig. S4). Based on SHAP values, the motifs TCACAG, TTCAAA, and AGGGCT exhibited the highest feature importance, serving as key indicators for distinguishing H3K4me3/H3K27me3 bivalent regions from monovalent counterparts (Supplementary Fig. S4C). Collectively, these results highlight that specific short DNA sequence motifs not only carry predictive power for chromatin state classification but also provide mechanistic insights into the sequence determinants underlying bivalent histone modifications.

To gain deeper insights into the sequence determinants regulating bivalent chromatin regions, we constructed deep learning models to predict H3K4me3/H3K9me3 bivalency. We systematically evaluated the effect of different input sequence lengths (200 bp, 400 bp, 600 bp, 800 bp, and 1000 bp) on model performance. Overall, most models exhibited improved AUROC values with increasing sequence length. Specifically, the DanQ, CNN+Attention, and CNN+Transformer models achieved optimal performance when trained on sequences of 600 bp or 1000 bp (Fig. 4G), suggesting that these lengths are well-suited for this binary classification task. Among all evaluated models, the CNN+Attention architecture consistently demonstrated the best predictive performance, followed by the DanQ and CNN+Transformer models, based on comprehensive evaluation metrics (Fig. 4H, I). These results suggest that deep learning models are capable of capturing higher-order and context-dependent sequence features relevant to bivalent chromatin, with interpretable architectures further facilitating motif discovery.

Analysis of sequence regulatory grammar underlying bivalent histone modifications

To elucidate the regulatory grammar of distinct bivalent histone modifications, we performed a comparative analysis of sequence features associated with H3K4me3/H3K9me3 and H3K4me3/H3K27me3 regions. Histone modification level was reported to be closely related to the overall GC content of the corresponding region⁴¹, we categorized the identified motifs into three groups based on their GC content: GC-rich, AT-rich, and neutral motifs. Most motifs were GC-rich, while a smaller proportion were AT-rich (Fig. 5A). Previous studies have implied the role of particular DNA sequences (or motifs) in setting the boundary of histone modification domain or opening chromatin to allow remodeling enzymes to bind DNA⁴². We next examined the positional distribution of the selected motifs within bivalent sequences. Each bivalent region was divided into ten equal bins, with the outermost bins representing the edges and the central bins representing the center. We observed that motifs associated with H3K4me3/H3K27me3 bivalency were predominantly enriched at the edges of bivalent regions, whereas those linked to H3K4me3/H3K9me3 bivalency were largely classified as neutral. These results indicate that the two types of bivalency exhibit distinct motif positional preferences within bivalent domains (Fig. 5B; Supplementary Fig. S6A).

**Fig. 5: Sequence-based regulatory features associated with bivalent histone modifications.**

Next, we further investigate potential transcription factors (TFs) associated with these sequence features, we used the TomTom tool to compare all identified motifs with a mouse motif database. Pluripotency-related transcription factors, including STAT3, ESRRB, KLF4, OCT4, SOX2, and SMAD1, were significantly enriched in both types of bivalent regions (Supplementary Fig. S6), suggesting shared regulatory programs in embryonic stem cells. Notably, the key motif TCTGAA that distinguishes H3K4me3/H3K9me3 bivalent regions was primarily matched to the TF TCFCP2l1 (Fig. 5E). These findings suggest that the DNA sequences co-occupied by H3K4me3/H3K27me3 or H3K4me3/H3K9me3 are potentially regulated by pluripotency factors in embryonic stem cells. We further mapped the identified sequence features of H3K4me3/H3K9me3 bivalent modifications to specific peak regions, annotated the corresponding genes, and identified four bivalent genes containing all four key motifs (Fig. 5C, D). These peaks exhibited intermediate ChIP-seq signal intensities between H3K4me3-only and H3K9me3-only regions (Supplementary Fig. S6E), supporting the notion of a distinct chromatin state.

To interpret the learned sequence patterns, we applied a sequence alignment-based interpretability method (SABM) to H3K4me3/H3K9me3-labeled regions. This method identifies subsequences that activate convolutional neurons and stacks position weight matrices (PWMs), analogous to traditional estimation methods⁴³. We used information content (IC) to evaluate each model’s ability to identify functional sequence motifs, with the DanQ model achieving the highest IC values, indicating stronger motif definition in this subtype (Fig. 5F). Using the TOMTOM tool, we identified the transcription factors associated with motifs learned by different models. These transcription factors are capable of binding not only to the promoters of pluripotency-related genes in embryonic stem cells, but also to their enhancer regions, thereby maintaining the pluripotent state (Fig. 5G). Subsequently, we applied the FIMO tool to locate the specific positions of each motif (motif sites), and annotated the corresponding bivalent genes. KEGG pathway enrichment analysis of these bivalent genes revealed significant enrichment in the Neuroactive ligand–receptor interaction pathway, which is closely associated with nervous system development and stem cell differentiation (Fig. 5H).

In summary, while H3K4me3/H3K9me3 and H3K4me3/H3K27me3 bivalent regions share common pluripotency-related regulatory factors, they differ in base composition, motif positioning, TF specificity, and associated biological pathways. These findings suggest that distinct sequence grammars underlie the two classes of bivalent chromatin, reflecting their divergent regulatory roles in embryonic stem cells.

Source link

b"asta binance h"anvisningskod commented on Hiring platform Uplers ups the ante; claims to have created two pronged strategy for workforce : I don't think the title of your article matches th
创建个人账户 commented on WestMetric Defends Controversial On-Page SEO Services for the Era of AI: Your article helped me a lot, is there any more re
Registro commented on Security Architect | eFinancialCareers: Thanks for sharing. I read many of your blog posts
Anm"al dig f"or att fa 100 USDT commented on Best ChatGPT Tips and Tricks shared by ChatGPT Experts: Turbo-Charge Your AI Experience: Prompts included | by Michael King | Oct, 2023: Thanks for sharing. I read many of your blog posts
Elizabeth Nash commented on AI platform Hugging Face says hackers have stolen authentication tokens from Spaces: 🌍 Global crypto mining is now at your fingertips h

Machine and Deep Learning Reveal Sequence Determinants Encoding Bivalent Histone Modifications

Genome-wide profiling of histone modifications in mouse embryonic stem cells

Bivalent modifications coordinately regulate gene expression

Interpretable model of bivalent histone modifications driven by sequence features

Analysis of sequence regulatory grammar underlying bivalent histone modifications

RECENT POSTS

Bird song data from Merlin ID app supports global biodiversity projects | Birds

Largest study of undergraduate student use of AI reveals disparities in access and abuse

Donald Trump posts AI video of himself treating celebrities for ‘TDS’

Genome-wide profiling of histone modifications in mouse embryonic stem cells

Bivalent modifications coordinately regulate gene expression

Interpretable model of bivalent histone modifications driven by sequence features

Analysis of sequence regulatory grammar underlying bivalent histone modifications

Related Posts