Development of a deep learning model for cancer diagnosis by examining free DNA end motifs

Machine Learning


Data collection

We collected a total of 4606 samples from five studies covering hepatocellular carcinoma (HCC), colorectal cancer (CRC), non-small cell lung cancer (NSCLC), and esophageal cancer (ESCA) that underwent different types of cfDNA sequencing, including whole genome sequencing (WGS), whole genome bisulfite sequencing (WGBS), targeted genomic bisulfite sequencing (TGBS), and 5-hydroxymethylcytosine sequencing (5hmC).

The HCC-WGS dataset consists of plasma cfDNA samples that underwent WGS from 74 HCC patients and 55 HCC-free controls. It was downloaded from the European Bioinformatics Institute (accession number EGAS00001003409).3.

The HCC-WGBS dataset consists of plasma cfDNA samples that underwent WGS from 34 HCC patients and 25 HCC-free controls. Raw sequencing data were applied and downloaded from the European Bioinformatics Institute (accession number EGAS00001003409).3.

The CRC-TGBS dataset consists of plasma cfDNA samples from 801 CRC patients and 1021 healthy controls who underwent TGBS. Raw sequencing data were downloaded from the Sequence Read Archive database (accession number PRJNA574555).Four.

The HCC-TGBS dataset consists of plasma cfDNA samples from 1171 HCC patients and 959 healthy controls who underwent TGBS. Raw sequencing data were downloaded from the Sequence Read Archive (Accession No. PRJNA360288).Five.

The NSCLC-5hmCS dataset consists of plasma cfDNA samples from 66 NSCLC patients and 67 healthy controls who underwent 5hmCS. Raw sequencing data were downloaded from the Genome Sequence Archive database (accession number PRJCA000816).7.

The ESCA-5hmCS dataset consists of plasma cfDNA samples from 150 ESCA patients and 183 healthy controls who underwent 5hmCS. Raw sequencing data were downloaded from the Genome Sequence Archive database (accession number PRJCA000646).6.

The Cristiano dataset was generated by analyzing plasma cfDNA samples from 231 patients with cholangiocarcinoma using WGS.yeah= 25), breast cancer (yeah= 54), CRC(yeah = 27), duodenal cancer (yeah = 1), stomach cancer (yeah = 27), lung cancer (yeah= 35), ovarian cancer (yeah= 28) and pancreatic cancer (yeah= 34) and healthy individuals (yeah = 246) (dbGaP, accession no. 34536)Ten.

The in-house NSCLC dataset consisted of plasma cfDNA samples from 21 NSCLC patients and 20 non-cancerous controls that underwent whole-exome sequencing.

Data Preprocessing

We counted the frequency of 256 4-kmers from the 5′ end of sequenced cfDNA reads.

The endomotifs were ordered by frequency and the sequence of endomotifs was obtained. The input to EMIT was formulated as follows: Ma= {CLSA, Meters0, Meters1…, MetersI…, Meterst, September; t< 256}, where Meters0, Meters1…, MetersI The frequency of these endomotifs is Meters0Meters1≥ … ≥ MetersI≥ … ≥ Meterst. t is a predefined value that is set to 128. MetersI This is an endomotif token. CLSA and September are two specific tokens that are added to the beginning and end of the sequence, respectively. To increase data points, we randomly generated endomotif sequences for 10 million sequence reads without replacement.

Architecture of Transformer via Endomotif Examination (EMIT)

EMIT is a transformer18 It consists of an encoder module and a projection head. The encoder module contains an embedding layer, a self-attention module followed by a position-wise feedforward network. The projection head is a two-layer neural network.

The embedding layer extracts the input endomotif tokens ( circleX) and position (by parameterization circlep) into the representation matrix. Specifically, the embedding layer is I motifMetersI To the feature vector d -size XIa feature matrix is ​​generated. X= { X0, X1…, XIXt; t < 4hair}T End motif sequence MaIn addition, the embedding layer determines the position of each endomotif. Ma To the feature vector d -dimension, written as P= { p0, p1…,pt; t< 4hair}TThe transformer encoder is X and P (i.e. X+P) is input and the representation matrix is ​​output. figure.

The encoder layer is a multi-head self-attention module followed by Position-wise feedforward neural network ( FFN). Layer-wise normalization35 Used before and after FFN . Residual Connection36 It was added to improve the flow of information.

Self-attention performs a scaled dot product question , hairand Five18:

$$SelfAttn(Q,K,V)=soft{max}\left(\frac{Q{K}^{{\rm{T}}}}{\sqrt{{d}_{k}}}\ right)V$$

(1)

Q, K, and V are matrices projected from the output of the embedding layer. Scaling coefficients \(\sqrt{{d}_{k}}\) Used to mitigate extremely small gradients18.

Multi-headed self-attention allows the model to jointly attend to information from different representation subspaces at different locations and is formulated as follows:

$$MultiHead(Q,K,V)=Concat(SelfAtt{n}_{1},{\mathrm{..}}.,SelfAtt{n}_{h})$$

(2)

where h Note the number of heads.

By position FFN It is a fully connected neural layer. This layer consists of two linear transformations: Re-LUThe intermediate activation function is defined as:

$$FFN(x)=\,{\max }(0,x{W}_{1}+{b}_{1}){W}_{2}+{b}_{2}$$

(3)

where circle1 and circle2 is the weight matrix, \({b}_{1}\) and \({b}_{2}\) It's bias.

The output is EncoderIt is input to the projection head, which contains a two-layer neural network with layer-wise normalization.35 During.

Various sizes of EMIT models

We trained three EMIT models (EMIT-2Mb, EMIT-8Mb, EMIT-32Mb) with different parameter sizes by varying the hidden size of the transformer as 384, 768, and 1536, and the number of attention heads as 6, 12, and 24, respectively. We used exponential cross entropy (ECE) as the main metric to compare these models.

Development of EMIT

EMIT takes as input a sequence of endomotif tokens of length 128. We randomly keep 5000 instances for evaluation. We followed the training scheme proposed by Devlin and colleagues.twenty twoThe training data generator randomly corrupts 15% of each input motif sequence and selects the selected motifs. [MASK] We trained these models for 40 epochs using the Adam optimizer with a batch size of 256, weight decay of 0.01, and a learning rate of 1e−4. The learning rate was ramped down towards zero following a cosine schedule with one epoch warming up. EMIT Pie Torch (version 1.7.1) and Transformer (version 4.21.1) were run on an NVIDIA DGX A100 equipped with eight GPUs, each with 40 Gb memory. The input sequence length was set to 128.

End motif highlights

Attention between endomotifs I and gun , \({\alpha }_{{\boldsymbol{i}}{\boldsymbol{,}}{\boldsymbol{j}}}\)is defined as the softmax normalized dot product between the query vector and the key vector.

$${\rm{A}}=SelfAttn(Q,K,V)=soft{max}\left(\frac{Q{K}^{{\rm{T}}}}{\sqrt{{d }_{k}}}\right)V$$

(Four)

$${\alpha }_{i,j}=A(i,j)$$

(5)

According to Clark et al. CLSA A token is used to aggregate the representation of each endomotif to represent the input sequence.twenty twotherefore, gun Final motifCLSA token, \({\alpha }_{CLS,j}\) represents the influence of that endomotif on the representation of the input sequence. A higher value means that the endomotif is more important to the representation of the sequence. We used the Wilcoxon rank-sum test.37 Evaluate the differences in endomotifs between cancer patients and cancer-free controls.

The cancer endomotif attention matrix is ​​obtained by taking the cumulative sum of: \({\alpha }_{i,j}\) For all cancer cases, the top 0.1% interactions were retained for subsequence analysis. The endomotif attention network Sitescape38 (Version 3.9.0).

Linear projection of representations from EMIT

Linear projection is essentially a linear classifier used to determine if the input representations are linearly separable. Linear projection is widely used to determine if certain features are encoded and represented within deep learning models.20,39,40We performed 5-fold cross-validation to examine the linear projection accuracy on the collected public datasets with sequence lengths of 64, 96, 128, 160, 200, and 256, respectively.

Motif Diversity Score (MDS)

We calculate the motif diversity score following the method of Jiang et al.3is defined as follows:

$${MDS}=\mathop{\sum }\limits_{i=1}^{256}-{P}_{i}* {\log (P}_{i})/\log (256)$ $

(6)

where PI The frequency of the i-th motif.

Baseline Method

As a baseline method for comparison, we considered the linear projection of the count matrix of 4 kmer endomotifs.

KBETTests for evaluating batch effects

To assess the presence of batch effects, we used the k-nearest neighbor batch effect test ( KBET)twenty five In the Cristiano dataset, we used the volume of plasma per sample as a batch variable. The dataset containsyeahsample, MetersThere is a batch, \({n}_{j}\) sample gunNumber ( gunMeters) batch. The batch mixing frequency is expressed as: \(f=({f}_{1},\cdots ,{f}_{m})\)where \({f}_{j}=\frac{{n}_{j}}{N}\). The number of neighbors is IThe th sample in the batch gunteeth \({n}_{ji}^{k}\). \({\chi }^{2}\) Statistics \({k}_{i}^{k}=\mathop{\sum }\limits_{j=1}^{m}\frac{{({n}_{ji}^{k}\,- \,{f}_{j}\cdot k)}^{2}}{{f}_{j}\cdot k}\) The degree of freedom is Meters− 1. PThe value is calculated as follows: \({p}_{i}^{k}=1-{F}_{m-1}({k}_{i}^{k})\)where \({F}_{m-1}(x)\) Represents the cumulative density function. KBETThe acceptance rate is defined as the proportion of samples that accept the null hypothesis at the significance level α as follows:

$$kBET{\hbox{-}}rate=\frac{{\sum }_{i=1}^{N}I({p}_{i}^{k}\ge \alpha )}{N }\times 100 \%$$

(7)

Indicator Functions I(x)= 1 if x > 0, else I(x)= 0. UsedPegasus(Version 1.4.3)KBETApproval rate by settinghairWe set α to 5 and 0.05, respectively.

Statistical analysis

The experiments were performed using Python (version 3.7.10), R (version 4.2.1), ggplot2 (version 3.3.6), and PROC (version 1.18.0). Calculation of the area under the receiver operating characteristic curve (AUROC) was performed using PROC. The 95% confidence interval for the AUROC was calculated using the DeLong method implemented in pROC. Accuracy, sensitivity, and specificity were calculated using the R software package caret (version 6.0.78). Calculation of the 95% confidence interval for accuracy, sensitivity, and specificity was performed using the Clopper-Pearson method.41The Benjamini-Hochberg method was used for adjustment.pSpecify a value for multiple hypothesis testing if appropriate, otherwise two-sided tests will be used.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *