Rapid diagnosis of celiac disease based on plasma Raman spectroscopy combined with deep learning

Machine Learning


Principal component analysis and interpretability

Considering the characteristics of the dataset and the requirements of the analysis task, for Raman spectral data, we retained 85% of the explained total variance in the principal components. Meanwhile, we used the explained variance ratio attribute to obtain the variance explained by each principal component. We standardized the spectral data using z-score and combined PCA to visualize the data distribution (as shown in Fig. 3), providing a clearer depiction of how the data spread in terms of spectral standard deviation and principal component analysis.

Figure 3
figure 3

(a) PCA after Z-score standardization; (b) PCA without Z-score standardization.

Figure 4a illustrates the projection direction of the data in the principal component space. Specifically, each point represents a feature (in this case, possibly the original features or PCA components), rather than a sample. The coordinates of each point represent the weights on the corresponding principal components. Therefore, the left plot helps us understand the contribution of original features to the principal components and the relationships between the components. Figure 4b displays the projection of each sample in the principal component space. Each point represents a sample, and its coordinates are its projection values on the principal components. This plot helps us observe the distribution of samples along the principal component directions and the distinctiveness between different categories.

Figure 4
figure 4

(a) Projection direction of data in the principal component space; (b) Projection of each sample in the principal component space.

To visualize the variance of the entire dataset, we plotted a graph (as shown in Fig. 5) displaying the standard deviation of the entire dataset.

Figure 5
figure 5

Standard deviation across spectral regions.

In spectral analysis, each wavelength corresponds to a feature. The spectral standard deviation in Fig. 5 reflects the dispersion of data at each wavelength. By showcasing the distribution of spectral standard deviation, we can understand the variability of data at different wavelengths and the dispersion of data across the spectral range. This helps illustrate how each sample in the dataset spreads in terms of spectral standard deviation and principal component analysis. The contributions of each wavelength to the two principal components are shown in Fig. 6.

Figure 6
figure 6

The contribution of each wavelength corresponding to the two principal components.

Through the aforementioned visualizations, we can observe the variation and distribution of data across different feature dimensions. However, for the utilization of advanced classification tools such as neural networks, more data features are required for training and classification. Therefore, we can see that the distribution of data after PCA is not ideal, which serves as one of the reasons for choosing more advanced classification tools like deep learning models.

Raman spectroscopy

The Raman spectra of plasma from patients with celiac disease (CD) are shown in Fig. 1, where the Raman characteristic peaks represent substances rich in lipids, proteins, nucleic acids, and amino acids in the tissue. Previous studies have indicated that changes in Raman peaks of proteins and nucleic acids may be observed in the plasma of diseased individuals, reflecting abnormal expression of cellular nucleic acids and proteins18. Additionally, CD patients exhibit higher levels of high-sensitivity C-reactive protein in their plasma, and in terms of the lipoprotein spectrum, CD patients show lower levels of high-density lipoprotein cholesterol (HDL-C)19. The serum of CD patients is characterized by lower levels of various metabolites (such as amino acids, lipids, ketones, and choline) (P < 0.01)20. Comparative experiments have revealed that, in terms of lipids, the main differences between celiac disease patients and the control group are a decrease in cholesterol and phospholipids in both high-density lipoprotein and low-density lipoprotein in the former. These differences persist after treatment, and a lower level of cholesterol in very-low-density lipoprotein (VLDL) has also been observed21. Table 1 lists the major characteristic peaks of plasma in celiac disease, along with the assignment of each feature peak. As shown in Fig. 7, patients with celiac disease exhibit Raman peaks at 1402 cm−1, 1477 cm−1, 1518 cm−1, 1545 cm−1, 1715 cm−1 and 1772 cm−1, in their plasma, which are higher than those in normal controls. However, the peak at 1445 cm−1 is lower than in normal controls. Significant differences exist between celiac disease patients and healthy controls in terms of functionality, tissue structure, and surface features in plasma. Specifically, the notable Raman peak difference at 1402 cm−1 reflects differences in bending modes of methyl groups between the two groups, indicating potential abnormal lipid metabolism in celiac disease patients, such as damage to adipose tissue due to malabsorption of fat22. As shown in Table 2, The Raman peak at 1477 cm−1 reflects calcium oxalate in the patient’s plasma, exhibiting significant changes compared to healthy plasma23. Celiac disease is an immune-related disease that may involve an abnormal immune response to proteins in the intestines. This may lead to observed Raman peak differences in celiac disease patients, reflecting changes in protein structure or composition. In celiac patients, changes in lipid and protein composition are related to alterations in cell membrane structure and function due to damage to the intestinal mucosa. Additionally, celiac disease is often accompanied by inflammation and the formation of immune complexes. These biological processes may cause changes in the intra- and extracellular environment, including the distribution and structure of lipids and proteins. The Raman peak difference at 1518 cm−1 is attributed to differences in cytosine content. In celiac patients, the impact on nucleotides, including changes in concentration or structure, may occur due to intestinal damage. The expression level changes of phenylalanine are reflected in the Raman peak at 1545 cm−1, indicating the metabolic status, redox balance, and regulation of some physiological functions. Differences in the Raman spectrum of C=O vibration at 1715 cm−1 and 1772 cm−1 are lipid-related, as celiac disease is a malabsorption disease. Therefore, if significant differences in C=O vibration are detected in celiac patients, it implies abnormal lipid metabolism or changes in lipid composition, which are related to the absorption and metabolism of fat in the intestines22.

Figure 7
figure 7

Average Raman spectra of Celiac Disease and healthy controls.

Table 2 The major Raman bands and their corresponding assignments35.

Model evaluation

Convolutional neural network (CNN) model evaluation

Convolutional Neural Network (CNN) is a deep feedforward neural network with features such as local connections and weight sharing. As one of the representative algorithms of deep learning, CNN has significant advantages in complex machine learning problems such as image classification, computer vision, natural language processing24,25,26,27, making it one of the most widely used models. The components of CNN include basic input and output layers, as well as convolutional layers, pooling layers, and fully connected layers28. The convolutional layer is used to extract different features of the input data, which may only be able to extract some low-level features. Most convolution operations can iteratively extract more complex features from low-level features. Then, the pooling layer is used to reduce the dimensionality of the features, achieving feature invariance. As is shown in Fig. 8a, after multiple convolution and pooling operations, all local features are combined into global features in the fully connected layer. In this experiment, the CNN model mainly includes four Conv1D layers with 32, 64, 64, and 32 filters, as well as 2 neurons. A Dropout layer is added after each Dense layer to prevent the problem of model overfitting.

Figure 8
figure 8

Structure of (a) CNN; (b) MCNN; (c) ResNet; (d) DRSN.

The ROC curve of the CNN is shown in Fig. 9. Compared to machine learning models, CNN shows improvement in classification accuracy, specificity, and sensitivity. However, CNN still makes errors in recognizing a considerable number of samples.

Figure 9
figure 9

ROC curve of CNN, MCNN, ResNet, DRSN, SVM, KNN.

Multi-scale convolutional neural network (MCNN) evaluation

MCNN is a simple yet effective multi-scale convolutional neural network that can map the input to its corresponding density map29. MCNN has stronger universality for input information. By using filters of different sizes with different receptive fields, the features learned by convolutional neural networks at different scales have stronger adaptability due to the perspective effect. The MCNN used in this experiment consists of Conv1d layers, LeakyReLu layers, pooling layers, and Conv1d layers. As is shown in Fig. 8b, three convolutional layers are used, with 16, 32, and 64 filters, and kernel sizes of 4, 8, and 16, respectively. The stride is 1, and “same” padding is used. MCNN outperforms the CNN model in accuracy, and the model’s runtime is similar to CNN. The ROC curve of MCNN is shown in Fig. 9. From the confusion matrix, it can be seen that MCNN is more powerful in classifying positive samples, which is crucial for the diagnosis of celiac disease.

Evaluation of deep residual network (ResNet)

ResNet, as a powerful deep neural network structure, has been widely applied to disease assessment tasks30. Its design of residual learning makes the network easier to train and enables deeper feature exploration in images. In disease assessment, ResNet can learn complex features and patterns in medical images, thereby improving the accuracy and robustness in disease diagnosis. The ResNet used in this experiment consists of multiple convolutional blocks, each including a convolutional layer and a batch normalization layer. It also incorporates multiple residual connection blocks, each containing two convolutional blocks and possible convolutional layers for shortcut connections. Dropout layers are added after each residual connection block to prevent overfitting (Fig. 8c). The ResNet model can capture deep features in images and identify potential pathological information. Its structure of direct connections between layers enables better information transmission, alleviating the vanishing gradient problem, and reducing the risk of overfitting. However, its performance on celiac disease spectral data is not superior to that of convolutional neural networks. The ROC curve of ResNet is shown in Fig. 9.

Evaluation of deep residual shrinkage network (DRSN)

The Deep Residual Shrinkage Network (DRSN), as a deep learning model, is particularly suitable for features related to noise. It effectively addresses noise and redundant information in spectra, enhancing its learning and feature extraction capabilities for disease features31. Built upon ResNet, DRSN introduces improvements by setting a threshold for each channel and incorporating two fully connected layers. As is shown in Fig. 8d, the second fully connected layer outputs neurons equal to the number of input feature map channels, and each neuron undergoes sigmoid activation. DRSN demonstrates significant advantages in handling spectral data32, as its residual block structure facilitates deeper exploration of disease features in plasma spectra. Additionally, the introduced shrinkage mechanism effectively suppresses noise in spectral data, enhancing the model’s robustness.

By training on celiac disease and healthy control plasma samples, DRSN can learn spectral features related to the disease, achieving precise extraction of potential biomarkers. The design of its network structure allows information to flow between different levels, enabling the model to better capture complex relationships in plasma spectra. Moreover, DRSN’s shrinkage mechanism helps reduce redundant information, improving the signal-to-noise ratio of spectral signals. The ROC curve of DRSN is shown in Fig. 9.

Classification results

Validation results for the CNN, MCNN, ResNet, and DRSN models show that the CNN and MCNN models perform well on the training and validation sets, with accuracies reaching 92.31% and 90.76%, respectively. However, the CNN’s specificity is suboptimal at only 85.71%. ResNet exhibits the poorest performance across all metrics, with an accuracy of only 80.23% and specificity of only 68.57%, all have associated 95% confidence interval variance bands. In contrast, the DRSN model outperforms CNN, MCNN, and ResNet in accuracy, specificity, sensitivity, and precision. A crucial factor is the enhanced generalization capability of DRSN in combating noise. And we also compared these models with SVM and KNN. To increase the credibility of the experimental results, this study calculated five evaluation metrics, namely the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve, accuracy, sensitivity, specificity, and precision. Table 3 presents the evaluation metrics for the test sets of the four models after five-fold cross-validation.

Table 3 Raman spectral model classification results.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *