Development of predictive models for breast cancer detection using radiomics-based mammography and machine learning | Egyptian Journal of Radiology and Nuclear Medicine

Machine Learning


One of the most effective and important contributions of machine learning algorithms to breast cancer treatment is early detection of breast lesions in mammographic images and the precise determination of whether they are benign or malignant without invasive procedures such as biopsies. In this study, we investigated the predictive performance of 10 common feature selection and machine learning classification models using radiomic features extracted from digital mammographic images in three classes, typically benign and malignant. Mammography images were collected from two different imaging centers, CC and MLO views. Differences in imaging protocols, image quality, and patient populations can reduce bias, increase data diversity, and improve machine learning models' generalizations. As a result, the model provides a more accurate diagnosis when faced with previously invisible data [40, 41].

The key findings of this study were examined in two aspects. In the first aspect, we investigated various feature selection approaches to determine the most relevant features, and PCA emerged as the most successful feature selection method. PCA is a simple, non-parametric method for extracting related information from complex datasets. This method converts data from high-dimensional space to low-dimensional space. It is used to reduce the dimensions of the data while retaining the components of the dataset while having the greatest impact on distribution. In mammographic images, these features reveal important patterns of tissue and structure that are important for the diagnosis of breast lesions. PCA can be useful in several ways. First, by maintaining key components and removing non-essential components, you can reduce data noise and facilitate classifier learning. Second, due to high correlation between features of image data, PCA converts correlated features into non-correlated features, improving model performance. Finally, this method prevents overfitting and helps to improve the performance of the model when faced with new data.

In the second aspect, multiple machine learning classification procedures were used to detect breast lesions, of which the ET, RF, and GBT classifiers each provided the best predictive performance. The ET classifier is an ensemble method that constructs multiple randomised decision trees during training. Each tree is constructed from the entire training dataset, with a random split selected at each node. This method improves generalizability by minimizing variance and reduces the risk of overfitting by introducing randomness into the model construction process. RF similarly builds multiple decision trees, but differs by using bootstrap datasets and selecting a random subset of the capabilities of node splits. This addresses overfitting and increases the robustness of the model. GBT develops ensemble models in turn by training each new tree, correcting errors made by previous trees, and optimizing the accuracy of the model through gradient descent.

PCA+ET achieved the highest accuracy, AUC, and specificity, whereas sensitivity calculations for the RF+ET and LGR+ET models showed acceptable results. Following ET, PCA + RF showed high accuracy, AUC, and specificity. RF techniques solve the overfitting problem through random selection of features and the use of bootstrap samples. They also demonstrate good performance with the disproportionate data that are often encountered in breast cancer diagnosis. GBT achieved valuable results in accuracy, AUC and sensitivity calculations. This technique is based on the idea of ​​enhancing weak models to create strong learners. PCA + GBT uses sequential learning and gradient descent to minimize loss functions, thus providing excellent flexibility and improving the accuracy of predictive models.

Ensemble learning techniques outperform probabilistic models due to their ability to effectively manage high-dimensional and complex data structures, such as radioactive functions extracted from mammographic images. Probabilistic models such as naive Bayes and logistic regression rely on assumptions of feature independence and linear separation, and may not be able to properly capture complex relationships inherent in radioactive data. In contrast, the ensemble method combines multiple learners to address variance, bias, and overfitting, increasing flexibility in modeling nonlinear patterns. This ability to aggregate diverse decision-making boundaries makes ensemble approaches more resilient and accurate when processing heterogeneous, high-dimensional datasets typical in breast cancer imaging studies.

Many studies have been conducted to predict breast cancer using machine learning models. Previous studies have demonstrated that radioactive features facilitate detection of invisible lesions in humans and allow quantification of findings. Wang et al. [42] To detect suspicious Bi-rads 4 and 5 lesions in mammographic images, the applied radioactivity achieved sensitivity comparable to that of the biopsy. Data were divided into benign and malignant categories. Features selection was performed using Pearson correlation and Lasso methods, followed by classification using SVM methods with linear kernels. The authors emphasize the importance of combining data obtained from CC and MLO views to improve diagnostic accuracy. This is consistent with the approach taken in this study. The stability of the model was evaluated using a 5x cross validation method, with the following results: The AUC is 0.915, sensitivity is 98.7%, and specificity is 36.7%. The current study achieved excellent results for both AUC and specificity. In the PCA+ET model, the AUC reached 0.996 and the specificity reached 0.985. Furthermore, the PCA+SVM model achieved an AUC of 0.948, and the LGR+SVM model showed a specificity of 0.960. In our study, the highest sensitivity of 0.963 was obtained in the PCA + GBT model, but still an acceptable result. Incorporation of data from all three categories (usually benign, malignant) improves the performance of models for all three label types, thus affecting the results of the study's sensitivity and specificity.

In their research, Mao Zedong et al. [43] Four machine learning algorithms were applied: SVM, LGR, KNN, and Naive Bayes to improve the accuracy of breast cancer diagnosis based on radiation features extracted from mammographic images. Following 10x cross-validation, the LGR algorithm demonstrated the most optimal performance in the test group, showing an accuracy of 0.886, a specificity of 0.900, and a sensitivity of 0.867. In this study, excellent results were obtained through the application of the PCA + ET model, achieving an accuracy of 0.960 and a specificity of 0.990. Additionally, the LGR+ET, Mi+ET, and Lasso+ET models achieved sensitivity of 0.954. The best sensitivity analysis in our study was performed by the PCA+GBT (0.963) and PC+GNB (0.957) models, but multiple feature selection methods were used to deliberately set a wide range of selected features (29–96 features). This approach has incorporated the most optimal and relevant features into the model training algorithm, facilitating the development of more accurate diagnostic results.

Several studies have shown that in addition to the radioactive features of mammography, the adoption of features derived from pathological testing, genetics, and patient demographic data improves the performance of machine learning models. For example, Rabiei et al. [44] We employed 20 laboratory and patient demographic characteristics and 24 mammography capabilities to develop four machine learning models, including RF, MLP, GBT and GA. Smote was used to ensure a balanced dataset, and this approach was adopted in the current study. Among other methods, RF showed the most optimal diagnostic performance, reaching 80% accuracy, 0.56 AUC, 95% sensitivity, and specificity of 80%. Furthermore, GBT achieved 74% accuracy, AUC of 0.59, sensitivity of 82%, and specificity of 86%. In our study, ET, RF, and GBT models showed excellent diagnostic performance despite lack of demographic and laboratory characteristics. Compared to the previous study, the results showed that the RF model showed the highest accuracy (0.953), AUC (0.993), sensitivity (0.944), and specificity (0.977), while the GBT model showed the highest accuracy (0.938), AUC (0.988), sensitivity (0.963), and specific (0.963), and specific (0.963).

Discovering ideal feature selection methods is extremely important in the context of machine learning and breast cancer diagnosis. Introducing unrelated features in the model generates noise and reduces accuracy. Al Tawil et al. [45] We investigated the impact of a variety of feature selection techniques, including maximum association of minimum redundancy (MRMR), PC, and Lasso in various machine learning classification models, including SVM, LightGBM, RF, LGR, KNN, and naive Bayes. The WDBC dataset consisted of 30 attributes. After utilizing the 10x cross-validation method, results obtained by the MRMR+LightGBM, PC+LightGBM, and Lasso+LGR models were reported with features of 15, 15, and 5 and numbers of accuracy of 98%, 95%, and 96%, respectively. The LightGBM classification technique, an ensemble technique, was able to achieve the best performance. This is consistent with the results of the current study, demonstrating that the ensemble method is most effective in classifying mammography findings.

Further improvements will be made to address the limitations of this study. The limited number of patients included in this study may affect the training process of the model and subsequently affect the effectiveness and quality of its diagnosis. Additionally, the model may not have generalizability to other data types. In this study, only mammography images were examined. It may be beneficial to assess the performance of the optimal algorithm in this study of breast ultrasound and MRI imaging. Furthermore, this study did not consider laboratory and demographic characteristics. This study shows that radioactive features of the imaging can significantly enhance the diagnosis of breast cancer. Despite the observed limitations, this study was able to provide a valuable analytical comparison of the various classification and feature selection methods employed in the field of machine learning for breast cancer prediction and diagnosis, with the aim of promoting more effective and early stage treatment.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *