Enhancing breast cancer diagnosis through machine learning algorithms

Data collection

This analytical investigation utilized data sourced from the Motamed Cancer Institute, a clinical research center specializing in breast cancer in Tehran, Iran.

Breast cancer dataset: Initially, we gathered 300 records from individuals who had been referred to the research center in the last 3 years (2021–2024). Each record contained information on 6 features (see Table 1), all marked according to the device amp, and sorted into two groups indicating the presence, or absence of breast cancer. Out of these records, 75% were found to have cancer cells. It’s important to note that without knowing the specific model and dataset used, it’s difficult to say for certain what these features represent. However, in the context of breast cancer, these features could potentially represent:

HER2 receptor status: HER2 stands for Human epidermal growth factor receptor 2—a protein that plays an active role in the growth and division of a certain kind of breast cancer. As it is a receptor, it is a player in signal transduction that influences key processes to facilitate growth and division. HER2 is a protein that, under some conditions, expresses in larger numbers or in amplified layers, leading unconditionally to further growth and cell division—or, in other words, promoting the process of initiation and progression of breast cancer. The HER2-positive breast cancers make up a proportion of 15% to 20% among all the cases of breast cancers and are relatively more aggressive in their nature of growth and spreading than the HER2-negative ones. Up until then, such cancers had a very bad prognosis, which radically changed with the advent of targeted therapy.

Table 1 Her2 scoring system.

Although the prognosis radically changed over the past 20 years for the better in cases of HER2-positive breast cancers, the development of more precisely targeted forms of therapeutics has otherwise significantly changed the prognosis. HER2-positive breast cancers are more aggressive, but the drugs used to treat them—trastuzumab (Herceptin), pertuzumab (Perjeta), ado-trastuzumab emtansine (Kadcyla), and lapatinib (Tykerb)—are extremely effective. Adjuvant trastuzumab regimens have been found to reduce the risk of recurrence and improve overall survival to such an extent that the prognosis of patients with HER2+ breast cancer is, in certain settings, now at least as good as—even superior to—that of patients whose disease lacks HER2 overexpression. These targeted therapies for HER2, in addition to chemotherapy and, on occasion, with radiation and surgery, are transforming one of the formerly quite ominous variants of breast cancer into a rather more manageable type under today’s paradigms of treatment²⁵. The status of HER2 in the data that we used was classified into four levels, namely 0, 1, 2, and 3. This is based upon the output recorded in the pathobiology test of the patients. This shows the frequency of each groups (Fig. 3).

Ki-67 proliferation rate: Ki67 is considered both a predictive marker of the response to therapy and a prognostic predictor in breast cancer. The Ki67 index, representing the percentage of tumor cells positive for Ki67, represents a very important prognostic indicator of the nature and prognosis of tumors. While high levels of Ki67 usually denote higher tumor grades, features of more aggressive tumors, and worse prognosis, less aggressive tumors and better treatment outcomes are generally associated with low levels. The predictive value of Ki67, therefore, helps clinicians in treatment direction toward a more successful course. Given their penchant for fast growth, high-Ki67 tumors would logically be more susceptible to certain chemotherapies, while their low-Ki67 counterparts could be more responsive to hormone therapies or other targeted treatments²⁶.
Estrogen Receptors (ERs): Most instances of hormone receptor-positive breast cancer demonstrate just how vital estrogen receptors, or ERs, are to the development of the disease in question. As a result of the manifestation of estrogen receptors on the cell membranes, ER-positive breast tumors have the ability to respond to estrogens—the naturally occurring hormone responsible for growth and proliferation. The state of the disease sees estrogen binding with its receptor in ER-positive breast cancer, turning on the signaling pathways that promote tumor growth.

Curiously, the ER status represents an important prognostic factor: generally, the prognosis for patients with ER-positive tumors is better compared to those with ER-negative tumors. The promising prognosis is explained by factors including increased sensitivity of these types of tumors to hormone treatments, reduced tumor grade, and slower growth. Second, ER status provides treatment options by serving as a predictive indicator of therapy response. For the ER-positive tumors, hormonal therapies employ either aromatase inhibitors or tamoxifen to alter estrogen signaling or to lower estrogen levels. Individuals with ER-positive tumors therefore tend to benefit from these drugs, which effectively prevent the tumor from growing and reduce the possibility of its recurrence²⁷.

Progesterone receptor: Expression of the progesterone receptor is one major biomarker in the context of breast cancer, especially of the hormone receptor-positive types. PR, just like its counterpart—the estrogen receptor—plays a very important role in the response to hormones of breast cancer. Progesterone receptors are expressed by PR-positive breast malignancies to respond well to progesterone signaling in a manner quite similar to ER-positive tumors. Similar to estrogen, in the presence of its receptor, progesterone can stimulate cell growth and proliferation of PR-positive breast cancer cells.

In this respect, PR status is also a predictive factor in the case of breast cancer, although possibly less directly than ER. Generally, when tumors are both PR and ER positive, prognosis is better than for those cancers that are ER positive but PR negative. PR positive status generally correlates with lower tumor grade, slower tumor growth, as well as a greater likelihood of response to hormone therapy. PR status also indicates tumor responsiveness to hormone therapy in a similar manner as ER. The PR-positive tumors will most likely benefit from treatments such as tamoxifen or aromatase inhibitors, which exert their action on the tumor through a more hormonal route. When a tumor cell is PR-positive, it means that progesterone signaling is important in its proliferation—that is, it becomes all the more susceptible to treatments that disturb this very signaling pathway.

In summary, PR makes additional contributions to hormone responsiveness, determination of prognosis, and prediction of response to therapy in hormone receptor-positive breast cancer. Assessment, in addition to ER status, allows the best choices in therapy and optimal outcomes for patients with breast cancer^28,29.The frequency and distribution of each of these variables are displayed in the violin plot (Fig. 4).

Neoadjuvant therapy: Whether or not the patient received chemotherapy before surgery, in our data, almost half of the patients had a history of chemotherapy (Fig. 5).

Summarizing and analyzing data characteristics using descriptive statistics:

Descriptive statistics refers to a collection of instruments and methods employed in statistics to characterize, quantify, and analyze the attributes of a data set. These features facilitate our understanding of the data and enable us to identify the significant patterns (Tables 2 and 3).

Table 2 Type of variable.

Table 3 Descriptive statistics.

It is also displayed to provide a better understanding of the distribution of each variable (Fig. 6).

Preprocessing

In the first stage of data preparation, triple-negative cancers were removed. Then, based on the CDP number, a new column was formed in such a way that the positive CDP number one (indicating the presence of cancer cells) and the negative number zero (indicating the absence of cancer cells) were considered. It should be mentioned that based on the performance of the device and the explanations of its makers, numbers above 370 are considered positive numbers.

Mahalanobis distance

The distance between a point and a set is estimated by the Mahalanobis distance, an important term in statistics and data science. In this method, the distance is recognized with the use of a covariance matrix of the data. The formula to determine the Mahalanobis distance is as shown below:

$$\text{D }(\text{x},\text{ y}) = \sqrt{{\left(x-y\right)}^{T}{s}^{-1}\left(x-y\right)}$$

Using this formula, the desired ‘x’ point along with its distance from the set is computed. The ‘S’ is the covariance matrix of the data, and ‘y’ is the data set. Mahalanobis distance is the measure of the distance of a data point from the set of data points, one of the most important concepts in statistics and data science. This distance takes into account the dispersion of the data. In data science, Mahalanobis distance is very important for finding and eliminating outlier data. One can tell whether a data point deviates from the normal pattern of the data by calculating this interval for every data point. A data point could be labeled as an outlier and excluded from the model if its distance from the center of the data distribution exceeds the maximum allowable threshold. Thus, Mahalanobis distance could be used to find an outlier³⁰.

PCA

Principal Component Analysis is a statistical procedure in dimensionality reduction that involves an orthogonal transformation of a dataset—probably with correlated variables—into a set of uncorrelated variables known as the principal components. The main intention of PCA, therefore, is to retain those features that explain the maximum variance within the data so that the data can be analyzed based on reduced dimensionality quite easily and interpretation is enhanced. First, standardization is carried out on the data matrix X, consisting of n observations and p variables. It is a very crucial step since the use of variables with different scales in one analysis ensures each variable is given equal weight. A standardized data matrix Z is obtained by centering each variable—that is, subtracting the mean—and scaling, that is, dividing by the standard deviation:

$$\text{Zij}=\frac{{x}_{ij}-{\mu }_{j}}{{\sigma }_{j}}$$

where μ_j is the mean of the j-th variable and σ_j is its standard deviation.

Next, the covariance matrix C of the standardized data is computed to capture the relationships between the variables. The covariance matrix, which is a p × p symmetric matrix, is defined as:

$$C = \left( {1/\left( {n – 1} \right)} \right) \times \left( {Z^{T} \times \, Z} \right)$$

In this matrix, each element C_jk represents the covariance between variables j and k. To identify the directions of maximum variance, PCA performs an eigenvalue decomposition of the covariance matrix, which involves solving the eigenvalue equation:

$$C \times v = \lambda \times v$$

Here, v represents an eigenvector and λ is the corresponding eigenvalue. The eigenvalues provide insight into the variance captured by each principal component, while the eigenvectors indicate the directions in the original feature space along which this variance occurs.

The eigenvalues are sorted in descending order, and the top k eigenvectors corresponding to the largest eigenvalues are selected to form a new basis for the data. If the eigenvalues are denoted as λ₁, λ₂, …, λ_k, and the associated eigenvectors as v₁, v₂, …, v_k, the first principal component PC₁ can be expressed as:

$$PC_{1} = Z \times v_{1}$$

The transformation of the original standardized data into the new subspace defined by the principal components is achieved through the matrix multiplication:

where Y is the transformed data matrix and V_k is the matrix that contains the selected eigenvectors. This transformation reduces the dimensionality of the data while preserving the essential variance.

PCA has numerous applications across various fields, such as image processing, bioinformatics, and finance, where it facilitates the identification of underlying patterns, reduces noise, and enables the exploration of high-dimensional datasets. By capturing the most critical information in fewer dimensions, PCA provides a powerful tool for researchers and analysts seeking to derive meaningful insights from complex data^31,32,33.

Illustrates the distribution of the data analyzed in this study on a PCA plot prior to balancing (Fig. 7).

SMOTE

SMOTE (Synthetic Minority Over-sampling Technique) is a widely utilized and effective approach employed when machine learning datasets exhibit class imbalance. Class imbalance transpires when a class, typically the minority class, is under-represented relative to others, resulting in biased performance from the learning model. SMOTE generates synthetic instances for the minority class, thus balancing the class distribution. This method recognizes instances of the minority class and generates new synthetic examples for them. The new samples are produced along the line segments connecting each or all of the k nearest neighbours of the minority class. The newly generated imitations are produced inside the feature space, hence enhancing the representation of the minority class without introducing bias to the majority class³⁴.

SMOTE reduces issues related to imbalanced datasets by generating a greater and more equitable training set for machine learning methods. It achieves this by producing synthetic data to mitigate overfitting in the minority class and enhance the model’s generalization ability. However, it is important to recognize that this method may not be appropriate for all datasets, particularly when there is much overlap between the minority and majority class data, or when the sample space is inadequately specified. Moreover, while SMOTE may be useful in some situations, it is essential to assess its efficacy and integrate it with additional methodologies to enhance the model’s robustness and precision in managing imbalanced datasets³⁵.

In this study, the initial number of individuals in the minority group was 76, whereas the majority group comprised 224 individuals. Following the application of the SMOTE technique, utilized for class balancing, the instances in both groups became equal, each containing 179 individuals (Fig. 8).

Machine learning algorithm

In our project, we have accomplished predictive analysis through the implementation of various machine learning supervised algorithms (Fig. 9). The machine learning algorithms utilized in our project include:

SVM

Support Vector Machines (SVMs) are a powerful class of supervised learning algorithms used for classification tasks. The primary objective of SVMs is to find the optimal hyperplane that separates data points of different classes in a high-dimensional feature space. The hyperplane is defined by the equation:

where w is the weight vector (normal vector to the hyperplane), x is the input feature vector, and b is the bias term.

The goal of SVM is to maximize the margin between the hyperplane and the closest data points from each class, known as support vectors. The margin is defined as the distance from the hyperplane to the nearest data points of either class. The optimization problem can be mathematically expressed as follows:

Minimize:

$$1/2\left| {\left| w \right|} \right|^{2}$$

Subject to the constraints:

$y_{i} \left( {wx_{i} + b} \right) \ge 1,\;{\text{for}}\;{\text{all}}\;i$.where y_i is the class label of the i-th training sample x_i, taking values of + 1 or − 1. The term ||w||2 represents the squared norm of the weight vector, which we seek to minimize to maximize the margin.

To solve this constrained optimization problem, we can introduce Lagrange multipliers α_i for each constraint. The Lagrangian function L is formulated as:

$$L\left( {w,b,\alpha } \right) = 1/2\left| {\left| w \right|} \right|^{2} – \sum {\left( {\alpha_{i} \left[ {y_{i} \left( {wx_{i} + b} \right) – 1} \right]} \right)}$$

where α_i ≥ 0. The dual form of the optimization problem is obtained by maximizing the Lagrangian with respect to α while minimizing it with respect to w and b.

The dual problem can be formulated as:

Maximize:

$$\sum {\left( {\alpha_{i} } \right)} \Sigma – 1/2\sum { \, \left( {\sum {\left( {\alpha_{i} \alpha_{j} y_{i} y_{j} \left( {x_{i} x_{j} } \right)} \right)} } \right)}$$

Subject to:

$$\Sigma \left( {\alpha_{i} y_{i} } \right) = 0,\alpha_{i} \ge 0$$

The dual formulation focuses on the Lagrange multipliers α and captures the relationships between data points through the kernel trick³⁶.

SVMs can be extended to handle non-linear classification problems by using kernel functions. A kernel function K (x_i, x_j) computes the dot product in a transformed feature space without explicitly mapping the input data to that space. Common kernels include:

Linear Kernel: K (x_i, x_j) = x_i · x_j
Polynomial Kernel: K (x_i, x_j) = (x_i · x_j + c) ⁿ
Radial Basis Function (RBF) Kernel: K (x_i, x_j) = e^{(−γ||xi—xj||2)}

Using kernels allows SVMs to create complex decision boundaries while maintaining computational efficiency³⁷.

Once the optimal α values are determined, the decision function for a new data point x can be expressed as:

$$f\left( x \right) = {\text{sign}}\left( {\sum {\left( {\alpha_{i} y_{i} K \, \left( {x_{i} , \, x} \right)} \right)} + b} \right)$$

The support vectors, which are the data points that lie closest to the decision boundary, play a critical role in defining the hyperplane. The SVM decision boundary is influenced solely by these support vectors, making SVMs both efficient and robust.

SVMs are known for their strong performance in high-dimensional spaces and their ability to generalize well to unseen data, particularly due to the maximization of the margin. However, they can be computationally intensive, especially with large datasets or when using non-linear kernels. Additionally, careful parameter tuning (e.g., choice of kernel and regularization parameters) is essential for optimal performance.

Despite these challenges, SVMs remain a powerful tool for classification tasks across various domains, including bioinformatics, image recognition, and text classification^38,39.

Random Forest

Random Forest is a widely used ensemble learning approach in supervised machine learning, typically employed for regression and classification tasks. To enhance precision and resilience, the method constructs several decision trees during training and amalgamates their predictions. To implement randomization and reduce overfitting, each decision tree in the forest is trained on a random selection of features and a random part of the training data. During prediction, the ensemble of decision trees aggregates or votes on each other’s predictions, producing a more accurate and exact model⁴⁰.

The Random Forest method can be properly expressed as follows: The process comprises many phases employing a training dataset (D) containing (N) samples and (M) features. The procedure starts with the creation of bootstrapped datasets (D_b), each including (N) samples randomly chosen with replacement from the original dataset. A decision tree (T_b) is generated for each bootstrapped dataset (D_b) by recursively splitting the data according to randomly selected attributes at each node. The ultimate prediction is generated by consolidating the forecasts from each decision tree. This aggregation is often performed in regression tasks by averaging the predictions of all trees, but in classification tasks, it is usually achieved by picking the mode (most common class) from the predictions of all trees. This ensemble method utilizes the combined expertise of several trees to reduce overfitting and enhance the performance beyond that of singular trees^41,42.

Logistic regression

Logistic regression is a widely used statistical model for binary classification tasks, where the objective is to predict the probability that a given instance belongs to one of two classes, typically denoted as 1 (positive class) and 0 (negative class)⁴³.

This method is particularly effective in scenarios where the outcome is categorical and can be interpreted probabilistically.

At its core, logistic regression applies a logistic function, also known as the sigmoid function, to a linear combination of the input features. The mathematical formulation can be expressed as follows:

$$P\left( {y = 1|x} \right) = \sigma \left( z \right) = 1/\left( {1 + e^{( – z)} } \right)$$

where:

P (y = 1 | x) represents the probability of the instance belonging to class 1 given the input features x.

z is defined as the linear combination of the input features and their corresponding coefficients (weights):

$$z = \beta_{0} + \beta_{1} \times x_{1} + \beta_{2} \times x_{2} + … + \beta_{n} \times x_{n}$$

In this equation:

β₀ is the intercept (bias term),

β₁, β₂, …, β_n are the coefficients associated with each input feature x₁, x₂, …, x_n.

The logistic function, represented by σ (z), maps any real-valued number into the range¹, ensuring that the predicted probabilities are valid. The output of the logistic regression model can be interpreted as the likelihood that the instance belongs to the positive class⁴⁴.

Odds and log-odds

Logistic regression is closely related to the concept of odds and log-odds (logit). The odds of an event occurring is defined as the ratio of the probability of the event occurring to the probability of it not occurring:

$${\text{Odds}} = P\left( {y = 1|x} \right)/P\left( {y = 0|x} \right) = P\left( {y = 1|x} \right)/\left( {1 – P\left( {y = 1|x} \right)} \right)$$

Taking the natural logarithm of the odds gives the log-odds, also known as the logit:

$${\text{Logit}}\left( {P\left( {y = 1|x} \right)} \right) = \log \left( {P\left( {y = 1|x} \right)/\left( {1 – P\left( {y = 1|x} \right)} \right)} \right) = z$$

This relationship shows that the log-odds are linearly related to the input features:

$$\log \left( {P\left( {y = 1|x} \right)/\left( {1 – P\left( {y = 1|x} \right)} \right)} \right) = \beta_{0} + \beta_{1} \times x_{1} + \beta_{2} \times x_{2} + … + \beta_{n} \times x_{n}$$

Estimation of coefficients

The coefficients β₀, β₁, …, β_n are estimated using training data through a method called maximum likelihood estimation (MLE). MLE finds the parameter values that maximize the likelihood of observing the given data under the model. The likelihood function for a logistic regression model can be expressed as:

$$L\left( \beta \right) = \prod \left( {P\left( {y^{i} | \, x^{i} ;\beta } \right)} \right)$$

where N is the number of observations, and yⁱ and xⁱ are the response variable and feature vector for the i-th observation, respectively. The coefficients are found by maximizing this likelihood function, often using iterative optimization algorithms such as gradient ascent or Newton–Raphson.

During the prediction phase, an instance is classified into class 1 if the predicted probability P (y = 1 | x) exceeds a predefined threshold (commonly set at 0.5). If the predicted probability is below this threshold, the instance is classified as belonging to class 0. This threshold can be adjusted based on the specific application requirements, such as balancing false positives and false negatives^44,45.

Decision tree

In machine learning, decision trees are an effective tool for both regression and classification problems. It divides the input space recursively into smaller areas according to the characteristics of the data. In order to maximize the purity of the resulting subsets, the algorithm chooses the characteristic at each stage that best divides the data into homogenous subsets. This procedure keeps going until a predetermined point is reached, such a maximum depth, or until further splitting doesn’t materially enhance the model’s functionality^46,47.

The mathematical goal of the decision tree method is to find the best split at each node so that the most purity or information is achieved. The Gini impurity is a common way to measure how impure a node becomes. It estimates how likely it is that a randomly chosen sample would be wrongly labelled if its label were based on the node’s class distribution. Another way to measure the amount of doubt or disorder in a set of samples is to use entropy. At each node, the program picks the split that gets rid of the most impurities or adds the most information⁴⁸.

The decision tree algorithm can be represented by the following formula:

$${\text{Split}}\;{\text{feature}} = \arg \max \;{\text{features}} \left( {{\text{Impurity}}\;{\text{measure}}\left( x \right) – \sum\limits_{i = 1}^{k} {\frac{{\left| {x_{i} } \right|}}{\left| x \right|}} {\text{Impurity}}\;{\text{measure}}\left( {xi} \right)} \right)$$

where x_i is the set of samples at the current node, x_i are the subsets resulting from splitting x based on a particular feature, and k is the number of subsets.

Until a stopping requirement is satisfied, the algorithm repeats this process, creating a tree structure where each leaf node denotes a class label (in classification) or a predicted value (in regression)^37,49.

Evaluation:

To assess the effectiveness and capabilities of the utilized models, we employed three evaluation methods: accuracy, recall, and precision. The operational concepts of each method are explained below.

$$\text{Accuracy}: \frac{TP+TN}{\text{Total Number of Predictions}}$$

$$\text{Precision}: \frac{TP}{TP+FP}$$

$$\text{Recall}: \frac{TP}{TP+FN}$$

TP: True Positive
FP: False Positive
TN: True Negative
FN: False Negative

Ethics approval and consent to participate

All methods in this study were conducted in accordance with relevant guidelines and regulations and were approved by the National Ethics Committee of Iran (approval code: IR.TUMS.VCR.REC.1397.355). Informed consent was obtained from all participants (or their legal guardians), and the objectives and methods of the study were fully explained to them before the study began.

Source link