Breast cancer can be detected in peripheral blood using machine learning via T-cell receptor repertoire

This study investigated the relationship between T-cell receptor (TCR) repertoire in peripheral blood and tumor conditions by comparing blood samples from breast cancer (BC) patients with healthy donors (HDs). Using machine learning techniques, a repertoire-based classification model has been developed that allows accurate identification of previously invisible clinical status of samples. The main purpose of the model is to distinguish blood samples from BC patients and HD (Figure 1).

**Figure 1: Overview of workflows for sample collection, processing, and machine learning analysis.**

Model development follows the methodological approach outlined in the Methods section, and comprehensive results are recorded in Supplementary Table 1 of the Dome Report. All findings are based on data that have undergone appropriate subsampling procedures, as detailed in the Methods section.

Quality control assessment of TCR repertoire data

Several important repertoire characteristics were examined to assess the quality of the data and ensure comparability between groups. The percentage of unproductive sequences (0.09 + – 0.0004) is consistent with previously published datasets from the lab, indicating reliable sequences and annotations. HeatMaps was used to visualize the use patterns of V and J genes. This showed a uniform distribution in all samples with no observable bias between the BC and HD groups (Supplementary Fig. 1A). Furthermore, analysis of the length distribution of CDR3 revealed identical profiles in both cohorts, as shown in the bar plot (Supplementary Fig. 1b). These quality control assessments support the overall reliability of the dataset and ensure that the input features used in downstream machine learning analyses are not confused by technical artifacts or group-specific sequence inconsistencies.

Chronotype quantification and diversity analysis

Initial analysis quantified unique chronotypes in all samples (Supplementary Fig. 1C) revealing statistically significant differences between the BC and HD groups. Specifically, groups include the number of unique sequences (p-value <0.003) and the rare chronotype (p-Value <0.001) (Supplementary Fig. 1d). However, no significant differences were detected in the most abundant chronotypes (Supplementary Fig. 1E).

Further studies investigated diversity indicators such as Gini Diversity, Inverse Simpson Index, and True Diversity (Supplementary Figures 1F, 1G, 1H). These analyses demonstrated statistically significant variation between the BC and HD groups across all three metrics (p-Value <0.005, 0.0007, 0.005, respectively). Despite these findings, the differences were not sufficient for accurate sample classification. This is the main goal. The diversity of the Gini-Simpson Index showed no significant differences between groups (Supplementary Fig. 1i, p-value <0.66).

Evaluation of HLA distribution as a potential confounder

Applying hlaguessr¹⁵ The dataset revealed no significant differences in predicted HLA allele distribution between the BC and HD groups. These findings suggest that HLA composition is unlikely to explain the patterns observed with public TCR chronotypes between the two groups. The estimated HLA profiles were visualized in heatmap (Supplementary Fig. 1J), showing comparable distributions across groups, supporting the conclusion that HLA types are not confounders of the classification framework.

A monitored machine learning approach for sample classification.

Traditional statistical approaches have proved to be insufficient to distinguish BC samples from HD samples, prompting the adoption of supervised machine learning techniques. The dataset consisted of 94 columns (four samples were excluded, see Methods) and 1,156,548 rows. Each column corresponds to a sample (49 HD samples and 45 BC samples) with rows depicting the individual TCR sequence frequencies. In particular, 1,014,799 of these sequences were observed only in a single sample (private clone) and left 141,749 sequences displayed in two or more samples (public clones). This model was trained on randomly selected subsamples of data (see Randomly selected columns, see Dome Models) and tested on previously invisible samples (20% of data) to validate the predictive capabilities of the model.

The training set was further divided into sub-training and validation sets. To facilitate understanding of the workflow, an overview of the data partitions and training processes is shown in Figure 2 and Supplementary Figure 2. The feature selection methodology used works by selecting TCRS for the model based on the public chronotype, including only the general ones across the sample. Starting with the most frequent clones of the top 600 in the sub-training set, we narrowed this down to 10 key features (clones) using a selection from the models (SFM). Using these 10 features, we trained eight different models and evaluated performance on a validation set to select the best performance model. To ensure robustness and reliability of the model selection process, we repeated the entire procedure and selected 100 times, each using a different split of training data. The results of these 100 iterations of the model selection process showed that all three most frequently selected models were boost models, Adaboost (ADAB), gradient boost machine (GBM), and XGBoost (XGB). These three models were selected in 44 of the 100 iterations, all using LR as the estimator for the SFM. Based on this consistency, I decided to proceed with the XGB model. Furthermore, we analyzed 10 features selected in each of these 44 iterations and found that only 16 unique features were repeatedly selected. As a result, we continued these 16 unique features for further analysis.

**Figure 2: A flow chart of an overview of the workflow for model development and evaluation.**

Based on previous experiences (other work in the lab), we added another step to feature selection. In this step, we refine our set of 16 features and narrow it down to the optimal subset of 10 features. The training data was split into sub-training and validation sets, and the SFM algorithm was applied using an LR estimator to reorganize the XGB model. This procedure was repeated for 10 different splits of training data. See Supplementary Table 2 for the 10 features selected in each split. For each of the 10 different data splits, the performance of the model was assessed by calculating the AUC scores of the validation set. This was followed by reorganizing the model across the training set (combining subtraining and validation) and assessing performance over multiple subsamplings of the test set. For each of the 10 splits, see Supplementary Table 3 for various performance metrics for the model for multiple subsampling of the test set. The best models were chosen based on their performance and given priority to models demonstrating optimal matching of AUC scores in both validation and test sets. The winning model achieved an AUC of 1.0 in the validation set and an average AUC of 0.96 across multiple subsamplings of the test set (Feature Set #7) (Fig. 3A). The confusion matrix of the winning model for the test set is shown in Figure 3b.

**Figure 3: Results of classification and feature selection.**

For more information about (1) estimators, see the Methods section for (2) Bayesian optimization, (3) training procedures, (4) validation, and (5) hyperparameter tuning using (1) testing methods.

The dome description (Supplementary Table 1) details the specific hyperparameters selected by BO for the model.

Identification of T cell subtypes via combinations of V(d)j

The identification of different T-cell subtypes can be achieved through the analysis of their unique V(D)J combinations. This determines the particular TCR sequence.¹⁶. We used the feature selection method described above to identify 10 CDR3 sequences: Cadhqnyggsqgnlif, calqggseklvf, caghdyklsf, calsaarssntgklif, cvvsdrgstlgrlyf, caatdswgklqf, caveetsgsrltf, caatqggseklvf. Among these, we determined that the CVVSDRGSTLGRLYF sequence is the product of V10 and J18 corresponding to the INKT cell, while the CavrdsnyQliW sequence is the product of V1-2 and J33 corresponding to the MAIT cell.

Database analysis of clone frequencies for selected sequences

Further testing of clone set revealed this in vdjdb¹⁷cavrdsnyqliw has 16 unique entries, caatdswgklqf has 5, calqggseklvf has 3, caaseygnklvf, caseyygnklvf, and caveetsgsrltf each have 2, and the remaining sequences have one or none (Supplementary Table 4).

With the MCPAS database¹⁸TCRS showed 391 entries for CVVSDRGSTLGRLYF, 279 entries for CavrdsnyQliw, two entries for Calqggseklvf, one entry for Caatdswgklqf, and no entries for other sequences (Supplementary Table 5).

TCR embedding and visualization using CVC¹⁹

The UMAP plot of embedded CVC revealed a distinct cluster of TCR sequences, suggesting that the model captures meaningful structures of the repertoire. Of the ten chronotypes identified as important in classification analysis, nine are part of different clusters (Fig. 3C), indicating that these sequences occupy different regions of the learned embedding space.

Source link