Patient data and tumor specimens
We collected the data of four cohorts of patients with BRAFV600E/K-mutated melanoma treated with MAPKi: (1) Blateau et al. (vemurafenib, vemurafenib and cobimetinib, dabrafenib, dabrafenib and trametinib; n = 53)6; (2) Catalanotti et al. (vemurafenib, vemurafenib and cobimetinib, dabrafenib, dabrafenib and trametinib; n = 66)8; (3) Louveau et al. (vemurafenib and cobimetinib, dabrafenib and trametinib; n = 20)5; (4) Van Allen et al. (vemurafenib, dabrafenib; n = 45)27. Pretreatment tumors were from patients with stage IV or unresectable stage III BRAFV600E/K melanoma. 86% of patients received MAPKi as first-line therapy (Supplementary Table 8). Clinical responses to therapy were determined using RECIST criteria28. The use of data from these patient cohorts has been approved by the ethics committee of the University of Montpellier, France (favorable advice UM 2022-009bis). As we did not generate new data for this study, no additional ethics approval was required. For each patient, tumor tissues were collected, stored as formalin-fixed, paraffin- embedded (FFPE), or frozen samples, and processed. Somatic single-nucleotide variants and small insertions or deletions were considered as mutations. For the Van Allen et al. and Catalanotti et al. cohorts, sequencing was performed on all exons of the sequenced genes. For the Blateau et al. and Louveau et al. cohorts, sequencing was performed on some exons of the sequenced genes, with partial overlap between both cohorts6,29. In the training/test set (Van Allen et al. and Catalanotti et al. cohorts) on the one hand, and the validation set (Blateau et al. and Louveau et al.) on the other hand, only common genes were taken into account to maximize overlap and avoid missing values. Table 1 summarizes the patient demographic/clinical features, outcomes, and DNA sequencing methods/results. Data have been made available at the following link: https://www.ircm.fr/crcm_en/melanodb.html.
Genomic data pre-processing
We used FATHMM30, a software that predicts the functional consequences of coding and non-coding variants detected by genome sequencing. We used a binary classification to categorize genes based on the FATHMM results. If a somatic mutation detected in a gene was predicted to be a cancer-promoting/driver mutation by FATHMM, the reported value was 1, otherwise 0. We then selected the genes mutated in at least 5% of the patients (Fathmm dataset). Alternatively, we used Cscape31, that predicts the oncogenic status (i.e. disease-driver or neutral) of somatic point mutations in the coding and non-coding regions of the cancer genome. As done with FATHMM, we employed the obtained information to define a feature set con- sisting of binary indicators and selected the genes mutated in at least 5% of the patients (Cscape dataset). We also defined variants for which only high confidence predictions by Cscape were considered (CscapeHigh dataset).
To select genes involved in melanoma signaling pathways, we used the curated melanoma pathway from the KEGG database32,33,34 and the curated melanoma pathway model from the Virtual Melanoma Cell (VCELLS) project35. We only considered the 70 proteins of the KEGG melanoma pathway (KEGG-Mel dataset), and the 119 proteins of the VCELLS melanoma pathway (VCELLS-Mel dataset), both represented by HGNC gene symbols.
As an alternative to the previously described binary indicator approach, we constructed features by statistical overrepresentation analysis (hyper-geometric test) of mutated genes in all KEGG pathways (MRKEGG dataset) and in the KEGG melanoma (MRKEGG-Mel dataset) and VCELLS melanoma (MRVCELLS-Mel dataset) pathways36. We performed this analysis for each patient separately, and obtained one p-value per patient and per pathway. Because p-values span over several decades, we used the negative logarithm of this p-value as a feature value.
Survival analysis algorithms
Survival analysis algorithms make predictions of the survival time until an event of interest occurs, i.e. melanoma cancer progression, and estimate the survival probability at the estimated survival time. Ensemble models are combination of simple individuals models that together create a powerful new models.
Random survival forest (RSF) is an ensemble method for survival analysis that uses survival trees as its base model. It is an aggregation of multiple decision trees, each decision tree uses subsets of a dataset. At each branch of the tree, the algorithm randomly selects a subset of features to make a prediction. The trees keep splitting the data based on these random features selection, asking “yes” or “no” questions, and are train to find the best splits in the data until it reaches good predictions. Once the forest is built, a prediction combines the output of multiple decision trees. RSF is based on the bagging method in which random samples are repeatedly drawn from the input data to reduce the base model variance and avoid overfitting. However, an extra level of randomness is added to RSF while growing the trees. This random subset of features is used for splitting a node instead of using all features. This reduces the correlation among trees in the forest and improves the prediction performance. Unlike the decision trees used in popular Random Forest classifiers (described by Breiman), RSF employs survival trees and averages the ensemble cumulative hazard function of each tree for prediction37. Each node of the tree is split using the random subset of features to maximize the dissimilarity between child-nodes. The non-parametric Nelson-Aalen estimator is used to estimate the ensemble cumulative hazard function by taking the average of all cumulative hazard functions of each survival tree38,39,40.
Extra survival trees (EST), also known as extremely random survival forest, is a meta-estimator developed as an extension of the RSF method. This method fits several randomized survival trees, called extra-trees, and like RSF, uses averaging to improve the predictive accuracy and control overfitting. The extra level of randomness in extra survival trees comes from how splits are computed, which is different from the method used in RSF. Instead of taking the most discriminating thresholds at each node, thresholds for each feature are drawn randomly, and the best of these randomly generated thresholds is used for splitting the node. This approach tends to improve the model generalization by reducing the overall model bias and variance. The log-rank statistic is used as a splitting criterion for extra survival trees, like in RSF. In this study, we used the scikit-survival package that includes this method41.
Gradient boosting survival model (GBS) is an ensemble learning method that combines the predictions of multiple weak base learners to create a powerful overall model. The algorithm first creates a simple model (decision tree for example) to compute predictions from a given dataset. From predictions, we can observe the residuals between initial prediction and actual target values. Those differences help to quantify how far the initial prediction was from each target points. The next model is then specifically trained to improve predictions from upon the errors from the first model. Unlike random forest, GBS combines the base learners into a weighted ensemble model, giving extra weight to weak learners for correct predictions (refs. 39,42). This process repeats, and each new model adds a “boost” to improve the overall accuracy. In survival analysis, the loss function is derived from the Cox partial likelihood function and regression trees are used as base learners38. Specifically, the linear model in the partial likelihood function of the Cox model is replaced by an additive model, and the objective of the loss function is to maximize the log partial likelihood. Gradient boosting survival models are constructed sequentially in a greedy stagewise fashion.
Penalized models introduce regularization by adding penalty terms to the model equation. The Elastic Net penalty regularization method is particularly useful in scenarios with more features than observations. It combines the strengths of lasso (l1) and ridge (l2) penalties, by performing feature selection like lasso and by shrinking correlated features together as done in ridge regression43. The Cox proportional hazard (Cox PH) model estimates the effect of various features on the hazard function that describe the instantaneous risk of an event at a specific time t. In this case, Cox’s PH model and Elastic net penalty are combined to analyse survival data. The Elastic Net (EN-COX) acts as a regularizer by adding a penalties term to the Cox PH model’s objective function. The penalties prevent coefficients from becoming too large, essentially by shrinking them towards zero. In the Cox regression model, regularization is added to the log partial likelihood, resulting in the following equation:
$$\beta =\mathop{{\rm{argmax}}}\limits_{\beta }\left[l\left(\beta \right)-\lambda {P}_{a}\left(\beta \right)\right]$$
(1)
where l(β) is the log partial likelihood, λ ≥ 0 the regularization parameter and Pα(β) a penalty form. In the elastic net penalized Cox regression model, the penalty term is expressed as:
$$\lambda {P}_{a}\left(\beta \right)=\lambda \left(\alpha \mathop{\sum }\limits_{i=1}^{p}\left|{\beta }_{i}\right|+\frac{1}{2}\left(1-\alpha \right)\mathop{\sum }\limits_{i=1}^{p}{\beta }_{i}^{2}\right)$$
(2)
where p is the number of variables and α ∈ (0, 1] is the additional parameter that controls the amount of shrinkage that comes from the l1 and l2 penalties. Setting α = 1 corresponds to the lasso penalty.
Survival support vector machines (SVM) are supervised learning models initially used to maximize the margin between classes through features values and to find the separating hyperplane that minimizes misclassification between classes. In survival analysis, there is no separation between classes. Instead, the objective is to separate individuals based on their expected survival times. In our case SVMs are employed to learn a ranking based on all possible pairs of patients rather than a binary classification. More specifically, the SVM is trained to separate patients with shorter survival a lower rank from those with longer survival44. The observations on the margin of the decision hyperplane are known as support vectors. SVM can manage non-linear relationships between features and survival data using the kernel trick. Non-linear kernel functions map the input features into high-dimensional feature spaces where a linear ranking function functioned can be estimated. Survival analysis in with SVM can be tackled in two different ways: (a) by ranking samples according to their survival times, and (b) by using a regression approach in order to find a function that estimates the observed survival times as continuous outcomes.
Data pre-processing
To train the survival models, the input feature sets underwent various data pre-processing steps that included handling missing values, scaling numerical variables, and encoding categorical variables using one-hot encoding. The only missing values are for the clinical features LDH (40%), Best Overall Response (2%) and Type of drug (1%). For missing value imputation, we used Iterative Imputer with the Random Forest classifier as the imputation model. This strategy models each feature with missing values as a function of other features and uses the function to estimate values for imputation. As the features with missing values were categorical in nature, we opted for a Random Forest classifier. We fitted the imputer to the training dataset only and then applied it to transform the test dataset. Imputation was carried out in the outer loop of the nested cross-validation procedure, and a random seed was used to reproduce the same imputed values for all feature sets across repetitive trials. This helped to ensure the feature set comparability. The imputation was only performed on the training data and not on the test set. Hence, imputation performance did not impact the reported prediction performance.
As the clinical features included predominantly categorical variables, we used one-hot encoding to transform them into numerical variables. We scaled continuous variables to a range of [0, 1] using max-min normalization to ensure that the same scales were used for all input variables.
We employed a nested, stratified 5-fold cross-validation with an inner 3-fold cross-validation loop to tune hyperparameters, train and evaluate over the outer cross-validation loop the time-to-event models. The dataset was split into five outer folds, while balancing samples from different sources. In each outer fold, the pre- processed training set was further split into three inner folds for training and tuning hyper-parameters. We selected the model and hyperparameters with the highest concordance index across the inner folds to evaluate the outer test dataset. We repeated this procedure with ten random seeds for all survival models with all feature sets for reproducibility.
Evaluation of the model prediction performance
The concordance index was used as a metric to evaulate our models’ performance. The Concordance Index (C-index) proposed by Uno et al.45 is a widely used performance metric to compare survival models applied to right-censored data. The C-index value is defined by the proportion of concordant predictions and outcomes of all comparable pairs. It defines the number of correctly ordered pairs by models. The algorithm provided by scikit-survival doesn’t directly output a survival time as a prediction but rather a risk score. This risk score is a numerical value that indicates the likelihood of an event occurring sooner rather than later. The calculation of the risk score is specific to each used model.
For ensemble models such as the Random Survival Forest, the Gradient Boosting or the Extra survival trees, models build multiple decision trees and combine their predictions. The risk score is typically an average of the predictions from the individual trees or a weighted sum of the predictions of each decision tree in the case of Gradient Boosting.
For the Elastic-Net Regularized Cox model, the risk score is derived from the Cox model’s hazard ratio. It estimates the hazard ratio, which is the relative risk of an event occurring at a specific time point compared to a reference individual.
For the survival Support Vector Machine, the model learns hyperplanes that separate the observations based on their survival times. For predictions of new observations, the model calculates their distance from the decision boundary and this distance is used as a risk score.
Samples with a higher estimated risk score have a shorter actual survival time. Two samples are considered a comparable pair if they both experienced an event at two different times or if the one with a shorter observed survival time experienced an event, in which case the event-free subject “outlived” the other. Two samples are not comparable if they experienced events at the same time.
Other estimators have also been used to assess temporal performance. When predicting survival, the patients disease status is not fixed and can change over time. Consequently, performance measures, such as specificity and sensitivity, become time-dependent measures. We considered an estimator of the cumulative/dynamic area under the ROC curve (AUC) at specific time points to evaluate our models32,46. The function considers comparable pairs of instances [i.e. one instance experienced an event before time t (ti ≤ t) and the other at time ti > t, and calculates the AUC at discrete time points]. The time-dependent AUC for any specific survival time t can be calculated as follows:
$${AUC}\left(t\right)=P\left({\hat{t}}_{i} \,<\, {\hat{t}}_{j}\left|\right.{t}_{i} \,<\, t,\,{t}_{j} \,>\, t\right)=\frac{1}{{num}\left(t\right)}\sum _{i:{t}_{i} \,<\, t}\sum _{j:{t}_{j} \,>\, t}I\left({\hat{t}}_{i} \,<\, {\hat{t}}_{j}\right)$$
(3)
where t ∈ T is the set of all possible survival times, ti and tˆi represent the observed time and the predicted value respectively, num(t) is the number of comparable pairs at time t and I(.) is the indicator function39. Then, we can then determin how well a model can distinguish subjects who fail (i.e. disease progression) before given time (ti ≤ t) from subjects who fail after this time ti > t.
In survival analysis, the time-dependent Brier score (BS) is also a crucial metric for assessing the model calibration over time because it takes into account dynamic changes in survival probabilities. It quantifies the concordance between predicted and observed outcomes at various time points, offering a comprehensive evaluation of model’s calibration47. We consider that for each patient, at t = 0, we have information on a vector X of patient specific covariates. This is used to predict a time-to-event variable T. Let us suppose that the complete data (Ti, Xi), i = 1, …, n is available for n patients. We also consider that T is censored and we observe T˜i = min(Ti, Ci), δi = I(Ti ≤ Ci), where Ci are the (hypothetical) times under observation. Under the common assumption of random censorship (T and X are independent on C) the empirical Brier score is defined
as:
$$\begin{array}{l}{BS}\left(t\right)=\,\frac{1}{n}\,\mathop{\sum }\limits_{i=1}^{n}\left\{I\left({\widetilde{T}}_{i}\le {\rm{t}},\,{\delta }_{i}=1\right)\frac{{\left(0-\widetilde{\pi }\left(t|{X}_{i}\right)\right)}^{2}}{\hat{G}\left({\widetilde{T}}_{i}\right)}\right.\\\qquad\qquad\left.+I\,\left({\widetilde{T}}_{i} > t\right)\frac{{\left(1-\widetilde{\pi }\left(t|{X}_{i}\right)\right)}^{2}}{\hat{G}\left(t\right)}\right\}\end{array}$$
(4)
where πˆ(t | X) is the predicted probability of remaining event-free up to time point t given the feature vector X and Gˆ(t) is the probability of censoring weight, estimated by the Kaplan-Meier estimator.
Additionally, the integrated Brier score provides a concise measure of calibration across the entire time period:
$${IBS}=\,\frac{1}{{t}_{\max }}\mathop{\int}\limits_{0}^{{{t}_{\max}}}{BS}\left(t\right){dt}$$
(5)
We used the Wilcoxon signed-rank test to compare the performance of the survival models. This non- parametric test compares paired differences in the C-index, derived from the same sample, using a significance level of 0.05. We ensured reproducibility by using a consistent random state, resulting in the same training and test datasets in identical fold conditions for each model. Consequently, we could pair the C-indexes of two models. We then compared the performance of the top-performing models and feature sets. We also used a one-sided test, at 5% significance, to evaluate whether the median of the better-performing model was higher, effectively testing whether the highest-ranked model performance was significantly better. Finally, this test was also used to compare paired computed accuracies at 5 and 6 month with Clinical-only and MRVCELLS-Mel datasets. The test was performed on all combined relapse probability thresholds between 0.5 and 0.9, using a significance level of 0.05.
