--- title: "Variable importance in survival analysis: maximizing C-index" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Variable importance in survival analysis: maximizing C-index} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(survML) library(survival) library(dplyr) set.seed(72924) ``` ## Introduction Our notion of intrinsic variable importance uses the idea of the *oracle prediction function*. For a given predictiveness measure, the oracle prediction function is the function of the features that achieves optimal predictiveness. This optimality is defined with respect to a class of possible prediction functions. Generally, we consider this class to be more or less unrestricted; as a result, the oracle prediction function can take a complex form rather than being restricted to, say, an additive function of the features. For some predictiveness measures, such as the time-varying AUC, it is fairly straightforward to characterize the oracle prediction function (see the Appendix of the [overview vignette](https://cwolock.github.io/survML/articles/v2_variable_importance.html) for examples). A notable exception is the concordance index or C-index, a popular predictiveness measure in survival analysis. For a given prediction function $f$, the C-index measures the probability that, for a randomly selected pair of individuals, the individual who first experiences the event of interest has the higher value of $f$. Using $X$ to denote the feature vector and $T$ the outcome, the C-index $V$ of a prediction function $f$ under distribution $P_0$ can be written as \begin{align*} V(f, P_0) = P_0\left\{f(X_1) > f(X_2) \mid T_1 < T_2 \right\}. \end{align*} In general, it seems that the optimizer $f_0$ of the C-index is not available in closed form. For this reason, we take a direct optimization approach. ## Boosting the C-index We first note that the C-index can be written as \begin{align*} V(f, P_0) = P_0\left\{f(X_1) > f(X_2) \mid T_1 < T_2 \right\} = \frac{P_0\left\{f(X_1) > f(X_2), T_1 < T_2 \right\}}{P_0\left\{T_1 < T_2 \right\}}. \end{align*} Because the denominator does not involve $f$, we focus on maximizing the numerator, which we can write as $E_{P_0}\left[I\{f(X_1) > f(X_2)\}I(T_1 < T_2)\right]$. Maximization of this objective function with respect to $f$ is difficult because of the indicator function $I\{f(X_1) > f(X_2)\}$. Both Chen et al. (2013) and Mayr and Schmid (2014) have proposed optimizing a smooth approximation of the C-index, where the aforementioned indicator function is replaced by the sigmoid function $h_\sigma(s) = \{1 + \exp(s/\sigma)\}$, where $\sigma$ is a tuning parameter determining the smoothness of the approximation. Because $P_0$ is unknown, and because $T$ is subject to right censoring, we must optimize an *estimate* of the smoothed C-index. In `survML`, we implement optimization of a doubly-robust estimate of the smoothed C-index using gradient boosting. For details, see Wolock et al. (2025). We rely on the `mboost` gradient boosting `R` package, which allows gradient boosting for various types of base learners $f$, including trees, additive models, and linear models. The boosting tuning parameters `mstop` (number of boosting iterations) and `nu` (learning rate), along with the smoothness parameter $\sigma$ are all selected via cross-validation. In our experience, we have found that the boosting procedure tends to be relatively insensitive to the choice of $\sigma$. As with most gradient boosting algorithms, it is generally advisable to set a small learning rate `nu` and use cross-validation to select the number of iterations `mstop`; however, the computational cost of a large `mstop` can be quite large. Another computational consideration is the calculation of the smoothed C-index estimate and its gradient, both of which involve a double sum. We allow the user to subsample observations for the boosting procedure, using the `subsample_n` parameter, which can greatly decrease computation time without a substantial loss in performance. ## Example: Predicting recurrence-free survival time in cancer patients As in the variable importance [overview vignette](https://cwolock.github.io/survML/articles/v2_variable_importance.html), we consider the importance of various features for predicting recurrence-free survival time using the `gbsg` dataset from the `survival` package. For illustration, we look at the importance of tumor-level features (size, nodes, estrogen receptor, progesterone receptor, and grade) relative to the full feature vector. ```{r} data(cancer) ### variables of interest # rfstime - recurrence-free survival # status - censoring indicator # hormon - hormonal therapy treatment indicator # age - in years # meno - 1 = premenopause, 2 = post # size - tumor size in mm # grade - factor 1,2,3 # nodes - number of positive nodes # pgr - progesterone receptor in fmol # er - estrogen receptor in fmol # create dummy variables and clean data gbsg$tumgrad2 <- ifelse(gbsg$grade == 2, 1, 0) gbsg$tumgrad3 <- ifelse(gbsg$grade == 3, 1, 0) gbsg <- gbsg %>% na.omit() %>% select(-c(pid, grade)) time <- gbsg$rfstime event <- gbsg$status X <- gbsg %>% select(-c(rfstime, status)) # remove outcome # find column indices of features/feature groups X_names <- names(X) tum_index <- which(X_names %in% c("size", "nodes", "pgr", "er", "tumgrad2", "tumgrad3")) ``` We again use the `vim()` function. Compared to estimating time-varying AUC importance, as in the [overview vignette](https://cwolock.github.io/survML/articles/v2_variable_importance.html), there are several key differences to keep in mind. First, the C-index predictiveness measure does not involve a `landmark_time`, so this argument is no longer relevant. However, in order to ensure that the C-index is identified under right censoring, we must choose a `restriction_time`. In essence, this is the value $\tau$ for which events that occur after $\tau$ are not considered in computing the C-index. There is not a data-driven way to select the `restriction_time`, but simulations have shown that values of `restriction_time` for which ~10% of individuals are still at-risk for experiencing an event perform reasonably well. We choose 2000 days for the `restriction_time` in this example. Second, the procedure for estimating nuisance parameters is slightly different for the C-index, since the large and small oracle prediction functions are no longer estimated using `SuperLearner()`, but rather the gradient boosting procedure described above. The control parameters relevant to gradient boosting are instead: * `V`: Number of folds for cross-validation selection of tuning parameters. * `tuning`: Whether or not to tune the boosting parameters, or simply use the (single) user-provided value of each parameter without tuning. * `subsample_n`: Size of subsample for computation of boosting objective function and gradient. Subsampling proportions of 1/4, 1/3, and 1/2 all perform well in simulations, with somewhat larger bias and variance for smaller subsample proportions at smaller overall sample sizes. * `boosting_params`: Named list of parameters for the actual gradient boosting procedure, including `mstop`, `nu`, `sigma`, and `learner` (which base learner from `mboost` to use; options are `glm`, `gam`, and `tree`). Note that we provide multiple values of `mstop` and set `tuning = TRUE` --- this will trigger a cross-validation procedure (in this case, with `V = 2` folds) to select the estimated optimal `mstop` among the two provided values. ```{r} restriction_time <- 2000 output <- vim(type = "C-index", time = time, event = event, X = X, restriction_time = 2000, large_feature_vector = 1:ncol(X), small_feature_vector = (1:ncol(X))[-as.numeric(tum_index)], conditional_surv_generator_control = list(SL.library = c("SL.mean", "SL.glm"), V = 2, bin_size = 0.5), large_oracle_generator_control = list(V = 2, tuning = TRUE, subsample_n = 300, boosting_params = list(mstop = c(100, 200), nu = 0.1, sigma = 0.1, learner = "glm")), small_oracle_generator_control = list(V = 2, tuning = TRUE, subsample_n = 300, boosting_params = list(mstop = c(100, 200), nu = 0.1, sigma = 0.1, learner = "glm")), approx_times = sort(unique(stats::quantile(time[event == 1 & time <= 2000], probs = seq(0, 1, by = 0.025)))), cf_fold_num = 2, sample_split = FALSE, scale_est = TRUE) output$result ``` ## References The survival variable importance methodology is described in Charles J. Wolock, Peter B. Gilbert, Noah Simon and Marco Carone. ["Assessing variable importance in survival analysis using machine learning."](https://doi.org/10.1093/biomet/asae061) *Biometrika* (2025). Other references: Yifei Chen, Zhenyu Jia, Dan Mercola and Xiaohui Xie. ["A Gradient Boosting Algorithm for Survival Analysis via Direct Optimization of Concordance Index."](https://doi.org/10.1155/2013/873595) *Computational and Mathematical Methods in Medicine* (2013). Andreas Mayr and Matthias Schmid. ["Boosting the Concordance Index for Survival Data – A Unified Framework To Derive and Evaluate Biomarker Combinations."](https://doi.org/10.1371/journal.pone.0084483) *PLoS One* (2014).