| Title: | Tools for Flexible Survival Analysis Using Machine Learning |
|---|---|
| Description: | Statistical tools for analyzing time-to-event data using machine learning. Implements survival stacking for conditional survival estimation, standardized survival function estimation for current status data, and methods for algorithm-agnostic variable importance. |
| Authors: | Charles Wolock [aut, cre, cph] (ORCID: <https://orcid.org/0000-0003-3527-1102>), Avi Kenny [ctb] (ORCID: <https://orcid.org/0000-0002-9465-7307>) |
| Maintainer: | Charles Wolock <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 1.2.0.9000 |
| Built: | 2026-06-06 07:03:28 UTC |
| Source: | https://github.com/cwolock/survml |
Aggregate multiseed VIM results
aggregate_vim(result_list, agg_method, ci_grid, n_eff, alpha = 0.05)aggregate_vim(result_list, agg_method, ci_grid, n_eff, alpha = 0.05)
result_list |
List of result data frames return by the |
agg_method |
P-value aggregation method use to combine results from different seeds. Current options are |
ci_grid |
Grid of VIM values over which to construct a confidence interval by inverting a hypothesis test. The aggregation works by constructing
hypothesis tests (at level |
n_eff |
The effective sample size. Without sample-splitting, this is simply the sample size. With sample-splitting, this is the sample size divided by two (i.e., the size of each of the two halves of the data). |
alpha |
The level at which to compute confidence intervals and hypothesis tests. Defaults to 0.05. |
Named list with the following elements:
agg_result |
Data frame giving results aggregated over seeds. |
agg_method |
P-value aggregation method used. |
n_seed |
Number of iterations (seeds) used to perform the VIM estimation procedure. |
vim_objects |
A list of |
Using doubly-robust gradient boosting to generate estimates of the prediction function that maximizes the C-index
boost_c_index( time, event, X, newX, S_hat, G_hat, V, approx_times, tuning, produce_fit = TRUE, subsample_n, boosting_params )boost_c_index( time, event, X, newX, S_hat, G_hat, V, approx_times, tuning, produce_fit = TRUE, subsample_n, boosting_params )
time |
|
event |
|
X |
|
newX |
|
S_hat |
|
G_hat |
|
V |
Number of cross-validation folds for selection of tuning parameters |
approx_times |
Numeric vector of length J2 giving times at which to
approximate C-index integral. Note that the last time in |
tuning |
Logical, whether or not to use cross-validation to select tuning parameters |
produce_fit |
Logical, whether to produce a fitted prediction function using the selected optimal parameters. |
subsample_n |
Number of samples to use for boosting procedure. Using a subsample of the full sample can greatly reduce runtime |
boosting_params |
Named list of parameter values for the boosting procedure. Elements of this list include |
Vector of predictions corresponding to newX
# This is a small simulation example set.seed(123) n <- 250 X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5)) T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 * X[,1] * X[,2])) C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2])) C[C > 15] <- 15 time <- pmin(T, C) event <- as.numeric(T <= C) # Note that this a very small Super Learner library, for computational purposes. SL.library <- c("SL.mean", "SL.glm") # Note that we do not use times beyond the 90th percentile of observed follow-up times approx_times <- c(0, unique(quantile(time, probs = seq(0, 0.9, by = 0.01)))) # estimate conditional survival functions at approx_times fit <- stackG(time = time, event = event, X = X, newX = X, newtimes = approx_times, direction = "prospective", bin_size = 0.1, time_basis = "continuous", surv_form = "PI", learner = "SuperLearner", time_grid_approx = approx_times, SL_control = list(SL.library = SL.library, V = 3)) # use boosting to estimate optimal (according to C-index) prediction function boosted_preds <- boost_c_index(time = time, event = event, X = X, newX = X, approx_times = approx_times, S_hat = fit$S_T_preds, G_hat = fit$S_C_preds, V = 3, tuning = TRUE, produce_fit = TRUE, subsample_n = 200, boosting_params = list(mstop = c(100, 200), nu = 0.1, sigma = 0.1, learner = "glm")) boosted_preds# This is a small simulation example set.seed(123) n <- 250 X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5)) T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 * X[,1] * X[,2])) C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2])) C[C > 15] <- 15 time <- pmin(T, C) event <- as.numeric(T <= C) # Note that this a very small Super Learner library, for computational purposes. SL.library <- c("SL.mean", "SL.glm") # Note that we do not use times beyond the 90th percentile of observed follow-up times approx_times <- c(0, unique(quantile(time, probs = seq(0, 0.9, by = 0.01)))) # estimate conditional survival functions at approx_times fit <- stackG(time = time, event = event, X = X, newX = X, newtimes = approx_times, direction = "prospective", bin_size = 0.1, time_basis = "continuous", surv_form = "PI", learner = "SuperLearner", time_grid_approx = approx_times, SL_control = list(SL.library = SL.library, V = 3)) # use boosting to estimate optimal (according to C-index) prediction function boosted_preds <- boost_c_index(time = time, event = event, X = X, newX = X, approx_times = approx_times, S_hat = fit$S_T_preds, G_hat = fit$S_C_preds, V = 3, tuning = TRUE, produce_fit = TRUE, subsample_n = 200, boosting_params = list(mstop = c(100, 200), nu = 0.1, sigma = 0.1, learner = "glm")) boosted_preds
Generate cross-fitted oracle prediction function estimates
crossfit_oracle_preds( time, event, X, folds, nuisance_preds, pred_generator, ... )crossfit_oracle_preds( time, event, X, folds, nuisance_preds, pred_generator, ... )
time |
|
event |
|
X |
|
folds |
|
nuisance_preds |
Named list of conditional event and censoring survival functions that will be used to estimate the oracle prediction function. |
pred_generator |
Function to be used to estimate oracle prediction function. |
... |
Additional arguments to be passed to |
Named list of cross-fitted oracle prediction estimates
Generate cross-fitted conditional survival predictions
crossfit_surv_preds(time, event, X, newtimes, folds, pred_generator, ...)crossfit_surv_preds(time, event, X, newtimes, folds, pred_generator, ...)
time |
|
event |
|
X |
|
newtimes |
Numeric vector of times on which to estimate the conditional survival functions |
folds |
|
pred_generator |
Function to be used to estimate conditional survival function. |
... |
Additional arguments to be passed to |
Named list of cross-fitted conditional survival predictions
Estimate a survival function under current status sampling
currstatCIR( time, event, X, SL_control = list(SL.library = c("SL.mean", "SL.glm"), V = 3), HAL_control = list(n_bins = c(5), grid_type = c("equal_mass"), V = 3), deriv_method = "m-spline", sample_split = FALSE, m = 5, eval_region = NULL, n_eval_pts = 101, eval_grid = NULL, alpha = 0.05, sensitivity_analysis = FALSE, copula_control = list(taus = c(-0.1, -0.05, 0.05, 0.1)) )currstatCIR( time, event, X, SL_control = list(SL.library = c("SL.mean", "SL.glm"), V = 3), HAL_control = list(n_bins = c(5), grid_type = c("equal_mass"), V = 3), deriv_method = "m-spline", sample_split = FALSE, m = 5, eval_region = NULL, n_eval_pts = 101, eval_grid = NULL, alpha = 0.05, sensitivity_analysis = FALSE, copula_control = list(taus = c(-0.1, -0.05, 0.05, 0.1)) )
time |
|
event |
|
X |
|
SL_control |
List of |
HAL_control |
List of |
deriv_method |
Method for computing derivative. Options are |
sample_split |
Logical indicating whether to perform inference using sample splitting |
m |
Number of sample-splitting folds, defaults to 5. |
eval_region |
Left and right endpoints of region over which to estimate the survival function. Alternatively, provide |
n_eval_pts |
Number of points in grid on which to evaluate survival function.
The points will be evenly spaced, on the quantile scale, between the endpoints of |
eval_grid |
Grid of time points at which to estimate the survival function. This is an alternative
in place of specifying |
alpha |
The level at which to compute confidence intervals and hypothesis tests. Defaults to 0.05 |
sensitivity_analysis |
Logical, whether to perform a copula-based sensitivity analysis. Defaults to |
copula_control |
A named list of control parameters for the copula-based sensitivity analysis. This should be a named list. |
List of data frames giving results. If not performing a sensitivity analysis, a single data frame is returned; if performing a sensitivity analysis, a separate data frame will be returned for each value of the copula association parameter. The results data frames have columns:
t |
Time at which survival function is estimated |
S_hat_est |
Survival function estimate |
S_hat_cil |
Lower bound of confidence interval |
S_hat_ciu |
Upper bound of confidence interval |
Wolock C.J., et al. (2025). "Investigating symptom duration using current status data: a case study of post-acute COVID-19 syndrome."
## Not run: # This is a small simulation example set.seed(123) n <- 300 x <- cbind(2*rbinom(n, size = 1, prob = 0.5)-1, 2*rbinom(n, size = 1, prob = 0.5)-1) t <- rweibull(n, shape = 0.75, scale = exp(0.4*x[,1] - 0.2*x[,2])) y <- rweibull(n, shape = 0.75, scale = exp(0.4*x[,1] - 0.2*x[,2])) # round y to nearest quantile of y, just so there aren't so many unique values quants <- quantile(y, probs = seq(0, 1, by = 0.05), type = 1) for (i in 1:length(y)){ y[i] <- quants[which.min(abs(y[i] - quants))] } delta <- as.numeric(t <= y) dat <- data.frame(y = y, delta = delta, x1 = x[,1], x2 = x[,2]) dat$delta[dat$y > 1.8] <- NA dat$y[dat$y > 1.8] <- NA eval_region <- c(0.05, 1.5) res <- survML::currstatCIR(time = dat$y, event = dat$delta, X = dat[,3:4], SL_control = list(SL.library = c("SL.mean", "SL.glm"), V = 3), HAL_control = list(n_bins = c(5), grid_type = c("equal_mass"), V = 3), sensitivity_analysis = FALSE, eval_region = eval_region)$primary_results xvals = res$t yvals = res$S_hat_est fn=stepfun(xvals, c(yvals[1], yvals)) plot.function(fn, from=min(xvals), to=max(xvals)) ## End(Not run)## Not run: # This is a small simulation example set.seed(123) n <- 300 x <- cbind(2*rbinom(n, size = 1, prob = 0.5)-1, 2*rbinom(n, size = 1, prob = 0.5)-1) t <- rweibull(n, shape = 0.75, scale = exp(0.4*x[,1] - 0.2*x[,2])) y <- rweibull(n, shape = 0.75, scale = exp(0.4*x[,1] - 0.2*x[,2])) # round y to nearest quantile of y, just so there aren't so many unique values quants <- quantile(y, probs = seq(0, 1, by = 0.05), type = 1) for (i in 1:length(y)){ y[i] <- quants[which.min(abs(y[i] - quants))] } delta <- as.numeric(t <= y) dat <- data.frame(y = y, delta = delta, x1 = x[,1], x2 = x[,2]) dat$delta[dat$y > 1.8] <- NA dat$y[dat$y > 1.8] <- NA eval_region <- c(0.05, 1.5) res <- survML::currstatCIR(time = dat$y, event = dat$delta, X = dat[,3:4], SL_control = list(SL.library = c("SL.mean", "SL.glm"), V = 3), HAL_control = list(n_bins = c(5), grid_type = c("equal_mass"), V = 3), sensitivity_analysis = FALSE, eval_region = eval_region)$primary_results xvals = res$t yvals = res$S_hat_est fn=stepfun(xvals, c(yvals[1], yvals)) plot.function(fn, from=min(xvals), to=max(xvals)) ## End(Not run)
Generate estimates of conditional survival probability or conditional restrcited mean survival time using doubly-robust pseudo-outcome regression with SuperLearner
DR_pseudo_outcome_regression( time, event, X, newX, approx_times, S_hat, G_hat, newtimes, outcome, SL.library, V )DR_pseudo_outcome_regression( time, event, X, newX, approx_times, S_hat, G_hat, newtimes, outcome, SL.library, V )
time |
|
event |
|
X |
|
newX |
|
approx_times |
Numeric vector of length J2 giving times at which to approximate integral appearing in the pseudo-outcomes |
S_hat |
|
G_hat |
|
newtimes |
Numeric vector of times at which to generate oracle prediction function estimates. For outcome |
outcome |
Outcome type, either |
SL.library |
Super Learner library |
V |
Number of cross-validation folds, to be passed to |
Matrix of predictions corresponding to newX and newtimes.
# This is a small simulation example set.seed(123) n <- 250 X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5)) T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 * X[,1] * X[,2])) C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2])) C[C > 15] <- 15 time <- pmin(T, C) event <- as.numeric(T <= C) # Note that this a very small Super Learner library, for computational purposes. SL.library <- c("SL.mean", "SL.glm") approx_times <- c(0, sort(unique(time))) # estimate conditional survival functions at approx_times fit <- stackG(time = time, event = event, X = X, newX = X, newtimes = approx_times, direction = "prospective", bin_size = 0.1, time_basis = "continuous", surv_form = "PI", learner = "SuperLearner", time_grid_approx = approx_times, SL_control = list(SL.library = SL.library, V = 3)) # use DR pseudo-outcome regression to (robustly) estimate survival at t = 5 DR_preds <- DR_pseudo_outcome_regression(time = time, event = event, X = X, newX = X, newtimes = 5, approx_times = approx_times, S_hat = fit$S_T_preds, G_hat = fit$S_C_preds, outcome = "survival_probability", SL.library = SL.library, V = 3) DR_preds# This is a small simulation example set.seed(123) n <- 250 X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5)) T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 * X[,1] * X[,2])) C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2])) C[C > 15] <- 15 time <- pmin(T, C) event <- as.numeric(T <= C) # Note that this a very small Super Learner library, for computational purposes. SL.library <- c("SL.mean", "SL.glm") approx_times <- c(0, sort(unique(time))) # estimate conditional survival functions at approx_times fit <- stackG(time = time, event = event, X = X, newX = X, newtimes = approx_times, direction = "prospective", bin_size = 0.1, time_basis = "continuous", surv_form = "PI", learner = "SuperLearner", time_grid_approx = approx_times, SL_control = list(SL.library = SL.library, V = 3)) # use DR pseudo-outcome regression to (robustly) estimate survival at t = 5 DR_preds <- DR_pseudo_outcome_regression(time = time, event = event, X = X, newX = X, newtimes = 5, approx_times = approx_times, S_hat = fit$S_T_preds, G_hat = fit$S_C_preds, outcome = "survival_probability", SL.library = SL.library, V = 3) DR_preds
Generate cross-fitting and sample-splitting folds
generate_folds(n, V, sample_split, cf_folds = NULL)generate_folds(n, V, sample_split, cf_folds = NULL)
n |
Total sample size |
V |
Number of cross-fitting folds to use |
sample_split |
Logical, whether or not sample-splitting is being used |
cf_folds |
Cross-fitting folds, if already provided by user. |
Named list of cross-fitting and sample-splitting folds
Estimate conditional survival function nuisance parameters using survival stacking
generate_nuisance_predictions_stackG( time, event, X, X_holdout, newtimes, SL.library = c("SL.mean", "SL.glm", "SL.earth", "SL.gam", "SL.ranger"), V = 5, bin_size = 0.05, approx_times )generate_nuisance_predictions_stackG( time, event, X, X_holdout, newtimes, SL.library = c("SL.mean", "SL.glm", "SL.earth", "SL.gam", "SL.ranger"), V = 5, bin_size = 0.05, approx_times )
time |
|
event |
|
X |
|
X_holdout |
|
newtimes |
|
SL.library |
Super Learner library |
V |
Number of cross-validation folds, to be passed to |
bin_size |
Size of time bin on which to discretize for estimation
of cumulative probability functions. Can be a number between 0 and 1,
indicating the size of quantile grid (e.g. |
approx_times |
Numeric vector of times at which to approximate product integral or cumulative hazard interval. See stackG documentation. |
A list containing elements S_hat (conditional event survival function, corresponding to X_holdout and newtimes),
S_hat_train (conditional event survival function, corresponding to X and newtimes),
G_hat (conditional censoring survival function, corresponding to X_holdout and newtimes),
and G_hat_train (conditional censoring survival function, corresponding to X and newtimes)
Estimate oracle prediction function using DR gradient boosting
generate_oracle_predictions_boost( time, event, X, X_holdout, nuisance_preds, restriction_time, approx_times, V = 5, indx, tuning = FALSE, subsample_n = length(time), boosting_params = list(mstop = c(100), nu = c(0.1), sigma = c(0.01), learner = c("glm")) )generate_oracle_predictions_boost( time, event, X, X_holdout, nuisance_preds, restriction_time, approx_times, V = 5, indx, tuning = FALSE, subsample_n = length(time), boosting_params = list(mstop = c(100), nu = c(0.1), sigma = c(0.01), learner = c("glm")) )
time |
|
event |
|
X |
|
X_holdout |
|
nuisance_preds |
Named list of conditional survival function predictions with elements |
restriction_time |
Maximum follow-up time for calculation of the C-index.
Essentially, this time should be chosen such that the conditional survival function is identified at
this time for all covariate values |
approx_times |
Numeric vector of length J2 giving times at which to approximate C-index integral. |
V |
Number of cross-validation folds for selection of tuning parameters |
indx |
Numeric index of column(s) of |
tuning |
Logical, whether or not to use cross-validation to select tuning parameters |
subsample_n |
Number of samples to use for boosting procedure. Using a subsample of the full sample can greatly reduce runtime |
boosting_params |
Named list of parameter values for the boosting procedure. Elements of this list include |
A list containing elements f0_hat and f0_hat_train, the estimated oracle prediction functions for
X_holdout and X, respectively.
Estimate full oracle prediction function using DR pseudo-outcome regression
generate_oracle_predictions_DR( time, event, X, X_holdout, nuisance_preds, outcome, landmark_times, restriction_time, approx_times, SL.library = c("SL.mean", "SL.glm", "SL.earth", "SL.gam", "SL.ranger"), V = 5, indx )generate_oracle_predictions_DR( time, event, X, X_holdout, nuisance_preds, outcome, landmark_times, restriction_time, approx_times, SL.library = c("SL.mean", "SL.glm", "SL.earth", "SL.gam", "SL.ranger"), V = 5, indx )
time |
|
event |
|
X |
|
X_holdout |
|
nuisance_preds |
Named list of conditional survival function predictions with elements |
outcome |
Outcome type, either |
landmark_times |
Numeric vector of length J1 giving
landmark times at which to estimate VIM ( |
restriction_time |
Maximum follow-up time for calculation of |
approx_times |
Numeric vector of length J2 giving times at which to approximate integral appearing in the pseudo-outcomes |
SL.library |
Super Learner library |
V |
Number of cross-validation folds, to be passed to |
indx |
Numeric index of column(s) of |
A list containing elements f0_hat and f0_hat_train, the estimated oracle prediction functions for
X_holdout and X, respectively.
When the oracle prediction function is a conditional mean, the small oracle prediction function can be estimated by regressing the large oracle prediction function on the small feature vector. This function performs this regression using Super Learner.
generate_oracle_predictions_SL( time, event, X, X_holdout, nuisance_preds, outcome, landmark_times, restriction_time, approx_times, SL.library = c("SL.mean", "SL.glm", "SL.earth", "SL.gam", "SL.ranger"), V = 5, indx )generate_oracle_predictions_SL( time, event, X, X_holdout, nuisance_preds, outcome, landmark_times, restriction_time, approx_times, SL.library = c("SL.mean", "SL.glm", "SL.earth", "SL.gam", "SL.ranger"), V = 5, indx )
time |
|
event |
|
X |
|
X_holdout |
|
nuisance_preds |
Named list of conditional survival function predictions with elements |
outcome |
Outcome type, either |
landmark_times |
Numeric vector of length J1 giving
landmark times at which to estimate VIM ( |
restriction_time |
Maximum follow-up time for calculation of |
approx_times |
Numeric vector of length J2 giving times at which to approximate integral appearing in the pseudo-outcomes |
SL.library |
Super Learner library |
V |
Number of cross-validation folds, to be passed to |
indx |
Numeric index of column(s) of |
A list containing elements f0_hat and f0_hat_train, the estimated small oracle prediction functions for
X_holdout and X, respectively.
Repeat the VIM estimation procedure multiple times and aggregate the results, mitigating the additional randomness introduced by sample-splitting and cross-fitting.
multiseed_vim( n_seed, agg_method = "compound_bg", ci_grid, type, time, event, X, landmark_times = stats::quantile(time[event == 1], probs = c(0.25, 0.5, 0.75)), restriction_time = max(time[event == 1]), approx_times = NULL, large_feature_vector, small_feature_vector, conditional_surv_generator = NULL, conditional_surv_generator_control = NULL, large_oracle_generator = NULL, large_oracle_generator_control = NULL, small_oracle_generator = NULL, small_oracle_generator_control = NULL, cf_fold_num = 5, sample_split = TRUE, scale_est = FALSE, alpha = 0.05, verbose = FALSE )multiseed_vim( n_seed, agg_method = "compound_bg", ci_grid, type, time, event, X, landmark_times = stats::quantile(time[event == 1], probs = c(0.25, 0.5, 0.75)), restriction_time = max(time[event == 1]), approx_times = NULL, large_feature_vector, small_feature_vector, conditional_surv_generator = NULL, conditional_surv_generator_control = NULL, large_oracle_generator = NULL, large_oracle_generator_control = NULL, small_oracle_generator = NULL, small_oracle_generator_control = NULL, cf_fold_num = 5, sample_split = TRUE, scale_est = FALSE, alpha = 0.05, verbose = FALSE )
n_seed |
Number of iterations (seeds) to perform the VIM estimation procedure. These will be aggregated into a single result. |
agg_method |
P-value aggregation method use to combine results from different seeds.
Current options are |
ci_grid |
Grid of VIM values over which to construct a confidence interval by
inverting a hypothesis test. The aggregation works by constructing
hypothesis tests (at level |
type |
Type of VIM to compute. Options include |
time |
|
event |
|
X |
|
landmark_times |
Numeric vector of length J1 giving
landmark times at which to estimate VIM ( |
restriction_time |
Maximum follow-up time for calculation of |
approx_times |
Numeric vector of length J2 giving times at which to approximate integrals. Defaults to a grid of 100 timepoints, evenly spaced on the quantile scale of the distribution of observed event times. |
large_feature_vector |
Numeric vector giving indices of features to include in the 'large' prediction model. |
small_feature_vector |
Numeric vector giving indices of features to include in the 'small' prediction model. Must be a
subset of |
conditional_surv_generator |
A function to estimate the conditional survival functions of the event and censoring variables. Must take arguments
( |
conditional_surv_generator_control |
A list of arguments to pass to |
large_oracle_generator |
A function to estimate the oracle prediction function using |
large_oracle_generator_control |
A list of arguments to pass to |
small_oracle_generator |
A function to estimate the oracle prediction function using |
small_oracle_generator_control |
A list of arguments to pass to |
cf_fold_num |
The number of cross-fitting folds, if not providing |
sample_split |
Logical indicating whether or not to sample split. Sample-splitting is required for valid hypothesis testing of null importance and is generally recommended. Defaults to |
scale_est |
Logical, whether or not to force the VIM estimate to be nonnegative. |
alpha |
The level at which to compute confidence intervals and hypothesis tests. Defaults to 0.05. |
verbose |
Whether to print progress messages. |
Using a larger value of n_seed will result in more stable results, at a greater computational cost.
Named list with the following elements:
agg_result |
Data frame giving results aggregated over seeds. |
agg_method |
P-value aggregation method used. |
n_seed |
Number of iterations (seeds) used to perform the VIM estimation procedure. |
vim_objects |
A list of |
Vovk V. and Wang R. (2020). "Combining p-values via averaging."
Wolock C.J., Gilbert P.B., Simon N., and Carone, M. (2025). "Assessing variable importance in survival analysis using machine learning."
# This is a small simulation example set.seed(123) n <- 100 X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5)) T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 * X[,1] * X[,2])) C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2])) C[C > 15] <- 15 time <- pmin(T, C) event <- as.numeric(T <= C) # landmark times for AUC landmark_times <- c(3) output <- multiseed_vim(n_seed = 2, agg_method = "compound_bg", ci_grid = seq(0, 1, by = 0.01), type = "AUC", time = time, event = event, X = X, landmark_times = landmark_times, large_feature_vector = 1:2, small_feature_vector = 2, conditional_surv_generator_control = list(SL.library = c("SL.mean", "SL.glm")), large_oracle_generator_control = list(SL.library = c("SL.mean", "SL.glm")), small_oracle_generator_control = list(SL.library = c("SL.mean", "SL.glm")), cf_fold_num = 2, sample_split = TRUE, scale_est = TRUE) print(output$result)# This is a small simulation example set.seed(123) n <- 100 X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5)) T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 * X[,1] * X[,2])) C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2])) C[C > 15] <- 15 time <- pmin(T, C) event <- as.numeric(T <= C) # landmark times for AUC landmark_times <- c(3) output <- multiseed_vim(n_seed = 2, agg_method = "compound_bg", ci_grid = seq(0, 1, by = 0.01), type = "AUC", time = time, event = event, X = X, landmark_times = landmark_times, large_feature_vector = 1:2, small_feature_vector = 2, conditional_surv_generator_control = list(SL.library = c("SL.mean", "SL.glm")), large_oracle_generator_control = list(SL.library = c("SL.mean", "SL.glm")), small_oracle_generator_control = list(SL.library = c("SL.mean", "SL.glm")), cf_fold_num = 2, sample_split = TRUE, scale_est = TRUE) print(output$result)
Obtain predicted conditional survival and cumulative hazard functions from a global survival stacking object
## S3 method for class 'stackG' predict( object, newX, newtimes, surv_form = object$surv_form, time_grid_approx = object$time_grid_approx, ... )## S3 method for class 'stackG' predict( object, newX, newtimes, surv_form = object$surv_form, time_grid_approx = object$time_grid_approx, ... )
object |
Object of class |
newX |
|
newtimes |
|
surv_form |
Mapping from hazard estimate to survival estimate.
Can be either |
time_grid_approx |
Numeric vector of times at which to
approximate product integral or cumulative hazard interval. Defaults to the value
saved in |
... |
Further arguments passed to or from other methods. |
A named list with the following components:
S_T_preds |
An |
S_C_preds |
An |
Lambda_T_preds |
An |
Lambda_C_preds |
An |
time_grid_approx |
The approximation grid for the product integral or cumulative hazard integral, (user-specified). |
surv_form |
Exponential or product-integral form (user-specified). |
# This is a small simulation example set.seed(123) n <- 250 X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5)) S0 <- function(t, x){ pexp(t, rate = exp(-2 + x[,1] - x[,2] + .5 * x[,1] * x[,2]), lower.tail = FALSE) } T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 * X[,1] * X[,2])) G0 <- function(t, x) { as.numeric(t < 15) *.9*pexp(t, rate = exp(-2 -.5*x[,1]-.25*x[,2]+.5*x[,1]*x[,2]), lower.tail=FALSE) } C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2])) C[C > 15] <- 15 entry <- runif(n, 0, 15) time <- pmin(T, C) event <- as.numeric(T <= C) sampled <- which(time >= entry) X <- X[sampled,] time <- time[sampled] event <- event[sampled] entry <- entry[sampled] # Note that this a very small Super Learner library, for computational purposes. SL.library <- c("SL.mean", "SL.glm") fit <- stackG(time = time, event = event, entry = entry, X = X, newX = X, newtimes = seq(0, 15, .1), direction = "prospective", bin_size = 0.1, time_basis = "continuous", time_grid_approx = sort(unique(time)), surv_form = "exp", learner = "SuperLearner", SL_control = list(SL.library = SL.library, V = 5)) preds <- predict(object = fit, newX = X, newtimes = seq(0, 15, 0.1)) plot(preds$S_T_preds[1,], S0(t = seq(0, 15, .1), X[1,])) abline(0,1,col='red')# This is a small simulation example set.seed(123) n <- 250 X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5)) S0 <- function(t, x){ pexp(t, rate = exp(-2 + x[,1] - x[,2] + .5 * x[,1] * x[,2]), lower.tail = FALSE) } T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 * X[,1] * X[,2])) G0 <- function(t, x) { as.numeric(t < 15) *.9*pexp(t, rate = exp(-2 -.5*x[,1]-.25*x[,2]+.5*x[,1]*x[,2]), lower.tail=FALSE) } C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2])) C[C > 15] <- 15 entry <- runif(n, 0, 15) time <- pmin(T, C) event <- as.numeric(T <= C) sampled <- which(time >= entry) X <- X[sampled,] time <- time[sampled] event <- event[sampled] entry <- entry[sampled] # Note that this a very small Super Learner library, for computational purposes. SL.library <- c("SL.mean", "SL.glm") fit <- stackG(time = time, event = event, entry = entry, X = X, newX = X, newtimes = seq(0, 15, .1), direction = "prospective", bin_size = 0.1, time_basis = "continuous", time_grid_approx = sort(unique(time)), surv_form = "exp", learner = "SuperLearner", SL_control = list(SL.library = SL.library, V = 5)) preds <- predict(object = fit, newX = X, newtimes = seq(0, 15, 0.1)) plot(preds$S_T_preds[1,], S0(t = seq(0, 15, .1), X[1,])) abline(0,1,col='red')
Obtain predicted conditional survival function from a local survival stacking object
## S3 method for class 'stackL' predict(object, newX, newtimes, ...)## S3 method for class 'stackL' predict(object, newX, newtimes, ...)
object |
Object of class |
newX |
|
newtimes |
|
... |
Further arguments passed to or from other methods. |
A named list with the following components:
S_T_preds |
An |
# This is a small simulation example set.seed(123) n <- 500 X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5)) S0 <- function(t, x){ pexp(t, rate = exp(-2 + x[,1] - x[,2] + .5 * x[,1] * x[,2]), lower.tail = FALSE) } T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 * X[,1] * X[,2])) G0 <- function(t, x) { as.numeric(t < 15) *.9*pexp(t, rate = exp(-2 -.5*x[,1]-.25*x[,2]+.5*x[,1]*x[,2]), lower.tail=FALSE) } C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2])) C[C > 15] <- 15 entry <- runif(n, 0, 15) time <- pmin(T, C) event <- as.numeric(T <= C) sampled <- which(time >= entry) X <- X[sampled,] time <- time[sampled] event <- event[sampled] entry <- entry[sampled] # Note that this a very small Super Learner library, for computational purposes. SL.library <- c("SL.mean", "SL.glm") fit <- stackL(time = time, event = event, entry = entry, X = X, newX = X, newtimes = seq(0, 15, .1), direction = "prospective", bin_size = 0.1, time_basis = "continuous", SL_control = list(SL.library = SL.library, V = 5)) preds <- predict(object = fit, newX = X, newtimes = seq(0, 15, 0.1)) plot(preds$S_T_preds[1,], S0(t = seq(0, 15, .1), X[1,])) abline(0,1,col='red')# This is a small simulation example set.seed(123) n <- 500 X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5)) S0 <- function(t, x){ pexp(t, rate = exp(-2 + x[,1] - x[,2] + .5 * x[,1] * x[,2]), lower.tail = FALSE) } T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 * X[,1] * X[,2])) G0 <- function(t, x) { as.numeric(t < 15) *.9*pexp(t, rate = exp(-2 -.5*x[,1]-.25*x[,2]+.5*x[,1]*x[,2]), lower.tail=FALSE) } C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2])) C[C > 15] <- 15 entry <- runif(n, 0, 15) time <- pmin(T, C) event <- as.numeric(T <= C) sampled <- which(time >= entry) X <- X[sampled,] time <- time[sampled] event <- event[sampled] entry <- entry[sampled] # Note that this a very small Super Learner library, for computational purposes. SL.library <- c("SL.mean", "SL.glm") fit <- stackL(time = time, event = event, entry = entry, X = X, newX = X, newtimes = seq(0, 15, .1), direction = "prospective", bin_size = 0.1, time_basis = "continuous", SL_control = list(SL.library = SL.library, V = 5)) preds <- predict(object = fit, newX = X, newtimes = seq(0, 15, 0.1)) plot(preds$S_T_preds[1,], S0(t = seq(0, 15, .1), X[1,])) abline(0,1,col='red')
Estimate a conditional survival function using global survival stacking
stackG( time, event = rep(1, length(time)), entry = NULL, X, newX = NULL, newtimes = NULL, direction = "prospective", time_grid_fit = NULL, bin_size = NULL, time_basis, time_grid_approx = sort(unique(time)), surv_form = "PI", learner = "SuperLearner", SL_control = list(SL.library = c("SL.mean"), V = 10, method = "method.NNLS", stratifyCV = FALSE), tau = NULL )stackG( time, event = rep(1, length(time)), entry = NULL, X, newX = NULL, newtimes = NULL, direction = "prospective", time_grid_fit = NULL, bin_size = NULL, time_basis, time_grid_approx = sort(unique(time)), surv_form = "PI", learner = "SuperLearner", SL_control = list(SL.library = c("SL.mean"), V = 10, method = "method.NNLS", stratifyCV = FALSE), tau = NULL )
time |
|
event |
|
entry |
Study entry variable, if applicable. Defaults to |
X |
|
newX |
|
newtimes |
|
direction |
Whether the data come from a prospective or retrospective study.
This determines whether the data are treated as subject to left truncation and
right censoring ( |
time_grid_fit |
Named list of numeric vectors of times of times on which to discretize
for estimation of cumulative probability functions. This is an alternative to
|
bin_size |
Size of time bin on which to discretize for estimation
of cumulative probability functions. Can be a number between 0 and 1,
indicating the size of quantile grid (e.g. |
time_basis |
How to treat time for training the binary
classifier. Options are |
time_grid_approx |
Numeric vector of times at which to
approximate product integral or cumulative hazard interval.
Defaults to |
surv_form |
Mapping from hazard estimate to survival estimate.
Can be either |
learner |
Which binary regression algorithm to use. Currently, only
|
SL_control |
Named list of parameters controlling the Super Learner fitting
process. These parameters are passed directly to the |
tau |
The maximum time of interest in a study, used for
retrospective conditional survival estimation. Rather than dealing
with right truncation separately than left truncation, it is simpler to
estimate the survival function of |
A named list of class stackG, with the following components:
S_T_preds |
An |
S_C_preds |
An |
Lambda_T_preds |
An |
Lambda_C_preds |
An |
time_grid_approx |
The approximation grid for the product integral or cumulative hazard integral, (user-specified). |
direction |
Whether the data come from a prospective or retrospective study (user-specified). |
tau |
The maximum time of interest in a study, used for retrospective conditional survival estimation (user-specified). |
surv_form |
Exponential or product-integral form (user-specified). |
time_basis |
Whether time is included in the regression as |
SL_control |
Named list of parameters controlling the Super Learner fitting process (user-specified). |
fits |
A named list of fitted regression objects corresponding to the constituent regressions needed for
global survival stacking. Includes |
Wolock C.J., Gilbert P.B., Simon N., and Carone, M. (2024). "A framework for leveraging machine learning tools to estimate personalized survival curves."
predict.stackG for stackG prediction method.
# This is a small simulation example set.seed(123) n <- 250 X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5)) S0 <- function(t, x){ pexp(t, rate = exp(-2 + x[,1] - x[,2] + .5 * x[,1] * x[,2]), lower.tail = FALSE) } T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 * X[,1] * X[,2])) G0 <- function(t, x) { as.numeric(t < 15) *.9*pexp(t, rate = exp(-2 -.5*x[,1]-.25*x[,2]+.5*x[,1]*x[,2]), lower.tail=FALSE) } C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2])) C[C > 15] <- 15 entry <- runif(n, 0, 15) time <- pmin(T, C) event <- as.numeric(T <= C) sampled <- which(time >= entry) X <- X[sampled,] time <- time[sampled] event <- event[sampled] entry <- entry[sampled] # Note that this a very small Super Learner library, for computational purposes. SL.library <- c("SL.mean", "SL.glm") fit <- stackG(time = time, event = event, entry = entry, X = X, newX = X, newtimes = seq(0, 15, .1), direction = "prospective", bin_size = 0.1, time_basis = "continuous", time_grid_approx = sort(unique(time)), surv_form = "exp", learner = "SuperLearner", SL_control = list(SL.library = SL.library, V = 5)) plot(fit$S_T_preds[1,], S0(t = seq(0, 15, .1), X[1,])) abline(0,1,col='red')# This is a small simulation example set.seed(123) n <- 250 X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5)) S0 <- function(t, x){ pexp(t, rate = exp(-2 + x[,1] - x[,2] + .5 * x[,1] * x[,2]), lower.tail = FALSE) } T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 * X[,1] * X[,2])) G0 <- function(t, x) { as.numeric(t < 15) *.9*pexp(t, rate = exp(-2 -.5*x[,1]-.25*x[,2]+.5*x[,1]*x[,2]), lower.tail=FALSE) } C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2])) C[C > 15] <- 15 entry <- runif(n, 0, 15) time <- pmin(T, C) event <- as.numeric(T <= C) sampled <- which(time >= entry) X <- X[sampled,] time <- time[sampled] event <- event[sampled] entry <- entry[sampled] # Note that this a very small Super Learner library, for computational purposes. SL.library <- c("SL.mean", "SL.glm") fit <- stackG(time = time, event = event, entry = entry, X = X, newX = X, newtimes = seq(0, 15, .1), direction = "prospective", bin_size = 0.1, time_basis = "continuous", time_grid_approx = sort(unique(time)), surv_form = "exp", learner = "SuperLearner", SL_control = list(SL.library = SL.library, V = 5)) plot(fit$S_T_preds[1,], S0(t = seq(0, 15, .1), X[1,])) abline(0,1,col='red')
Estimate a conditional survival function via local survival stacking
stackL( time, event = rep(1, length(time)), entry = NULL, X, newX, newtimes, direction = "prospective", bin_size = NULL, time_basis = "continuous", learner = "SuperLearner", SL_control = list(SL.library = c("SL.mean"), V = 10, method = "method.NNLS", stratifyCV = FALSE), tau = NULL )stackL( time, event = rep(1, length(time)), entry = NULL, X, newX, newtimes, direction = "prospective", bin_size = NULL, time_basis = "continuous", learner = "SuperLearner", SL_control = list(SL.library = c("SL.mean"), V = 10, method = "method.NNLS", stratifyCV = FALSE), tau = NULL )
time |
|
event |
|
entry |
Study entry variable, if applicable. Defaults to |
X |
|
newX |
|
newtimes |
|
direction |
Whether the data come from a prospective or retrospective study.
This determines whether the data are treated as subject to left truncation and
right censoring ( |
bin_size |
Size of bins for the discretization of time. A value between 0 and 1 indicating the size of observed event time quantiles on which to grid times (e.g. 0.02 creates a grid of 50 times evenly spaced on the quantile scaled). If NULL, defaults to every observed event time. |
time_basis |
How to treat time for training the binary
classifier. Options are |
learner |
Which binary regression algorithm to use. Currently, only
|
SL_control |
Named list of parameters controlling the Super Learner fitting
process. These parameters are passed directly to the |
tau |
The maximum time of interest in a study, used for
retrospective conditional survival estimation. Rather than dealing
with right truncation separately than left truncation, it is simpler to
estimate the survival function of |
A named list of class stackL.
S_T_preds |
An |
fit |
The Super Learner fit for binary classification on the stacked dataset. |
Polley E.C. and van der Laan M.J. (2011). "Super Learning for Right-Censored Data" in Targeted Learning.
Craig E., Zhong C., and Tibshirani R. (2021). "Survival stacking: casting survival analysis as a classification problem."
predict.stackL for stackL prediction method.
# This is a small simulation example set.seed(123) n <- 500 X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5)) S0 <- function(t, x){ pexp(t, rate = exp(-2 + x[,1] - x[,2] + .5 * x[,1] * x[,2]), lower.tail = FALSE) } T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 * X[,1] * X[,2])) G0 <- function(t, x) { as.numeric(t < 15) *.9*pexp(t, rate = exp(-2 -.5*x[,1]-.25*x[,2]+.5*x[,1]*x[,2]), lower.tail=FALSE) } C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2])) C[C > 15] <- 15 entry <- runif(n, 0, 15) time <- pmin(T, C) event <- as.numeric(T <= C) sampled <- which(time >= entry) X <- X[sampled,] time <- time[sampled] event <- event[sampled] entry <- entry[sampled] # Note that this a very small Super Learner library, for computational purposes. SL.library <- c("SL.mean", "SL.glm") fit <- stackL(time = time, event = event, entry = entry, X = X, newX = X, newtimes = seq(0, 15, .1), direction = "prospective", bin_size = 0.1, time_basis = "continuous", SL_control = list(SL.library = SL.library, V = 5)) plot(fit$S_T_preds[1,], S0(t = seq(0, 15, .1), X[1,])) abline(0,1,col='red')# This is a small simulation example set.seed(123) n <- 500 X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5)) S0 <- function(t, x){ pexp(t, rate = exp(-2 + x[,1] - x[,2] + .5 * x[,1] * x[,2]), lower.tail = FALSE) } T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 * X[,1] * X[,2])) G0 <- function(t, x) { as.numeric(t < 15) *.9*pexp(t, rate = exp(-2 -.5*x[,1]-.25*x[,2]+.5*x[,1]*x[,2]), lower.tail=FALSE) } C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2])) C[C > 15] <- 15 entry <- runif(n, 0, 15) time <- pmin(T, C) event <- as.numeric(T <= C) sampled <- which(time >= entry) X <- X[sampled,] time <- time[sampled] event <- event[sampled] entry <- entry[sampled] # Note that this a very small Super Learner library, for computational purposes. SL.library <- c("SL.mean", "SL.glm") fit <- stackL(time = time, event = event, entry = entry, X = X, newX = X, newtimes = seq(0, 15, .1), direction = "prospective", bin_size = 0.1, time_basis = "continuous", SL_control = list(SL.library = SL.library, V = 5)) plot(fit$S_T_preds[1,], S0(t = seq(0, 15, .1), X[1,])) abline(0,1,col='red')
Compute estimates of and confidence intervals for nonparametric variable importance based on the difference predictiveness obtained with and without the feature of interest. Designed for use with time-to-event outcomes subject to right censoring that may be informed by measured covariates.
vim( type, time, event, X, landmark_times = stats::quantile(time[event == 1], probs = c(0.25, 0.5, 0.75)), restriction_time = max(time[event == 1]), approx_times = NULL, large_feature_vector, small_feature_vector, conditional_surv_generator = NULL, conditional_surv_generator_control = NULL, large_oracle_generator = NULL, large_oracle_generator_control = NULL, small_oracle_generator = NULL, small_oracle_generator_control = NULL, conditional_surv_preds = NULL, large_oracle_preds = NULL, small_oracle_preds = NULL, cf_folds = NULL, cf_fold_num = 5, sample_split = TRUE, ss_folds = NULL, robust = TRUE, scale_est = FALSE, alpha = 0.05, verbose = FALSE )vim( type, time, event, X, landmark_times = stats::quantile(time[event == 1], probs = c(0.25, 0.5, 0.75)), restriction_time = max(time[event == 1]), approx_times = NULL, large_feature_vector, small_feature_vector, conditional_surv_generator = NULL, conditional_surv_generator_control = NULL, large_oracle_generator = NULL, large_oracle_generator_control = NULL, small_oracle_generator = NULL, small_oracle_generator_control = NULL, conditional_surv_preds = NULL, large_oracle_preds = NULL, small_oracle_preds = NULL, cf_folds = NULL, cf_fold_num = 5, sample_split = TRUE, ss_folds = NULL, robust = TRUE, scale_est = FALSE, alpha = 0.05, verbose = FALSE )
type |
Type of VIM to compute. Options include |
time |
|
event |
|
X |
|
landmark_times |
Numeric vector of length J1 giving
landmark times at which to estimate VIM ( |
restriction_time |
Maximum follow-up time for calculation of |
approx_times |
Numeric vector of length J2 giving times at which to approximate integrals. Defaults to a grid of 100 timepoints, evenly spaced on the quantile scale of the distribution of observed event times. |
large_feature_vector |
Numeric vector giving indices of features to include in the 'large' prediction model. |
small_feature_vector |
Numeric vector giving indices of features to include in the 'small' prediction model. Must be a
subset of |
conditional_surv_generator |
A function to estimate the conditional survival functions of the event and censoring variables. Must take arguments
( |
conditional_surv_generator_control |
A list of arguments to pass to |
large_oracle_generator |
A function to estimate the oracle prediction function using |
large_oracle_generator_control |
A list of arguments to pass to |
small_oracle_generator |
A function to estimate the oracle prediction function using |
small_oracle_generator_control |
A list of arguments to pass to |
conditional_surv_preds |
User-provided estimates of the conditional survival functions of the event and censoring
variables given the full covariate vector (if not using the |
large_oracle_preds |
User-provided estimates of the oracle prediction function using |
small_oracle_preds |
User-provided estimates of the oracle prediction function using |
cf_folds |
Numeric vector of length |
cf_fold_num |
The number of cross-fitting folds, if not providing |
sample_split |
Logical indicating whether or not to sample split. Sample-splitting is required for valid hypothesis testing of null importance and is generally recommended. Defaults to |
ss_folds |
Numeric vector of length |
robust |
Logical, whether or not to use the doubly-robust debiasing approach. This option
is meant for illustration purposes only — it should be left as |
scale_est |
Logical, whether or not to force the VIM estimate to be nonnegative. |
alpha |
The level at which to compute confidence intervals and hypothesis tests. Defaults to 0.05. |
verbose |
Whether to print progress messages. |
For nuisance estimation, it is generally advisable to use the pre-built nuisance generator functions provided by survML. See the ”Variable importance in survival analysis” vignette, or the package website for an illustration.
Named list with the following elements:
result |
Data frame giving results. See the documentation of the individual |
folds |
A named list giving the cross-fitting fold IDs ( |
approx_times |
A vector of times used to approximate integrals appearing in the form of the VIM estimator. |
conditional_surv_preds |
A named list containing the estimated conditional event and censoring survival functions. |
large_oracle_preds |
A named list containing the estimated large oracle prediction function. |
small_oracle_preds |
A named list containing the estimated small oracle prediction function. |
Wolock C.J., Gilbert P.B., Simon N., and Carone, M. (2025). "Assessing variable importance in survival analysis using machine learning."
vim_accuracy vim_AUC vim_brier vim_cindex vim_rsquared vim_survival_time_mse
# This is a small simulation example set.seed(123) n <- 100 X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5)) T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 * X[,1] * X[,2])) C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2])) C[C > 15] <- 15 time <- pmin(T, C) event <- as.numeric(T <= C) # landmark times for AUC landmark_times <- c(3) output <- vim(type = "AUC", time = time, event = event, X = X, landmark_times = landmark_times, large_feature_vector = 1:2, small_feature_vector = 2, conditional_surv_generator_control = list(SL.library = c("SL.mean", "SL.glm")), large_oracle_generator_control = list(SL.library = c("SL.mean", "SL.glm")), small_oracle_generator_control = list(SL.library = c("SL.mean", "SL.glm")), cf_fold_num = 2, sample_split = FALSE, scale_est = TRUE) print(output$result)# This is a small simulation example set.seed(123) n <- 100 X <- data.frame(X1 = rnorm(n), X2 = rbinom(n, size = 1, prob = 0.5)) T <- rexp(n, rate = exp(-2 + X[,1] - X[,2] + .5 * X[,1] * X[,2])) C <- rexp(n, exp(-2 -.5 * X[,1] - .25 * X[,2] + .5 * X[,1] * X[,2])) C[C > 15] <- 15 time <- pmin(T, C) event <- as.numeric(T <= C) # landmark times for AUC landmark_times <- c(3) output <- vim(type = "AUC", time = time, event = event, X = X, landmark_times = landmark_times, large_feature_vector = 1:2, small_feature_vector = 2, conditional_surv_generator_control = list(SL.library = c("SL.mean", "SL.glm")), large_oracle_generator_control = list(SL.library = c("SL.mean", "SL.glm")), small_oracle_generator_control = list(SL.library = c("SL.mean", "SL.glm")), cf_fold_num = 2, sample_split = FALSE, scale_est = TRUE) print(output$result)
Estimate classification accuracy VIM
vim_accuracy( time, event, approx_times, landmark_times, f_hat, fs_hat, S_hat, G_hat, cf_folds, sample_split, ss_folds, scale_est = FALSE, alpha = 0.05 )vim_accuracy( time, event, approx_times, landmark_times, f_hat, fs_hat, S_hat, G_hat, cf_folds, sample_split, ss_folds, scale_est = FALSE, alpha = 0.05 )
time |
|
event |
|
approx_times |
Numeric vector of length J1 giving times at which to approximate integrals. |
landmark_times |
Numeric vector of length J2 giving times at which to estimate accuracy |
f_hat |
Full oracle predictions (n x J1 matrix) |
fs_hat |
Residual oracle predictions (n x J1 matrix) |
S_hat |
Estimates of conditional event time survival function (n x J2 matrix) |
G_hat |
Estimate of conditional censoring time survival function (n x J2 matrix) |
cf_folds |
Numeric vector of length n giving cross-fitting folds |
sample_split |
Logical indicating whether or not to sample split |
ss_folds |
Numeric vector of length n giving sample-splitting folds |
scale_est |
Logical, whether or not to force the VIM estimate to be nonnegative |
alpha |
The level at which to compute confidence intervals and hypothesis tests. Defaults to 0.05 |
A data frame giving results, with the following columns:
landmark_time |
Time at which AUC is evaluated. |
est |
VIM point estimate. |
var_est |
Estimated variance of the VIM estimate. |
cil |
Lower bound of the VIM confidence interval. |
ciu |
Upper bound of the VIM confidence interval. |
cil_1sided |
Lower bound of a one-sided confidence interval. |
p |
p-value corresponding to a hypothesis test of null importance. |
large_predictiveness |
Estimated predictiveness of the large oracle prediction function. |
small_predictiveness |
Estimated predictiveness of the small oracle prediction function. |
vim |
VIM type. |
large_feature_vector |
Group of features available for the large oracle prediction function. |
small_feature_vector |
Group of features available for the small oracle prediction function. |
Estimate AUC VIM
vim_AUC( time, event, approx_times, landmark_times, f_hat, fs_hat, S_hat, G_hat, cf_folds, sample_split, ss_folds, robust = TRUE, scale_est = FALSE, alpha = 0.05 )vim_AUC( time, event, approx_times, landmark_times, f_hat, fs_hat, S_hat, G_hat, cf_folds, sample_split, ss_folds, robust = TRUE, scale_est = FALSE, alpha = 0.05 )
time |
|
event |
|
approx_times |
Numeric vector of length J1 giving times at which to approximate integrals. |
landmark_times |
Numeric vector of length J2 giving times at which to estimate AUC |
f_hat |
Full oracle predictions (n x J1 matrix) |
fs_hat |
Residual oracle predictions (n x J1 matrix) |
S_hat |
Estimates of conditional event time survival function (n x J2 matrix) |
G_hat |
Estimate of conditional censoring time survival function (n x J2 matrix) |
cf_folds |
Numeric vector of length n giving cross-fitting folds |
sample_split |
Logical indicating whether or not to sample split |
ss_folds |
Numeric vector of length n giving sample-splitting folds |
robust |
Logical, whether or not to use the doubly-robust debiasing approach. This option
is meant for illustration purposes only — it should be left as |
scale_est |
Logical, whether or not to force the VIM estimate to be nonnegative |
alpha |
The level at which to compute confidence intervals and hypothesis tests. Defaults to 0.05 |
A data frame giving results, with the following columns:
landmark_time |
Time at which AUC is evaluated. |
est |
VIM point estimate. |
var_est |
Estimated variance of the VIM estimate. |
cil |
Lower bound of the VIM confidence interval. |
ciu |
Upper bound of the VIM confidence interval. |
cil_1sided |
Lower bound of a one-sided confidence interval. |
p |
p-value corresponding to a hypothesis test of null importance. |
large_predictiveness |
Estimated predictiveness of the large oracle prediction function. |
small_predictiveness |
Estimated predictiveness of the small oracle prediction function. |
vim |
VIM type. |
large_feature_vector |
Group of features available for the large oracle prediction function. |
small_feature_vector |
Group of features available for the small oracle prediction function. |
vim for example usage
Estimate Brier score VIM
vim_brier( time, event, approx_times, landmark_times, f_hat, fs_hat, S_hat, G_hat, cf_folds, ss_folds, sample_split, scale_est = FALSE, alpha = 0.05 )vim_brier( time, event, approx_times, landmark_times, f_hat, fs_hat, S_hat, G_hat, cf_folds, ss_folds, sample_split, scale_est = FALSE, alpha = 0.05 )
time |
|
event |
|
approx_times |
Numeric vector of length J1 giving times at which to approximate integrals. |
landmark_times |
Numeric vector of length J2 giving times at which to estimate Brier score |
f_hat |
Full oracle predictions (n x J1 matrix) |
fs_hat |
Residual oracle predictions (n x J1 matrix) |
S_hat |
Estimates of conditional event time survival function (n x J2 matrix) |
G_hat |
Estimate of conditional censoring time survival function (n x J2 matrix) |
cf_folds |
Numeric vector of length n giving cross-fitting folds |
ss_folds |
Numeric vector of length n giving sample-splitting folds |
sample_split |
Logical indicating whether or not to sample split |
scale_est |
Logical, whether or not to force the VIM estimate to be nonnegative |
alpha |
The level at which to compute confidence intervals and hypothesis tests. Defaults to 0.05 |
A data frame giving results, with the following columns:
landmark_time |
Time at which AUC is evaluated. |
est |
VIM point estimate. |
var_est |
Estimated variance of the VIM estimate. |
cil |
Lower bound of the VIM confidence interval. |
ciu |
Upper bound of the VIM confidence interval. |
cil_1sided |
Lower bound of a one-sided confidence interval. |
p |
p-value corresponding to a hypothesis test of null importance. |
large_predictiveness |
Estimated predictiveness of the large oracle prediction function. |
small_predictiveness |
Estimated predictiveness of the small oracle prediction function. |
vim |
VIM type. |
large_feature_vector |
Group of features available for the large oracle prediction function. |
small_feature_vector |
Group of features available for the small oracle prediction function. |
vim for example usage
Estimate concordance index VIM
vim_cindex( time, event, approx_times, restriction_time, f_hat, fs_hat, S_hat, G_hat, cf_folds, sample_split, ss_folds, scale_est = FALSE, alpha = 0.05 )vim_cindex( time, event, approx_times, restriction_time, f_hat, fs_hat, S_hat, G_hat, cf_folds, sample_split, ss_folds, scale_est = FALSE, alpha = 0.05 )
time |
|
event |
|
approx_times |
Numeric vector of length J1 giving times at which to approximate integrals. |
restriction_time |
Restriction time (upper bound for event times to be compared in computing the C-index) |
f_hat |
Full oracle predictions (n x J1 matrix) |
fs_hat |
Residual oracle predictions (n x J1 matrix) |
S_hat |
Estimates of conditional event time survival function (n x J2 matrix) |
G_hat |
Estimate of conditional censoring time survival function (n x J2 matrix) |
cf_folds |
Numeric vector of length n giving cross-fitting folds |
sample_split |
Logical indicating whether or not to sample split |
ss_folds |
Numeric vector of length n giving sample-splitting folds |
scale_est |
Logical, whether or not to force the VIM estimate to be nonnegative |
alpha |
The level at which to compute confidence intervals and hypothesis tests. Defaults to 0.05 |
A data frame giving results, with the following columns:
restriction_time |
Restriction time (upper bound for event times to be compared in computing the C-index). |
est |
VIM point estimate. |
var_est |
Estimated variance of the VIM estimate. |
cil |
Lower bound of the VIM confidence interval. |
ciu |
Upper bound of the VIM confidence interval. |
cil_1sided |
Lower bound of a one-sided confidence interval. |
p |
p-value corresponding to a hypothesis test of null importance. |
large_predictiveness |
Estimated predictiveness of the large oracle prediction function. |
small_predictiveness |
Estimated predictiveness of the small oracle prediction function. |
vim |
VIM type. |
large_feature_vector |
Group of features available for the large oracle prediction function. |
small_feature_vector |
Group of features available for the small oracle prediction function. |
vim for example usage
Estimate R-squared (proportion of explained variance) VIM based on event occurrence by a landmark time
vim_rsquared( time, event, approx_times, landmark_times, f_hat, fs_hat, S_hat, G_hat, cf_folds, ss_folds, sample_split, scale_est = FALSE, alpha = 0.05 )vim_rsquared( time, event, approx_times, landmark_times, f_hat, fs_hat, S_hat, G_hat, cf_folds, ss_folds, sample_split, scale_est = FALSE, alpha = 0.05 )
time |
|
event |
|
approx_times |
Numeric vector of length J1 giving times at which to approximate integrals. |
landmark_times |
Numeric vector of length J2 giving times at which to estimate Brier score |
f_hat |
Full oracle predictions (n x J1 matrix) |
fs_hat |
Residual oracle predictions (n x J1 matrix) |
S_hat |
Estimates of conditional event time survival function (n x J2 matrix) |
G_hat |
Estimate of conditional censoring time survival function (n x J2 matrix) |
cf_folds |
Numeric vector of length n giving cross-fitting folds |
ss_folds |
Numeric vector of length n giving sample-splitting folds |
sample_split |
Logical indicating whether or not to sample split |
scale_est |
Logical, whether or not to force the VIM estimate to be nonnegative |
alpha |
The level at which to compute confidence intervals and hypothesis tests. Defaults to 0.05 |
A data frame giving results, with the following columns:
landmark_time |
Time at which AUC is evaluated. |
est |
VIM point estimate. |
var_est |
Estimated variance of the VIM estimate. |
cil |
Lower bound of the VIM confidence interval. |
ciu |
Upper bound of the VIM confidence interval. |
cil_1sided |
Lower bound of a one-sided confidence interval. |
p |
p-value corresponding to a hypothesis test of null importance. |
large_predictiveness |
Estimated predictiveness of the large oracle prediction function. |
small_predictiveness |
Estimated predictiveness of the small oracle prediction function. |
vim |
VIM type. |
large_feature_vector |
Group of features available for the large oracle prediction function. |
small_feature_vector |
Group of features available for the small oracle prediction function. |
vim for example usage
Estimate restricted predicted survival time MSE VIM
vim_survival_time_mse( time, event, approx_times, restriction_time, f_hat, fs_hat, S_hat, G_hat, cf_folds, sample_split, ss_folds, scale_est = FALSE, alpha = 0.05 )vim_survival_time_mse( time, event, approx_times, restriction_time, f_hat, fs_hat, S_hat, G_hat, cf_folds, sample_split, ss_folds, scale_est = FALSE, alpha = 0.05 )
time |
|
event |
|
approx_times |
Numeric vector of length J1 giving times at which to approximate integrals. |
restriction_time |
restriction time |
f_hat |
Full oracle predictions (n x J1 matrix) |
fs_hat |
Residual oracle predictions (n x J1 matrix) |
S_hat |
Estimates of conditional event time survival function (n x J2 matrix) |
G_hat |
Estimate of conditional censoring time survival function (n x J2 matrix) |
cf_folds |
Numeric vector of length n giving cross-fitting folds |
sample_split |
Logical indicating whether or not to sample split |
ss_folds |
Numeric vector of length n giving sample-splitting folds |
scale_est |
Logical, whether or not to force the VIM estimate to be nonnegative |
alpha |
The level at which to compute confidence intervals and hypothesis tests. Defaults to 0.05 |
A data frame giving results, with the following columns:
restriction_time |
Restriction time (upper bound for event times to be compared in computing the restricted survival time). |
est |
VIM point estimate. |
var_est |
Estimated variance of the VIM estimate. |
cil |
Lower bound of the VIM confidence interval. |
ciu |
Upper bound of the VIM confidence interval. |
cil_1sided |
Lower bound of a one-sided confidence interval. |
p |
p-value corresponding to a hypothesis test of null importance. |
large_predictiveness |
Estimated predictiveness of the large oracle prediction function. |
small_predictiveness |
Estimated predictiveness of the small oracle prediction function. |
vim |
VIM type. |
large_feature_vector |
Group of features available for the large oracle prediction function. |
small_feature_vector |
Group of features available for the small oracle prediction function. |
vim for example usage