Robust subgroup analysis and variable selection simultaneously

This function carries out robust subgroup analysis and variable selection simultaneously. It utilizes the cpp core solver and supports different types of loss functions and penalties.

RSAVS_Path(
  y_vec,
  x_mat,
  l_type = "L1",
  l_param = NULL,
  p1_type = "S",
  p1_param = c(2, 3.7),
  p2_type = "S",
  p2_param = c(2, 3.7),
  lam1_vec = NULL,
  lam2_vec = NULL,
  lam1_sort = TRUE,
  lam2_sort = TRUE,
  min_lam1_ratio = 0.03,
  min_lam2_ratio = 0.03,
  lam1_max_ncvguard = 100,
  lam1_len = 50,
  lam2_len = 40,
  const_r123 = NULL,
  const_abc = rep(1, 3),
  initial_values = NULL,
  phi = 1,
  tol = 0.001,
  max_iter = 10,
  cd_max_iter = 1,
  cd_tol = 0.001,
  subgroup_benchmark = FALSE,
  update_mu = NULL,
  loss_track = FALSE,
  diff_update = TRUE,
  omp_zsw = c(1, 4, 1),
  eigen_pnum = 4,
  s_v2 = TRUE,
  dry_run = FALSE
)

Arguments

y_vec

numerical vector of response. n = length(y_vec) is the number of observations.

x_mat

numerical matrix of covariates. Each row for one observation and p = ncol(x_mat) is the number of covariates.

l_type

character string, type of loss function.

"L1": l-1 loss(absolute value loss)
"L2": l-2 loss(squared error loss)
"Huber": Huber loss. Its parameter is given in l_param.

Default value is "L1".

l_param

numeric vector containing necessary parameters of the corresponding loss function. The default value is NULL.

p1_type, p2_type

a character indicating the penalty types for subgroup identification and variable selection.

"S": SCAD
"M": MCP
"L": Lasso

Default values for both parameters are "S".

p1_param, p2_param

numerical vectors providing necessary parameters for the corresponding penalties.

For Lasso, lam = p_param[1]
For SCAD and MCP, lam = p_param[1], gamma = p_param[2]

Default values for both parameters are c(2, 3.7). Note: This function searches the whole lam1_vec * lam2_vec grid for the best solution. Hence the lambdas provided in these parameters serve only as placeholder and will be ignored and overwritten in the actual computation.

lam1_vec, lam2_vec

numerical vectors of customized lambda vectors. For lam1_vec, it's preferred to be in the order from small to big.

lam1_sort, lam2_sort

boolen, whether to force sorting the provided lam1_vec and lam2_vec. By default, lam1_vec will sort in increasing order while lam2_vec in decreasing order.

min_lam1_ratio, min_lam2_ratio

the ratio between the minimal and maximal lambda, equals to (minimal lambda) / (maximal lambda). The default value is 0.03.

lam1_max_ncvguard

a safe guard constant for lam1_max when the penalty is nonconvex such as SCAD and MCP.

lam1_len, lam2_len

integers, lengths of the auto-generated lambda vectors.

const_r123

a length-3 numerical vector, providing the scalars needed in the augmented lagrangian part of the ADMM algorithm

const_abc

a length-3 numeric vector, providing the scalars to adjust weight of regression function, penalty for subgroup identification and penalty for variable selection in the overall objective function. Defaults to c(1, 1, 1).

initial_values

list of vector, providing initial values for the algorithm.

phi

numerical variable. A parameter needed for mBIC.

tol

numerical, convergence tolerance for the algorithm.

max_iter

integer, max number of iteration during the algorithm.

cd_max_iter

integer, max number of iteration during the coordinate descent update of mu and beta. If set to 0, will use analytical solution( instead of coordinate descent algorithm) to update mu and beta.

cd_tol

numerical, convergence tolerance for the coordinate descent part when updating mu and beta.

subgroup_benchmark

bool. Whether this call should be taken as a benchmark of subgroup identification. If TRUE, then the penalty for variable selection will be surpressed to a minimal value.

update_mu

list of parameters for updating mu_vec in the algorithm into meaningful subgroup structure. Defaults to NULL, which means there is no update performed. The update of mu_vec is carried out through RSAVS_Determine_Mu and the necessary parameters in update_mu are:

UseS: a bool variable, whether the s_vec should be used to provide subgroup structure information.
klim: a length-3 integer vector, given the range of number of cluster for considering.
usepam: a bool variable, whether to use pam for clustering.
round_digits: non-negative integer digits, indicating the rounding digits when merging mu_vec

Please refer to RSAVS_Determine_Mu to find out more details about how the algorithm works

loss_track

boolen, whether to track the value of objective function(loss value) during each iteration.

diff_update

boolen, whether to update the difference between each iteration. If set to FALSE, the algorithm will still stop when it reaches max_iter.

omp_zsw

a length-three integer vector, defaults to c(1, 4, 1). It represents how many parallel threads to be used during the update of z, s and w respectively.

eigen_pnum

integer number, representing the number of Eigen threads for matrix computation, defaults to 4.

s_v2

boolen, whether to use the updated and faster version during the computation of s and q2

dry_run

boolen, whether this is a so-called 'dry-run'. A dry-run will not carry out the real core solver, but only prepares and return the necessary initial values, additional values and solution plane for the core solver.

Examples

# a toy example
# first we generate data
n <- 200    # number of observations
q <- 5    # number of active covariates
p <- 50    # number of total covariates
k <- 2    # number of subgroups

# k subgroup effect, centered at 0
group_center <- seq(from = 0, to = 2 * (k - 1), by = 2) - (k - 1)
# covariate effect vector
beta_true <- c(rep(1, q), rep(0, p - q))
# subgroup effect vector    
alpha_true <- sample(group_center, size = n, replace = TRUE)    
x_mat <- matrix(rnorm(n * p), nrow = n, ncol = p)    # covariate matrix
err_vec <- rnorm(n, sd = 0.1)    # error term
y_vec <- alpha_true + x_mat %*% beta_true + err_vec    # response vector

# a simple analysis using default loss and penalties
res <- RSAVS_Path(y_vec = y_vec, x_mat = x_mat, 
                 lam1_len = 50, lam2_len = 40, 
                 phi = 5)
#> `const_r123` is missing. Use default settings!
#> lam1_vec is missing, use default values...
#> lam2_vec is missing, use default values...
#> `initial_values` is missing, use default settings!
#> prepare intermediate variables needed by the algorithm
#> generate pairwise difference matrix
#> compute `beta_lhs`
#> compute `mu_lhs`
#> additional variables prepared!

Robust subgroup analysis and variable selection simultaneously

Arguments

See also

Examples