Cross-validation for solution path of Logistic FAR. — Logistic_FARMM_CV

Logistic_FARMM_CV_path finds the solution path of logistic functional additive regression with log-contrast constrain via Logistic_FAR_Path. And it will use cross-validation to assess the goodness of the estimations in the solution path.

Usage

Logistic_FARMM_CV_path(
  y_vec,
  x_mat,
  h,
  kn,
  p,
  rand_eff_df,
  p_type,
  p_param,
  lambda_seq,
  lambda_length,
  min_lambda_ratio = 0.01,
  mu2,
  a = 1,
  bj_vec = rep(1/sqrt(kn), p),
  cj_vec = rep(1, p),
  rj_vec = 1e-05,
  weight_vec = 1,
  logit_weight_vec = 1,
  weight_already_combine = FALSE,
  relax_vec,
  delta_init,
  eta_stack_init,
  mu_1_init,
  tol,
  max_iter,
  nfold = 5,
  fold_seed,
  post_selection = TRUE,
  post_a = 1,
  verbose = 0
)

Arguments

y_vec: response vector, 0 for control, 1 for case. n = length(y_vec) is the number of observations.
x_mat: covariate matrix, consists of two parts. dim(x_mat) = (n, h + p * kn) First h columns are for demographical covariates(can include an intercept term) Rest columns are for p functional covariates, each being represented by a set of basis functions resulting kn covariates.
h, kn, p: dimension information for the dataset(x_mat).
rand_eff_df: data.frame of random effect related data. It must contain at least one column named "subj_vec_fct", which indicates the subject level. If this is the only column in rand_eff_df, then a constant random effect is applied. If there is other column(s), then they will all be additively added to the random effect as the slope term. The number of rows of rand_eff_df is the the same as length(y_vec).
p_type: an character variable indicating different types of the penalty
p_param: numerical vector for the penalty function. p_param[1] store sthe lambda value and will be provided by lambda_seq.
lambda_seq: a non-negative sequence of lambda, along which the solution path is searched. It is RECOMMENED to not supply this parameter and let the function itself determines it from the given data.
lambda_length: length of the lambda sequence when computing lambda_seq. If lambda_seq is provided, then of course lambda_length = length(lambda_seq).
mu2: quadratic term in the ADMM algorithm
a, bj_vec, cj_vec, rj_vec: parameters for the algorithm. See Algorithm_Details.pdf for more information.
weight_vec: weight vector for each subject. The final weight for each subject will be adjusted also by logit_weight_vec. And the summation of the final weight vector is normalized to n, the sample size.
logit_weight_vec: weight vector for each subject when computing the integral in the logit values. Each entry should be positive and no more than 1. This is a naive method for adjusting for early stop during the interval.
weight_already_combine: boolen, indicating whether the weight_vec is already combined with logit_weight_vec for each subject.
relax_vec: not used.
delta_init, eta_stack_init, mu1_init: initial values for the algorithm.
tol, max_iter: convergence tolerance and max number of iteration of the algorithm.
nfold: integer, number of folds
fold_seed: if supplied, use this seed to generate the partitions for cross-validation. Can be useful for reproducible runs.
post_selection: bool, should the function also computes cross-validation results based on post selection estimation results.
post_a: a for the post selection estimation.
verbose: integer, indicating level of information to be printed during computation, currently supports: always: some info if something went wrong, e.g. when no penalty function is matched 1: information about the start and stop of the iteration 2. How the loss value is changed during each iteration
min_lam_ratio:: min(lambda_seq) / max{lambda_seq}. This function uses this parameter to determine the minimal value of lambda_seq. If p > n, then it is recommended to set this no smaller than 0.01 (sometimes even 0.05), otherwise you can set it to 0.001 or even smaller.
svd_thresh: not used.

Value

A list containing the solution path of delta, eta_stack, mu1 and some computation information such as convergency, iteration number and the lambda sequence of this solution path. Also information of CV is returned such as the fold ID for each observation, the loglikelihood results on each test set and the index with the highest average loglik on the testsets. If post_selection = TRUE, same results based on the post selection estimation are also returned.

Note

Although this function will return the index of lambda given the highest averaged loglik on the testsets. It is more recommended to use the stand alone *_pick functions in this packages, such as CV_Pick to find a optimal lambda since those functions give more flexibility.