Skip to contents

Introduction

In this note we talk about some properties of the oracle estimator. This LogisticFAR package deals with high-dimensional logistic regression problem with groupwise-covariate selection feature and sum-to-zero constraint. For the oracle estimator, the important covariates are pre-given, so there’s no need for the variable selection part. We just need to perform the logistic regression with sum-to-zero constrant.

First we generate a toy example:

set.seed(1024)
n <- 200
kn <- 2
p <- 3
demox <- matrix(rnorm(n), nrow = n)
fx <- matrix(rnorm(n * p * kn), nrow = n)
beta0 <- c(1,    # intercept
           1,    # demographical covariates
           1, -2,    # group covaritates 1
           -2, 1,    # group covariates 2
           1, 1    # group covariates 3
           )
logit0 <- cbind(1, demox, fx) %*% beta0
pi0 <- exp(logit0) / (1 + exp(logit0))
y <- rbinom(n, size = 1, prob = pi0)

Here we generate a sample of size 200, with 1 demographical variables and 3 groups of covariates, each group contains 2 covariates. Here we omit other non-important group covariates since they will not be included in the oracle estimator. Note that the group covariate coefficient vector satisfying

j=1p𝛈j=𝟎kn. \sum\limits_{j = 1}^p\bm{\eta}_j = \bm{0}_{kn}.

Oracle Estimator

Fit the model via glm

Since the low-dimensional model structure is given, we can safely pick a reference level and fit a non-constraint model via glm

logitx <- cbind(demox, fx[, c(1, 3)] - fx[, 5], fx[, c(2, 4)] - fx[, 6])
colnames(logitx) <- c("demo", 
                      "x1_ref5", "x2_ref6", 
                      "x3_ref5", "x4_ref6")
glmfit <- glm(y ~ logitx, family = binomial())
glmfit$coefficients
#>   (Intercept)    logitxdemo logitxx1_ref5 logitxx2_ref6 logitxx3_ref5 
#>     0.5655721     1.0069234     0.7934938    -1.6096452    -1.6637660 
#> logitxx4_ref6 
#>     0.8454335

Here we use the 3rd group as the reference level and the full covariate effect vector would be

glmest <- c(glmfit$coefficients, 
            -(glmfit$coefficients["logitxx1_ref5"] + glmfit$coefficients["logitxx3_ref5"]), 
            -(glmfit$coefficients["logitxx2_ref6"] + glmfit$coefficients["logitxx4_ref6"]))
names(glmest) <- c("intercept", "demo", 
                   paste("x", 1 : 6, sep = ""))
glmest
#>  intercept       demo         x1         x2         x3         x4         x5 
#>  0.5655721  1.0069234  0.7934938 -1.6096452 -1.6637660  0.8454335  0.8702722 
#>         x6 
#>  0.7642117

Fit the model via LogisticFAR::Logistic_FAR_Path_Further_Improve

The LogisticFAR::Logistic_FAR_Path_Further_Improve used for post-selection estimation is essentially an unpenalized estiamtor, which is suitable for oracle estimation as long as we give the right model structure.

h <- 2    # intercept term is included
far_x <- cbind(1, demox, fx)
delta_init <- rep(0, 2)
eta_init <- rep(0.001, p * kn)
farfit <- LogisticFAR::Logistic_FAR_Path_Further_Improve(far_x, y, h = h, k_n = kn, p = p, 
                                                         delta_vec_init = delta_init, 
                                                         eta_stack_init = eta_init, 
                                                         mu1_vec_init = rep(0, kn), 
                                                         a = 1, 
                                                         weight_vec = 1, 
                                                         mu2 = 5, 
                                                         lam = 0.001, tol = 10 ^ (-10))
farest <- c(farfit$delta_vec, farfit$eta_stack_vec)
names(farest) <- c("intercept", "demo", 
                   paste("x", 1 : 6, sep = ""))
farest
#>  intercept       demo         x1         x2         x3         x4         x5 
#>  0.5653995  1.0066646  0.7932918 -1.6633419 -1.6091950  0.8452099  0.8159059 
#>         x6 
#>  0.8181323

In LogisticFAR::Logistic_FAR_Path_Further_Improve, the objective function takes the form

1aloglik(δ,η,weight_vec)+lam*(δTδ+ηTη)+μ1j=1pηj+μ22j=1pηj22 \frac{-1}{a}\mathrm{loglik}\left( \delta, \eta, weight\_vec \right) + lam * \left( \delta^T\delta + \eta^T\eta \right) + \mu_1\sum\limits_{j = 1}^p\eta_j + \frac{\mu_2}{2}\left\| \sum\limits_{j = 1}^p\eta_j \right\|_2^2

Therefore a and mu2 controls the balance between MLE and sum-to-zero constraint, lam is a robust ridge-type regularizer and weight_vec is a weight for each observation. Varying these parameters will lead to different estimators than the glm one.

Check the langrangian condition

By the method of subgradient(or equivalently the KKT condition), we know that the oracel estimator η̂jor\hat{\eta}_j^{or} satisfies

1aXjT(π(η̂or)y)+lor=𝟎kn,j=1,,p. \frac{1}{a}X_j^T\left( \pi\left(\hat{\eta}^{or}\right) - y \right) + l^{or} = \bm{0}_{kn} ,\quad\forall j = 1, \cdots, p.

We can verify this condition for the oracle estimator.

glm fit

pi_vec_glm <- glmfit$fitted.values
t(fx) %*% (pi_vec_glm - y)
#>            [,1]
#> [1,] -0.1975975
#> [2,]  4.1625945
#> [3,] -0.1975975
#> [4,]  4.1625945
#> [5,] -0.1975975
#> [6,]  4.1625945

LogisticFAR fit

pi_vec_far <- exp(far_x %*% c(farfit$delta_vec, farfit$eta_stack_vec)) / (1 + exp(far_x %*% c(farfit$delta_vec, farfit$eta_stack_vec)))
t(fx) %*% (pi_vec_far - y)
#>            [,1]
#> [1,] -0.1984088
#> [2,]  4.1651383
#> [3,] -0.1950729
#> [4,]  4.1614994
#> [5,] -0.1984187
#> [6,]  4.1617573

As we can see, they all satisfy this condition(up to some numerical error). The oracle estimator from glm based on referencing a group is more accurate than LogisticFAR fit since there is a robust regularizer in the LogisticFAR objective function also the tol and mu2 should also be adjusted for a more accurate estimator.

NOTE: If we plug-in the real underlying parameters, then we get

t(fx) %*% (pi0 - y)
#>            [,1]
#> [1,]  1.1648309
#> [2,]  2.8939424
#> [3,] -2.2444846
#> [4,]  4.3283496
#> [5,] -0.2705459
#> [6,]  4.7280568