Robust analysis of cancer heterogeneity for high-dimensional data

Composition of PAM50 subtypes for the five identified subgroups

Abstract

Cancer heterogeneity plays an important role in the understanding of tumor etiology, progression, and response to treatment. To accommodate heterogeneity, cancer subgroup analysis has been extensively conducted. However, most of the existing studies share the limitation that they cannot accommodate heavy-tailed or contaminated outcomes and also high dimensional covariates, both of which are not uncommon in biomedical research. In this study, we propose a robust subgroup identification approach based on M-estimators together with concave and pairwise fusion penalties, which advances from existing studies by effectively accommodating high-dimensional data containing some outliers. The penalties are applied on both latent heterogeneity factors and covariates, where the estimation is expected to achieve subgroup identification and variable selection simultaneously, with the number of subgroups being apriori unknown. We innovatively develop an algorithm based on parallel computing strategy, with a significant advantage of capable of processing large-scale data. The convergence property of the proposed algorithm, oracle property of the penalized M-estimators, and selection consistency of the proposed BIC criterion are carefully established. Simulation and analysis of TCGA breast cancer data demonstrate that the proposed approach is promising to efficiently identify underlying subgroups in high-dimensional data.

Publication
Robust analysis of cancer heterogeneity for high-dimensional data

An R package RSAVS is provided for this paper. Click the Code button above to check its Github repo.

Chao Cheng
Chao Cheng
Statistician

My research interests include applied statistics and machine learning.

Related