A generalization of moderated statistics to data adaptive semiparametric estimation in high-dimensional biology

Nima Hejazi, Philippe Boileau, Mark van der Laan, Alan Hubbard

March 2023

Preprint PDF Code Project Project Slides DOI

Abstract

The widespread availability of high-dimensional biological data has made the simultaneous screening of many biological characteristics a central problem in computational biology and allied sciences. While the dimensionality of such datasets continues to grow, so too does the complexity of biomarker identification from exposure patterns in health studies measuring baseline confounders; moreover, doing so while avoiding model misspecification remains an issue only partially addressed. Efficient estimators capable of incorporating flexible, data adaptive regression techniques in estimating relevant components of the data-generating distribution provide an avenue for avoiding model misspecification; however, in the context of high-dimensional problems that require the simultaneous estimation of numerous parameters, standard variance estimators have proven unstable, resulting in unreliable Type-I error control under standard multiple testing corrections. We present a general approach for applying empirical Bayes shrinkage to variance estimators of a family of efficient, asymptotically linear estimators of population intervention causal effects arising from comparing counterfactual contrasts of an exposure variable. Our generalization of shrinkage-based variance estimators increases inferential stability in high-dimensional settings, facilitating the application of these estimators for deriving nonparametric variable importance measures in high-dimensional biological datasets with modest sample sizes. The result is a data adaptive approach for robustly uncovering stable associations in high-dimensional data with limited sample sizes. Our generalized variance estimator is evaluated against alternative variance estimators in a numerical experiment, and an open source R package for the Bioconductor project, biotmle, is introduced. Identification of biomarkers with the proposed methodology is demonstrated in an analysis of high-dimensional DNA methylation data from an observational study on the epigenetic effects of tobacco smoking.

Type

Journal article

Publication

In Statistical Methods in Medical Research

Nima Hejazi

Assistant Professor of Biostatistics

My research lies at the intersection of causal inference and machine learning, developing flexible methodology for statistical inference tailored to modern experiments and observational studies in the biomedical and public health sciences.