Combining Causal Inference and Machine Learning for Model-agnostic Discovery in High-dimensional Biology


The widespread availability of high-dimensional data has catalyzed biological pattern discovery. Today, the simultaneous measurement of thousands to millions of biological characteristics (in, e.g., genomics, metabolomics, proteomics) is commonplace in many experimental settings, making the simultaneous screening of such a large magnitude of characteristics a central problem in computational biology and allied sciences. The information that may be gleaned from such studies promises substantial progress, yet population-level biomedical and public health sciences must often operate without access to the great precision offered by modern biological techniques for molecular-cellular level manipulations (as used in, e.g., chemical biology). Here, statistical innovations bridge the gap – being used to dissect mechanistic processes and to mitigate the inferential obstacles imposed by potential confounding in observational (non-randomized) studies. Unfortunately, most off-the-shelf statistical techniques rely on restrictive assumptions that invite opportunities for bias due to model misspecification (when the biological process under study fails to obey assumed mathematical conveniences). Model-agnostic statistical inference, drawing on causal inference and semiparametric efficiency theory, provides an avenue for avoiding restrictive modeling assumptions while obtaining robust statistical inference on scientifically relevant target parameters. We outline this framework briefly and introduce a model-agnostic approach to biomarker discovery. The proposed approach readily accommodates statistical parameters informed by causal inference and leverages state-of-the-art machine learning to construct flexible and robust estimators (mitigating model misspecification bias) while using variance moderation to deliver stable, conservative inference in high-dimensional settings. The approach is implemented in the open-source biotmle R/Bioconductor package ( This talk is based on joint work with Alan Hubbard, Mark van der Laan, and Philippe Boileau, and is based on a recently published manuscript (doi:; pre-print:

Mon, Mar 27, 2023 1:00 PM
B3D (Biostatistics, Biomedical Informatics, and Big Data) Seminar, Harvard T.H. Chan School of Public Health
Boston, Massachusetts, USA
Nima Hejazi
Nima Hejazi
Assistant Professor of Biostatistics

My research lies at the intersection of causal inference and machine learning, developing flexible methodology for statistical inference tailored to modern experiments and observational studies in the biomedical and public health sciences.