I will soon start as an
NSF postdoctoral research
fellow at
Weill Cornell
Medicine,
working with
Iván Díaz and collaborating with
David
Benkeser. I am completing
my PhD in biostatistics at
UC
Berkeley, under the
guidance of
Mark van der Laan and
Alan
Hubbard. During my
graduate studies, I served as a founding core developer of
the tlverse
project, the software ecosystem for targeted
learning, and enjoyed collaborations with the
Bill & Melinda Gates
Foundation,
Fred Hutchinson Cancer Research
Center,
Kaiser Permanente Division of
Research,
Pandora, and
Netflix.
My research interests sit at the intersection of nonparametric causal inference and machine learning, particularly in the development of statistical procedures tailored for efficient estimation and robust inference, in flexible statistical models. Broadly, I am motivated by methodological issues arising from high-dimensional inference, loss-based estimation, semiparametric theory, and complex study designs, usually inspired by applications in computational biology, epidemiology, and vaccine trials. I am also complementarily interested in high-performance statistical computing, research software engineering, and open source software for applied statistics.
PhD in Biostatistics (designated emphasis in Computational & Genomic Biology), 2021
University of California, Berkeley
MA in Biostatistics, 2017
University of California, Berkeley
BA with a triple major in Molecular & Cell Biology (em. Neurobiology), Psychology, and Public Health, 2015
University of California, Berkeley
Causal mediation analysis has historically been limited in two important regards: (i) a focus has traditionally been placed on binary treatments and static interventions, and (ii) direct and indirect effect decompositions have been pursued that are only identifiable in the absence of intermediate confounders affected by treatment. We present a theoretical study of an (in)direct effect decom- position of the population intervention effect, defined by stochastic interventions jointly applied to the treatment and mediators. In contrast to existing proposals, our causal effects can be evaluated regardless of whether a treatment is categorical or continuous and remain well-defined even in the presence of intermediate confounders affected by treatment. Our (in)direct effects are identifiable without a restrictive assumption on cross-world counterfactual independencies, allowing for substantive conclusions drawn from them to be validated in randomized controlled trials. Beyond the novel effects introduced, we provide a careful study of nonparametric efficiency theory relevant for the construction of flexible, multiply robust estimators of our (in)direct effects, all the while avoiding undue restrictions induced by assuming parametric models of nuisance parameter functionals. To complement our nonparametric estimation strategy, we introduce inferential techniques for constructing confidence intervals and hypothesis tests, and discuss open source software implementing the proposed methodology.
The advent and subsequent widespread availability of preventive vaccines has altered the course of public health over the past century. Despite this success, effective vaccines to prevent many high-burden diseases, including HIV, have been slow to develop. Vaccine development can be aided by the identification of immune response markers that serve as effective surrogates for clinically significant infection or disease endpoints. However, measuring immune response is often costly, which has motivated the usage of two-phase sampling for immune response sampling in clinical trials of preventive vaccines. In such trials, measurement of immunological markers is performed on a subset of trial participants, where enrollment in this second phase is potentially contingent on the observed study outcome and other participant-level information. We propose nonparametric methodology for efficiently estimating a counterfactual parameter that quantifies the impact of a given immune response marker on the subsequent probability of infection. Along the way, we fill in a theoretical gap pertaining to the asymptotic behavior of nonparametric efficient estimators in the context of two-phase sampling, including a multiple robustness property enjoyed by our estimators. Techniques for constructing confidence intervals and hypothesis tests are presented, and an open source software implementation of the methodology, the txshift
R package, is introduced. We illustrate the proposed techniques using data from a recent preventive HIV vaccine efficacy trial.
Mediation analysis in causal inference has traditionally focused on binary treatment regimes and deterministic interventions, as well as a decomposition of the average treatment effect in terms of direct and indirect effects. In this paper we present an analogous decomposition of the population intervention effect, defined through stochastic interventions. Population intervention effects provide a generalized framework in which a variety of interesting causal contrasts can be defined, including effects for continuous and categorical exposures. We show that identification of direct and indirect effects for the population intervention effect requires weaker assumptions than its average treatment effect counterpart. In particular, identification of direct effects is guaranteed in experiments that randomize the treatment and the mediator. We discuss various estimators of the direct and indirect effects, including substitution, re-weighted, and efficient estimators based on flexible regression techniques. Our efficient estimator is asymptotically linear under a condition requiring $n^{\frac{1}{4}}$-consistency of certain regression functions. We perform a simulation study in which we assess the finite-sample properties of our proposed estimators. We present the results of an illustrative study where we assess the effect of participation in a sports team on BMI among children, using mediators such as exercise habits, daily consumption of snacks, and overweight status.
(see CV for a full list)
I will not be teaching during the 2021-2022 academic year. Check back later.
Public Health 290: Biomedical Big Data Capstone Seminar (Targeted Learning in Practice), as graduate student instructor with Prof. Mark van der Laan; Spring 2021 at the University of California, Berkeley.
Public Health 240B / Statistics 245B: Survival Analysis and Causality, as graduate student instructor with Prof. Mark van der Laan; Fall 2020 at the University of California, Berkeley.
Public Health 290: Biomedical Big Data Capstone Seminar, as graduate student instructor with Prof. Alan Hubbard; Spring 2020 at the University of California, Berkeley.
Public Health 242C / Statistics 247C: Longitudinal Data Analysis, as graduate student instructor with Prof. Alan Hubbard; Fall 2019 at the University of California, Berkeley.
Public Health 290: Targeted Learning in Biomedical Big Data, as graduate student instructor with Prof. Mark van der Laan; Spring 2018 at the University of California, Berkeley.
Nothing on tap, for now. Check back later.
Targeted Learning in the tlverse
: Causal Inference Meets Ensemble Machine
Learning at the
Society for
Epidemiologic Research
Meeting;
2021 June; co-taught with
Mark van der
Laan,
Alan Hubbard,
Jeremy Coyle, Ivana Malenica, and Rachael Phillips.
Causal Mediation: Modern Methods for Path Analysis at the Society for Epidemiologic Research Meeting; 2021 May; co-taught with Iván Díaz and Kara Rudolph.
Targeted Learning in the tlverse
: Causal Inference Meets Ensemble Machine
Learning at the
Eastern North
American Region of the International Biometric Society (ENAR) Spring
Meeting;
2021 March; co-taught with
Mark van der
Laan,
Alan Hubbard,
Jeremy Coyle, Ivana Malenica, and Rachael Phillips.
The tlverse
Software Ecosystem for Targeted
Learning at the
Conference on
Statistical Practice;
2020 February; co-taught with
Alan
Hubbard,
Jeremy Coyle, Ivana Malenica, and Rachael Phillips.
The tlverse
Software Ecosystem for Causal
Inference at the
Atlantic Causal
Inference Conference;
2019 May; co-taught with
Mark van der
Laan,
Alan
Hubbard,
Jeremy Coyle, Ivana Malenica, and Rachael Phillips.
I am a member of Software Carpentry and Data Carpentry, through which I work on curriculum development, maintenance of lesson materials, and workshop delivery.
Software Carpentry: Shell, Git, and
R at the
Berkeley Institute
for Data Science; 2019 January; co-taught with
Scott Peterson and Nelle Varoquaux.
Course materials
here | GitHub repository
here
Software Carpentry: Shell, Git, and
Python at the
Berkeley Institute
for Data Science; 2018 July; co-taught with
Kunal Marwaha.
Course materials
here | GitHub repository
here
Data Carpentry: Genomics at
Lawrence Berkeley National Laboratory; 2018 May;
co-taught with Adam Orr.
Course materials
here | GitHub
repository
here
Collected collateral damage from doing statistics research, hopefully useful to others.
tlverse
The tlverse
is an ecosystem of R packages for
Targeted Learning, of which I am a co-founder and core developer. A few of the
tlverse
packages to which I’ve made significant contributions include
sl3
: An
R package providing a modern
implementation of the Super Learner ensemble modeling algorithm that
simultaneously exposes a flexible grammar for composing arbitrary pipelines
for machine learning tasks. Joint work with
Jeremy
Coyle,
Ivana
Malenica,
Rachael
Phillips, and
Oleg
Sofrygin.
[Docs] |
[GitHub]
origami
: An
R package exposing a generalized
framework for applying a great variety of cross-validation schemes to
arbitrary estimation functions. Joint work with
Jeremy
Coyle,
Ivana
Malenica, and
Rachael
Phillips.
[Docs] |
[GitHub] |
[CRAN] |
[Paper]
hal9001
: An
R package providing an efficient
implementation of the Highly Adaptive Lasso (HAL), a nonparametric
regression estimator achieving near-parametric convergence rates under
relatively mild assumptions.
Joint work with
Jeremy Coyle and
Mark van
der Laan.
[Docs] |
[GitHub] |
[CRAN] |
[Paper]
tmle3shift
: An
R package for targeted
maximum likelihood estimation of the causal effects of modified treatment
policies on continuous-valued exposures, incorporates working marginal
structural models for summarization of effect estimates. Joint work with
Jeremy Coyle and
Mark van der
Laan.
[Docs] |
[GitHub]
A significant focus of my research program centers on the intersection of causal inference and statistical machine learning. I’ve (co-)developed R packages for a range of problems: causal mediation analysis, evaluating stochastic interventions under two-phase sampling, conditional density estimation, and survival analysis.
medshift
: An
R package for estimating the
population intervention (in)direct effects based on stochastic interventions.
Classical and efficient estimators are supported for the effects of
incremental propensity score interventions and modified treatment policies.
Joint work with
Iván Díaz.
[Docs] |
[GitHub]
medoutcon
: An
R package for efficient
estimation of interventional (in)direct effects subject to intermediate
confounding, including one-step and targeted minimum loss estimators. Joint
work with
Iván Díaz and
Kara
Rudolph.
[Docs] |
[GitHub]
txshift
: An
R package for efficient
estimation of and inference on causal effects of stochastic interventions on
continuous-valued exposures. Robust estimation and efficient inference under
two-phased sampling is supported. Joint work with
David
Benkeser.
[Docs] |
[GitHub] |
[CRAN] |
[Paper]
haldensify
: An
R package for nonparametric
conditional density estimation based on the highly adaptive lasso, designed
for estimating the generalized propensity score. Joint work with
David
Benkeser
and
Mark van der Laan.
[Docs] |
[GitHub] |
[CRAN]
survtmle
: An
R package for the construction
of targeted maximum likelihood estimates of marginal cumulative incidence in
right-censored survival settings with and without competing risks, including
estimation procedures that respect bounds. Joint work with
David
Benkeser.
[Docs] |
[GitHub] |
[CRAN]
A parallel thread of my research concerns the development of novel statistical methodologies for application in high-dimensional and computational biology settings. Consequently, I have (co-)developed several R packages extending the Bioconductor Project.
biotmle
: An
R package for the model-free
discovery of biomarkers from biological expression data, introducing a
generalization of moderated statistics for variance stabilization of
semiparametric estimators. Joint work with
Alan
Hubbard and
Mark van der
Laan.
[Docs] |
[GitHub] |
[Bioconductor] |
[Paper]
scPCA
: An
R package for sparse contrastive
principal component analysis, facilitating the recovery of stable and
low-dimensional patterns from high-dimensional biological data while removing
technical artifacts by making use of control samples. Joint work with
Philippe Boileau and
Sandrine
Dudoit.
[GitHub] |
[Bioconductor] |
[Paper]
methyvim
: An
R package for genome-wide
assessment of differential methylation based on estimation of variable
importance measures at the level of CpG sites. Joint work with
Mark van der
Laan.
[Docs] |
[GitHub] |
[Bioconductor]
adaptest
: An
R package for multiple
hypothesis testing with data adaptive target parameters in high-dimensional
settings using Targeted Learning. Joint work with
Weixin
Cai and
Alan
Hubbard.
[GitHub] |
[Bioconductor] |
[Paper]
cvCovEst
: An
R package for asymptotically
optimal, cross-validated, loss-based selection of covariance matrix
estimators, particularly tailored for use in high-dimensional settings.
Joint work with
Philippe Boileau,
Brian
Collica,
Mark van der
Laan, and
Sandrine
Dudoit.
[Docs] |
[GitHub] |
[CRAN]
nima
: An
R package housing my personal R
toolbox, written to support statistical computing for research.
[Docs] |
[GitHub] |
[CRAN]