I am a PhD candidate in biostatistics, with a designated emphasis in computational and genomic biology, working with Mark van der Laan and Alan Hubbard. I am a founding core developer of the tlverse, the software ecosystem for Targeted Learning, and a workshop instructor with Software Carpentry. At UC Berkeley, I am affiliated with the Center for Computational Biology and the NIH Biomedical Big Data initiative. I have also served in biostatistical collaborations with the Bill & Melinda Gates Foundation and the Kaiser Permanente Division of Research.
My research interests primarily concern the development of robust and efficient statistical techniques at the intersection of machine learning and causal inference, aiming to facilitate data adaptive estimation and robust nonparametric inference for complex target estimands with data from observational or randomized studies. Broadly, my interests span nonparametric estimation, semiparametric theory, high-dimensional inference, targeted learning, statistical computing, and computational biology. Of late, my methodological work has touched upon stochastic treatment regimes, causal mediation analysis, robust inference under two-phase sampling, and data-adaptive conditional density estimation. I am also quite keenly interested in the design of open source software and the use of automated testing for the promotion of reproducibility in applied statistical and scientific practice.
PhD in Biostatistics (designated emphasis in Computational and Genomic Biology), 2017-present
University of California, Berkeley
MA in Biostatistics, 2017
University of California, Berkeley
BA with a triple major in Molecular and Cell Biology (em. Neurobiology), Psychology, and Public Health, 2015
University of California, Berkeley
Development of dimensionality reduction methods using contrastivity and sparsification.
Defining novel mediation effects and extensions using stochastic interventions.
Extensions and applications of causal inference based on stochastic interventions in complex settings.
Moderated variance estimators for use with semiparametric data-adaptive estimators in high-dimensional biology.
Identification of differentially methylated positions and regions based on targeted learning.
The things that keep me from working.
Assorted notes on graduate school.
(see CV for a full list)
Public Health 242C & Statistics 247C: Longitudinal Data Analysis (Fall 2019), as graduate student instructor with Prof. Alan Hubbard
Public Health 290: Targeted Learning in Biomedical Big Data (Spring
2018), as graduate student
instructor with Prof. Mark van der Laan
Course
materials here |
GitHub repositories here
I am an active member of Software Carpentry and Data Carpentry, through which I engage in curriculum development, maintenance of lesson materials, and workshop delivery.
Software Carpentry: Shell, Git, and
R at the Berkeley Institute
for Data Science; 2019 Jan. 17-18; co-taught
with S. Peterson, N. Varoquaux.
Course materials
here | GitHub repository
here
Software Carpentry: Shell, Git, and
Python at the Berkeley Institute
for Data Science; 2018 Jul. 16-17; co-taught
with K. Marwaha.
Course materials
here | GitHub repository
here
Data Carpentry: Genomics at
Lawrence Berkeley National Laboratory; 2018 May
3-4; co-taught with A. Orr.
Course materials
here | GitHub
repository here
Collected collateral damage from doing statistics research, hopefully useful to others.
tlverse
The tlverse
is an ecosystem of R packages for
Targeted Learning, of which I am a co-founder and core developer. A few of the
tlverse
packages to which I have made significant contributions include
sl3
: An R package providing a modern
implementation of the Super Learner ensemble modeling algorithm that
simultaneously exposes a grammar for composing arbitrary machine learning
pipelines. Joint work with Jeremy Coyle,
Ivana Malenica, and Oleg
Sofrygin.
[Docs] |
[GitHub]
origami
: An R package providing a general
framework for the application of various cross-validation schemes to arbitrary
functions, facilitating the extension of cross-validation to a diversity of
applications. Joint work with Jeremy
Coyle.
[Docs] |
[GitHub] |
[CRAN]
hal9001
: An R package providing an efficient
implementation of the Highly Adaptive Lasso (HAL), a nonparametric
regression estimator with optimality guarantees useful in semiparametric
inference. Joint work with Jeremy Coyle
and Mark van der Laan.
[Docs] |
[GitHub]
tmle3shift
: An R package providing a
targeted maximum likelihood estimator of effects of stochastic interventions,
with summarization of effects via working marginal structural models. Joint
work with Jeremy Coyle and Mark van der
Laan.
[Docs] |
[GitHub]
A significant focus of my research program lies at the intersection of causal inference and statistical machine learning. I’ve (co-)developed R packages for settings ranging from causal mediation analysis and the assessment of stochastic intervention effects with two-phase sampling to nonparametric conditional density estimation and survival analysis.
medshift
: An R package for estimating the
population intervention (in)direct effects based on stochastic interventions,
including both classical and efficient estimators for incremental propensity
score intervention (causal) effects. Joint work with Iván
Díaz.txshift
: An R package for efficient
estimation of and inference on the causal effects of stochastic
interventions, accommodating robust estimation and
semiparametric-efficient inference in the presence of two-phased sampling.
Joint work with David
Benkeser.
[Docs] |
[GitHub]
haldensify
: An R package for nonparametric
conditional density estimation using techniques based on the highly adaptive
lasso, designed specifically for estimation of the generalized propensity
score. Joint work with David
Benkeser
and Mark van der Laan.
[Docs] |
[GitHub]
survtmle
: An R package for the construction
of targeted maximum likelihood estimates of marginal cumulative incidence in survival settings
with and without competing risks, including estimators that respect bounds.
Joint work with David
Benkeser.
[Docs] |
[GitHub] |
[CRAN]
A parallel thread of my research concerns the development of novel statistical methodologies for application in high-dimensional and computational biology settings. Consequently, I have (co-)developed several R packages extending the Bioconductor Project.
biotmle
: An R package for the model-free
discovery of biomarkers from biological expression data, introducing a
generalization of moderated statistics for variance stabilization of
semiparametric estimators. Joint work with Alan
Hubbard and Mark van der
Laan.
[Docs] |
[GitHub] |
[Bioconductor]
scPCA
: An R package for sparse contrastive
principal component analysis, facilitating the recovery of stable and
low-dimensional expression patterns while removing technical artifacts based
on control samples. Joint work with Phil Boileau and
Sandrine Dudoit.
[GitHub] |
[Bioconductor]
methyvim
: An R package for genome-wide
assessment of differential methylation based on estimation of variable
importance measures at the level of CpG sites. Joint work with Mark van der
Laan.
[Docs] |
[GitHub] |
[Bioconductor]
adaptest
: An R package for multiple
hypothesis testing with data adaptive target parameters in high-dimensional
settings using Targeted Learning. Joint work with Weixin
Cai and Alan
Hubbard.
[GitHub] |
[Bioconductor]