I am a PhD candidate in biostatistics, with a designated emphasis in computational and genomic biology, working with Mark van der Laan and Alan Hubbard. I am a founding core developer of the tlverse, the software ecosystem for Targeted Learning, and a workshop instructor with Software Carpentry. At UC Berkeley, I am affiliated with the Center for Computational Biology and the NIH Biomedical Big Data initiative. I have also served in biostatistical collaborations with the Bill & Melinda Gates Foundation and the Kaiser Permanente Division of Research.
My research interests primarily concern the development of robust and efficient statistical techniques at the intersection of machine learning and causal inference, aiming to facilitate data adaptive estimation and robust nonparametric inference for complex target estimands with data from observational or randomized studies. Broadly, my interests span nonparametric estimation, semiparametric theory, high-dimensional inference, targeted learning, statistical computing, and computational biology. Of late, my methodological work has touched upon stochastic treatment regimes, causal mediation analysis, robust inference under two-phase sampling, and data-adaptive conditional density estimation. I am also quite keenly interested in the design of open source software and the use of automated testing for the promotion of reproducibility in applied statistical and scientific practice.
PhD in Biostatistics (designated emphasis in Computational and Genomic Biology), 2017-present
University of California, Berkeley
MA in Biostatistics, 2017
University of California, Berkeley
BA with a triple major in Molecular and Cell Biology (em. Neurobiology), Psychology, and Public Health, 2015
University of California, Berkeley
Development of dimensionality reduction methods using contrastivity and sparsification.
Defining novel mediation effects and extensions using stochastic interventions.
Extensions and applications of causal inference based on stochastic interventions in complex settings.
Moderated variance estimators for use with semiparametric data-adaptive estimators in high-dimensional biology.
Identification of differentially methylated positions and regions based on targeted learning.
The things that keep me from working.
Assorted notes on graduate school.
(see CV for a full list)
scPCA
is a toolbox for sparse contrastive principal component analysis of high-dimensional biological data. scPCA combines the …
Public Health 242C & Statistics 247C: Longitudinal Data Analysis (Fall 2019), as graduate student instructor with Prof. Alan Hubbard
Public Health 290: Targeted Learning in Biomedical Big Data (Spring
2018), as graduate student
instructor with Prof. Mark van der Laan
Course
materials here |
GitHub repositories here
The tlverse software ecosystem for targeted
learning at the Conference on
Statistical Practice;
2020 February 20; co-taught with A. Hubbard, J. Coyle, I. Malenica, R.
Phillips.
Course materials here
| GitHub repository
here
The tlverse software ecosystem for causal
inference at the Atlantic Causal
Inference Conference;
2019 May 22; co-taught with M. van der Laan,
A. Hubbard, J. Coyle, I. Malenica, R. Phillips.
Course materials here
| GitHub repository
here
I am an active member of Software Carpentry and Data Carpentry, through which I engage in curriculum development, maintenance of lesson materials, and workshop delivery.
Software Carpentry: Shell, Git, and
R at the Berkeley Institute
for Data Science; 2019 Jan. 17-18; co-taught
with S. Peterson, N. Varoquaux.
Course materials
here | GitHub repository
here
Software Carpentry: Shell, Git, and
Python at the Berkeley Institute
for Data Science; 2018 Jul. 16-17; co-taught
with K. Marwaha.
Course materials
here | GitHub repository
here
Data Carpentry: Genomics at
Lawrence Berkeley National Laboratory; 2018 May
3-4; co-taught with A. Orr.
Course materials
here | GitHub
repository here
Collected collateral damage from doing statistics research, hopefully useful to others.
tlverse
The tlverse
is an ecosystem of R packages for
Targeted Learning, of which I am a co-founder and core developer. A few of the
tlverse
packages to which I have made significant contributions include
sl3
: An R package providing a modern
implementation of the Super Learner ensemble modeling algorithm that
simultaneously exposes a grammar for composing arbitrary pipelines for
machine learning. Joint work with Jeremy
Coyle, Ivana
Malenica, and Oleg
Sofrygin.
[Docs] |
[GitHub]
origami
: An R package exposing a generalized
framework for the application of a variety of cross-validation schemes to arbitrary
functions, facilitating the extension of cross-validation to (and its use in)
a diversity of applications. Joint work with Jeremy
Coyle.
[Docs] |
[GitHub] |
[CRAN]
hal9001
: An R package providing an efficient
implementation of the Highly Adaptive Lasso (HAL), a nonparametric
regression estimator with fast convergence guarantees under mild assumptions.
Joint work with Jeremy Coyle and Mark van
der Laan.
[Docs] |
[GitHub] |
[CRAN]
tmle3shift
: An R package providing a
targeted maximum likelihood estimator(s) for the causal effects of modified
treatment policies on continuous-valued exposures. Incorporates nonparametric
working marginal structural models for summarization of effect estimates.
Joint work with Jeremy Coyle and Mark van
der Laan.
[Docs] |
[GitHub]
A significant focus of my research program lies at the intersection of causal inference and statistical machine learning. I’ve (co-)developed R packages for settings ranging from causal mediation analysis and the assessment of stochastic intervention effects with two-phase sampling to nonparametric conditional density estimation and survival analysis.
medshift
: An R package for estimating the
population intervention (in)direct effects based on stochastic interventions.
Classical and efficient estimators are supported for the effects of
incremental propensity score interventions and modified treatment policies.
Joint work with Iván Díaz.txshift
: An R package for efficient
estimation of and inference on causal effects of stochastic interventions on
continuous-valued exposures. Robust estimation and efficient inference under
two-phased sampling is supported. Joint work with David
Benkeser.
[Docs] |
[GitHub]
haldensify
: An R package for nonparametric
conditional density estimation using techniques based on the highly adaptive
lasso, designed primarily for estimation of the generalized propensity
score. Joint work with David
Benkeser
and Mark van der Laan.
[Docs] |
[GitHub] |
[CRAN]
survtmle
: An R package for the construction
of targeted maximum likelihood estimates of marginal cumulative incidence in
survival settings with and without competing risks, including estimation
procedures that respect bounds. Joint work with David
Benkeser.
[Docs] |
[GitHub] |
[CRAN]
A parallel thread of my research concerns the development of novel statistical methodologies for application in high-dimensional and computational biology settings. Consequently, I have (co-)developed several R packages extending the Bioconductor Project.
biotmle
: An R package for the model-free
discovery of biomarkers from biological expression data, introducing a
generalization of moderated statistics for variance stabilization of
semiparametric estimators. Joint work with Alan
Hubbard and Mark van der
Laan.
[Docs] |
[GitHub] |
[Bioconductor]
scPCA
: An R package for sparse contrastive
principal component analysis, facilitating the recovery of stable and
low-dimensional patterns from high-dimensional biological data while removing
technical artifacts by making use of control samples. Joint work with
Philippe Boileau and Sandrine
Dudoit.
[GitHub] |
[Bioconductor]
methyvim
: An R package for genome-wide
assessment of differential methylation based on estimation of variable
importance measures at the level of CpG sites. Joint work with Mark van der
Laan.
[Docs] |
[GitHub] |
[Bioconductor]
adaptest
: An R package for multiple
hypothesis testing with data adaptive target parameters in high-dimensional
settings using Targeted Learning. Joint work with Weixin
Cai and Alan
Hubbard.
[GitHub] |
[Bioconductor]