txshift
R package, is introduced. We illustrate the proposed techniques using data from a recent preventive HIV vaccine efficacy trial.
I am a PhD candidate in biostatistics, with a designated emphasis in computational and genomic biology, working with Mark van der Laan and Alan Hubbard. I am a founding core developer of the tlverse, the software ecosystem for Targeted Learning, and a workshop instructor with Software Carpentry. At UC Berkeley, I am affiliated with the Center for Computational Biology and the NIH Biomedical Big Data initiative. I have also served in biostatistical collaborations with the Bill & Melinda Gates Foundation, the Kaiser Permanente Division of Research, and the Fred Hutchinson Cancer Research Center.
My research interests primarily concern the development of robust and efficient statistical techniques at the intersection of machine learning and causal inference, aiming to facilitate data adaptive estimation and robust nonparametric inference for complex target estimands with data from observational or randomized studies. Broadly, my interests span causal inference, non/semi-parametric estimation, high-dimensional inference, targeted loss-based estimation, statistical computing, and computational biology. Of late, my methodological work has touched upon causal mediation analysis, stochastic treatment regimes, robust inference under two-phase sampling, and the construction of efficient estimators using undersmoothing. I am also quite keenly interested in the design of open source software and the use of automated testing for promoting reproducibility in applied statistical and scientific practice.
PhD in Biostatistics (designated emphasis in Computational and Genomic Biology), 2017-present
University of California, Berkeley
MA in Biostatistics, 2017
University of California, Berkeley
BA with a triple major in Molecular and Cell Biology (em. Neurobiology), Psychology, and Public Health, 2015
University of California, Berkeley
txshift
R package, is introduced. We illustrate the proposed techniques using data from a recent preventive HIV vaccine efficacy trial.
(see CV for a full list)
Public Health 290: Biomedical Big Data Capstone Seminar (Spring 2020), as graduate student instructor w/ Prof. Alan Hubbard
Public Health 242C & Statistics 247C: Longitudinal Data Analysis (Fall 2019), as graduate student instructor w/ Prof. Alan Hubbard
Public Health 290: Targeted Learning in Biomedical Big Data (Spring
2018), as graduate student
instructor w/ Prof. Mark van der Laan
Course
materials here |
GitHub repositories here
The tlverse software ecosystem for targeted
learning at the Conference on
Statistical Practice;
2020 February; co-taught w/ Alan Hubbard, Jeremy Coyle, Ivana
Malenica, Rachael Phillips
Course materials here
| GitHub repository
here
The tlverse software ecosystem for causal
inference at the Atlantic Causal
Inference Conference;
2019 May; co-taught w/ Mark van der Laan, Alan Hubbard, Jeremy Coyle,
Ivana Malenica, Rachael Phillips
Course materials here
| GitHub repository
here
I am an active member of Software Carpentry and Data Carpentry, through which I work on curriculum development, maintenance of lesson materials, and workshop delivery.
Software Carpentry: Shell, Git, and
R at the Berkeley Institute
for Data Science; 2019 January; co-taught w/
Scott Peterson, Nelle Varoquaux
Course materials
here | GitHub repository
here
Software Carpentry: Shell, Git, and
Python at the Berkeley Institute
for Data Science; 2018 July; co-taught w/
Kunal Marwaha
Course materials
here | GitHub repository
here
Data Carpentry: Genomics at
Lawrence Berkeley National Laboratory; 2018 May;
co-taught w/ Adam Orr
Course materials
here | GitHub
repository here
Collected collateral damage from doing statistics research, hopefully useful to others.
tlverse
The tlverse
is an ecosystem of R packages for
Targeted Learning, of which I am a co-founder and core developer. A few of the
tlverse
packages to which I have made significant contributions include
sl3
: An R package providing a modern
implementation of the Super Learner ensemble modeling algorithm that
simultaneously exposes a grammar for composing arbitrary pipelines for
machine learning. Joint work with Jeremy
Coyle, Ivana
Malenica, and Oleg
Sofrygin.
[Docs] |
[GitHub]
origami
: An R package exposing a generalized
framework for the application of a variety of cross-validation schemes to arbitrary
functions, facilitating the extension of cross-validation to (and its use in)
a diversity of applications. Joint work with Jeremy
Coyle.
[Docs] |
[GitHub] |
[CRAN]
hal9001
: An R package providing an efficient
implementation of the Highly Adaptive Lasso (HAL), a nonparametric
regression estimator with fast convergence guarantees under mild assumptions.
Joint work with Jeremy Coyle and Mark van
der Laan.
[Docs] |
[GitHub] |
[CRAN]
tmle3shift
: An R package providing a
targeted maximum likelihood estimator(s) for the causal effects of modified
treatment policies on continuous-valued exposures. Incorporates nonparametric
working marginal structural models for summarization of effect estimates.
Joint work with Jeremy Coyle and Mark van
der Laan.
[Docs] |
[GitHub]
A significant focus of my research program lies at the intersection of causal inference and statistical machine learning. I’ve (co-)developed R packages for settings ranging from causal mediation analysis and the assessment of stochastic intervention effects with two-phase sampling to nonparametric conditional density estimation and survival analysis.
medshift
: An R package for estimating the
population intervention (in)direct effects based on stochastic interventions.
Classical and efficient estimators are supported for the effects of
incremental propensity score interventions and modified treatment policies.
Joint work with Iván Díaz.
[Docs] |
[GitHub]
medoutcon
: An R package for the efficient
estimation of stochastic interventional (in)direct effects identifiable under
intermediate confounding, including one-step and targeted minimum
loss estimators. Joint work with Iván Díaz and Kara
Rudolph.
[Docs] |
[GitHub]
txshift
: An R package for efficient
estimation of and inference on causal effects of stochastic interventions on
continuous-valued exposures. Robust estimation and efficient inference under
two-phased sampling is supported. Joint work with David
Benkeser.
[Docs] |
[GitHub]
haldensify
: An R package for nonparametric
conditional density estimation using techniques based on the highly adaptive
lasso, designed primarily for estimation of the generalized propensity
score. Joint work with David
Benkeser
and Mark van der Laan.
[Docs] |
[GitHub] |
[CRAN]
survtmle
: An R package for the construction
of targeted maximum likelihood estimates of marginal cumulative incidence in
survival settings with and without competing risks, including estimation
procedures that respect bounds. Joint work with David
Benkeser.
[Docs] |
[GitHub] |
[CRAN]
A parallel thread of my research concerns the development of novel statistical methodologies for application in high-dimensional and computational biology settings. Consequently, I have (co-)developed several R packages extending the Bioconductor Project.
biotmle
: An R package for the model-free
discovery of biomarkers from biological expression data, introducing a
generalization of moderated statistics for variance stabilization of
semiparametric estimators. Joint work with Alan
Hubbard and Mark van der
Laan.
[Docs] |
[GitHub] |
[Bioconductor]
scPCA
: An R package for sparse contrastive
principal component analysis, facilitating the recovery of stable and
low-dimensional patterns from high-dimensional biological data while removing
technical artifacts by making use of control samples. Joint work with
Philippe Boileau and Sandrine
Dudoit.
[GitHub] |
[Bioconductor]
methyvim
: An R package for genome-wide
assessment of differential methylation based on estimation of variable
importance measures at the level of CpG sites. Joint work with Mark van der
Laan.
[Docs] |
[GitHub] |
[Bioconductor]
adaptest
: An R package for multiple
hypothesis testing with data adaptive target parameters in high-dimensional
settings using Targeted Learning. Joint work with Weixin
Cai and Alan
Hubbard.
[GitHub] |
[Bioconductor]
Efficient estimation of functional target parameters based on the highly adaptive lasso minimum loss estimator (HAL-MLE).
Cross-validated selection; robust, sparsified estimation; and dimension reduction for high-dimensional (contrastive) covariance matrices.
Defining novel, more flexible causal effects for mediation analysis, primarily using the formalism of stochastic interventions.
Estimating the causal effects of stochastic treatment regimes, including conditional density estimation and two-phase sampling corrections.
Introducing empirical Bayes variance moderation for data adaptive variable importance in high-dimensional biology applications.
The things that keep me from working.
Assorted notes on graduate school.