We develop and share open-source software for statistics, causal inference, and machine learning on GitHub. Our group’s GitHub organization page can be found at https://github.com/nshlab, where most of our open-source software is hosted. Read more about the role of open-source software in our group’s research program here.
Consistent with our commitment to open science, we exclusively work in open-source programming languages. Most often, these include
- R, for statistical computing and graphics: https://www.r-project.org/
- Julia, for fast numerical computing: https://julialang.org/
Below is a list of open-source software packages that members of the group have co-developed or otherwise significantly contributed to.
CausalTables.jl
A common interface for manipulating tabular data for causal inference in Julia, providing both utility functions to clean and manipulate datasets for downstream statistical tasks and simulation capabilities that simplify the extraction of true conditional distributions from a dataset in closed-form after data has been generated.
haldensify
Non-parametric conditional density estimation based on the highly adaptive lasso algorithm, designed for estimation of the generalized propensity score (for continuous exposures).
hal9001
An efficient implementation of the Highly Adaptive Lasso (HAL) procedure, a nonparametric regression estimator achieving favorable convergence rates under global structural assumptions. Part of the TLverse
project.
medoutcon
Efficient cross-fitted estimation of both natural and interventional direct and indirect effects subject to intermediate confounding, including one-step and targeted minimum loss estimators.
txshift
Efficient estimation of the causal effects of additive modified treatment policies for continuous exposures, including corrections for efficient inference under two-phased sampling designs.
sl3
An implementation of the Super Learner ensemble modeling algorithm that exposes a flexible grammar for composing arbitrary pipelines for machine learning tasks. Part of the TLverse
project.
origami
A generalized framework for applying a great variety of cross-validation schemes to arbitrary estimation functions. Part of the TLverse
project.
tmle3shift
Targeted maximum likelihood estimation of the causal effects of modified treatment policies for continuous exposures, incorporating working marginal structural models to summarize effect estimates. Part of the TLverse
project.
sherlock
Causal machine learning and semi-parametric estimation to discover population segments based on treatment effect heterogeneity. Flexible techniques for defining segment-specific treatment rules and efficient estimators of the causal effects of these dynamic treatment regimes are implemented. Built during time spent at Netflix Research.
cvCovEst
Asymptotically optimal, cross-validated, loss-based selection of covariance matrix estimators, tailored for use in high-dimensional settings.
scPCA
Sparse contrastive principal component analysis, facilitating the recovery of stable and low-dimensional patterns from high-dimensional biological data while removing technical artifacts by making use of control samples.
biotmle
Model-agnostic discovery of biomarkers from biological sequencing and expression data, introducing a hypothesis testing strategy with variance moderation to stabilize semi-parametric estimators in small-sample settings.
medshift
Estimation of population intervention direct and indirect effects based on stochastic interventions. Classical and efficient estimators are supported for the effects of incremental propensity score interventions and modified treatment policies.
survtmle
Targeted maximum likelihood estimates of marginal cumulative incidence in right-censored survival settings with and without competing risks, including estimation procedures that respect bounds.
LtAtStructuR
Restructure a collection of time-stamped measurements (e.g., electronic health record data) into a standard long format analytic dataset suitable for the evaluation of the effect of multiple time-point interventions in the presence of time-dependent confounding or selection bias.