We develop and share open-source software for statistics, causal inference, and machine learning on GitHub. Our group’s GitHub organization page can be found at https://github.com/nshlab, where most of our open-source software is hosted. Read more about the role of open-source software in our group’s research program here.

Consistent with our commitment to open science, we exclusively work in open-source programming languages. Most often, these include


Below is a list of open-source software packages that members of the group have co-developed or otherwise significantly contributed to.

CausalTables.jl

A common interface for manipulating tabular data for causal inference in Julia, providing both utility functions to clean and manipulate datasets for downstream statistical tasks and simulation capabilities that simplify the extraction of true conditional distributions from a dataset in closed-form after data has been generated.

Website Repo Paper

haldensify

Non-parametric conditional density estimation based on the highly adaptive lasso algorithm, designed for estimation of the generalized propensity score (for continuous exposures).

Website Repo Package Paper

hal9001

An efficient implementation of the Highly Adaptive Lasso (HAL) procedure, a nonparametric regression estimator achieving favorable convergence rates under global structural assumptions. Part of the TLverse project.

Website Repo Package Paper

medoutcon

Efficient cross-fitted estimation of both natural and interventional direct and indirect effects subject to intermediate confounding, including one-step and targeted minimum loss estimators.

Website Repo Paper

txshift

Efficient estimation of the causal effects of additive modified treatment policies for continuous exposures, including corrections for efficient inference under two-phased sampling designs.

Website Repo Package Paper

sl3

An implementation of the Super Learner ensemble modeling algorithm that exposes a flexible grammar for composing arbitrary pipelines for machine learning tasks. Part of the TLverse project.

Website Repo

origami

A generalized framework for applying a great variety of cross-validation schemes to arbitrary estimation functions. Part of the TLverse project.

Website Repo Package Paper

tmle3shift

Targeted maximum likelihood estimation of the causal effects of modified treatment policies for continuous exposures, incorporating working marginal structural models to summarize effect estimates. Part of the TLverse project.

Website Repo

sherlock

Causal machine learning and semi-parametric estimation to discover population segments based on treatment effect heterogeneity. Flexible techniques for defining segment-specific treatment rules and efficient estimators of the causal effects of these dynamic treatment regimes are implemented. Built during time spent at Netflix Research.

Website Repo Paper

cvCovEst

Asymptotically optimal, cross-validated, loss-based selection of covariance matrix estimators, tailored for use in high-dimensional settings.

Website Repo Package Paper

scPCA

Sparse contrastive principal component analysis, facilitating the recovery of stable and low-dimensional patterns from high-dimensional biological data while removing technical artifacts by making use of control samples.

Repo Package Paper

biotmle

Model-agnostic discovery of biomarkers from biological sequencing and expression data, introducing a hypothesis testing strategy with variance moderation to stabilize semi-parametric estimators in small-sample settings.

Website Repo Package Paper

medshift

Estimation of population intervention direct and indirect effects based on stochastic interventions. Classical and efficient estimators are supported for the effects of incremental propensity score interventions and modified treatment policies.

Website Repo

survtmle

Targeted maximum likelihood estimates of marginal cumulative incidence in right-censored survival settings with and without competing risks, including estimation procedures that respect bounds.

Repo

LtAtStructuR

Restructure a collection of time-stamped measurements (e.g., electronic health record data) into a standard long format analytic dataset suitable for the evaluation of the effect of multiple time-point interventions in the presence of time-dependent confounding or selection bias.

Repo