open-source software

Published

24 June 2025

An early call to embrace computing and programming in statistical science was issued by John Tukey, who saw such activities as critical to the next generation of developments in statistical data analysis (note: this in was the 1960s). Despite Tukey’s prominence and timely concerns, only a small fraction of the total academic effort in statistics research has been concentrated on developing, extending, or improving standards for statistical programming, computing, and graphics, or on the complementary paradigms of literate programming and literate computing. Cross-talk between statistics and allied quantitative sciences led to further concern about the state of research reproducibility. For example, as lessons learned from their development of Wavelab, a software toolkit for computational wavelet analysis, a few prominent statisticians bemoaned the lack of availability of scientific software, noting that the “release of software underlying scientific publication is the exception rather than the rule,” while simultaneously pointing out that, with respect to scientific publishing, “the actual scholarship is the complete software development environment and the complete set of instructions which generated the figures” (emphasis added). Fortunately, the emergence and fast-paced growth of the interdisciplinary field of data science have been marked by a renewed and fervent interest in open-source software and reproducible research, including standards for statistical software.

Today, intricate statistical procedures are routinely deployed to rigorously evaluate scientific claims. Open-source statistical software is poised to play a central role in helping to enhance transparency in the scientific discovery process. For such practice to thrive, reliable and user-friendly software must become the norm. Five key characteristics for such software include:

Clear, easily accessible, highly detailed documentation of all code-derived interfaces and objects, whether developer-oriented or user-facing.
Rigorous and focused testing to assess programmatic procedures (e.g., functions, classes, methods) and data structures (e.g., the ubiquitous data frame).
In-depth examples using literate programming documents that blend executable code with prose (e.g., Quarto) or literate computing notebooks that promote the interactive development of computation informed by data (e.g., Jupyter).
Open source development, embodying an ongoing, continuous, public peer review of the research product.
Automated monitoring of software quality through continuous integration services, ensuring accessibility across diverse computer systems and architectures.

Equipped with these characteristics, open-source software for statistics can empower the scientific community—and possibly even the public at large—to directly access published results of scientific investigations. Practices for reproducible research in the statistical sciences have been the subject of much discussion; yet, the core aspects of software development continue to be viewed as ancillary activities in most of academia. Recalling an anecdote about Wavelab, without continued investment in the development and promotion of open-source software standards, “a year [will continue to be] a long time in this business.”

A thread of our work aims to make such occurrences rarer. To that end, we routinely develop open-source software packages and workflows to accompany our contributions to both novel statistical methodology and in applied science investigations, releasing the unrestricted source code for most projects on GitHub. Each software package includes online documentation, a suite of unit tests, and automated quality control via continuous integration checking. Browse a non-exhaustive list of the open-source software packages that members of the lab have developed. Consistent with our views on open-source software development being a core research activity, we publish on our work in this area (see below).

selected publications

Some representative publications appear below. Please consult Google Scholar or a recent CV for a comprehensive list. Below, the names of lab members appear in bold, including those of trainees and of the PI; the names of frequent collaborators are italicized and underlined.

Balkus, S.V. and Hejazi, N.S. (2025) CausalTables.jl: Simulating and storing data for statistical causal inference in Julia. Journal of Open Source Software, 10, 7580.

Benkeser, D. and Hejazi, N.S. (2023) Doubly-robust inference in R using drtmle. Observational Studies, 9, 43–78.

Boileau, P., Hejazi, N.S., Collica, B., van der Laan, M.J. and Dudoit, S. (2021) cvCovEst: Cross-validated covariance matrix estimator selection and evaluation in R. Journal of Open Source Software, 6, 3273.

Boileau, P., Hejazi, N.S. and Dudoit, S. (2020) scPCA: A toolbox for sparse contrastive principal component analysis in R. Journal of Open Source Software, 5, 2079.

Coyle, J.R. and Hejazi, N.S. (2018) origami: A generalized framework for cross-validation in R. Journal of Open Source Software, 3, 512.

Hejazi, N.S. and Benkeser, D.C. (2020a) txshift: Efficient estimation of the causal effects of stochastic interventions in R. Journal of Open Source Software, 5, 2447.

Hejazi, N.S., Cai, W. and Hubbard, A.E. (2017) biotmle: Targeted Learning for biomarker discovery. Journal of Open Source Software, 2, 295.

Hejazi, N.S., Coyle, J.R. and van der Laan, M.J. (2020b) hal9001: Scalable highly adaptive lasso regression in R. Journal of Open Source Software, 5, 2526.

Hejazi, N.S., Rudolph, K.E. and Dı́az, I. (2022a) medoutcon: Nonparametric efficient causal mediation analysis with machine learning in R. Journal of Open Source Software, 7, 3979.

Hejazi, N.S., van der Laan, M.J. and Benkeser, D. (2022b) haldensify: Highly adaptive lasso conditional density estimation in R. Journal of Open Source Software, 7, 4522.