open-source software

An early call to embrace computing and programming in statistics was issued by John Tukey, who saw such activities as critical to the next generation of developments in statistical data analysis. Despite Tukey’s prominence and timely concerns, only a small fraction of the total academic effort in statistics research is concentrated on the development and improvement of standards for statistical programming, computing, and graphics or on the complementary paradigms of literate programming and literate computing. Cross-talk between statistics and allied quantitative sciences led to further concern about the state of research reproducibility. For example, as lessons learned from their development of Wavelab, a software toolkit for wavelet analysis, several prominent statisticians bemoaned the lack of availability of scientific software, noting that the “release of software underlying scientific publication is the exception rather than the rule” while simultaneously observing that, with respect to scientific publishing, “the actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.” Fortunately, the decades leading to the emergence of the interdisciplinary field of data science have been marked by a renewed and fervent interest in open-source software development and reproducible research.

In an era in which intricate statistical procedures are routinely deployed to rigorously evaluate scientific claims, open-source statistical software is poised to play a central role in enhancing transparency in the scientific discovery process. For such practice to thrive, reliable, user-friendly software must become the norm, rather than the exception. Such software is characterized by at least five essential properties.

Equipped with these characteristics, open-source software for statistics can empower the scientific community—and possibly even the public at large—to directly access the published results of scientific investigations. Practices for reproducible research in the statistical sciences have been the subject of much discussion; yet, in much of academia, the core aspects of software development are still viewed as ancillary activities Recalling an anecdote about Wavelab, without continued investment in the development and promotion of open-source software standards, “a year [will continue to be] a long time in this business.”

A thread of our work aims to make such occurrences rarer. To that end, we routinely develop open-source software packages to accompany both our methodological and applied science research, releasing the unrestricted source code for each package on GitHub. Each software package includes online documentation, a suite of unit tests, and automated quality control via continuous integration checking. Browse a non-exhaustive list of the open-source software packages that members of our group have developed here.

selected publications

Some representative publications appear below. Please consult Google Scholar or a recent CV for a comprehensive list. Below, the names of group members appear in bold, including those of trainees and of the PI; the names of frequent collaborators are italicized and underlined.

Balkus, S.V. and Hejazi, N.S. (2025) CausalTables.jl: Simulating and storing data for statistical causal inference in Julia. Journal of Open Source Software, 10, 7580.
Benkeser, D. and Hejazi, N.S. (2023) Doubly-robust inference in R using drtmle. Observational Studies, 9, 43–78.
Boileau, P., Hejazi, N.S., Collica, B., van der Laan, M.J. and Dudoit, S. (2021) cvCovEst: Cross-validated covariance matrix estimator selection and evaluation in R. Journal of Open Source Software, 6, 3273.
Boileau, P., Hejazi, N.S. and Dudoit, S. (2020) scPCA: A toolbox for sparse contrastive principal component analysis in R. Journal of Open Source Software, 5, 2079.
Coyle, J.R. and Hejazi, N.S. (2018) origami: A generalized framework for cross-validation in R. Journal of Open Source Software, 3, 512.
Hejazi, N.S. and Benkeser, D.C. (2020a) txshift: Efficient estimation of the causal effects of stochastic interventions in R. Journal of Open Source Software, 5, 2447.
Hejazi, N.S., Cai, W. and Hubbard, A.E. (2017) biotmle: Targeted Learning for biomarker discovery. Journal of Open Source Software, 2, 295.
Hejazi, N.S., Coyle, J.R. and van der Laan, M.J. (2020b) hal9001: Scalable highly adaptive lasso regression in R. Journal of Open Source Software, 5, 2526.
Hejazi, N.S., Rudolph, K.E. and Dı́az, I. (2022a) medoutcon: Nonparametric efficient causal mediation analysis with machine learning in R. Journal of Open Source Software, 7, 3979.
Hejazi, N.S., van der Laan, M.J. and Benkeser, D. (2022b) haldensify: Highly adaptive lasso conditional density estimation in R. Journal of Open Source Software, 7, 4522.