open-source software
An early call to embrace computing and programming in statistics was issued by John Tukey, who saw such activities as critical to the next generation of developments in statistical data analysis. Despite Tukey’s prominence and timely concerns, only a small fraction of the total academic effort in statistics research is concentrated on the development and improvement of standards for statistical programming, computing, and graphics or on the complementary paradigms of literate programming and literate computing. Cross-talk between statistics and allied quantitative sciences led to further concern about the state of research reproducibility. For example, as lessons learned from their development of Wavelab
, a software toolkit for wavelet analysis, several prominent statisticians bemoaned the lack of availability of scientific software, noting that the “release of software underlying scientific publication is the exception rather than the rule” while simultaneously observing that, with respect to scientific publishing, “the actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.” Fortunately, the decades leading to the emergence of the interdisciplinary field of data science have been marked by a renewed and fervent interest in open-source software development and reproducible research.
In an era in which intricate statistical procedures are routinely deployed to rigorously evaluate scientific claims, open-source statistical software is poised to play a central role in enhancing transparency in the scientific discovery process. For such practice to thrive, reliable, user-friendly software must become the norm, rather than the exception. Such software is characterized by at least five essential properties.
- Clear, easily accessible, highly detailed documentation of all code-derived interfaces and objects, whether developer-oriented or user-facing.
- Rigorous and focused testing to assess programmatic procedures (e.g., functions, classes, methods) and data structures (e.g., the ubiquitous data frame).
- In-depth examples using literate programming documents that blend executable code with prose (e.g.,
Quarto
) or literate computing notebooks that promote the interactive development of computation informed by data (e.g., Jupyter notebooks). - Open source development, embodying an ongoing, continuous, public peer review of the research product.
- Automated monitoring of software quality through continuous integration services, ensuring accessibility across diverse computer systems and architectures.
Equipped with these characteristics, open-source software for statistics can empower the scientific community—and possibly even the public at large—to directly access the published results of scientific investigations. Practices for reproducible research in the statistical sciences have been the subject of much discussion; yet, in much of academia, the core aspects of software development are still viewed as ancillary activities Recalling an anecdote about Wavelab
, without continued investment in the development and promotion of open-source software standards, “a year [will continue to be] a long time in this business.”
A thread of our work aims to make such occurrences rarer. To that end, we routinely develop open-source software packages to accompany both our methodological and applied science research, releasing the unrestricted source code for each package on GitHub. Each software package includes online documentation, a suite of unit tests, and automated quality control via continuous integration checking. Browse a non-exhaustive list of the open-source software packages that members of our group have developed here.
selected publications
Some representative publications appear below. Please consult Google Scholar or a recent CV for a comprehensive list. Below, the names of group members appear in bold, including those of trainees and of the PI; the names of frequent collaborators are italicized and underlined.