Bashing irreproducibility with shournal

Tycho Kirchner
Konstantin Riege
Steve Hoffmann

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

The Linux shell is arguably one of the most important computational tools across various scientific disciplines. Its high flexibility makes it the platform of choice for many file operations and smaller scripting tasks. Also, many stand-alone programs are called from the Linux shell - typically followed by multiple command-line parameters. However, in larger analysis projects, keeping track of the work quickly becomes challenging, as a typical shell workflow involves the iterative execution of commands with many parameters, modification of scripts, and editing of configuration files. Too often, researchers find themselves in the uncomfortable position that a computational result generated a few weeks ago can no longer be reproduced, despite having taken great care documenting the work manually. On the other hand, there is a lack of tools able to record the researcher’s shell activity automatically with reasonably low runtime- and storage overhead. To close this critical gap, we developed shournal , a program that tightly integrates with the Linux shell and automatically records every shell command along with the files it reads or writes. Besides logging command- and file metadata, such as working directory, file path, and checksums, shournal can be configured to archive scripts or configuration files that are not regularly under version control via git or svn .

Arcadia Science
Aug 22, 2023

Dear taylor.reiter,

Thanks a ton for your comments and suggestions! We will revise our manuscript and submit a new version soon.

In the meantime, we have tested shournal on an AWS EC2 instance. We have been able to install shournal after adding the universe repository. Apparently, this repo is not enabled by default on AWS instances (sudo add-apt-repository universe). We have added this bit of information on our github page.

We would be super grateful if you could give us feedback on whether this fix solves your problem!

Cheers Steve

Read the original source
Arcadia Science
Aug 11, 2023

In this manuscript, the authors present shournal, a tool to help with tracing shell commands that have been run on Linux computers. shournal sits in a space between iterative computational experiment and codifying those steps in a workflow. I'm excited by the concept of radical repeatability that lightweight tools like shournal could usher in.

I was unable to install shournal from the instructions on the github page, so this review does not cover feedback on the tool itself. I was eager to try on the snakemake integration and was sad not to be able to. I tried to install on an AWS EC2 instance (Ubuntu, t2.micro, using the latest release of shournal).

From a high level usability and adoption perspective, I think two things currently decrease the likelihood of shournal's broad adoption. First, the fact that there is no mac or windows …

In this manuscript, the authors present shournal, a tool to help with tracing shell commands that have been run on Linux computers. shournal sits in a space between iterative computational experiment and codifying those steps in a workflow. I'm excited by the concept of radical repeatability that lightweight tools like shournal could usher in.

I was unable to install shournal from the instructions on the github page, so this review does not cover feedback on the tool itself. I was eager to try on the snakemake integration and was sad not to be able to. I tried to install on an AWS EC2 instance (Ubuntu, t2.micro, using the latest release of shournal).

From a high level usability and adoption perspective, I think two things currently decrease the likelihood of shournal's broad adoption. First, the fact that there is no mac or windows distribution decreases shournal's audience. Second, the fact that shournal may be ineffective on HPCs further limits the audience (both by shournalk and by the event history not tracing over multiple machines These limitations do not decrease it's conceptual addition to the field, but will decrease the likelihood of adoption.

Culturally, I think there are pros and cons to shournal. On the pro side, I think having more tools in the reproducibility arsenal is a positive thing. Shournal meets scientists where they're at as they determine the best scripts to run on their data. However, I worry that reliance on shournal could lead to sub-par documentation for computational experiments. If researchers are in the habit of recording their commands and with notes, reliance on shournal may change this process, removing helpful metadata from command recordings. It is difficult to know how a tool like shournal could change the overall working habits of e.g. bioinformaticians, but it would be interesting to conduct a study on how adoption of shournal improves or detracts from reproducibility and documentation. (to be clear I am definitely NOT suggesting that that be done as part of this paper! But I think shournal could encourage a seachange in computing documentation, so it would be interesting from a metascience perspective to understand the benefits and drawbacks of those changes, and then how shournal could eventually be modified to reduce the drawbacks.)

One of the limitations suggested in the supplement is that, "provenance of binary executables is not tracked." Would it be possible to parse the help messages of binary executables or look at the stdout for version numbers or other tells of the software? This is far more inelegant than shournalk's current approach, but I wanted to supply it as a brainstorming idea in case the authors find it useful to iterate from. Alternatively, could the checksum of the binary executable be tracked?

Lastly, I left comments inline on the manuscript itself, but I also wanted to note that the first paragraph of the supplement provides important background knowledge that I think would be better served in the introduction of the paper if there is space to include it.

Read the original source
Arcadia Science
Aug 11, 2023

https://github.com/tycho-kirchner/ shournal

I think this URL has a typo in it

Read the original source
Arcadia Science
Aug 11, 2023

Read the original source
Arcadia Science
Aug 11, 2023

Second, tracing of file actions is limited to the comparatively rare close operation and lets the traced process return quickly by delegating further provenance collection to another thread

Clever -- so something like ls or cd would be totally ignored, but any program that actually looks at the data of a file will register, rigth?

Read the original source
Arcadia Science
Aug 11, 2023

Typical workflows

This is slightly confusing wording for me -- does this mean typical shournal runs? For me, workflow is conflated with workflow engines, which I typically assume have a fairly substantial overhead (e.g. a snakemake workflow with a DAG of over 1m processes can take ~16gb of RAM to run)

Read the original source
Arcadia Science
Aug 11, 2023

run permanently

What does this mean? It seems like shournal still needs to be activated at each shell session. Are you recommending that this be achieved by adding activation to the bash profile/bash rc, or is this only a statement that shournal was engineered to have a low overhead such that it could run indefinitely, even simulataneously with processes that demand high CPU, i/o, and/or RAM?

Read the original source
Arcadia Science
Aug 11, 2023

conceptually more extensive design goals

Do you have space to expand on this concept, even with another half sentence or sentence, on how the goals of shournal differ from the previously mentioned tools? I'm a huge nextflow & snakemake user and a big proponent of repeatable and reproducible computation, but I don't have the background to contextualize this comment which I think draws away from the potential impact of shournal in this space.

Read the original source
Arcadia Science
Aug 11, 2023

Ruiz, Richard

I think this citation may be broken

Read the original source
Arcadia Science
Aug 11, 2023

making later re-execution easier

by virtue of being recorded so a scientist can go back and re-trace their steps, or through some other mechanism?

Read the original source
Arcadia Science
Aug 11, 2023

(d) shournal’s tracing performance in various scenarios as relative runtime overhead. Boxes for both, kernel module- (KMOD) and fanotify backend are displayed. For comparison, our measured tracing overhead of Burrito, SPADE and the ptrace-based strace is shown as well.

Is the unit for the y axis seconds?

Read the original source
Arcadia Science
Aug 11, 2023

even if the original files have been modified or deleted

For clarification, this is only if the original scripts or configuration files have been deleted, not if the data files (like a FASTQ or something) have been removed?

Read the original source
Version published to 10.1101/2020.08.03.232843v2 on bioRxiv
Aug 4, 2023
Version published to 10.1101/2020.08.03.232843v1 on bioRxiv
Aug 5, 2020

Ephemeral Kubernetes: Dynamically Deleting and Recreating Clusters using Warewulf

This article has 2 authors:
1. Jonathan Decker
2. Julian Kunkel
This article has no evaluationsLatest version May 21, 2025
PyMossFit: A Google Colab Option for Mössbauer Spectra Fitting

This article has 1 author:
1. Fabio Daniel Saccone
This article has no evaluationsLatest version Jun 26, 2025
When Goodbye is Suddenly Tomorrow: How to Wrap Up Science Projects Quickly

This article has 5 authors:
1. Megan Hastings Hagenauer
2. Stacey Winham
3. Alexandra LJ Freeman
4. Paul W. Sternberg
5. Benedict Kolber
This article has no evaluationsLatest version Jun 25, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

Ephemeral Kubernetes: Dynamically Deleting and Recreating Clusters using Warewulf

PyMossFit: A Google Colab Option for Mössbauer Spectra Fitting

When Goodbye is Suddenly Tomorrow: How to Wrap Up Science Projects Quickly