Data and Code Management With Bash, GitHub, and Nextflow
Preface
This is a limited introduction to a number of tools I needed to learn quickly after my PhD program. Originally, I called this a short intro, but it grew as I found even basic tasks required more information than I was originally led to believe by online documentation - and I even removed a section on Docker (maybe for later, but Docker actually has good tutorials out there).
Also, I’m writing this as if you have no experience with Bash, GitHub, or Nextflow - these are tools that are not taught in most Epidemiology programs, leaving researchers to figure this out on their own time or rely on having budgetary space for a bioinformatician. These notes will get you setup quickly and hopefully remove any confusion or intimidation. I’m also going to assume if you are reading this, you’ve heard of these tools (or at least one) in name.
My primary case for using these tools is within the All of Us workbench. The secure workspace was created with a strong focus on Jupyter notebook, and while I’ve used it in the past, it is my least preferred interface. Using GitHub though, I can write code on my local machine in RStudio if I want, then push/pull the code into the Jupyter environment in All of Us. I don’t have to spend money on time in the workbench, and I can work in my most comfortable environment.
Finally, after writing most of the post, it became clear that there was actually a lot in it. You can skip to sections that are most relevant for you using the navbar!
I - Bash
Bash is how one uses a command line/terminal, specifically in Unix-like operating systems, including Linux distributions and macOS.
I suggest strongly that you use Bash (the Unix Bourne Again SHell - technically I believe it should be “BASH”). Some of these tools have point-and-click interfaces (like GitHub Desktop - which you can use in place of Git Bash if you don’t need it!), but you can do universally everything here in your Bash terminal, and you may not always have access to point-and-click interfaces if using a high performance cluster (HPC) or secured workspace (like the NIH’s All of Us workbench). Bash commands also don’t take too long to figure out.
I use Git Bash, however, all Bash is Bash. The terminal in RStudio is Bash, your Unix command line is Bash, and is any tool built with Unix that has a command line/terminal. That’s why Bash is helpful to know - you can use it in most workspaces or clusters and it is incredibly versatile.
Install Git Bash
Navigate to the git website and download the version that is appropriate for your computer.
You can search for it and open it - it looks like a terminal window. Now you’re GUI-free.
Initial Setup Bash Commands
#initially you will need to tell git your name and email, so that it is set up correctly:
#set username:
$ git config --global user.name "Your Name"
#set your email address - used for committing
$ git config --global user.email "example@email.edu"
Basic Commands in Bash
Below are some basic commands that can help you navigate using Bash - there is also a more thorough cheatsheet here. Definitely play around with these a bit and get accustomed.
One thing of note is that you cannot use the shortcut CTRL + C
or CTRL + V
to copy/paste; however, you can right-click and use copy/paste commands from the pop-up menu.
#activate "help": this shows ALL the commands available
$ help
#create a directory
$ mkdir directory_name
#explain a command (command "cat" will be explained)
$ cat --help
#look everything in the active directory/folder
$ ls
#which can show even hidden files (like a github repository) with:
$ ls -a
#change directory
$ cd pathway/to/your.file
#if directory is in the current directory you can use the hard stop as shorthand to move down
$ cd ./dir/file.txt
#move up to parent directory - the one that contains the current directory
$ cd ..
#can also use tilde for your user home directory as a shorthand. This command is similar to typing 'C:/Users/username/Documents'
$ cd ~/Documents
# remove/delete a file
$ rm pathway/to/your.file
#remove a WHOLE DIRECTORy
$ rm -r folder/pathway/
#move a file (assumes you are in the file's directory)
$ mv file_to_move.txt pathway/to/new_location/
#create an output file with results of a command
$ echo "hello world" > output.txt
#this puts the output 'hello world' into a new file called 'output.txt' in the active directory
# runs a file
$ source file_name
# runs an R script in the background using &
$ Rscript my_script.R &
#logout and quit (closes the terminal)
$ exit
Filenames and “Cases”
You may quickly find that it is problematic or frustrating to work with files that contain spaces in the names. Best practices in general suggest not including spaces in names so that you can type them in Bash without issue, and I agree (technically, you can often use quotation marks though).
The best way to do this is most likely by using “Snake Case” or “Camel Case:”
Snake Case uses all lower case characters and separates words with an underscore (“_”) and is generally easier to read.
- e.g.: this_is_a_snake_case_file
Camel Case does not use underscores, but differentiates the start of a word with a capital letter and is in general easier to type.
- e.g.: ThisIsACamelCaseFile
Vim & Editing Files in Bash
Vim is technically its own tool, but it is contained within Git Bash and can be used to edit files directly while in the Git Bash terminal. Below are basic commands for navigating when in Vim - you don’t need to hit enter after every command when in Vim.
#to open a file using Vim:
vim file_name.txt
#to edit the file - enter INSERT mode:
i
#then hit ESC to exit insert mode back to normal mode
#undo changes
u
#to save changes to the file (w for write, like write to disk):
:w
#exit Vim editor (q for quit)
:q
#save and exit at once:
:wq
II - Git & Github
Git and Github are separate concepts, and can be combined to extensively change the way you work with code and how you share it.
Technically, Git Bash will manage everything locally for you as we did above. However, GitHub is really how you can create a code repository for sharing reproducible research, collaborating/managing analyses across workspaces and environments, and also versioning control, and it can be completely interfaced with through Git Bash as well.
One of the most interesting things about using Git is that it monitors changes in code by line, not simply by whether a script is changed or not like saving a file. This allows Git to update any script on a line by line basis - if you change one line and your collaborator changes another, Git will compile the edits so that each line is updated in the script - and any lines you didn’t touch will not be changed.
Git
First, initialize and set up a local repository on your computer. Download git bash here, open it up the Git terminal once it is installed, and then we can set things up.
Use the Git Bash terminal to navigate to the project folder where you want to set up your repository, and then you can use the following commands.
#initialize the local repository
git init
#set up your name and ID for the repository
git config --global user.name "Your Name"
git config --global user.email "youremail@example.com"
Once that is set up, you can use your repository in your local machine. However, the real value comes when using with Github.
GitHub
Github is an amazing resource - you will have a dedicated remote space to store and share code (and you can use the codespaces to work in VS code). All you need to do is head over to their page and sign up! This allows for any person or local repository securely connected to the Github remote repository to find the code there, pull it to their local repository, or push their edits to the shared remote repository.
After creating an account and setting up a new repository, Github will prompt you to create a README.md file. If you didn’t know, the .md file type is “markdown.” You can do that now or later; the Markdown section should be helpful.
Keys and Connecting
In order to connect to Github, there has to be a secure shell (SSH) public key shared with Gitub. This is sort of like a security token that allows sharing information securely in a convenient way without having to enter a password every time. SSH keys have two main parts - the public and private key. You want to share your public key, and never your private key.
Generating a SSH key is quite easy - just navigate to the directory where your repository is and enter the following code in Bash.
# generate a key set with your email as comment
$ ssh-keygen -t rsa -b 4096 -C "youremail@example.com"
Once you do that, it will ask you to name the key and choose a password, if you want. If you do not want to use a password, just hit enter twice.
In some cases, you may need to start the SSH client after that - just to be thorough I would also run the following:
#initialize SSH client
$ eval "$(ssh-agent -s)"
#add private key to client
$ ssh-add ~/.ssh/id_rsa
#view and check the key - the name may change based on your prior input
cat ~/.ssh/id_rsa.pub
I use the All of Us workbench in my example, and there’s a chance you might also try these tools there (or a similar secured workspace). I will save you a few hours and a massive headache by mentioning that SSH does not naturally load the correct identity in the AoU workspace terminal. You may need the following commands to start SSH and add your key so that it knows what key to use with github:
$ eval "$(ssh-agent -s)"
$ ssh-add ./key_pathway/ssh_key
$ ssh-add -l
In fact, this seems to be needed often in AoU - so I write a script called ‘initialize.txt’ that runs this and saved it in my persistent disk. You can write all bash commands in a text file (without the ‘$
’ that I use to denote bash terminal commands) and then run that text file to run all those commands. In essence, I run $ source initialize.txt
, which then executes all the security initialization steps that I wanted through running that .txt file.
There are also ‘aliases’ which you can use to do something similar, but that’s a different conversation for a later date.
At this point, you can connect your local repository to Github. If you haven’t already created an account and set up a repository there, now is the time.
Once you have a repository on Github.com, you can add the public SSH key to your account so that Github has the ability connect securely. The keys page is found in account > settings > SSH and GPG keys. There will be a blue button at the top that says “New SSH Key” - click on that and you will be able to copy and paste your public key into Github (use $ cat keyname.pub
to view). Github even tells you how to connect them.
Once you have that correctly set up, the following commands will be helpful.
#this tests your connection to your github account:
$ ssh -T git@github.com
#connecting to the github repository with local repository (probably displayed in Github)
$ git remote add origin git@github.com:your_username/repository_name.git
#you can verify the connection between your local and github repository with:
$ git remote -v
Markdown
Markdown uses pound signs (#) for labeling text as a header or title, where one is a title (#), two is heading 1 (##), three is heading 2 (###), etc. In Quarto code blocks a pound sign is unfortunately also used as commenting out text - so when you see two comments back to back in the following blockw, that should be interpreted as one comment followed by markdown syntax - e.g., “#create title” is my comment explaining what I’m doing and “# title of document/README” is what I am literally typing into my document in Vim.
#create readme file in bash
$ touch README.md
#edit it in Vim: - note that if you only type README in this command, it creates a new file without an extenstion called "README" - this is an alternative way/hack to create a file
vim README.md
#then:
i #no need for ENTER
#create title
# title of document/README
#create subheading
## (or ###) Purpose
"enter in a description of purpose here"
#add as much other information as is necessary. Usually an explanaition of inputs, outputs, and what you are doing and your goals is suggested - imagine sharing your repository with a colleague and they want to know what this is all about.
#below are some helpful inputs for markdown:
*italicized text* #this italicizes words
**bold text** #this bolds words
***bold & italicized text*** #this bolds and italicizes text
#the following is a bulleted list:
- Item 1
- Item 2
- Item 3
#here's a nubmered list:
1. First item
2. Second item
3. Third item
#here is a link insertion: can put any name or text for the word a user can click on in "Link text"
[Link text](https://example.com)
#here is a table:
| name of thing | variable 1| variable 2 |
|----------|----------|----------|
| thing one | value 1 | value 2 |
| thing two | value 1 | value 2 |
#here is how you can enter in a code section:
`short code insertion` #inserts a little code inline
```r #inserts a code block
var3 <- var2*mean(var1)
```
#and finally, this makes a horizontal line
----
General Use of Git
The usual way you will interface with Git/Github is to track file changes, commit those changes, and then push or pull the changes to or from the main repository.
First, you must use add
each time you make changes to a file that you want to include from that point on. If you do not, the changes will not be staged for your commit. This prevents files from unintentionally being committed in the repository.
# add file to commit:
$ git add /file/name
#add all files you have changed:
$ git add -u
#start over with adding (maybe you accidentally added something you didn't want)
$git reset
If you aren’t sure if you added your file, or if there were changes relative to the main repository, you can check the status.
#this will show you what files are staged (in green) and which are not (in red)
$git status
Commit
Next, you create a commit - it’s just a packaging of your changes that you want to ship to the main repository. It’s what you will actually add to the code you are working on.
This is done with the following code, where “commit message” is a brief explanation of what you did - to keep track of the changes you made. The message is required, and Git will force one.
#create a commit:
$ git commit -m "commit message"
Push & Pull
If you have your changes staged for sharing (committed), then you can push them to the main repository, and the lines you changed will be tallied and will replace the lines in the previous version of the code. If you know there were also changes that occurred and you want those changes to be moved to your local repository, you can pull them from the main repository in Github.
#push your committed changes:
$ git push
#pull changes from repo to your local file:
$ git pull
#isn't this easy?
Restore and Reset
Occasionally, some things might get criss-crossed, one branch may get ahead of the others and accidentally separate off, or there may be some conflict that Git doesn’t seem to be able to resolve. You can remove any changes that you made to your code using restore, and you can also resort to a more drastic option if you already staged those changes using a hard reset. The hard reset changes your file back to the way it was in the previous commit.
# restore your work to the previous version still in main repo:
$ git restore
# hard reset; wipe unstaged changes and staged changes
$ git reset --hard
For the most part, that’s really it when it comes to Git and Github. There are of course other things you can do with it, and there may be occasional hiccups. However, this should provide a quick start that you can get started with, and then hopefully fill in gaps with the wealth of vague knowledge on the internet.
III - Workflow Management
There are numerous reasons for why you might want to set up multiple R scripts and run them all together - either in parallel, or in sequence. To accomplish this, there are also numerous workflow tools that facilitate the parallelization of processes, and the management of scripts with the ability to pass output from one script to the next and manage shared files and resources. The two which I have found most useful are Nextflow and R (sometimes with a little help from the ‘targets’ R package).
The use of R is much more straightforward than Nextflow, and I even find that Nextflow can overcomplicate things quickly. So here I will focus on basic forms of management - R on its own, and with the help of the targets package, and will introduce a basic example of Nextflow as well, but if you don’t need to pass output from one script to be input to another, I wouldn’t even consider Nextflow. You also are required to understand Bash commands, so if you aren’t keen on that, you probably won’t be keen on Nextflow.
If you are a SAS user, you will probably need to use SAS Enterprise, which is a different topic and is also much more well documented by SAS and SAS users groups, so I am going to consider that outside the scope of this post.
R Master Scripts
The most basic way to manage analyses is with a master script, which you can write in R just as you would your analysis. You can simply use the R function source()
or system()
to run other R scripts, and if your scripts already manage input/output and packages well enough, then that might take care of most of your tasks. You can also set up specific directories and libraries for the management of a workflow.
In this example, there are hypothetically the folders data
, results
, and scripts
. These contain exactly what their names imply. If you are exploring a process or the outcome of your script does not matter, you can use .rds files as intermediate steps to run tests faster, but if you are running any analysis of consequence, I suggest using .csv files, so that you can examine the intermediate steps more easily. I also add in a progress bar - not necessary but I like it.
The most important part of the master script example below is the dynamic code in the run_if_missing
function that only runs one process if files do not exist. This master script will never run the process that produces any file you already have in your results
folder, and will never overwrite your results. If you want to replace results, you must delete them out of your results
folder. This is prevents mistakes and conserves resources, keeping steps modular and providing an ability to only run scripts that are necessary.
# Set up progress bar
<- 4
steps <- txtProgressBar(min = 0, max = steps, style = 3)
pb <- 0
step_num
# Helper to run a step only if needed
<- function(output_file, cmd) {
run_if_missing if (!file.exists(output_file)) {
message("Running: ", cmd)
<- system(cmd)
exit_code if (exit_code == 0) {
message("✅ Step complete: ", output_file)
else {
} message("❌ Step failed (exit code ", exit_code, "): ", output_file)
}else {
} message("⏩ Skipping step (already exists): ", output_file)
}# Update progress bar
<<- step_num + 1
step_num setTxtProgressBar(pb, step_num)
}
# Define all paths
<- "data/raw_data.csv"
raw_csv <- "results/loaded_data.csv"
loaded_csv <- "results/cleaned_data.csv"
cleaned_csv <- "results/model_summary.csv"
model_summary_csv <- "results/plot.png"
plot_png
# Step 1: Load data
run_if_missing(
loaded_csv,paste("Rscript scripts/01_load_data.R", raw_csv, loaded_csv)
)
# Step 2: Clean data
run_if_missing(
cleaned_csv,paste("Rscript scripts/02_clean_data.R", loaded_csv, cleaned_csv)
)
# Step 3: Fit model
run_if_missing(
model_summary_csv,paste("Rscript scripts/03_fit_model.R", cleaned_csv, model_summary_csv)
)
# Step 4: Plot results
run_if_missing(
plot_png,paste("Rscript scripts/04_plot_results.R", cleaned_csv, plot_png)
)
# Done
close(pb)
message("All steps complete.")
How Targets in R works
If you want a dedicated package to handle this sort of workflow in R with more features and ability to plan, the {targets} package is probably what you want. This does something similar, but also provides dependency graphs, automatically handles staging and re-running so you don’t need helper functions, parallelizes tasks, etc.
Targets does change your work slightly - you will have to code your scripts as functions, and follow a format that targets expects you to follow. However, you will very likely have more ability to plan workflows once you become used to the different formatting.
The same script in the example above could be written as:
library(targets)
# Optional but helpful for CSV I/O
tar_option_set(
packages = "readr" # add others as needed
)
# Define the pipeline
list(
tar_target(
raw_data,read_csv("data/raw_data.csv"),
format = "file" # treat as file dependency
),
tar_target(
cleaned_data_file,
{source("scripts/01_clean_data.R") # should define `clean_data()`
write_csv(clean_data(raw_data), "results/cleaned_data.csv")
"results/cleaned_data.csv"
},format = "file"
),
tar_target(
model_summary_file,
{source("scripts/02_fit_model.R") # should define `fit_model()`
<- read_csv(cleaned_data_file)
data write_csv(fit_model(data), "results/model_summary.csv")
"results/model_summary.csv"
},format = "file"
),
tar_target(
plots_file,
{source("scripts/03_plot_results.R") # should define `make_plots()`
<- read_csv(model_summary_file)
data make_plots(data, output_file = "results/plots.png")
"results/plots.png"
},format = "file"
) )
However, note that all the scripts are no longer scripts. These are functions, and are create as functions now, not scripts. If that is a bit confusing to write everything as a function, there is always the simple R master script option.
What you gain from having all your scripts be functions is the ability to check your flows, re-run anything as a function to call it on any dataset without re-writing your script, and automatic staging and modularity. It does most of the things you would want in a master script, without really having to write all the master-script specific functions.
For more information, visit the targets user manual. It is short, but helpful.
How Nextflow works (environment setup)
Workflow
In Nextflow, things can be pretty complex, but a basic workflow can be accomplished with the following test.nf file.
# define a process - this outputs to a specific directory I created calld 'output'
process hello {
output:
path 'hello.txt'
publishDir './output', mode: 'copy'
script:
"""
echo 'Hello world!' > hello.txt
"""
}
#define another process - this is also sent to 'output'
process reply {
output:
path 'reply.txt'
publishDir './output', mode: 'copy'
script:
"""
echo 'Goodnight moon!' > reply.txt
"""
}
#define a workflow that runs our processes
workflow {
hello()
reply()
}
Nextflow Work Directories (output and logs)
When you run a Nexflow process, it will create a work directory where it places logs and output. You can of course control the output directory, examine logs, change config, etc.
#clean up after yourself: removes the created work directory when you're done
$ nextflow clean -f
Working with directories outside of the work directory can be a bit difficult - especially if you are trying to follow the official Nextflow documentation at time of writing. Apparently they are updating the syntax, and some information is not up-to-date.
Currently, the way to read in to Nextflow processes is to use a channel. The file() method is deprecated.
Additionally, you cannot use the ‘~’ (tilde) shortcut either; Nextflow will not recognize your file if you include anything other than the exact, full file path in your path channel.
Pay close attention to the use of the channel in the All of Us example below, and I advise that you do not follow examples elsewhere on the web or from an LLM such as ChatGPT at this time. Even some examples on the official website are misleading, at time of writing.
Using R in Nextflow (on a cluster)
In practice, R scripts require extra steps taken in Nextflow, especially when working with a secure cluster or other environment you are not in direct control of. There are some issues with both package usage, as well as script execution that really should be mentioned, mostly the following two.
Packages are Problematic - using the normal initiation of packages can cause issues, particularly if you are using something like All of Us, where scripts need to install packages to run - having each script install the same package creates ‘race conditions,’ in which scripts are trying to write and access the same locked files while installing.
R scripts are not executable as they are in some processes that you want to run repeatedly in workflows. R scripts may need to be made into an executable form with a shebang, in order to feed it into commands without issue - i.e., you may not want to use the ’Rscript” bash command, and only call the file in one command instead of two in order to work with Nextflow syntax.
The solutions are not exactly stratforward for issue 1 - if it is a script that you are running multiple cases of at the same time (as Nextflow normally does), you need to have the packages already installed in your environment and accessible via a bin. If you just need one script’s packages, you can use a Conda environment.yml file, or a conda directive in your process (with appropriate nextflow config file adjustments), and also the scripts can usually just include the R function ‘install.packages()’ as well. However, the use of a bin may be required if you are using All of Us or a similar platform as I do in my example - I install my packages there as a ‘local’ library and then call them in my R scripts.
The solution for issue 2 is more straightforward, and simply requires using a shebang. A shebang tells your computer how to execute a file - essentially marking your Rscript as an Rscript so that you don’t have to use ‘Rscript’ in your Bash command.
Scheduling Locally
Example in All of Us
The All of Us Workbench is a possible application at this time. It is a paltform that I have used for research using Nextflow, so I will use an example from this research. If the platform continues to exist in the future, this will hopefully also help others conduct research there.
Nextflow in AoU uses Google Cloud (LIfe Sciences API)
I will use the terminal available in the workbench - the Cloud Analysis Terminal. The tutorial in the workbench already provides info on using python/jupyter to run and work with Nextflow. I just want to use Bash, and not complicate the process further when I work.
first, we to set up some things to run a process with Nextflow:
#put Nexflow in your workspace: $ curl -s https://get.nextflow.io | bash #initialize and see help $ nextflow p #config files #use google utilities to navigate your actual buckets - python and rstudio buckets get separated out and can't play nice together apparently $ gsutil ls #examine buckets $ gsutil ls gs://your-bucket-name/ #navigate to your intended bucket #write nextflow.config file
Required Dynamic Coding in R Scripts
#The following allows for data to be put into R through Bash
<- commandArgs(trailingOnly = TRUE)
args <- args[1]
datasetFile
# Use the file path in your read.csv()
<- read.csv(datasetFile) cleaning_step_4
Currently, I haven’t found a way to load R packages simultaneously across multiple scripts in Nextflow. Nextflow attempts to run scripts at the same time, in parallel, in separate work subdirectories for each process. This means that R scripts attempt to install packages at the same time for different directories, which is not possible because the installation is a process that is locked - this causes conflicts and breaks the process.
Conclusion
This is all I felt like I should include for now - but I may come back and edit this in the future if I find something useful and apropos. These notes should help with a jump start into some of these skills, and I found that after usage these felt like much smoother tools to use.