Building Basic Workflows
Goal
The goal of this guide is to take you through the process of structuring and populating a very basic worfklow. Work your way through this recipe step by step and end up with a worfklow which takes raw read data as input, creates quality assurance reports alongside trimmed reads, as well as generates de-novo assemblies from isolate read data.
The recipe assumes some prior knowledge of executing bash commands, making very simple scripts, positional arguments as script input, and finally utilizing virtual Conda environments to install and execute bioinformatic tools. In the end though, you will get a hang of how you can start your very own basic bioinformatic workflow.
Day 1
The first steps
The natural first step of building worfklows is to install a virtual environment with a piece of bioinformatic software, and test commands for executing the software.
Part 0 - Setup
- Prepare folders for containing experiment data
mkdir -p ~/MyExperiment/Dataset cd ~/MyExperiment/Dataset wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/040/ERR14229040/ERR14229040_1.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/040/ERR14229040/ERR14229040_2.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/089/ERR14229089/ERR14229089_1.fastq.gz wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/089/ERR14229089/ERR14229089_2.fastq.gz
- Prepare folders for our workflow files
mkdir ~/MyFirstWorkflow cd ~/MyFirstWorkflow
From here on out, we assume that you are always at ~/MyFirstWorkflow, when running commands and when creating files.
Part 1 - Running a Bioinformatic tool
As a very first step, lets try to use bioinformatic software for assessing raw read quality, for this purpose FastQC is a great tool.
Conda installation of bioinformatic tool
FastQC is very easy to install with micromamba, especially if using environment files. We start by making an environment file for installing FastQC. There will be a few bonus assignments in the end of this document, since they rely on other software, lets include a few extra bioinformatic tools into the same envrionment.
Exercise 1
- Make
QC.yamlby runningnano QC.yamlname: QC channels: - conda-forge - bioconda dependencies: - fastqc - quast - multiqc
- Install the environment from the
QC.yamlfile withmicromamba createSolution
micromamba create -f QC.yaml
Execute bioinformatic command
- Run FastQC on any given sample
- First, make a folder for the FastQC output
~/MyExperiment/Results/FastQC/
- Next, run
fastqcon both read mates of the selected sample, ensure to divert output to the~/MyExperiment/Results/FastQC/folder.mkdir -p ~/MyExperiment/Results/FastQC micromamba run -n QC fastqc ~/MyExperiment/Dataset/ERR14229040_1.fastq.gz ~/MyExperiment/Dataset/ERR14229040_2.fastq.gz -o ~/MyExperiment/Results/FastQC
- Inspect the output folder
- What is the names of the output files we could be interested in? (Are they unique or generic file names?)
- First, make a folder for the FastQC output
Part 2 - Converting command into scripts
Scripts are basically simple text files which when executed can be interpreted as simple bash commands. Their advantage is that the command are stored and thus are able to be reused in the future with minimal testing.
Basic script
Exercise 2
- Convert the bioinformatic command into a script
- Open an empty file called
FastQC.sh
- Add the following code as the very first line
#!/bin/bash, this is the shebang and tell Unix what language is used for execution
- Copy both the
mkdircommand and thefastqccommand into a simple text file calledFastQC.sh
- Generate the three variables called
read1,read2, &output_dirafter the shebang but before the commands
- Designate file paths for the reads and output folder into their respective variables
- Replace file paths with corresponding variables in the
fastqccommand
- Save the changes and test whether it works by executing with bash
Solution
#!/bin/bash # Input read1=~/MyExperiment/Dataset/ERR14229040_1.fastq.gz read2=~/MyExperiment/Dataset/ERR14229040_2.fastq.gz # Output output_dir=~/MyExperiment/Results/FastQC mkdir -p $output_dir micromamba run -n QC fastqc $read1 $read2 -o $output_dir
- Open an empty file called
Dynamic script
Currently, our script can only run FastQC on only one specific sample. To make the script more useful, we should replace the hard coded paths with positional arguments.
By introducing positional arguments, you make the script a bit harder to execute from bash, since it requires the user to specify input and output, on the other hand, the script are now able to run on other samples and write to other output folder, without requiring further modification of the script.
Exercise 3
- Introduce positional arguments to
FastQC.shscript- Reopen the file and replace the
read1,read2, &output_dirhardcoded values with positional arguments$1,$2&$3
- Make a
outdirvariable which points to$output_dir/FastQC
- Save the changes and re-execute the script with bash. Remember to provide input files and output directory in the bash command.
Solution
bash FastQC.sh ~/MyExperiment/Dataset/ERR14229040_1.fastq.gz ~/MyExperiment/Dataset/ERR14229040_2.fastq.gz ~/MyExperiment/Results#!/bin/bash # Input read1=$1 read2=$2 # Output output_dir=$3 # Variables outdir=$output_dir/FastQC mkdir -p $outdir micromamba run -n QC fastqc $read1 $read2 -o $outdir
- Reopen the file and replace the
Part 3 - Make another script
Now we have a handler for creating quality reports for raw reads, lets create another script for generating assemblies. We will use Shovill as it contains handlers for polishing and trimming reads, while it also polishes the genome after assembly.
Installation
Exercise 4
- Generate a yaml file for Shovill called
Assembly.yamlname: Assembly channels: - conda-forge - bioconda dependencies: - shovill
- Install the environment with
micromamba createSolution
micromamba create -f Assembly.yaml
Execution
Exercise 5
Shovill is a bit harder to use, so we would need to inspect its help page.
- Execute
shovillon any given sample (preferably the smallest sample)- Inspect the help page, Note down how to point to Read1, Read2, Output directory and enable read adapter trimming
micromamba run -n Assembly shovill --help
- Execute
shovilland make sure to direct output to a~/MyExperiment/Result/Shovill/folder, make sure to add the--trimoption
- Inspect the output folder.
- What is the name of the output assembly we need? (Is it an unique or generic file name?)
Solution
micromamba run -n Assembly shovill --R1 ~/MyExperiment/Dataset/ERR14229089_1.fastq.gz --R2 ~/MyExperiment/Dataset/ERR14229089_2.fastq.gz --outdir ~/MyExperiment/Results/Shovill/ --trim
- Inspect the help page, Note down how to point to Read1, Read2, Output directory and enable read adapter trimming
Convert to dynamic script
Exercise 6
We have learned that Shovill uses generic file names, this means that if we run the same script for multiple samples, the output would overwrite each others. Therefore we would have to ensure that the output would be written to a unique folder. One straightforward approach would be to include the sample name as part of the output folders.
# Assuming read1 is defined, and its file name ends with the _R1.fastq.gz suffix, sample name can be extracted with the following code:
sample=$(basename $read1 _1.fastq.gz)
- Create a script called
Shovill.shwhich executesshovill- Copy the
shovillcommand into a new script file calledShovill.sh
- Create variables for
read1,read2, &output_dirdesignated as positional argument
- Define a new
samplevariable, using the code chunk specified above.
- Define a
outdirvariable, pointing to$output_dir/$sample
- Replace hard coded paths with the variables with
read1,read2, &outdir
- Save and test the script, remember to specify input and output.
Solution
#!/bin/bash # Input read1=$1 read2=$2 # Output output_dir=$3 # Variables sample=$(basename $read1 _1.fastq.gz) outdir=$output_dir/Shovill/$sample micromamba run -n Assembly shovill --R1 $read1 --R2 $read2 --outdir $outdir
- Copy the
Day 2
Part 4 - Building a basic workflow
Exercise 7
By now, our folder should contain the following files:
~/MyFirstWorkflow
βββ FastQC.sh
βββ QC.yaml
βββ Shovill.sh
βββ Assembly.yamlNow where we have a few components, we can start to structure a basic workflow. Lets start by organising the existing files.
- Make two folders inside
~/MyFirstWorkflow, one calledEnvs/the other calledModules/
- Move all
.shfiles into theModules/folder and move all.yamlfiles into theEnvs/folder
By now your folder structure should look something like this:
~/MyFirstWorkflow
βββ Modules
βΒ Β βββ FastQC.sh
βΒ Β βββ Shovill.sh
βββ Envs
Β Β βββ QC.yaml
Β Β βββ Assembly.yamlExercise 8
Lets try to create a single script for executing the modules, so we in the future easily can add additional steps.
- Create a single script for executing all workflow commands
- Add a empty file called
workflow.shinside the~/MyFirstWorkflow/folder.
- Add
read1,read2andoutdirvariables for handing the positional arguments$1,$2, &$3
- Copy the following line into the script
bash Modules/FastQC.sh
- Provide the
read1,read2andoutdirvariables as input for the bash call
- Save and execute the
workflow.shscript, remember to provide positional arguments
- Add the assembly module by pasting
bash Modules/Shovill.shwith respective positional arguments into the end of theworkflow.shfile.
Solution
#!/bin/bash read1=$1 read2=$2 outdir=$3 bash Modules/FastQC $read1 $read2 $outdir bash Modules/Shovill.sh $read1 $read2 $outdir
- Add a empty file called
By now your folder structure should look like this
~/MyFirstWorkflow
βββ workflow.sh
βββ Modules
βΒ Β βββ FastQC.sh
βΒ Β βββ Shovill.sh
βββ Envs
Β Β βββ QC.yaml
Β Β βββ Assembly.yamlPart 5 - Expanding your workflow
With our basic framework up and running, lets try to attach another module, to ensure that we can expand on our workflow.
Lets try to attach some typing tools with ResFinder, which screens samples for resistance genes. Since it requires a database for setup, this will be a bit more complex to build. But first lets start with the environment:
Installation
Exercise 9
- Install ResFinder in an environment called
CGEfinders- Create an environment file called
CGEfinders.yamlinside theEnvs/folder
- Copy paste the contents
name: CGEfinders channels: - conda-forge - bioconda dependencies: - resfinder - plasmidfinder
- Create the environment
- Test that it works by executing
run_resfinder.py --helpfrom theCGEfindersenvironment
Solution
micromamba create -f CGEfinders.yaml micromamba run -n CGEfinders run_resfinder.py --help
- Create an environment file called
Setup database
Exercise 10
The hardest part here is to install and setup the database, while being possible to explain, weβll save the time and just copy paste a script for the same purpose:
- Make a script for setting up the ResFinder DB
- Make a new script called
install_databases.shin theModules/folder
- Copy paste the following code into the file
#!/bin/bash # Define default value db_dir="databases" # Check if user provided input if [[ $1 ]]; then db_dir=$1 fi # Define current location here=$(pwd) # Create database folder and navigate mkdir -p $db_dir cd $db_dir # Check and set up ResFinder database if [ -d "resfinder_db" ] && [ "$(ls -A resfinder_db)" ]; then echo "ResFinder database detected." else echo "Setting up ResFinder database..." git clone https://bitbucket.org/genomicepidemiology/resfinder_db/ cd resfinder_db # Execute resfinder-db install using python modules from its own environment micromamba run -n CGEfinders python3 INSTALL.py # Return to previous location cd $here fi echo "Done!"
- Save and execute
- Make a new script called
Execution
Exercise 11
- Execute ResFinder and check whether it works
- Execute
run_resfinder.py --help
- Figure out which options are needed for determining
- fastq input files
- output directory
- resistance database folder
- resistance database kma folder (called
db_res_kma)
- Option for searching for acquired resistance genes
- Inspect the output.
- Execute
- Create a script called
ResFinder.shwhich executesrun_resfinder.py- Open an empty file called
ResFinder.shinside theModules/folder.
- Make variables for handling positional arguments call them
read1,read2,output_dir, &db_dir
- Copy paste the
samplevariable and code chunk from the Assembly exercise (resfinder would otherwise overwrite itself)
- Make a variable called
outdirwhich points to$output_dir/ResFinder/$sample
- Write a
run_resfinder.pywith the required options, usingmicromamba run
Solution
#!/bin/bash read1=$1 read2=$2 output_dir=$3 database_dir=$4 # Variables sample=$(basename $read1 _R1.fastq.gz) outdir=$output_dir/ResFinder/$sample db_dir=$database_dir/resfinder_db micromamba run -n CGEfinders run_resfinder.py -ifq $read1 $read2 -o $outdir -db_res $db_dir -db_res_kma $db_dir --acquired
- Open an empty file called
Exercise 12
- Attach the latest module into the workflow
- Under the section with positional arguments, add the following
db_dir=$4in theworkflow.sh
- Execute the install_databases.sh modules just after the Shovill module is executed and add the
$db_dirvariable to the callbash Modules/install_databases.sh $db_dir
- Execute the
ResFinder.shmodule after theinstall_databases.shmodule remember to add all variables
- Save and execute the script
Solution
#!/bin/bash read1=$1 read2=$2 outdir=$3 db_dir=$4 bash Modules/FastQC $read1 $read2 $outdir bash Modules/Shovill.sh $read1 $read2 $outdir bash Modules/install_databases.sh $db_dir bash Modules/ResFinder.sh $read1 $read2 $outdir $db_dir
- Under the section with positional arguments, add the following
Part 6 workflow on entire directories
With everything up and running so far, itβs time to automate the worfklow further. Currently, samples needs to be specified one by one, one helpful enhancement would rather to point to an input directory, and then have the worfklow screen that directory for input files. This of course means that we would have to update the description, but it would be worth the hassle.
Exercise 13
- Point to an input directory rather than two read files as input
- In the
workflow.shfile, replace theread1=$1andread2=$2lines withread_dir=$1
- Fix
outdirto point to the second positional argument ($2) (was third$3)
- Fix
db_dirto point to the third positional argument ($3) (was fourth$4)
- Add the following code chunk just before the first module is called
read1_files=$(find $read_dir -maxdepth 1 -name *_R1.fastq.gz | sort) for read1 in $read1_files; do # Determine read mate 2 read2=${read1%_R1.fastq.gz}_R2.fastq.gz sample=$(basename $read1 _R1.fastq.gz)
- add two spaces before every modules (as such
bash Modules/...)
- enter
doneas the final line of the workflowread1_files=$(find $read_dir -maxdepth 1 -type f -name *_R1.*f*q* | sort) for read1 in $read1_files; do # Determine read mate 2 read2=${read1%_R1.fastq.gz}_R2.fastq.gz sample=$(basename $read1 _R1.fastq.gz) bash modules/moduleA.sh $read1 $read2 $outdir bash modules/moduleB.sh $read1 $read2 $outdir ... doneSolution
#!/bin/bash read_dir=$1 outdir=$2 db_dir=$3 read1_files=$(find $read_dir -maxdepth 1 -type f -name *_R1.*f*q* | sort) for read1 in $read1_files; do # Determine read mate 2 and sample name read2=${read1%_R1.fastq.gz}_R2.fastq.gz sample=$(basename $read1 _R1.fastq.gz) # Running FastQC bash modules/FastQC.sh $read1 $read2 $outdir # Running Shovill bash modules/Shovilll.sh $read1 $read2 $outdir # Installing databases bash Modules/install_databases.sh $db_dir # Running ResFinder bash Modules/ResFinder.sh $read1 $read2 $outdir $db_dir done
- In the
Part 7 - Making the workflow more user friendly
Currently, each module takes in unnamed positional arguments for determining Input and Output. In order to make the workflow more useful for others, it would be a great idea to define named arguments and add a software description.
Exercise 14
Lets start with the description
- Add the following just into the
worfklow.shscript after the!#/bin/bashline# Define usage function usage() { echo "Usage:" echo " $(basename "$0") [-r|--read_dir Raw read directory] [-o|--outdir Output directory] [-d|--db_dir Database directory]" >&2 exit 1 }This is a function which will print out details on how we wish others to use our workflow
- Just beneath the usage function, add the following code
# Parse options while [[ $# -gt 0 ]]; do # Making named positional arguments case "$1" in -r|--read_dir) read_dir="$2" shift ;; -o|--outdir) outdir="$2" shift ;; -d|--db_dir) db_dir="$2" shift ;; *) usage ;; esac shift doneThis will look through all user input and look whether there are any instances of
-r,--read1,-R,--read2,-o, or--outdirdefined.Everytime any of these is identified from the user input, the very next inpuit is used to define the value corresponding to the argument.
- Finally, we have to add a section that STOPS the script if none of the above arguments are provided, add the following code:
# Check if required arguments are provided if [[ -z $read_dir || -z $outdir || -z $db_dir ]]; then usage fiThis looks whether the variables
$read1,$read2, or$outdirwas defined, if not the code calls theusagefunction and then Shuts down
- Execute the workflow to ensure that it actually works.
Bonus Exercises
Part 8 - Improving quality assurance
We have generated de-novo assemblies, yet we do not generate quality assessment reports for these assemblies. Additionally with FastQC, reports are generated on a per read basis. While being valuable for inspecting and interpreting the quality of the sequencing data, itβs far from optimal to skim through every report one by one. This bonus section aims at providing you with the inspiration for setting up additional tools for generating quality reports for assemblies, as well as aggregating all quality reports into a single report.
Exercise A
Quast is a tool for assessing the quality of a given de-novo assembly, like FastQC it works by simply pointing to the assembly file itself, and pointing to a output folder. Luckily, the QC.yaml file already contains Quast, so no further environment setup is required.
- Implement Quast into the existing workflow
- Inspect the Usage: section of the help page with
quast.pyusing micromamba to execute the command
- Define the assembly file as a variable within the
workflow.shfile- Inspect the output of any given assembly (e.g.
ls ~/MyExperiment/Results/Shovill/ERR14229040)
- Open
workflow.shand navigate to the sections whereread1,read2&outdirvariables are defined.
- Add the following line after the positional arguments
sample=$(basename $read1 _1.fastq.gz)βΉοΈThe
samplevariable are now defined insideworkflow.shandModules/Shovill.sh, as a later clean-up step, one idea would be to make another positional argument insideModules/Shovill.shand thereby replace the previous definition.
- Inside
workflow.shcreate a variable calledassemblyand designate it to the assembly file generated within theModules/Shovill.shfile. (Remember to use the$outdirpoint to theShovill/folder and the$samplevariables)
- Save and exit
- Inspect the output of any given assembly (e.g.
- Make a new script file:
Modules/Quast.sh- Set variables
assembly,output_dir, &sampleas positional arguments
- Define a variable
outdirto point to$output_dir/Quast/$sample
- Call the
quast.pycommand pointing to the variables
- Save and test
- Set variables
- Call
Quast.shwith bash insideworkflow.sh
- Inspect the Usage: section of the help page with
Exercise B
MultiQC Is a tool which can recognise and aggregate results from a variety of bioinformatic tools, including Quast and FastQC. here weβll learn to partially integrate MultiQC into the worklfow, however since Shovill doesnβt provide output files named after their input samples, MultiQC will be unable to distinguish the Quast reports from one another.
To enable distinctions, you should use the $assembly and the $sample variables to rename the output file according to its sample name, and then ensure that the output from Quast is likewise named according to sample names.
Again, lucky for us, MultiQC was likewise defined with the QC.yaml file, and are therefore also installed allready
- Integrate MultiQC into the
workflow.shfile after the Quast- Inspect usage from the help page of
mutliqcby evoking it through micromamba
- Make a new script:
Modules/MultiQC.sh- Make a single positional argument called
output_dir
- Inside the script make a new variable
outdirwhich points to$output_dir/MultiQC
- Call
multiqcusingoutput_diras input and make sure to point tooutdiras output
- Make a single positional argument called
- Call
MultiQC.shwith bash insideworkflow.sh
- Inspect usage from the help page of
Part 9 - Improve typers Advanced
Exercise C
Lets improve the typers by adding yet another module to the workflow.
- Utilize the existing
install_databases.shscript to additinally add installation instruction for Plasmidfinder
- Make a
Modules/Plasmidfinder.shscript which callsplasmidfinder.pyfrom theCGEtypersenvironment
- Implement the
Plasmidfinder.shscript into theworkflow.sh