Building Basic Workflows

Goal

The goal of this guide is to take you through the process of structuring and populating a very basic worfklow. Work your way through this recipe step by step and end up with a worfklow which takes raw read data as input, creates quality assurance reports alongside trimmed reads, as well as generates de-novo assemblies from isolate read data.

The recipe assumes some prior knowledge of executing bash commands, making very simple scripts, positional arguments as script input, and finally utilizing virtual Conda environments to install and execute bioinformatic tools. In the end though, you will get a hang of how you can start your very own basic bioinformatic workflow.

Day 1

The first steps

The natural first step of building worfklows is to install a virtual environment with a piece of bioinformatic software, and test commands for executing the software.

Part 0 - Setup

Prepare folders for containing experiment data

mkdir -p ~/MyExperiment/Dataset
cd ~/MyExperiment/Dataset
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/040/ERR14229040/ERR14229040_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/040/ERR14229040/ERR14229040_2.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/089/ERR14229089/ERR14229089_1.fastq.gz
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/089/ERR14229089/ERR14229089_2.fastq.gz

Prepare folders for our workflow files

mkdir ~/MyFirstWorkflow
cd ~/MyFirstWorkflow

From here on out, we assume that you are always at ~/MyFirstWorkflow, when running commands and when creating files.

Part 1 - Running a Bioinformatic tool

As a very first step, lets try to use bioinformatic software for assessing raw read quality, for this purpose FastQC is a great tool.

Conda installation of bioinformatic tool

FastQC is very easy to install with micromamba, especially if using environment files. We start by making an environment file for installing FastQC. There will be a few bonus assignments in the end of this document, since they rely on other software, lets include a few extra bioinformatic tools into the same envrionment.

Exercise 1

Make QC.yaml by running nano QC.yaml

name: QC
channels:
 - conda-forge
 - bioconda
dependencies:
 - fastqc
 - quast
 - multiqc

Install the environment from the QC.yaml file with micromamba create
- Solution
```
micromamba create -f QC.yaml
```

Execute bioinformatic command

Run FastQC on any given sample
- First, make a folder for the FastQC output ~/MyExperiment/Results/FastQC/
- Next, run fastqc on both read mates of the selected sample, ensure to divert output to the ~/MyExperiment/Results/FastQC/ folder.
```
mkdir -p ~/MyExperiment/Results/FastQC
micromamba run -n QC fastqc ~/MyExperiment/Dataset/ERR14229040_1.fastq.gz ~/MyExperiment/Dataset/ERR14229040_2.fastq.gz -o ~/MyExperiment/Results/FastQC
```
- Inspect the output folder
  - What is the names of the output files we could be interested in? (Are they unique or generic file names?)

Part 2 - Converting command into scripts

Scripts are basically simple text files which when executed can be interpreted as simple bash commands. Their advantage is that the command are stored and thus are able to be reused in the future with minimal testing.

Basic script

Exercise 2

Convert the bioinformatic command into a script
- Open an empty file called FastQC.sh
- Add the following code as the very first line #!/bin/bash, this is the shebang and tell Unix what language is used for execution
- Copy both the mkdir command and the fastqc command into a simple text file called FastQC.sh
- Generate the three variables called read1, read2, & output_dir after the shebang but before the commands
- Designate file paths for the reads and output folder into their respective variables
- Replace file paths with corresponding variables in the fastqc command
- Save the changes and test whether it works by executing with bash
- Solution
```
#!/bin/bash

# Input
read1=~/MyExperiment/Dataset/ERR14229040_1.fastq.gz
read2=~/MyExperiment/Dataset/ERR14229040_2.fastq.gz

# Output
output_dir=~/MyExperiment/Results/FastQC

mkdir -p $output_dir
micromamba run -n QC fastqc $read1 $read2 -o $output_dir
```

Dynamic script

Currently, our script can only run FastQC on only one specific sample. To make the script more useful, we should replace the hard coded paths with positional arguments.

By introducing positional arguments, you make the script a bit harder to execute from bash, since it requires the user to specify input and output, on the other hand, the script are now able to run on other samples and write to other output folder, without requiring further modification of the script.

Exercise 3

Introduce positional arguments to FastQC.sh script
- Reopen the file and replace the read1, read2, & output_dir hardcoded values with positional arguments $1, $2 & $3
- Make a outdir variable which points to $output_dir/FastQC
- Save the changes and re-execute the script with bash. Remember to provide input files and output directory in the bash command.
- Solution
```
bash FastQC.sh ~/MyExperiment/Dataset/ERR14229040_1.fastq.gz ~/MyExperiment/Dataset/ERR14229040_2.fastq.gz ~/MyExperiment/Results
```
```
#!/bin/bash

# Input
read1=$1
read2=$2

# Output
output_dir=$3

# Variables
outdir=$output_dir/FastQC

mkdir -p $outdir
micromamba run -n QC fastqc $read1 $read2 -o $outdir
```

Part 3 - Make another script

Now we have a handler for creating quality reports for raw reads, lets create another script for generating assemblies. We will use Shovill as it contains handlers for polishing and trimming reads, while it also polishes the genome after assembly.

Installation

Exercise 4

Generate a yaml file for Shovill called Assembly.yaml

name: Assembly
channels:
 - conda-forge
 - bioconda
dependencies:
 - shovill

Install the environment with micromamba create
- Solution
```
micromamba create -f Assembly.yaml
```

Execution

Exercise 5

Shovill is a bit harder to use, so we would need to inspect its help page.

Execute shovill on any given sample (preferably the smallest sample)
- Inspect the help page, Note down how to point to Read1, Read2, Output directory and enable read adapter trimming
```
micromamba run -n Assembly shovill --help
```
- Execute shovill and make sure to direct output to a ~/MyExperiment/Result/Shovill/ folder, make sure to add the --trim option
- Inspect the output folder.
  - What is the name of the output assembly we need? (Is it an unique or generic file name?)
- Solution
```
micromamba run -n Assembly shovill --R1 ~/MyExperiment/Dataset/ERR14229089_1.fastq.gz --R2 ~/MyExperiment/Dataset/ERR14229089_2.fastq.gz --outdir ~/MyExperiment/Results/Shovill/ --trim
```

Convert to dynamic script

Exercise 6

We have learned that Shovill uses generic file names, this means that if we run the same script for multiple samples, the output would overwrite each others. Therefore we would have to ensure that the output would be written to a unique folder. One straightforward approach would be to include the sample name as part of the output folders.

👾

# Assuming read1 is defined, and its file name ends with the _R1.fastq.gz suffix, sample name can be extracted with the following code:

sample=$(basename $read1 _1.fastq.gz)

Create a script called Shovill.sh which executes shovill
- Copy the shovill command into a new script file called Shovill.sh
- Create variables for read1, read2, & output_dir designated as positional argument
- Define a new sample variable, using the code chunk specified above.
- Define a outdir variable, pointing to $output_dir/$sample
- Replace hard coded paths with the variables with read1, read2, & outdir
- Save and test the script, remember to specify input and output.
- Solution
```
#!/bin/bash

# Input
read1=$1
read2=$2

# Output
output_dir=$3

# Variables
sample=$(basename $read1 _1.fastq.gz)
outdir=$output_dir/Shovill/$sample

micromamba run -n Assembly shovill --R1 $read1 --R2 $read2 --outdir $outdir
```

Day 2

Part 4 - Building a basic workflow

Exercise 7

By now, our folder should contain the following files:

~/MyFirstWorkflow
├── FastQC.sh
├── QC.yaml
├── Shovill.sh
└── Assembly.yaml

Now where we have a few components, we can start to structure a basic workflow. Lets start by organising the existing files.

Make two folders inside ~/MyFirstWorkflow, one called Envs/ the other called Modules/

Move all .sh files into the Modules/ folder and move all .yaml files into the Envs/ folder

By now your folder structure should look something like this:

~/MyFirstWorkflow
├── Modules
│   ├── FastQC.sh
│   └── Shovill.sh
└── Envs
    ├── QC.yaml
    └── Assembly.yaml

Exercise 8

Lets try to create a single script for executing the modules, so we in the future easily can add additional steps.

Create a single script for executing all workflow commands
- Add a empty file called workflow.sh inside the ~/MyFirstWorkflow/ folder.
- Add read1 , read2 and outdir variables for handing the positional arguments $1, $2, & $3
- Copy the following line into the script bash Modules/FastQC.sh
- Provide the read1, read2 and outdir variables as input for the bash call
- Save and execute the workflow.sh script, remember to provide positional arguments
- Add the assembly module by pasting bash Modules/Shovill.sh with respective positional arguments into the end of the workflow.sh file.
- Solution
```
#!/bin/bash

read1=$1
read2=$2
outdir=$3

bash Modules/FastQC $read1 $read2 $outdir
bash Modules/Shovill.sh $read1 $read2 $outdir
```

By now your folder structure should look like this

~/MyFirstWorkflow
├── workflow.sh
├── Modules
│   ├── FastQC.sh
│   └── Shovill.sh
└── Envs
    ├── QC.yaml
    └── Assembly.yaml

Part 5 - Expanding your workflow

With our basic framework up and running, lets try to attach another module, to ensure that we can expand on our workflow.

Lets try to attach some typing tools with ResFinder, which screens samples for resistance genes. Since it requires a database for setup, this will be a bit more complex to build. But first lets start with the environment:

Installation

Exercise 9

Install ResFinder in an environment called CGEfinders
- Create an environment file called CGEfinders.yaml inside the Envs/ folder
- Copy paste the contents
```
name: CGEfinders
channels:
 - conda-forge
 - bioconda
dependencies:
 - resfinder
 - plasmidfinder
```
- Create the environment
- Test that it works by executing run_resfinder.py --help from the CGEfinders environment
- Solution
```
micromamba create -f CGEfinders.yaml
micromamba run -n CGEfinders run_resfinder.py --help
```

Setup database

Exercise 10

The hardest part here is to install and setup the database, while being possible to explain, we’ll save the time and just copy paste a script for the same purpose:

Make a script for setting up the ResFinder DB

Make a new script called install_databases.sh in the Modules/ folder

Copy paste the following code into the file

#!/bin/bash

# Define default value 
db_dir="databases"

# Check if user provided input
if [[ $1 ]]; then
  db_dir=$1
fi

# Define current location
here=$(pwd)

# Create database folder and navigate
mkdir -p $db_dir
cd $db_dir

# Check and set up ResFinder database
if [ -d "resfinder_db" ] && [ "$(ls -A resfinder_db)" ]; then
  echo "ResFinder database detected."
else
  echo "Setting up ResFinder database..."
  git clone https://bitbucket.org/genomicepidemiology/resfinder_db/
  cd resfinder_db
  
  # Execute resfinder-db install using python modules from its own environment
  micromamba run -n CGEfinders python3 INSTALL.py
  
  # Return to previous location
  cd $here
fi

echo "Done!"

Save and execute

Execution

Exercise 11

Execute ResFinder and check whether it works
- Execute run_resfinder.py --help
- Figure out which options are needed for determining
  - fastq input files
  - output directory
  - resistance database folder
  - resistance database kma folder (called db_res_kma)
  - Option for searching for acquired resistance genes
- Inspect the output.

Create a script called ResFinder.sh which executes run_resfinder.py
- Open an empty file called ResFinder.sh inside the Modules/ folder.
- Make variables for handling positional arguments call them read1, read2, output_dir, & db_dir
- Copy paste the sample variable and code chunk from the Assembly exercise (resfinder would otherwise overwrite itself)
- Make a variable called outdir which points to $output_dir/ResFinder/$sample
- Write a run_resfinder.py with the required options, using micromamba run
- Solution
```
#!/bin/bash

read1=$1
read2=$2
output_dir=$3
database_dir=$4

# Variables
sample=$(basename $read1 _R1.fastq.gz)
outdir=$output_dir/ResFinder/$sample
db_dir=$database_dir/resfinder_db

micromamba run -n CGEfinders run_resfinder.py -ifq $read1 $read2 -o $outdir -db_res $db_dir -db_res_kma $db_dir --acquired
```

Exercise 12

Attach the latest module into the workflow
- Under the section with positional arguments, add the following db_dir=$4 in the workflow.sh
- Execute the install_databases.sh modules just after the Shovill module is executed and add the $db_dir variable to the call
```
bash Modules/install_databases.sh $db_dir
```
- Execute the ResFinder.sh module after the install_databases.sh module remember to add all variables
- Save and execute the script
- Solution
```
#!/bin/bash

read1=$1
read2=$2
outdir=$3
db_dir=$4

bash Modules/FastQC $read1 $read2 $outdir
bash Modules/Shovill.sh $read1 $read2 $outdir
bash Modules/install_databases.sh $db_dir
bash Modules/ResFinder.sh $read1 $read2 $outdir $db_dir
```

Part 6 workflow on entire directories

With everything up and running so far, it’s time to automate the worfklow further. Currently, samples needs to be specified one by one, one helpful enhancement would rather to point to an input directory, and then have the worfklow screen that directory for input files. This of course means that we would have to update the description, but it would be worth the hassle.

Exercise 13

Point to an input directory rather than two read files as input

In the workflow.sh file, replace the read1=$1 and read2=$2 lines with read_dir=$1

Fix outdir to point to the second positional argument ($2) (was third $3)

Fix db_dir to point to the third positional argument ($3) (was fourth $4)

Add the following code chunk just before the first module is called

read1_files=$(find $read_dir -maxdepth 1 -name *_R1.fastq.gz | sort)

for read1 in $read1_files; do

  # Determine read mate 2
  read2=${read1%_R1.fastq.gz}_R2.fastq.gz
  sample=$(basename $read1 _R1.fastq.gz)

add two spaces before every modules (as such bash Modules/... )

enter done as the final line of the workflow

read1_files=$(find $read_dir -maxdepth 1 -type f -name *_R1.*f*q* | sort)

for read1 in $read1_files; do

  # Determine read mate 2
  read2=${read1%_R1.fastq.gz}_R2.fastq.gz
  sample=$(basename $read1 _R1.fastq.gz)
  
  bash modules/moduleA.sh $read1 $read2 $outdir
  bash modules/moduleB.sh $read1 $read2 $outdir
  ...
  
done

Solution

#!/bin/bash

read_dir=$1
outdir=$2
db_dir=$3

read1_files=$(find $read_dir -maxdepth 1 -type f -name *_R1.*f*q* | sort)

for read1 in $read1_files; do

  # Determine read mate 2 and sample name
  read2=${read1%_R1.fastq.gz}_R2.fastq.gz
  sample=$(basename $read1 _R1.fastq.gz)

  # Running FastQC
  bash modules/FastQC.sh $read1 $read2 $outdir
  
  # Running Shovill
  bash modules/Shovilll.sh $read1 $read2 $outdir
  
  # Installing databases
  bash Modules/install_databases.sh $db_dir
  
  # Running ResFinder
  bash Modules/ResFinder.sh $read1 $read2 $outdir $db_dir  

done

Part 7 - Making the workflow more user friendly

Currently, each module takes in unnamed positional arguments for determining Input and Output. In order to make the workflow more useful for others, it would be a great idea to define named arguments and add a software description.

Exercise 14

Lets start with the description

Add the following just into the worfklow.sh script after the !#/bin/bash line

# Define usage function
usage() {
  echo "Usage:"
  echo " $(basename "$0") [-r|--read_dir Raw read directory] [-o|--outdir Output directory] [-d|--db_dir Database directory]" >&2
  exit 1
}

This is a function which will print out details on how we wish others to use our workflow

Just beneath the usage function, add the following code

# Parse options
while [[ $# -gt 0 ]]; do

  # Making named positional arguments
  case "$1" in
    -r|--read_dir)
      read_dir="$2"
      shift
      ;;
    -o|--outdir)
      outdir="$2"
      shift
      ;;
    -d|--db_dir)
      db_dir="$2"
      shift
      ;;
    *)
      usage
      ;;
  esac
  shift
done

This will look through all user input and look whether there are any instances of -r , --read1 , -R, --read2, -o, or --outdir defined.

Everytime any of these is identified from the user input, the very next inpuit is used to define the value corresponding to the argument.

Finally, we have to add a section that STOPS the script if none of the above arguments are provided, add the following code:
```
# Check if required arguments are provided
if [[ -z $read_dir || -z $outdir || -z $db_dir ]]; then
  usage
fi
```
This looks whether the variables $read1, $read2, or $outdir was defined, if not the code calls the usage function and then Shuts down

Execute the workflow to ensure that it actually works.

Bonus Exercises

Part 8 - Improving quality assurance

We have generated de-novo assemblies, yet we do not generate quality assessment reports for these assemblies. Additionally with FastQC, reports are generated on a per read basis. While being valuable for inspecting and interpreting the quality of the sequencing data, it’s far from optimal to skim through every report one by one. This bonus section aims at providing you with the inspiration for setting up additional tools for generating quality reports for assemblies, as well as aggregating all quality reports into a single report.

Exercise A

Quast is a tool for assessing the quality of a given de-novo assembly, like FastQC it works by simply pointing to the assembly file itself, and pointing to a output folder. Luckily, the QC.yaml file already contains Quast, so no further environment setup is required.

Implement Quast into the existing workflow
- Inspect the Usage: section of the help page with quast.py using micromamba to execute the command
- Define the assembly file as a variable within the workflow.sh file
  - Inspect the output of any given assembly (e.g. ls ~/MyExperiment/Results/Shovill/ERR14229040)
  - Open workflow.sh and navigate to the sections where read1, read2 & outdir variables are defined.
  - Add the following line after the positional arguments sample=$(basename $read1 _1.fastq.gz)
    ℹ️
    The sample variable are now defined inside workflow.sh and Modules/Shovill.sh, as a later clean-up step, one idea would be to make another positional argument inside Modules/Shovill.sh and thereby replace the previous definition.
  - Inside workflow.sh create a variable called assembly and designate it to the assembly file generated within the Modules/Shovill.sh file. (Remember to use the $outdir point to the Shovill/ folder and the $sample variables)
  - Save and exit
- Make a new script file: Modules/Quast.sh
  - Set variables assembly, output_dir, & sample as positional arguments
  - Define a variable outdir to point to $output_dir/Quast/$sample
  - Call the quast.py command pointing to the variables
  - Save and test
- Call Quast.sh with bash inside workflow.sh

Exercise B

MultiQC Is a tool which can recognise and aggregate results from a variety of bioinformatic tools, including Quast and FastQC. here we’ll learn to partially integrate MultiQC into the worklfow, however since Shovill doesn’t provide output files named after their input samples, MultiQC will be unable to distinguish the Quast reports from one another.

ℹ️

To enable distinctions, you should use the $assembly and the $sample variables to rename the output file according to its sample name, and then ensure that the output from Quast is likewise named according to sample names.

Again, lucky for us, MultiQC was likewise defined with the QC.yaml file, and are therefore also installed allready

Integrate MultiQC into the workflow.sh file after the Quast
- Inspect usage from the help page of mutliqc by evoking it through micromamba
- Make a new script: Modules/MultiQC.sh
  - Make a single positional argument called output_dir
  - Inside the script make a new variable outdir which points to $output_dir/MultiQC
  - Call multiqc using output_dir as input and make sure to point to outdir as output
- Call MultiQC.sh with bash inside workflow.sh

Part 9 - Improve typers Advanced

Exercise C

Lets improve the typers by adding yet another module to the workflow.

Utilize the existing install_databases.sh script to additinally add installation instruction for Plasmidfinder

Make a Modules/Plasmidfinder.sh script which calls plasmidfinder.py from the CGEtypers environment

Implement the Plasmidfinder.sh script into the workflow.sh