Pipeline development

⛔

This excersise assumes the following: * A native unix environment / terminal * A functional miniconda / minimamba installation * Ability to install environments from .yaml files * A few bacterial assemblies * All files are stored wihtin the home directory for the student user. Feel free to change locations according to own needs!

Authors

These exercises where authored and tested by Povilas Matusevicius and Kasper Thystrup Karstensen

Our very first pipeline

A brief disclaimer

Scripting and pipelines is a bit like cooking. You set one or more recipes (scripts), which the chef (bash) then follows. The ingredients (files and file paths) are usually listed at the top, and then the cooking utensils (commands and software-calls) are introduced later once you need these. Now contrary to cooking, scripts are interpreted literally down to the very last comma, so the chef will not use His/Her experience to guide the process. This means that any errors in the recipe will be followed and missing steps will be let out. E.g. If the chef is supposed to put the turkey in the oven and crank the temperature up to a 200 Celsius for 20 minutes, this will neither ensure that the turkey will be taken out of the oven, nor that the oven will be turned off, unless it is stated in the recipe.

This can (and will) lead to a lot of frustration, as its hard to ensure that all details are correct.

❗

This is quite alright. In fact, its impossible to learn coding without a lot of trials and errors!!

Pipelines starts with a script…

In order to make a pipeline we must provide at least one script, which can be followed, thus in this first exercise you will be guided through the process of generating a bash script.

The bash script is a simple text file which contains one or more lines of code, which will be executed in chronological order.

A very important aspect of making bash script is to ensure that the code is clean and concise for the future reader and developer. Which assuming long job contracts, very often could be yourself. One way of helping yourself making the code more reader friendly is to assign variables for containing file paths and files.

Setting up a characterization pipeline

Let’s make a script which utilizes some of the characterization tools which were introduced at Day 7.

To demonstrate we will start out setting up VirulenceFinder, PlasmidFinder, and AMRFinderPlus

Prerequisites

For this exercise we will use the BTG_finders environment. In addition we will use the following files and file paths.

Sample: /home/student/BTG/Bacteria_Illumina/skesa_assemblies/Ec001.illumina.fasta

Path for output folder: /home/student/day8_pipelines/output/

Path to VirulenceFinder database: /home/student/BTG/dbs/virulencefinder_db/

Path to PlasmidFinder database: /home/student/BTG/dbs/plasmidfinder_db/

Tasks: Variables

First, we will make variables which can make writing and interpreting the script easier.

Activate the required environment using conda activate BTG_finders

Generate a new folder in the home directory, call it day8_pipelines. The full path to the folder should be /home/student/day8_pipelines.

Navigate into the folder using cd.

Make a new file called finders_pipeline.sh within the folder using nano finders_pipeline.sh

Copy, paste, and fill out the missing information for the vfdb_path and the pfdb_path in the following code chunk:

#!/bin/bash

# Assign values to the variables
sample="/home/student/BTG/Bacteria_Illumina/skesa_assemblies/Ec001.illumina.fasta"
vfdb_path=""
pfdb_path=""
output_folder="/home/student/day8_pipelines/output/"

Task: Execution

The very first line of this script is called a shebang, and it’s comparable to file extensions in Windows. In Windows, if you take a word document and rename its file extension form .docx to .exe , Windows will assume that it’s a program that it will try to run once you double click it. However, it will fail, as it’s in fact a document file and not a program.

In Unix, the shebang works by telling the system which program is required to run the script, if no shebang line is added, you would manually have to tell which program is used to execute the script.

Scripts are easy to execute, you just have to point to the file with a preceding dot and forward slash (./)

Can you name the program which is used to execute this script?

Try to execute the script using ./finders_pipeline.sh. Did any errors show up?

Right now the files permission is to read and write. However, as a safety mechanism in Unix files can’t be executed unless you change their permissions to do so.

Allow execution of the file by running chmod u+x finders_pipeline.sh
1. Explanation: u means for current user only, + means add permissions, x means execution permission.

Try to execute the script again. Did any errors show up this time?

Task: Telling the chef what to do…

Now where the ingredients list have been set up, lets start making the script usable.

Currently, we are pointing to a output file path which does not exist, like telling the chef to drop the dishes on an imaginary table, not very helpful!

A great start is then to ensure that the folder is created early on.

In the finders_pipeline.shfile, directly after the variables section, add the following lines:

# Create output folders
mkdir -p "$output_folder"/vf
mkdir -p "$output_folder"/pf
mkdir -p "$output_folder"/af

♟️

A word about the quoted variables. Quotes on “$variables” are not required but recommended. Say that we have a $output variable which points to the file …/Ecc001_results.txt. If you wrote $output_results.txt the script would look for a $output_results variable instead of $output. In order to prevent this behavior just add quotes: "$output"_results.txt

Now, its time to add some lines of code which executes some of the programs which we want to use for characterization, lets start with VirulenceFinder and PlasmidFinder.

Add the following lines to your script.

# Start characterization with finders
virulencefinder.py -i $sample -p $vfdb_path -o "$output_folder"/vf -xq
plasmidfinder.py -i $sample -p $pfdb_path -o "$output_folder"/pf -xq

In another terminal, activate the BTG_finder and then consult the help page for one of the two finders and fill out the information on the following arguments:
1. -i
1. -p → Path to the databases
1. -o → Path to blast output
1. -x → Extended output
1. -q → Quiet mode (Hide messages)

By now your code should look something along the lines of this

Running AMRfinder

By now we are well on our way of setting up a small pipeline for characterizing isolates.

We are missing out on resistance genes so a natural next step would be to add a ResFinder command, but instead lets be curious and implement AMRFinder instead.

By calling the AMRFinder help page, we can inspect its usage. Here we have only included details for the two arguments we need:

❓

USAGE: amrfinder [--protein PROT_FASTA] [--nucleotide NUC_FASTA] [--gff GFF_FILE] [--database DATABASE_DIR] [--update] [--ident_min MIN_IDENT] [--coverage_min MIN_COV] [--organism ORGANISM] [--translation_table TRANSLATION_TABLE] [--plus] [--report_common] [--point_mut_all POINT_MUT_ALL_FILE] [--blast_bin BLAST_DIR] [--parm PARM] [--output OUTPUT_FILE] [--quiet] [--gpipe] [--threads THREADS] [--debug] -n NUC_FASTA, --nucleotide NUC_FASTA | Nucleotide FASTA file to search -o OUTPUT_FILE, --output OUTPUT_FILE | Write output to OUTPUT_FILE instead of STDOUT

Add the following amrfinder call to the finders_pipeline.sh below and finish out the missing variables by replacing the $var1 and $var2 with the correct variable names, e.g. one of them being $output_folder.

amrfinder -n $var1 -o $var2/af/amrfinder_results.txt

Ready to take the pipeline for a spin??? safe and exit nano, then… Let’s GO

# Run this from terminal
./finders_pipeline.sh

Solution - Please minimize until you are done!

#!/bin/bash

# Assign values to the variables
sample="/home/student/BTG/Bacteria_Illumina/skesa_assemblies/Ec001.illumina.fasta"
vfdb_path="/home/student/BTG/dbs/virulencefinder_db/"
pfdb_path="/home/student/BTG/dbs/plasmidfinder_db/"
output_folder="/home/student/day8_pipelines/output/"

# Create output folders
mkdir -p "$output_folder"/vf
mkdir -p "$output_folder"/pf
mkdir -p "$output_folder"/af

# Start characterization with finders
virulencefinder.py -i $sample -p $vfdb_path -o "$output_folder"/vf -xq
plasmidfinder.py -i $sample -p $pfdb_path -o "$output_folder"/pf -xq
amrfinder -n $sample -o "$output_folder"/af/amrfinder_results.txt

Wrap up

Congratulations, you have made your very first pipeline. Provided you didn’t introduce any errors, it should run the same way each time you execute the script. This is really useful for reproducibility and to semi-automate your own workflow.

Now the script is not very useful if you have samples other than Ec001.illumina.fasta, as you would have to change the sample variable in the script every time you wanted to run it on a different sample. Don’t worry, there are very small changes required to achieve this, we will take a look at this next.

Making the pipeline run on other samples

Positional arguments

One way to make the pipeline easily usable one can replace the required input with positional arguments. Positional arguments is a way to make a script look at the extra arguments written by the user, when invoking the script.

Imagine we have a small executable bash script called simple.sh, it works like this:

#!/bin/bash

firstVar=$1
secondVar=$2

echo "The first variable is $firstVar. The second variable is $secondVar."

When you execute it:

./simple.sh Fish ImSecond
The first variable is Fish. The second variable is ImSecond.

Copy the finders_pipeline.sh script into a new file called finders_positional_pipe.sh using:

cp finders_pipeline.sh finders_positional_pipe.sh

Open the new file (finders_positional_pipe.sh) with nano.

Change the variable definitions so that sample= takes the first positional argument ($1) and the output_folder= takes the second positional argument ($2)

Safe and exit nano

Execute the script providing the following file (a new file!) and file path, as first and second arguments respectively.
1. Sample: /home/student/BTG/Bacteria_Illumina/skesa_assemblies/Ec002.illumina.fasta
1. Output_folder: /home/student/day8_pipelines/output/

./finders_positional_pipe.sh [sample] [output-folder]

Screening folder for samples

Another approach to enhance usability of your pipeline is to replace the input sample file with a sample folder, and then automatically screen this folder for relevant sample files.

Screening a folder for fasta files can be a bit out of the scope of this course, so we will provide the code necessary.

First copy the finders_positional_pipe.sh script into a new file called finders_on_folder.sh using cp

Open the new file (finders_on_folder.sh) with nano

Rename the sample variable to sample_dir. Remember to leave the remaining $sample variables in the remainder of the script.

Add the following lines to the script right after the variables and mkdir commands:

# Screen the sample_dir for fasta files
files=$(find "$sample_dir" -maxdepth 1 -type f -name "*.fasta" | sort)

Explanation
- -maxdepth 1 | parameter limits the search to exclude sub folders.
- -type f | limits search to only files and not the folders
- -name | defines name of the file or folder that has to be find
- sort | A command which sorts all the output, in this instance from the find command

Solution - Inspect after finishing step 4!

#!/bin/bash

# Assign values to the variables
sample_dir=$1
vfdb_path="/home/student/BTG/dbs/virulencefinder_db/"
pfdb_path="/home/student/BTG/dbs/plasmidfinder_db/"
output_folder=$2

# Create output folders
mkdir -p "$output_folder"/vf
mkdir -p "$output_folder"/pf
mkdir -p "$output_folder"/af

# Screen the sample_dir for fasta files
files=$(find "$sample_dir" -maxdepth 1 -type f -name "*.fasta" | sort)

# Start characterization with finders
virulencefinder.py -i $sample -p $vfdb_path -o "$output_folder"/vf -xq
plasmidfinder.py -i $sample -p $pfdb_path -o "$output_folder"/pf -xq
amrfinder -n $sample -o "$output_folder"/af/amrfinder_results.txt

The script does not yet work as the $sample variable is no longer defined, so lets comment out the lines which does not work. Add a # in front of all the finder commands so they will be ignored:

# Start characterization with finders
#virulencefinder.py -i $sample -p $vfdb_path -o "$output_folder"/vf -xq
#plasmidfinder.py -i $sample -p $pfdb_path -o "$output_folder"/pf -xq
#amrfinder -n $sample -o "$output_folder"/af/amrfinder_results.txt

Now its time to figure out whether the fasta file screener works or not. Add the following lines directly after the screening lines (files=$(find …)

# Looping over each of the sample files individually
for sample in $files; do
  # Define the sample name from the file
  sample_name=$(basename "$sample" .fasta)

  # Print out helpfull message that this in fact works
  echo The sample $sample_name is located here: $sample
done

Save and exit nano, then execute the script by invoking:

./finders_on_folder.sh /home/student/BTG/Bacteria_Illumina/skesa_assemblies/ /home/student/day8_pipelines/output/

If things goes well, the script should print the file name for every single sample file located within the sample_dir. The second argument is provided to satisfy the output_folder variable which expects a second argument.

Reopen the script in, remove the # signs in front of the virulencefinder, plasmidfinder, and amrfinder commands, and move the finder commands into the for loop. Remove the echo $sample line with the finder lines.

In its current state the script should automatically run the finders for each of the samples, however as the final output files have the same names, these will be overwritten. To prevent this, we must split these results into individual unique folders. Luckily, this was thought of in the for loop by defining the sample_name variable.

For each of the finder lines change the output arguments from the following: -o "$output_folder"/Xf to this: -o “$output_folder”/Xf/"$sample_name" Where Xf is denotes vf for VirulenceFinder, pf for PlasmidFinder, and af for AMRfinder

There is one more issue, finders don’t create folder themselves, so it would give you an error. You need to add in the loop before finder commands: mkdir -p "$output_folder"/Xf/"$sample_name" for each of the finders. Where Xf is denotes vf for VirulenceFinder, pf for PlasmidFinder, and af for AMRfinder

Add a final victory message at the buttom of the file using echo e.g. echo Jobs done!

Solution - Final script

#!/bin/bash

# Assign values to the variables
sample_dir=$1
vfdb_path="/home/student/BTG/dbs/virulencefinder_db/"
pfdb_path="/home/student/BTG/dbs/plasmidfinder_db/"
output_folder=$2

# Screen the sample_dir for fasta files
files=$(find "$sample_dir" -maxdepth 1 -type f -name "*.fasta" | sort)

# Create output folders
mkdir -p $output_folder/vf
mkdir -p $output_folder/pf
mkdir -p $output_folder/af

# Looping over each of the sample files individually
for sample in $files; do
  # Define the sample name from the file
  sample_name=$(basename "$sample" .fasta)

  mkdir -p "$output_folder"/vf/"$sample_name"
  mkdir -p "$output_folder"/pf/"$sample_name"
  mkdir -p "$output_folder"/af/"$sample_name"

  # Start characterization with finders
  virulencefinder.py -i $sample -p $vfdb_path -o $output_folder/vf/$sample_name -xq
  plasmidfinder.py -i $sample -p $pfdb_path -o $output_folder/pf/$sample_name -xq
  amrfinder -n $sample -o $output_folder/af/$sample_name/amrfinder_results.txt
done

echo Jobs done!

CONGRATZ on your first pipeline!!!

I'm sorry, but I cannot provide instructions on how to export a Notion page with the exact same layout, fonts, and colors as shown in Notion. Notion has its own unique formatting and exporting options, and the final output may differ depending on the export format and settings. However, you can explore Notion's export options by clicking on the "..." button on the top right corner of the page and selecting "Export." From there, you can choose the export format (e.g. HTML, Markdown, PDF) and customize the export settings.