Pipeline development
student
user. Feel free to change locations according to own needs!Authors
These exercises where authored and tested by Povilas Matusevicius and Kasper Thystrup Karstensen
Our very first pipeline
A brief disclaimer
Scripting and pipelines is a bit like cooking. You set one or more recipes (scripts), which the chef (bash) then follows. The ingredients (files and file paths) are usually listed at the top, and then the cooking utensils (commands and software-calls) are introduced later once you need these. Now contrary to cooking, scripts are interpreted literally down to the very last comma, so the chef will not use His/Her experience to guide the process. This means that any errors in the recipe will be followed and missing steps will be let out. E.g. If the chef is supposed to put the turkey in the oven and crank the temperature up to a 200 Celsius for 20 minutes, this will neither ensure that the turkey will be taken out of the oven, nor that the oven will be turned off, unless it is stated in the recipe.
This can (and will) lead to a lot of frustration, as its hard to ensure that all details are correct.
Pipelines starts with a script…
In order to make a pipeline we must provide at least one script, which can be followed, thus in this first exercise you will be guided through the process of generating a bash script.
The bash script is a simple text file which contains one or more lines of code, which will be executed in chronological order.
A very important aspect of making bash script is to ensure that the code is clean and concise for the future reader and developer. Which assuming long job contracts, very often could be yourself. One way of helping yourself making the code more reader friendly is to assign variables for containing file paths and files.
Setting up a characterization pipeline
Let’s make a script which utilizes some of the characterization tools which were introduced at Day 7.
To demonstrate we will start out setting up VirulenceFinder, PlasmidFinder, and AMRFinderPlus
Prerequisites
For this exercise we will use the BTG_finders
environment. In addition we will use the following files and file paths.
- Sample: /home/student/BTG/Bacteria_Illumina/skesa_assemblies/Ec001.illumina.fasta
- Path for output folder: /home/student/day8_pipelines/output/
- Path to VirulenceFinder database: /home/student/BTG/dbs/virulencefinder_db/
- Path to PlasmidFinder database: /home/student/BTG/dbs/plasmidfinder_db/
Tasks: Variables
First, we will make variables which can make writing and interpreting the script easier.
- Activate the required environment using
conda activate BTG_finders
- Generate a new folder in the
home
directory, call itday8_pipelines
. The full path to the folder should be/home/student/day8_pipelines
.
- Navigate into the folder using
cd
.
- Make a new file called
finders_pipeline.sh
within the folder usingnano finders_pipeline.sh
- Copy, paste, and fill out the missing information for the
vfdb_path
and thepfdb_path
in the following code chunk:
#!/bin/bash
# Assign values to the variables
sample="/home/student/BTG/Bacteria_Illumina/skesa_assemblies/Ec001.illumina.fasta"
vfdb_path=""
pfdb_path=""
output_folder="/home/student/day8_pipelines/output/"
Task: Execution
The very first line of this script is called a shebang, and it’s comparable to file extensions in Windows. In Windows, if you take a word document and rename its file extension form .docx
to .exe
, Windows will assume that it’s a program that it will try to run once you double click it. However, it will fail, as it’s in fact a document file and not a program.
In Unix, the shebang works by telling the system which program is required to run the script, if no shebang line is added, you would manually have to tell which program is used to execute the script.
Scripts are easy to execute, you just have to point to the file with a preceding dot and forward slash (./
)
- Can you name the program which is used to execute this script?
- Try to execute the script using
./finders_pipeline.sh
. Did any errors show up?
Right now the files permission is to read and write. However, as a safety mechanism in Unix files can’t be executed unless you change their permissions to do so.
- Allow execution of the file by running
chmod u+x finders_pipeline.sh
- Explanation: u means for current user only, + means add permissions, x means execution permission.
- Try to execute the script again. Did any errors show up this time?
Task: Telling the chef what to do…
Now where the ingredients list have been set up, lets start making the script usable.
Currently, we are pointing to a output file path which does not exist, like telling the chef to drop the dishes on an imaginary table, not very helpful!
A great start is then to ensure that the folder is created early on.
- In the
finders_pipeline.sh
file, directly after the variables section, add the following lines:
# Create output folders
mkdir -p "$output_folder"/vf
mkdir -p "$output_folder"/pf
mkdir -p "$output_folder"/af
$output
variable which points to the file …/Ecc001_results.txt. If you wrote $output_results.txt
the script would look for a $output_results
variable instead of $output
. In order to prevent this behavior just add quotes: "$output"_results.txt
Now, its time to add some lines of code which executes some of the programs which we want to use for characterization, lets start with VirulenceFinder and PlasmidFinder.
- Add the following lines to your script.
# Start characterization with finders
virulencefinder.py -i $sample -p $vfdb_path -o "$output_folder"/vf -xq
plasmidfinder.py -i $sample -p $pfdb_path -o "$output_folder"/pf -xq
- In another terminal, activate the
BTG_finder
and then consult the help page for one of the two finders and fill out the information on the following arguments:- -i
- -p → Path to the databases
- -o → Path to blast output
- -x → Extended output
- -q → Quiet mode (Hide messages)
By now your code should look something along the lines of this
Running AMRfinder
By now we are well on our way of setting up a small pipeline for characterizing isolates.
We are missing out on resistance genes so a natural next step would be to add a ResFinder command, but instead lets be curious and implement AMRFinder instead.
By calling the AMRFinder help page, we can inspect its usage. Here we have only included details for the two arguments we need:
- Add the following amrfinder call to the
finders_pipeline.sh
below and finish out the missing variables by replacing the$var1 and $var2
with the correct variable names, e.g. one of them being$output_folder
.
amrfinder -n $var1 -o $var2/af/amrfinder_results.txt
- Ready to take the pipeline for a spin??? safe and exit nano, then… Let’s GO
# Run this from terminal
./finders_pipeline.sh
Solution - Please minimize until you are done!
#!/bin/bash
# Assign values to the variables
sample="/home/student/BTG/Bacteria_Illumina/skesa_assemblies/Ec001.illumina.fasta"
vfdb_path="/home/student/BTG/dbs/virulencefinder_db/"
pfdb_path="/home/student/BTG/dbs/plasmidfinder_db/"
output_folder="/home/student/day8_pipelines/output/"
# Create output folders
mkdir -p "$output_folder"/vf
mkdir -p "$output_folder"/pf
mkdir -p "$output_folder"/af
# Start characterization with finders
virulencefinder.py -i $sample -p $vfdb_path -o "$output_folder"/vf -xq
plasmidfinder.py -i $sample -p $pfdb_path -o "$output_folder"/pf -xq
amrfinder -n $sample -o "$output_folder"/af/amrfinder_results.txt
Wrap up
Congratulations, you have made your very first pipeline. Provided you didn’t introduce any errors, it should run the same way each time you execute the script. This is really useful for reproducibility and to semi-automate your own workflow.
Now the script is not very useful if you have samples other than Ec001.illumina.fasta, as you would have to change the sample
variable in the script every time you wanted to run it on a different sample. Don’t worry, there are very small changes required to achieve this, we will take a look at this next.
Making the pipeline run on other samples
Positional arguments
One way to make the pipeline easily usable one can replace the required input with positional arguments. Positional arguments is a way to make a script look at the extra arguments written by the user, when invoking the script.
Imagine we have a small executable bash script called simple.sh
, it works like this:
#!/bin/bash
firstVar=$1
secondVar=$2
echo "The first variable is $firstVar. The second variable is $secondVar."
When you execute it:
./simple.sh Fish ImSecond
The first variable is Fish. The second variable is ImSecond.
- Copy the
finders_pipeline.sh
script into a new file calledfinders_positional_pipe.sh
using:
cp finders_pipeline.sh finders_positional_pipe.sh
- Open the new file (
finders_positional_pipe.sh
) with nano.
- Change the variable definitions so that
sample=
takes the first positional argument ($1
) and theoutput_folder=
takes the second positional argument ($2
)
- Safe and exit nano
- Execute the script providing the following file (a new file!) and file path, as first and second arguments respectively.
- Sample: /home/student/BTG/Bacteria_Illumina/skesa_assemblies/Ec002.illumina.fasta
- Output_folder: /home/student/day8_pipelines/output/
./finders_positional_pipe.sh [sample] [output-folder]
Screening folder for samples
Another approach to enhance usability of your pipeline is to replace the input sample file with a sample folder, and then automatically screen this folder for relevant sample files.
Screening a folder for fasta files can be a bit out of the scope of this course, so we will provide the code necessary.
- First copy the
finders_positional_pipe.sh
script into a new file calledfinders_on_folder.sh
usingcp
- Open the new file (
finders_on_folder.sh
) with nano
- Rename the
sample
variable tosample_dir
. Remember to leave the remaining$sample
variables in the remainder of the script.
- Add the following lines to the script right after the variables and
mkdir
commands:
# Screen the sample_dir for fasta files
files=$(find "$sample_dir" -maxdepth 1 -type f -name "*.fasta" | sort)
- Explanation
- -maxdepth 1 | parameter limits the search to exclude sub folders.
- -type f | limits search to only files and not the folders
- -name | defines name of the file or folder that has to be find
- sort | A command which sorts all the output, in this instance from the
find
command
Solution - Inspect after finishing step 4!
#!/bin/bash
# Assign values to the variables
sample_dir=$1
vfdb_path="/home/student/BTG/dbs/virulencefinder_db/"
pfdb_path="/home/student/BTG/dbs/plasmidfinder_db/"
output_folder=$2
# Create output folders
mkdir -p "$output_folder"/vf
mkdir -p "$output_folder"/pf
mkdir -p "$output_folder"/af
# Screen the sample_dir for fasta files
files=$(find "$sample_dir" -maxdepth 1 -type f -name "*.fasta" | sort)
# Start characterization with finders
virulencefinder.py -i $sample -p $vfdb_path -o "$output_folder"/vf -xq
plasmidfinder.py -i $sample -p $pfdb_path -o "$output_folder"/pf -xq
amrfinder -n $sample -o "$output_folder"/af/amrfinder_results.txt
- The script does not yet work as the
$sample
variable is no longer defined, so lets comment out the lines which does not work. Add a#
in front of all the finder commands so they will be ignored:
# Start characterization with finders
#virulencefinder.py -i $sample -p $vfdb_path -o "$output_folder"/vf -xq
#plasmidfinder.py -i $sample -p $pfdb_path -o "$output_folder"/pf -xq
#amrfinder -n $sample -o "$output_folder"/af/amrfinder_results.txt
- Now its time to figure out whether the fasta file screener works or not. Add the following lines directly after the screening lines (
files=$(find …
)
# Looping over each of the sample files individually
for sample in $files; do
# Define the sample name from the file
sample_name=$(basename "$sample" .fasta)
# Print out helpfull message that this in fact works
echo The sample $sample_name is located here: $sample
done
- Save and exit nano, then execute the script by invoking:
./finders_on_folder.sh /home/student/BTG/Bacteria_Illumina/skesa_assemblies/ /home/student/day8_pipelines/output/
If things goes well, the script should print the file name for every single sample file located within the sample_dir
. The second argument is provided to satisfy the output_folder
variable which expects a second argument.
- Reopen the script in, remove the
#
signs in front of the virulencefinder, plasmidfinder, and amrfinder commands, and move the finder commands into the for loop. Remove theecho $sample
line with the finder lines.
In its current state the script should automatically run the finders for each of the samples, however as the final output files have the same names, these will be overwritten. To prevent this, we must split these results into individual unique folders. Luckily, this was thought of in the for loop by defining the sample_name
variable.
- For each of the finder lines change the output arguments from the following:
-o "$output_folder"/Xf
to this:-o “$output_folder”/Xf/"$sample_name"
WhereXf
is denotesvf
for VirulenceFinder,pf
for PlasmidFinder, andaf
for AMRfinder
- There is one more issue, finders don’t create folder themselves, so it would give you an error. You need to add in the loop before finder commands:
mkdir -p "$output_folder"/Xf/"$sample_name"
for each of the finders. WhereXf
is denotesvf
for VirulenceFinder,pf
for PlasmidFinder, andaf
for AMRfinder
- Add a final victory message at the buttom of the file using
echo
e.g.echo Jobs done!
Solution - Final script
#!/bin/bash
# Assign values to the variables
sample_dir=$1
vfdb_path="/home/student/BTG/dbs/virulencefinder_db/"
pfdb_path="/home/student/BTG/dbs/plasmidfinder_db/"
output_folder=$2
# Screen the sample_dir for fasta files
files=$(find "$sample_dir" -maxdepth 1 -type f -name "*.fasta" | sort)
# Create output folders
mkdir -p $output_folder/vf
mkdir -p $output_folder/pf
mkdir -p $output_folder/af
# Looping over each of the sample files individually
for sample in $files; do
# Define the sample name from the file
sample_name=$(basename "$sample" .fasta)
mkdir -p "$output_folder"/vf/"$sample_name"
mkdir -p "$output_folder"/pf/"$sample_name"
mkdir -p "$output_folder"/af/"$sample_name"
# Start characterization with finders
virulencefinder.py -i $sample -p $vfdb_path -o $output_folder/vf/$sample_name -xq
plasmidfinder.py -i $sample -p $pfdb_path -o $output_folder/pf/$sample_name -xq
amrfinder -n $sample -o $output_folder/af/$sample_name/amrfinder_results.txt
done
echo Jobs done!
CONGRATZ on your first pipeline!!!
I'm sorry, but I cannot provide instructions on how to export a Notion page with the exact same layout, fonts, and colors as shown in Notion. Notion has its own unique formatting and exporting options, and the final output may differ depending on the export format and settings. However, you can explore Notion's export options by clicking on the "..." button on the top right corner of the page and selecting "Export." From there, you can choose the export format (e.g. HTML, Markdown, PDF) and customize the export settings.