Exercise: Generate a pairwise sequence alignment using the
Needleman-Wunsch algorithm
Exercise: Generate a multiple sequence alignment using muscle and mafft
Exercise: Get familiar with BLAST tools in various ways Exercise:
Generate a core genome SNP alignment using Snippy
All required files for the practicals are deposited in the github
repo github.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7.
To get started, clone this repo to your computer.
cd <your preferred location>
git clone https://github.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7.git
cd GenEpi-BioTrain_Virtual_Training_7To have the required tools installed on your computer, use
conda with the provided environment .yaml
files:
conda env create -f alignment.env.yaml
conda env create -f phylo.env.yaml
conda env create -f snippy.env.yamlImportant: Create a subfolder within the repo folder for each tool you are running on the command line, so the output of each tool is in its own folder.

Gap penalty = -2

Use the environment alignment
Align the sequences found in data/16S_data/16s_sequences.fasta using
mafft and muscle from command line
If you don't have them available yet, download the sequences using:
wget https://github.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/raw/main/data/16s_data/16s_sequences.fastaHave a look at the file
less 16s_sequences.fastaCheck MAFFT helper
mafft -hRun MAFFT to align the sequences in 16s_sequences.fasta
mafft 16s_sequences.fasta > 16s_sequences_mafft_alignment.fastaCheck MUSCLE helper
muscle -hRun MUSCLE to align the sequences in 16s_sequences.fasta
muscle -align 16s_sequences.fasta -output 16s_sequences_muscle_alignment.fastaFor this exercise we will use the Listeria assemblies found in assemblies.tar.gz
If you have not already downloaded and extracted these, either download the tar-file from EVA or from github using
wget https://github.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/raw/main/data/assemblies.tar.gzYou can extract the assemblies using
tar -xf assemblies.tar.gzThis will create a folder named “assemblies” with genome assemblies for 22 isolates. Have a look at the first one with
less assemblies/SRR27240806.fastaWe will use blast to identify the v3-v4 region of the 16s rRNA gene
in this assembly. For that we will need a reference sequence.
If you have not already downloaded this, do so using
wget https://github.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/raw/main/data/16s_data/16s_v3_v4_reference.fastahave a look at the file
less 16s_v3_v4_reference.fastaUse the environment alignment
First check out the blast help file
blastn -hNow we'll use blast to identify the 16s v3-v4 region in one of our isolates. For that we will use the v3-v4 reference sequence fasta file as query and the assembly fasta file as subject. Output the data to a file.
blastn -query 16s_v3_v4_reference.fasta -subject assemblies/SRR27240806.fasta -out SRR27240806_16s_blast.txtHave a look at the output
less SRR27240806_16s_blast.txtNow do the same blast, but with a more programmatic-friendly tab separated output
blastn -query 16s_v3_v4_reference.fasta -subject assemblies/SRR27240806.fasta -out SRR27240806_16s_blast.tsv -outfmt 6And have a look
less SRR27240806_16s_blast.tsvAgain with a programmatic friendly output, but this time we want to output specific information. Like the exact DNA sequence of the subject match and the length of the input sequence
blastn -query 16s_v3_v4_reference.fasta -subject assemblies/SRR27240806.fasta -out SRR27240806_16s_blast_2.tsv -outfmt "6 qseqid sseqid length qlen pident sseq"And have a look at the output again
less SRR27240806_16s_blast_2.tsvWe can also blast multiple query sequences at once. Let us try with all the different 16s sequences we aligned earlier.
blastn -query 16s_sequences.fasta -subject assemblies/SRR27240806.fasta -out SRR27240806_16s_all_blast.txtAnd have a look at the output again
less SRR27240806_16s_all_blast.txtBLAST is fast with a small data set like a single bacterial genome, but with large subject sequences it can get computationally heavy. By building a pre-indexed database that can be passed with the -db option, rather than parsing a subject sequence with the -subject option, you can speed up your blast search. Additionaly, it is easy to include data from multiple genomes in our database, so we can identify sequences in multiple isolates with a single blast search.
We want to create a blast database containing the genomes of all 22 isolates in the assemblies folder. To do this, we first have to combine (or in computer-speak "concatenate"), all our assemblies into a single fasta file.
To do this, use:
cat assemblies/*.fasta > Listeria_assemblies.fastaWe can now create a blast database from this file. First create a folder for your database:
mkdir blast_DBAnd then make the database using
makeblastdb -hash_index -in Listeria_assemblies.fasta -out blast_DB/Listeria_assemblies -dbtype nuclHave a look at the files in the "blast_DB" folder
ls -l blast_DBWe can now query the reference 16s sequence against all the assemblies at once
blastn -query 16s_v3_v4_reference.fasta -db blast_DB/Listeria_assemblies -out Listeria_16s_blast.txtAnd have a look at the output
less Listeria_16s_blast.txtAnd lastly, here is a neat little trick if you want to quickly compare how a gene differs across a set of isolates:
blastn -query 16s_v3_v4_reference.fasta -db blast_DB/Listeria_assemblies -qcov_hsp_perc 90 -perc_identity 80 -out Listeria_16s_blast.tsv -outfmt "6 sseqid sseq"Note that our output contains the name of the subject sequence in column 1, and the aligned part of the subject sequence in column 2.
Now run this command:
while IFS=$'\t' read -r col1 col2; do echo -e ">$col1\n$col2"; done < Listeria_16s_blast.tsv > Listeria_16s_sequences.fastaThis will read the tab separated file and output a fasta-formatted file with column 1 ans the headers and column 2 as the sequence.
Have a look at the file:
less Listeria_16s_sequences.fastaDo note that the sequences contain gaps because they have been individually pairwise aligned to the reference. This does not mean they are all aligned to each other! Since sequences may contain insertions relative to the reference which others do not.
But if you want to investigate the diversity within the gene, you can ofcourse do Multiple Sequence Alignment with f.ex. MAFFT or MUSCLE
Have a look at the file:
muscle -align Listeria_16s_sequences.fasta -output Listeria_16s_sequences_muscle.fastaUse a screen so you can have the job running in the background:
screencd back into your data folder
Use the environment env_snippy_v4.3.6
. activate env_snippy_v4.3.6Run Snippy on the 22 assembly files in assemblies/
mkdir snippy
cd snippy
for f in ../assemblies/*.fasta; do n=$(basename $f); n=${n/.fasta}; snippy --outdir $n --ctgs ${f} --reference ../assemblies/SRR27240806.fasta; doneWhen it's running, you can detach the screen using
Ctrl+a, followed by d to return to your
original terminal window.
To get back into the screen, use screen -r
Note: For running snippy on raw reads (which is much more reliable but takes a bit longer):
for f in ../reads/SRR272408*_1*; do n=$(basename $f); snippy --outdir ${n/_1*} --R1 $(dirname ${f})/${n} --R2 $(dirname ${f})/${n/_1*}_2.fastq.gz --reference ../assemblies/SRR27240806.fasta; done
When snippy is done, you need to create the core-genome using:
snippy-core --ref ../assemblies/SRR27240806.fasta SRR*and finally you can exit the screen:
exit