In this practical, you will learn to create a phylogenetic tree from an alignment and visualise it in different tools.
For this exercise, you will need the following stand alone software:
Note: If you have done these steps for session 1, there is no need to redo them
All required files for the practicals are deposited in the github repo github.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7.
To get started, clone this repo to your computer.
cd <your preferred location>
git clone git@github.com:ssi-dk/GenEpi-BioTrain_Virtual_Training_7.git
cd GenEpi-BioTrain_Virtual_Training_7
To have the required tools installed on your computer, use conda with the provided environment .yaml files:
conda env create -f phylo.env.yaml
Important: Create a subfolder within the repo folder for each tool you are running on the command line, so the output of each tool is in its own folder.
In Part 1 of this practical, we will create phylogenetic trees using different methods.
Neighbor joining (NJ) is a bottom-up (agglomerative) clustering method for the creation of phylogenetic trees, created by Naruya Saitou and Masatoshi Nei in 1987. Neighbour joining takes a distance matrix, which specifies the distance between each pair of taxa, as input. The algorithm starts with a completely unresolved tree, whose topology corresponds to that of a star network, and iterates over several deterministic steps, until the tree is completely resolved, and all branch lengths are known.
Here we use the MEGA software to create a NJ tree.
If you don't have it available yet, download the required file
16s_sequences_mafft_alignment.fastafrom EVA or via command line using:wget https://raw.githubusercontent.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/main/data/16s_data/16s_sequences_mafft_alignment.fasta
16s_sequences_mafft_alignment.fasta file on the window. The file is available in the mafft folder from session 1, in the data/16s_data folder in your git repo, or whereever you have downloaded it from EVA or via wget.Analize because the file is already alignedNucleotide SequencesYes when asked if these are protein-coding sequences because we are using the full 16S sequence.StandardNJ phyologenyyes for the current fileOKA tree showing the phylogenetic relationship appears. Read the caption and decide if you agree.
Questions:
We again use the MEGA software to create a Maximum parsimony tree.
yes for the current fileOK. Make sure you enter 100 bootstrap replicates in the “Test phylogeny” field.Topology only mode. Click the Topology only button in the menu bar to switch to the tree with actual branch lengths.A tree showing the phylogenetic relationship appears. Read the caption and decide if you agree.
Questions:
Now we switch to the command line and to the core genome SNPs from the L. monocytogenes dataset from session 1.
If you don't have it available yet, download the required file
core.alnfrom EVA or via command line using:wget https://raw.githubusercontent.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/main/data/core.aln
Load the phylo environment using the following command:
source activate phylo
And cd into the directory with the course data:
cd <your_repo_path>
To obtain a good alignment of SNPs, we need to take care of regions of putative recombination. We use gubbins to remove these regions from the SNP matrix.
To do so, we first need to strip odd characters from the matix using a sed command because gubbins doesn’t like these:
mkdir gubbins
cd gubbins
sed -r 's/::.*//' ../data/core.aln > core_stripped.fasta
This creates a copy of the matrix with the odd characters stripped.
We can then run the gubbins command:
run_gubbins.py core_stripped.fasta -c 8
This will output a purged SNP matrix with fewer sites but the same number of taxa called core_stripped.filtered_polymorphic_sites.fasta
We use the very versatile software IQTREE to produce a high-quality maximum likelihood tree from the purged SNP matrix.
If you don't have it available yet, download the required file
core.alnfrom EVA or via command line using:wget https://raw.githubusercontent.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/main/data/core_stripped.filtered_polymorphic_sites.fasta
Run iqtree:
cd ..; mkdir iqtree; cd iqtree
iqtree -s ../gubbins/core_stripped.filtered_polymorphic_sites.fasta -m TEST+ASC -T AUTO --threads-max 8 -pre ML_iqtree -mem 8GB
This creates the output file ML_iqtree.treefile, which is a NEWICK format tree file. To use it further, we need to make a copy with extension .nwk:
cp ML_iqtree.treefile ML_iqtree.treefile.nwk
We use the very fast software fasttree to produce a fast approximate maximum likelihood tree from the purged SNP matrix.
cd ..; mkdir fasttree; cd fasttree
fasttree -nt -gtr ../gubbins/core_stripped.filtered_polymorphic_sites.fasta > core_fasttree.nwk
This creates the output file core_fasttree.nwk, which is a NEWICK format tree file.
In this part, we will visualize the obtained tree using different methods.
Microreact is a tool for open data visualization and sharing for genomic epidemiology. It is freely available and is widely used in public health data analysis.
If you don't have them available yet, download the required files from EVA or via command line using:
wget https://raw.githubusercontent.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/main/data/ML_iqtree.treefile.nwk wget https://raw.githubusercontent.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/main/metadata/metadata.tsv
To get your tree visualized and annotated in Microreact, do the following:
uploadReference, and samplesSRR27240806, SRR27240812 and SRR27240820 as outgroup (use the right-click menu on their common ancestor branch)metadata.tsv (available in the metadata folder) to the tree by linking the tree tip labels (id) to the key columnRegion, KMA and SampleMaterial using the Metadata blocks buttonlat and long columns from the metadata for the coordinates..png and the tree as a .svg file.Questions:
iTOL is an online tool for visualizing phylogenies and related metadata. The tool is free to use, but for saving your annotations, paied subscription has been introduced a few years ago.
The tool is frequently used for publication ready phylogenetic trees.
If you don't have them available yet, download the required files from EVA or via command line using:
wget https://raw.githubusercontent.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/main/data/ML_iqtree.treefile.nwk wget https://raw.githubusercontent.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/main/metadata/metadata.tsv wget https://raw.githubusercontent.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/main/metadata/dataset_color_gradient_template.txt wget https://raw.githubusercontent.com/ssi-dk/GenEpi-BioTrain_Virtual_Training_7/main/metadata/dataset_color_strip_template.txt
To get your tree visualized and annotated in iTOL do the following:
Uploadchoose fileReference and samplesSRR27240806, SRR27240812 and SRR27240820 as outgroup (use the submenu Tree structure)dataset_color_strip_template.txt and dataset_color_gradient_template.txt from the metadata folder to add annotations:Datasets in the Control panelUpload annotation filesupload
Note: More templates can be downloaded from https://itol.embl.de/help.cgi#annoTemplate
.pdfETE3 is a python toolkit to do phylogenetic analysis and visualize phylogenetic trees.
Here we have prepared a basic script to plot our tree, called ete3_phylo.py (available in the scripts folder).
This script is dependent on the correct folder structure and file names, namely:
iqtree/ML_iqtree.treefile.nwkandmetadata/metadata.tsvas well as the script in thescriptsfolder.
To run the script, open a console and type the following commands:
python scripts/ete3_phylo.py
open mytree.png
Open the file mytree.png and compare it to the figures obtained in other tools.
Inspect the script and try to answer the following questions:
my_layout() function do?If you have some extra time, try to change some of the settings in the script.