Bioinformatic Data Analysis, Computer Practical 3

Goal:

to become familiar with phylogeny software
to demonstrate differences between species and gene trees
to demonstrate differences between DNA and protein sequence based phylogenetic trees
to learn how to interpret phylogenetic trees

Before starting the computer exercise, make sure you have read the following page:

Help page of Phylogeny tools in EBI

Remarks: We will perform phylogenetic analysis as part of the Clustal package. Before starting this computer exercise, go through the Protocol on the Phylogeny (distributed as a seperate sheet but can also be downloaded here .) If you are having trouble with one of the questions, have a look at the Hints section. Write down your answers, either on paper or in digital form (e.g. a Word document). It is important to make notes! Write down for yourself (broadly) what you have done, what your results were, and try to formulate a bottom line (or in other words: what have you learned from the exercise?). The last part of SARS question is very similar to what you need to make for your project work if you choose to work on the origins of HIV-1. If one of the webservers linked in the questions is offline or too slow, you might find alternative servers in the links section.

Ubiquitin

Let us return back to Ubiquitin molecule which you studied during the previous computer exercise (last question). Remember what you learned about the evolution of ubiquitin.

Construct multiple sequence alignments and phylogenies using Protein and DNA sequences of ubiquitin with Clustal Omega. Which tree gives more evolutionary information? Why? Hints

Where would you estimate the root of this tree to be? Reroot your tree using Itol. Hints

Check the species tree in the Tree of Life project. A zoomed-out tree can be found here; use this tree to view all species in a single tree. Compare this species tree with the tree you made using ubiquitin sequences. What is the difference? Hints

SARS

SARS genome and virus particle

In 2003, a near world wide pandemic of the Severe Acute Respiratory Syndrome (SARS) corona virus caused more than 700 deaths. We have collected matrix proteins from 15 corona viruses (note that only one of them is SARS) in Matrix proteins. The names of the sequences always start with the species the virus infects (e.g., bat, rat, human, etc). CV stands for the corona virus, and if there is more than one strain of the same virus in this data set, this is indicated by the last part of the name. For example, Porcine_CV_VW572 stands for pig corona virus strain VW572, Bovine_CV stands for cow corona virus.

Make a phylogenetic analysis of these viruses based on the matrix protein.

What does the tree suggest with regards to the origin of SARS?

Can you find articles in the literature that support your conclusion? Use Pubmed and Google Scholar.

Now let's look at another protein from the Coronaviruses. Collect, for example, spike proteins from these corona viruses by first finding one spike protein from a bat coronavirus at NCBI. Then, do a BLAST search to find other spike proteins from other coronaviruses. Use BLAST to generate a dataset that includes several different host species, like in the matrix protein dataset. Perform the phylogenetic analysis on this new data set. For SARS, do you see the same relationship as was the case with the matrix proteins? Hints

The Middle East Respiratory Syndrome (MERS) coronavirus is another coronavirus that caused a big world wide scare in 2012. Include spike protein sequences from the MERS coronavirus to your dataset. Perform the phylogenetic analysis on this new data set. Can you discover in what animal the closest relative of the human MERS coronavirus is found?

Links

BLAST servers:

Alignment and Phylogeny webservers:

Tree of Life

Itol (to reroot your trees).

NCBI Entrez

Hints

Ubiquitin

Note that we have unaligned protein sequences in FASTA-format in this computer exercise, so they should be aligned before you can make a phylogenetic tree! To do this, first paste the sequences in the Clustal Omega page and align them. To make a phylogenetic tree, paste the aligned sequences in the submission form on the EBI Phylogeny page, including the first line (the line that says CLUSTAL 2.1 multiple sequence alignment). The aligned sequences can also be downloaded by clicking on the file links (top of clustalW result page). It would of course be better if the EBI server would align the sequences automatically before making a tree, but unfortunately it doesn't do this. The phylogeny software can generate a tree with unaligned sequences, so you should remember yourself to do the alignment first. Remember the guide tree in the alignment page is not a phylogenetic tree, but one made based on pairwise distances and UPGMA method. The guide tree is a "guide" to generate the multiple alignment.
Scroll down the page to see the tree. This is a phylogram, to get a cladogram press the button Show as Cladogram Tree.
From these two figures, can you tell the difference between a cladogram and a phylogram? Both are phylogenetic trees, but a phylogram does not only indicate the relationships between the taxa, it also conveys a sense of time or rate of evolution. The temporal aspect of a phylogram is missing from a cladogram.
In the parameters there is also a setting called CORRECT DIST. This is the same as the Kimura 2-parameter correction (see your reader), which corrects the evolutionary distances for multiple substitutions. See whether setting this has an effect on the phylograms you produce.
Remember these trees that you obtain are UNROOTED trees. To make a rooted tree one has to use an outgroup, for example yeast ubiquitin in this example.
On the Tree of Life Web Project page, the search function works only with latin names of the organisms. Alternatively, start at the root of the tree, and 'click your way up' to species you have in this question. The added advantage is that you get some very nice pictures along the way. Keep in mind that your trees were unrooted, that is the root can be anywhere.

SARS

To search for relevant articles, you can make use of the keywords: SARS and bats
With regard to the final exercise: There are various ways of doing this. Remember that you can search for specific organisms using by adding "organism"[ORGN] to your search (e.g. "Murine Hepatitis virus"[ORGN], "sars coronavirus"[ORGN] or "coronavirus"[ORGN]), although you will see that it is not always easy to find viruses this way as most do not have a scientific name. Once you find one spike protein the other ones are most easily found by performing a BLAST search. Limit your BLAST search to viruses only. When BLAST search is finished scroll down to alignments and select the sequences you want to retrieve. At the end of the page you will see "Get selected sequences" button. Once you have your sequences in a file in FASTA format edit the header lines (the first line) so that the first word indicates which virus this spike protein is from. You should use unique names. You need this renaming to be able to see in your phylogentic tree which sequences belong to which viruses. If you are finished with preparing this dataset, move to sequence alignments and phylogeny. If you are stuck with preparing this dataset, you can also use Spikes file that is ready made for this exercise. If you correct distances while generating this tree, you will observe an improvement.