Bioinformatic Data Analysis, Computer Practical 1

Make sure that you use Firefox instead of Internet Explorer.

You can start Firefox by going to Start menu and typing Firefox under "run" or by using the link under standard applications.

Goal (leerdoelen):

to become familiar with multiple alignment software (CLUSTALW) available on the Internet,
learn the differences between protein and nucleotide alignments,
study the effect of parameters chosen (such as similarity matrices, gap penalties) on the alignments.

Remarks: If you are having trouble with the questions, have a look at the All Hints section. Write down your answers, either on paper or in digital form (e.g. a Word document). It is important to make notes! Write down for yourself (broadly) what you have done, what your results were, and try to formulate a bottom line (or in other words: what have you learned from the exercise?). If one of the webservers linked in the questions is offline or too slow, you might find alternative servers in the links section. Do not log out before you discuss your results with your TA.

Webpages of the NCBI and the EBI

Before starting the computer exercise, make sure that you have been watching the following videos:

How to Retrieve Sequences for an Organism using NCBI
How to download sequences (watch the first 2.5 minutes)
Obtain genomic sequence around a gene using NCBI

and have read the following page:

Help page of Clustal Omega

These material will prepare you to work with NCBI and the EBI servers. Both servers offer a wide range of services and databases, and have quite extensive help or tutorial sections (tutorial section of the NCBI). If you run into problems during this Computer Exercise (or any of the following ones), please have a look at these pages.

Cytochrome

Cytochromes are mostly membrane-bound proteins that contain heme groups and carry out electron transport or catalyze reductive/oxidative reactions. In Eukaryotes cytochromes are found in the inner membrane of mitochondria and endoplasmic reticulum.

cytochrome b

The file CytBProt contains the amino acid sequences of the cytochrome B proteins from the mitochondrial genome of 16 vertebrate species. The sequences are labeled with species name. Take a look at these sequences. Which format is used to represent the sequences? Will very long gaps be necessary for aligning these sequences? Why?

Try Clustal Omega at EBI (or other clustalw servers given in the links section if the EBI server does not work) to make multiple alignment of these protein sequences. Look at the alignment using colors. Can you identify conserved regions that are longer than 10 amino acids? Hints

The file CytBDNA contains the nucleotide sequences of the cytochrome B proteins from the same species. Make alignments of DNA sequences (DNA alignments will take longer time, be patient). Remember to set sequence type DNA. Does the DNA alignment look as you expected it to, given what you saw in the protein alignment? What is unexpected (if anything)? For example, are there gaps within the sequences, and if so how large are they? Why? What parameters can you change to correct a possible mistake? Look at the conservation of the alignment positions. Do you see a pattern in the conservation? Hints

What is the difference between DNA and protein alignment? How do you explain this?

Let us now return to the cytochrome B alignment and have a look at a cytochrome B protein sequence from another kingdom, for example from Arabidopsis thaliana. Find this sequence in NCBI (choose pull down menu to make a search in NCBI protein, and take the protein sequence that has accession number CAA47966.1). Realign your vertebrate sequence with this plant sequence. Are the regions you previously identified as conserved, still conserved? Examine Entrez entry for CAA47966.1. Can you conclude anything about the functional properties of the conserved regions? Hints

Hexokinases are enzymes that phosphorylate hexose (mainly glucose). After phosphorylation the sugar is ready to enter some intracellular metabolic processes. This hexokinase file contains the amino acid sequences of hexokinases from human and dog. Perform the alignment. Look at the Results Summary and study the Percent Identity Matrix. What is the percent sequence identity between all the sequences (e.g., human hexokinase_1 and dog hexokinase_1, human hexokinase_2 and dog hexokinase_2, etc)? Which pairs of sequences are most similar to each other?

Based on this limited data sets, is the evolution of hexokinases or cytochrome B faster ?

Links

Clustal webservers:

All Hints

The most commonly used format for sequence files in ClustalW servers (but also for many other bioinformatics servers) is the FASTA format. The description of this format is explained in NCBI help pages.
To align sequences using a webserver, open the FASTA file in your browser or in Notepad, and paste the sequences into the sequence box.
The labels of most of the Clustal Omega options at the EBI website are links to the relevant bits of the Clustal Omega help pages.
Always use ClustalW's slow or full alignment algorithm, and not the fast one. Some servers default to the fast algorithm, so do not forget to change this.
You can compare the alignments using JalView, or if you use the EBI server, directly on the ClustalW Results page (scroll down to see them). Click the Show Colors button to color the amino acids according to their properties. This will also make it easier to compare the sequences.
To open more than 1 alignment at a time in JalView, you need to save the result and then open it again.
Remember that the differences between alignments are in the gaps! So focus on the gaps, heads and tails.
If a position is fully conserved, it is indicated with "*". Substitutions can fall into three categories: between very similar amino acids (":"), between relatively similar amino acids (".") and between non-similar ones (indicated without any symbol).
Remember that PAM or BLOSUM matrices cannot be used to align nucleotide sequences. (Do you know why?)
Whether or not ClustalW decides that it is 'good' to have a gap in an alignment depends of course on the gap penalty. For very low gap penalties, ClustalW may easily insert a gap to get a 'better' alignment, even if a deletion or insertion would be unlikely (when could this be the case? Think of the relation between nucleotides and proteins). For extremely high gap penalties almost all gaps will go away (except at the beginning and end of the sequence), even if a deletion or insertion could easily have taken place.
You can limit your search result at NCBI by specifying the database field you want to search, for example: cytochrome B AND arabidopsis[orgn] only returns hits in the organism Arabidopsis. You will find more on the syntax of NCBI entrez queries here.