Bioinformatic Data Analysis
  Computer Practical 2: Database Searches
All Hints
 
Goal (leerdoel):
  • to become familiar with BLAST software to search for homologs,
  • function prediction based on homology,
  • demonstrate differences between different BLAST programs (BLASTn, BLASTp, tBLASTn, etc)


Remarks: Before starting the computer exercise, make sure that you have been watching the following tutorial videos: These will prepare you to work with BLAST in NCBI server. There are many web servers available to do BLAST searches; however, because of its stability and capacity to handle many queries we will use NCBI BLAST for the exercises. In case the NCBI BLAST server is not available, use one of the servers in the links section. If you are having trouble with one of the questions, have a look at the Hints section. Write down you answers, either on paper or in digital form (e.g. a Word document). It is important to make notes! Write down for yourself (broadly) what you have done, what your results were, and try to formulate a bottom line (or in other words: what have you learned from the exercise?). If one of the webservers linked in the questions is offline or too slow, you might find alternative servers in the links section.
 

 
Different flavors of BLAST
A research group identified a gene from patients with disturbed sleeping patterns:

Nucleotide sequence:
gggtgaacag ccgcacggga gtaggtacgc acctgacctc gctggcactg
ccgggcaagg cagagggtgt ggcgtcgctc accagccagt gcagctacag
cagcaccatc gtccatgtgg gagacaagaa gccgcagccg gagttagaga
tggtggaaga tgctgcgagt gggccagaat

1. Perform a Blastn search in NCBI BLAST. Use the "Nucleotide Collection". What is the most likely hit? Identify the single nucleotide polymorphism(s) (SNPs) that this patient carries. Do these mutations cause a difference on the protein sequence that the patient expresses? Can you find this out using a different BLAST? Hints
2. What is the difference between BLASTN and MEGABLAST results? To see this focus on distant homologs found by both searches (you can get the taxonomy/organism/lineage report by clicking on "Taxonomy Report" link on the top of the results page). Are you able to find an homolog in, for example, Xenopus in both cases? If not, why?
3. Search OMIM to see if the gene you identified can really cause sleep disorders. Hints
 

 
Function prediction
The genome of the bacteriophage crAssphage was recently discovered using bioinformatic analysis, by analyzing DNA sequences isolated from human fecal samples. The crAssphage genome contains 80 protein-coding genes, but most of those proteins do not yet have a reliable functional annotation. At this point in time, we do not even know what bacterial host this phage infects.
1. Predict the function of the following sequence based on their homologs (using NCBI BLAST). Hints
>crAssphage_protein_25 MKRNISNTILTKDYIFSKVSQITIFSTYTGISVEDIQHCIDTGEFISSPFREDTHPSFGFRYDNRNKLKG RDFAGYWWGDCIDAAATVLSEIVHKQIDISIKSQFLFVLKHIAYTFRNIIYGQDKDENNDYNIARAISNV RNHKPIIELVTRPWNNLDAKYWGQFGVNLNFLNTHFVYPVDQFYINRSTNPIPKYFYDKDKTDLCYGYVL GQDKRGIVNVKLYFPNRNKKTEVKFITNSNTIEGVINLELDNYDVIIITKSTKDRLSLECYLKSINHSIL YGGSTLESKTIGVVNIPHETYKLRQIEYDWLRSKLNRNGFLISLMDNDRTGLMEAVILKNDYDIIPIIIP KELGVKDFAELRSSYSTNVINELTQQVIKYIEENYGEETEFTWDTEESNTLPY

2. In which species did you find the best hits? Is this expected?
3. Imagine you have the DNA sequence of the crAssphage genome. How can you, using BLAST, identify where on the genome the above protein sequence is encoded?
 

 
Ubiquitin
Ubiquitin is a regulatory protein that is ubiquitously expressed in eukaryotes. Ubiquitination (or ubiquitylation) refers to the post-translational modification of a protein by the covalent attachment (via an isopeptide bond) of one or more ubiquitin monomers. The most prominent function of ubiquitin is labeling proteins for proteasomal degradation. Besides this function, ubiquitination also controls the stability, function, and intracellular localization of a wide variety of proteins (source: Wikipedia).
1. Use BLAST to find out how conserved ubiquitin is. As a start point use this human ubiquitin sequence:

>human ubiquitin
MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG

2. Using BLAST find out whether or not ubiquitin is present only in eukaryotes.
 

 
Links

BLAST servers:

NCBI BLAST help

OMIM

NCBI Entrez

 

 
Hints

Blastp and Blastn

  • At the NCBI Blast start-page, look carefully at all the links before you click on something.
  • Choose for "nr" database for your blast search. In this way you are not limiting yourself to only human or mouse sequences.
  • Make sure you use the correct BLAST flavor; NCBI defaults to Megablast for nucleotide sequences, not blastn!
  • When you want to search for scientific articles, there are various specialised search engines you can use. For articles related to biology and medicine, the best one is considered to be NCBI's PubMed at www.pubmed.gov. To find scientific articles on all fields of science, you can also try Google Scholar: scholar.google.com.
  • To search OMIM, go to the NCBI and select OMIM from the pulldown menu on the top left, enter the gene or protein name you want to search in the text field and hit the Go button. In the results list you can click on the identifiers to go to the corresponding OMIM page.

Function Prediction

  • There are several ways you can try to predict the function of a sequence. A much-used method is to search for homologs with a known function, either in other species (orthologs), or in the same species (paralogs). Do you think Blast is a good tool for this? What kind of hits would you look for when searching for homologs? Think about significance (E-value: the probability of matching a random sequence in the database by chance) and annotation (description, references). 
  • An easy way to reduce database size is to limit you BLAST search in a specific database to a subset of that database through the "Organism" or "Entrez Query" options (found under "Choose Search Set" at the BLAST query page). You can see the actual database size when you click on the Search summary link on top of the results page.
  • You can turn on the low complexity filter under Algorithm parameters. If you click on the number of bits following each Blast hit, you will see the alignment of the query sequence (the one you provided) and the hit sequence (the one in the database, it is marked as Sbjct in the alignment). You might notice that some parts of the sequence are printed with greyish low case letters. These are so-called low-complexity regions, which are ignored when computing the alignment score of the hit. For more information, see your reader. Why is it useful to ignore such regions when computing the E-value?
  • Exact amino acid matches in the alignment are marked with the amino acid letter, and "pretty good" matches are marked with + (e.g. amino acids with high values in BLOSUM or another scoring matrix, which probably have comparable properties). Remember the colours in the ClustalW alignments of the previous web exercises.
  • You can determine the exact size of a BLAST database by selecting "Use old BLAST report format" under formatting options on the NCBI BLAST output page.
  • TBLASTN searches a translated nucleotide database for a protein query.

Ubiquitin

  • Click on Taxonomy reports link on top of BLAST result page to see the taxonomy of your hits.
  • By default NCBI BLAST only shows the 100 best hits. You can change this behaviour by clicking "Algorithm Parameters" on the BLAST query page and selecting a higher number under "Max target sequences". Note that selecting numbers above 1000 tends to make BLAST rather slow.