Bioinformatic Pattern Analysis, Project

Goal: Learn to combine tools encountered in the previous Computer exercises to answer specific biological questions.

Remarks: Although a couple of hints are given at the bottom of this page, the questions are less straightforward than the previous Computer exercises, and answering them will require some trial and error. Do not hesitate to ask your assistant if you feel you are not making any progress (but please read the hints first!). Note that you can answer question 2 without fully answering question 1, so you won't need to be bored when you get stuck while your assistant is temporarily unavailable.

Format: The report should be minimum of 800 words and contain two figures that you have generated yourself. Moreover, it should be in the form of a short article, with an introduction, results and discussion section. Please do not copy-paste complete pages of BLAST output; a summary like "a blastp search of protein x against the nr databases gives hits in species y and z" will do. If you are using web servers (e.g. for a BLAST search or a clustalw alignment), don't forget to write down the settings you are using if they are different from the default settings of the page. Make the report a story, not just a list of answers. You can find a more elaborate guide here. If you work in a group of 2-3, it is enough to hand in a single report, but please state clearly the names. The deadline is on the Calendar page of this course.

Obesity

Obesity is a medical condition in which excess body fat has accumulated to the extent that it may have an adverse effect on health, leading to reduced life expectancy and/or increased health problems. Body mass index (BMI), a measurement which compares weight and height, defines people as overweight (pre-obese) if their BMI is between 25 and 30 kg/m2, and obese when it is greater than 30 kg/m2. With approximately 1.6 billion overweight individuals and at least 400 million clinically obese adults world wide, obesity is one of the world's greatest health concerns. It is a disease which leads to associated health complications including diabetes, cardiovascular disease, musculoskeletal disorders, and cancer (visit WHO site on obesity for detailed statistics). As a result obesity is the focus of much scientific interest. Among several genes that have been studied extensively, two are very significantly associated with the obesity: the melanocortin 4 receptor gene (MC4R) and alpha-ketoglutarate-dependent dioxygenase (FTO) gene.

Study how important these two genes (MC4R and FTO) are for the eukaryotic life forms. How early did they emerge and how wide spread are they? First think about which answers you expect to find to these questions and list your arguments. Search the literature to find functional data (e.g. animal models where these genes are knockout) to support your conclusion. Are your conclusions the same as what you expected?

Now focus on one of the two genes and generate a data set so that you can tell evolutionary story of this gene, ie how wide spread it is, in which species it is found in one or more copies, where do you see gene duplications happening? Point out parts of the gene, where you expect single nucleotide polymorphisms (SNPs) could interrupt the protein function and therefore might result in a disease condition (in this case obesity). Search the databases to find SNPs in these genes that are associated with obesity and see if their location fits with the regions you have identified as likely positions that can cause obesity.

Links

BLAST servers:

Sequence analysis webservers:

(source: http://www.csm.ornl.gov/SC99/GENwall.html)

A comparison of a mouse unable to produce leptin thus resulting in obesity (left) and a normal mouse (right)

Hints

Many of the hints of the previous Computer exercises could proof useful. Here are additional hints:

You can try out the STRING server for extra analysis to support/check your phylogenetic analysis.
Think before hand which sequences you want to use for analyzing conservation between very different species: protein/DNA. When considering which BLAST database might be most suitable to find reliable, non-redundant homologs, remember that the Genbank "non-redundant" database (NR) has non-redundant sequence identifiers but may have many redundant sequences. Conversely, RefSeq has sequences from carefully curated reference genomes.
The OMIM database contains informations about which SNPs are associated with diseases like obesity. The location of most of these SNPs are given with respect to the protein sequence.
You can get a fasta file with the amino acid sequence from a selection of BLAST hits by selecting the checkbox in front of your hits of interest in the alignment part of NCBI BLAST output, followed by clicking the "get selected sequences" link at the bottom of the page. On the next page, change the Display pull-down menu to "fasta" and select "Text" or "File" from the Send-to menu.
If you want to make a phylogenetic tree of a protein with a lot of blast hits, don't simply take the top hits, but consider including hits with varying degrees of similarity.
You can select how BLAST output is sorted by clicking on the different headers of the summary table. Sorting by "Query_coverage" can be helful in getting rid if partial hits and synthetic constructs (which are often very short).
When you are selecting BLAST hits to put into a multiple sequence alignment, try to select proteins which are similar to the BLAST query over their entire length.
Making multiple sequence alignments of long proteins can take quite a while, so it is probably a good idea to limit the number of sequence to align (max 15).
Jalview is a nice tool for identifying specific positions and conserved regions in multiple sequence alignments; you can find a link to Jalview near the top of the Clustal output.
An easy way to include pictures in your report is to press the print-screen button, followed by pasting in a program like MS Word or Open Office Writer.