Systems Biology - Bioinformatic Data Analysis

Reader | Teacher

Bioinformatic Data Analysis

Errata Reader
  • Page 2: For all R modules, it is mentioned to look in the back of the reader, but in fact the best place to look is in Blackboard under "Course Content". Blackboard also contains the R selftests and the deadlines (see calendar, also in Blackboard).
  • Page 2: On Tuesday February 4th, 11:00-12:45 you can make the exercises of Chapters 1 & 2 with the teaching assistants (TAs), location: see MyTimeTable. After 15:00 you can make the R modules 4-6 in self study (see Blackboard).
  • Page 35: In q5, add: Hint: remember, you can always use '?' in R to find help on any function (see Section 2.5.1).
  • Page 46: In q3d, Does this tell you something about the alpaca should be What does this tell you about the alpaca. (This should be changed to make you think a bit harder and avoid giving you a 50% chance of guessing the answer.)
  • Page 48: q5 is an extra exercise to practice with R - you may skip until later.
  • Page 48: q6 d-h - you may skip until later.
  • Page 62: q9b - (i.e. star representation) should be (i.e. radial representation).
  • Page 72: q4 - forget the comment about Jukes-Cantor correction - this is only explained in Chapter 7.
  • Page 73: q6 - In the next version of the reader, I will break down this exercise into steps. First, identify the phylogenetically informative positions. Second, create a distance matrix listing the number of sites that differ between each pair of sequences. Third, consider all possible trees that could be made from these six sequences. In each tree, indicate which mutations occurred on the branches. Fourth: the maximum parsimony tree is the one that requires the lowest total number of mutations. Hint: use the heuristic tere-searching approach. Once you have found a tree that is pretty good (requires a low number of mutations), see if you can improve it by swapping branches around. Once you are confident that no branch-swaps lead to a lower number of assumed mutations, you might just have found the maximum parsimony tree!
  • Page 79: Change sentence to "In such cases, you will find more synonymous mutations, and less non-synonymous mutations in the DNA, so dN/dS < 1."
  • Page 85: q10c - Note that "T" is used as a variable name here, not as the R shorthand for "TRUE".
  • Page 86: q12a - Do not take the length of the sequence into account for now.
  • Page 93: q4b - Hint: we use the seq command to get evenly spaced values along the X-axis. This allows us to nicely plot the graph in R.
  • Page 104: Section 8.7: Check out the very helpful Wikipedia page of the Needleman-Wunsch algorithm.
  • Page 106: q6k - Hint: click the link to the Interpro domain database.
  • Page 118: q11a - Error! Reference source not found. should be 5.
  • Page 133: In the answer to q1b, you could refer to Figure 12 to see that 10 different phyla were measured. If we assume that this is the same data that the PCA in Figure 13 was based on (it is from the same publication after all), then the original data had 10 dimensions - one for every phylum. Every metagenome is a point in this 10-dimensional space depending on how much of that phylum was measured, and Figure 13 is a two-dimensional projection of this space.
  • Page 134: Extra explanation to q3d. We know:
    1. 0 < D < 1
    2. -1 < r < 1
    3. Correlation (r) can be viewed as a similarity (S) measure (r~S).
    4. D = 1 - S
    Thus we could say D = 1 - r. However, that would violate the restriction in (1) that 0 < D < 1 - if e.g. r = -0.6. Therefore we need to first scale r, so the values it takes lead to meaningful distance calculations.
    Starting with (2):
    -1 < r < 1
    0 < r+1 < 2 (add 1 to all members of the inequality)
    0 < (r+1) / 2 <1 (divide all by 2)
    Thus, if r_scaled = (r+1) / 2 :
    D = 1 - r_scaled =
    1 - ((r+1)/2) =
    (2 - r -1) / 2 =
    (1-r) / 2 => D = 0.5 - r/2
  • Page 137: In the answer to q5b, note that this function is basically identical to Equation 1 on pg41.
  • Page 140: In the answer to q6g, 88% of the variation should be 89% of the variation.
  • Page 144: In the answer to 2c: (((A:0.1,B:0.1):0.075,C:0.175):0.04,D:215); should be (((A:0.1,B:0.1):0.075,C:0.175):0.04,D:0.215);.
  • Page 144: In the answer to q3, "P (Phenylalanine)" should be "P (Proline)". (Phenylalanine is F.)
  • Page 146: In the answer to q10d, the ("Human COX1","Sheep COX1") branch in the tree on the left has a bootstrap value of 75. Note that this number could be lower if any of the other bootstrap trees had a ("Sheep COX1","Sheep COX1") branch.
  • Page 148: The answer to q5 could be A, B, or E, depending on the specific mutation. If the 3-nucleotide insertion (A) or deletion (E) occurred in frame, this would lead to a single change in the protein sequence, i.e. the insertion of deletion of a single amino acid, respectively. If these mutations occurred out of frame, the mutation could lead to two amino acid changes in the protein sequence. The 4-nucleotide substitution (answer E) could lead to 0, 1, or 2 amino acid changes in the protein sequence, depending on which codons are affected.
  • Page 150: In the answer to q1a, S is the least conserved in the PAM250 matrix.
  • Page 151: In the answer to q3, the observed/expected ratio is 2(7/2)=11.3. Thus, these two sequences are 11.3 times more likely to be well-aligned homologs than unaliged sequences.
  • Page 151: In the answer to q4d, note that D is always smaller than d because back mutations could make two sequences look more alike.
  • Page 152: In the answer to q1, the alignment score of all three alignments is -2. They are all optimal and you should report all three of them.
  • Page 156: In the answer to q2b, "specific to your novel fungus" should be "specific to your novel fungus and not found in other genomes".
  • Page 157 at the top: In the answer to q2c, "in a blastx search" should be "in a blastx or tblastn search".
  • Page 158: In the answer to q8b, "If you use a blastx" should be "If you use a blastx against a protein database".
  • Page 159: In the answer to q8d, "we can find homologs in Xenopus frogs with blastn if the number of target sequences is set sufficiently high, but not with megablast".

Course Description
In the last two decades, biology has become a data-driven science. Just one example of the types of large datasets that have become an integral part of mainstream biology, are the DNA sequences generated by Next Generation Sequencing machines. Bioinformatics is an essential tool to interpret and understand these and other datasets, and extract relevant biological insights. Thus, bioinformatics is critical in all areas of biology, including the study of molecules (molecular biology), tissues (physiology), biological communities (ecology), and everything in between. Evolution leaves recognizable signals in all these biological systems, and these evolutionary signals provide bioinformatics with its great power to understand biology.

During this course, students will be introduced to several concepts and methods in the field of Bioinformatic Data Analysis. Our emphasis will be on studying the function and evolution of proteins and genomes. Students will learn the basics of important bioinformatic methods that will be used throughout their career, including: database searches, sequence alignments, phylogeny, and clustering. Moreover, students will perform a mini-project and write a mini-article to experience how bioinformatics is used in modern biological research.