Topics in Systematics and Evolution: Bioinformatics for Evolutionary Biology

Daily Assignments

These are the questions selected for marking. Reminder, answers should be <= 1 paragraph in length. Answers should be sent to gregory.lawrence.owens@gmail.com

Questions 2-10 are due on Friday August 9th at 10 am.

  1. NO QUESTION

  2. What is one task you’d rather use an R script instead of a shell script? Why? What is one task you’d rather use a shell script, instead of an R script? Why?

  3. Try different filtering options for the GBS data (see http://prinseq.sourceforge.net/manual.html for options) and plot QC graphs. Discuss which options you would choose to implement if this was your data and why.

  4. What are two ways that could be used to evaluate which aligner is best?

  5. Quantify the assembly metrics for your first assembly that you ran without any options. Pick different sets of parameters to run. Compare the resulting assemblies and discuss which ones seemed to have improved the assembly and why that might be.

  6. What expression measure would you use to compare gene expression between different genes and why? Is it appropriate to compare the raw expression counts? Can you get more appropriate data from RSEM?

  7. You’re trying to create a very stringent set of SNPs for measuring population structure in a PCA. Based on the site information GATK produces, what filters would you use? Include the actual VCF abbreviations.

  8. For a site that is invariant in both populations (i.e. a locus with no variation), what is Fst?

  9. If you have a dataset of 100 samples and 100,000 SNPs, what is the maximum number of PC axes? PCs can also be called eigenvectors. HINT: Here’s an explanation of PCAs

  10. What does it mean when something has 50% bootstrap support? What are two possible reasons that a node may have low support? Include one biological and one methodological reason.