Objectives
- To learn how to download data from the SRA
- To learn how to filter a fastq file
- To understand the structure and format of genome files
- To learn tricks for manipulating a fastq file
Today we’re going to learn how to access sequence data and prepare it for analysis. Please login to the “fossa” server and go to your home directory to start.
All students will be accessing the server using their UVic Netlink ID. You will use the ssh
or Secure Shell
command to login to the remote server, while logged into the UVic VPN.
Make sure that you are using the “student” profile when logging into the VPN, not “employee”.
ssh your_id@fossa.rcs.uvic.ca
You will be prompted to use Duo, two-factor authentication before you enter your password. After approving the authentication on your phone, you will use your netlink ID to login.
(base) p165-072:~ gregoryowens$ ssh grego@fossa.rcs.uvic.ca
*
* Acceptable Use of Electronic Information Resources
*
* Access to this system is prohibited except for authorized University of
* Victoria students, faculty, and staff.
*
* All activity may be logged/monitored and unauthorized access will be handled
* per the provisions of UVic's Policy IM7200: Acceptable Use of Electronic
* Information Resources.
*
* https://www.uvic.ca/universitysecretary/assets/docs/policies/IM7200_6030_.pdf
*
(grego@fossa.rcs.uvic.ca) Duo two-factor login for grego
Enter a passcode or select one of the following options:
1. Duo Push to XXX-XXX-XXXX
Passcode or option (1-1):
After logging in, you will see a screen showing something like this:
To access CVMFS modules please source the appropriate profile.
For example: 'source /cvmfs/soft.computecanada.ca/config/profile/bash.sh'
Last failed login: Thu Oct 5 10:44:07 PDT 2023 from 142.104.165.72 on ssh:notty
There was 1 failed login attempt since the last successful login.
Last login: Thu Oct 5 09:48:45 2023 from 142.104.165.72
As a first step, we need to download some sequence data. The North American repository of sequence data is the Sequence Read Archive (SRA). Whenever a researcher wants to publish a genomic analysis, they need to upload their raw data to the SRA so that other researchers can use replicate their results or use it for other purposes. The sample we’re going to work with is identified as DRR053219.
Activity
- What species is this sample from?
- What project was this sample sequenced for?
- Each data file in the SRA has multiple layers of metadata. Specifically Bioproject, Biosample, Experiment and Run. Explain the difference between each of these layers.
To download data we are going to use fasterq-dump, a program that can pull reads from the SRA based on their ID number.
#First we need to load the module system on the server, if you didn't modify your .bashrc in previous lessons.
source /cvmfs/soft.computecanada.ca/config/profile/bash.sh
#Then we go into our working directory
cd /project/biol470-grego
#Next make a directory for your own testing and enter it
mkdir YOUR_NAME
cd YOUR_NAME
#Then we load the sra-toolkit, which has fasterq-dump
module load StdEnv/2023 gcc/12.3 sra-toolkit/3.0.9
#Then we download the sample, splitting up the data into two files for the forward and reverse reads
fasterq-dump --skip-technical -S DRR053219
We now have two read files for this sample. It’s good pratice to exam the quality of the data using a program like FastQC. This program will process your samples and output a variety of stats about the data.
#First load the program
module load fastqc/0.12.1
#Then run fastqc on our samples.
fastqc DRR053219_1.fastq DRR053219_2.fastq
It will produce an html report for each sample titled DRR053219_1_fastqc.html and DRR053219_2_fastqc.html. To view those, you’ll need to download them to your desktop. Open a new terminal and run scp.
scp YOUR_NAME@fossa.rcs.uvic.ca:/project/biol470-grego/YOUR_NAME/DRR*html Desktop/
Take a look at the output and answer the following questions
Activity
- Which read has better base quality?
- What is an overrepresented sequence and what is the likely source? Think about how this sequence got there.
- Look at the “Per base sequence content” and compare it to an example in the FastQC manual (https://dnacore.missouri.edu/PDF/FastQC_Manual.pdf). What is this result telling us about the sequencing library? If this were a normal whole genome sequencing library, would this be a good or bad sign?
If you don’t see any red flags in fastqc, the next step is to trim the reads. This will do three different things: 1) Remove adapter sequences, which were added to your DNA but aren’t from your target sample’s genome. 2) Remove low quality reads. 3) Trim poor quality bases off the ends of reads. Base quality tends to decline so the ends of reads can be very low quality.
#First load the program
module load trimmomatic
#Then run trimmomatic on our samples.
java -jar $EBROOTTRIMMOMATIC/trimmomatic-0.39.jar PE DRR053219_1.fastq DRR053219_2.fastq DRR053219_1.trim.fastq DRR053219_1.unpaired.fastq DRR053219_2.trim.fastq DRR053219_2.unpaired.fastq ILLUMINACLIP:$EBROOTTRIMMOMATIC/adapters/TruSeq3-PE.fa:2:30:10:2:True LEADING:3 TRAILING:3 MINLEN:36
Activity
- Look at the manual for trimmomatic and figure out what the arguments in the trimmomatic call mean
- If you wanted to make your filtering more stringent, what arguments would you change?
- What is TruSeq3-PE? Would you use this for every library?
- How many reads were removed by the filtering? Can you calculate that just by comparing the files?
We also need a reference genome to compare our sample with. Reference genomes are often available on the NCBI website. Here’s a link to the yeast (S. cerevisiae) reference genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000146045.2/
Activity
- Using the NCBI website, find the GPX1 gene in the yeast reference genome. Where is it? What is the nucleotide sequence of the gene? What is the function of the gene?
- What are the genes next to GPX1 in the genome?
- Using BLAST, find where this gene is in the genome:
>Unknown gene ATGGTCAAATTAACTTCAATCGCCGCTGGTGTCGCTGCCATCGCTGCTACTGCTTCTGCAACCACCACTC TAGCTCAATCTGACGAAAGAGTCAACTTGGTGGAATTGGGTGTCTACGTCTCTGATATCAGAGCTCACTT AGCCCAATACTACATGTTCCAAGCCGCCCACCCAACTGAAACCTACCCAGTCGAAGTTGCTGAAGCCGTT TTCAACTACGGTGACTTCACCACCATGTTGACCGGTATTGCTCCAGACCAAGTGACCAGAATGATCACCG GTGTTCCATGGTACTCCAGCAGATTAAAGCCAGCCATCTCCAGTGCTCTATCCAAGGACGGTATCTACAC TATCGCAAACTAG
To work with the yeast reference genome, we have to download it to the server. Go back to the first yeast genome page and click on the “Curl” button. It will show you a URL, which you should copy. We can use that URL and the “curl” command to download the reference genome files
curl -o yeast_genome.zip 'https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_000146045.2/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,RNA_FASTA,CDS_FASTA,PROT_FASTA,SEQUENCE_REPORT'
unzip yeast_genome.zip
Now lets take a look at what we got.
cd ncbi_dataset/data/GCF_000146045.2
ls
cds_from_genomic.fna GCF_000146045.2_R64_genomic.fna genomic.gff protein.faa rna.fna sequence_report.jsonl
First lets talk about “GCF_000146045.2_R64_genomic.fna”. This is the genome sequence in fasta format. It uses a .fna file suffix because it’s a fasta with only nucleotide information (and not amino acids, for example).
head GCF_000146045.2_R64_genomic.fna
>NC_001133.9 Saccharomyces cerevisiae S288C chromosome I, complete sequence
ccacaccacacccacacacccacacaccacaccacacaccacaccacacccacacacacacatCCTAACACTACCCTAAC
ACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCAT
TCAACCATACCACTCCGAACCACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC
CAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATAT
TGAAACGCTAACAAATGATCGTAAATAACACACACGTGCTTACCCTACCACTTTATACCACCACCACATGCCATACTCAC
CCTCACTTGTATACTGATTTTACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTC
CACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATGCACGGCACTTGCCTCAGCGG
TCTATACCCTGTGCCATTTACCCATAACGCCCATCATTATCCACATTTTGATATCTATATCTCATTCGGCGGTcccaaat
attgtataaCTGCCCTTAATACATACGTTATACCACTTTTGCACCATATACTTACCACTCCATTTATATACACTTATGTC
Each chromosome (or contig) will be represented by a different fasta entry. You can see that this has an ID for the first chromosome “NC_001133.9” as well as a description of that entry. Some of the bases are lowercase, while others are uppercase. The lowercase bases have been identified as being repetitive and are what we call “soft-masked”. Some programs will ignore soft-masked regions of the genome.
We can manipulate the reference fasta file using samtools.
#First we load the samtools module
module load samtools/1.18
#Then we need to index the fasta file.
samtools faidx GCF_000146045.2_R64_genomic.fna
#This creates the index file "GCF_000146045.2_R64_genomic.fna.fai".
#Sometimes index files are only machine-readable, but in this case the index file is actually helpful
head GCF_000146045.2_R64_genomic.fna.fai
NC_001133.9 230218 76 80 81
NC_001134.8 813184 233249 80 81
NC_001135.5 316620 1056676 80 81
NC_001136.10 1531933 1377332 80 81
NC_001137.3 576874 2928491 80 81
NC_001138.5 270161 3512653 80 81
NC_001139.9 1090940 3786270 80 81
NC_001140.6 562643 4890926 80 81
NC_001141.2 439888 5460680 80 81
NC_001142.9 745751 5906143 80 81
This shows all of the contigs in the fasta file (first column) and the second column shows the number of basepairs in each contig.
Activity
- Sometimes you will want to extract the sequence from a specific region of the genome. You can do this using samtools faidx. Figure out how to output the region NC_001135.5:10030-10130.
- Other times, you’ll want to work on only a subset of chromosomes. Write a script using bash that finds the two longest chromosomes and outputs only those two chromosomes to a separate fasta file.
The directory also contains information on the annotated genes in the genome. Specifically: 1) “cds_from_genomic.fna”. This is the coding sequence for all genes 2) “rna.fna”. This is the RNA sequence for all genes 3) “protein.faa” This is the translated amino acid sequence for all genes. 4) “genomic.gff” This includes the locations and descriptions of all features in the genome (not only genes)
The cds and rna are very similar, but the cds only includes bases that code for amino acids while the rna includes transcribed bases upstream of the start codon and downstream of the stop codon.
Some genome assemblies have additional information files and you can find their definitions here: https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/
Lets take a look a the gff file.
#We're piping it to `less -S` to prevent line wrapping, because the lines on this
#file can be quite long
head -n 20 genomic.gff | less -S
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build R64
#!genome-build-accession NCBI_Assembly:GCF_000146045.2
#!annotation-source SGD R64-4-1
##sequence-region NC_001133.9 1 230218
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=559292
NC_001133.9 RefSeq region 1 230218 . + . ID=NC_001133.9:1..230218;Dbxref=taxon:559292;Na>
NC_001133.9 RefSeq telomere 1 801 . - . ID=id-NC_001133.9:1..801;Dbxref=SGD:S00>
NC_001133.9 RefSeq origin_of_replication 707 776 . + . ID=id-NC_001133.9:707..776;Dbxr>
NC_001133.9 RefSeq gene 1807 2169 . - . ID=gene-YAL068C;Dbxref=GeneID:851229;Name=PAU8;>
NC_001133.9 RefSeq mRNA 1807 2169 . - . ID=rna-NM_001180043.1;Parent=gene-YAL068C;Dbxre>
NC_001133.9 RefSeq exon 1807 2169 . - . ID=exon-NM_001180043.1-1;Parent=rna-NM_00118004>
NC_001133.9 RefSeq CDS 1807 2169 . - 0 ID=cds-NP_009332.1;Parent=rna-NM_001180043.1;Db>
NC_001133.9 RefSeq gene 2480 2707 . + . ID=gene-YAL067W-A;Dbxref=GeneID:1466426;Name=YA>
NC_001133.9 RefSeq mRNA 2480 2707 . + . ID=rna-NM_001184582.1;Parent=gene-YAL067W-A;Dbx>
NC_001133.9 RefSeq exon 2480 2707 . + . ID=exon-NM_001184582.1-1;Parent=rna-NM_00118458>
NC_001133.9 RefSeq CDS 2480 2707 . + 0 ID=cds-NP_878038.1;Parent=rna-NM_001184582.1;Db>
NC_001133.9 RefSeq gene 7235 9016 . - . ID=gene-YAL067C;Dbxref=GeneID:851230;Name=SEO1;>
The start of the gff is a header, marked by the “#”. It tells you which specific gff version you have and where it came from. The first few columns tell us:
- Chromosome or contig
- Source of the feature
- What type of feature
- The start of the feature
- The end of the feature
- The score for the feature (in this case blank)
- The strand the feature is on
- The phase of the feature (in this case, what codon frame the CDS is in)
- Additional info
You can read more about how gff format works here: GFF3 Specification
In many cases, you’ll want to know which features are in a particular part of the genome. Since a gff file is actually just text, we can use some basic unix tools to filter it. A convenient option is “awk”.
awk '$1 == "NC_001133.9" && $4 >= 10290 && $5 <= 27968' genomic.gff
This filter for features on chromosome NC_001133.9 that start after 10290 and end before 27968.
This can work well for a single region, but you often will want to select information on multiple regions. Imagine a scenario where you have some test for each site across the genome, and you want to find all the genes that overlap with those sites. In this case, you can use bedtools.
module load bedtools
We need to make a bed file that includes the regions. This file should have three columns, chromosome, start and stop. Use nano to make this table, putting tabs between columns, and save it as regions.bed.
NC_001133.9 0 5000
NC_001134.8 0 5000
One thing to keep in mind is that bed files are zero-based positions. Which means that if you want the first five bases you need to range 0-5. If you did 1-5, it would skip the first base. When you work with VCFs later See this link for more information.
bedtools intersect -a genomic.gff -b regions.bed > selected.gff
Activity
- Find all the features on NC_001148.4 from 1 to 10000 bp.
- Find all the origin of replications on NC_001133.9. How many are there? HINT: awk or grep can be used to pull out lines matching a string.
- How would you use bedtools to excluded specific regions?