Genome Assembly
Today we’re going to use two different genome assembly methods and assess assembly quality. Log into the server and run the source command.
cd /project/biol470-grego/YOUR_NAME
mkdir lab_10
cd lab_10
mkdir velvet
wget https://zenodo.org/api/records/582600/files-archive -O velvet_files.zip
unzip velvet_files.zip
We’re going to run a de novo assembly of a simulated Staphylococcus aureus genome. This species has a genome only 197,394 bp so its relatively simple. The first step is to filter our reads. If we don’t remove low quality reads or reads with adapter sequences we can incorporate errors into our assembly.
module load fastp
fastp --in1 mutant_R1.fastq --in2 mutant_R2.fastq --stdout > mutant_bothreads.fastq
You may notice we’re outputting only a single file from fastp. In this case, we want to generate an interleaved fastq file that has both R1 and R2 in the same file. This is needed for velvet.
module load velvet
#This first step breaks our reads into kmers of length 29
velveth mutant 29 -shortPaired -fastq -interleaved mutant_bothreads.fastq
#This second step assembles the kmers
velvetg mutant
We’ve now assembled our reads! Hurray! The main output of this is the contigs.
cd mutant
head contigs.fa
>NODE_1_length_29_cov_1.000000
AGTCGGTTGCTGAAGTATCACAGGGCCATGTCCCTCAACTTCTCACAATAAGAAATA
>NODE_2_length_171_cov_26.801170
GTTGGTAACATTTATACAGGATCATTATACTTAAGTTTAATTTCGTTATTACAGAACCAC
ACATTCCAACCAGAAGAGAAAGTATGTCTATTTAGTTATGGTTCAGGAGCAGTAGGAGAA
ATCTTTAGTGGTTCAATCGTTAAAGGATATGACAAAGCATTAGATAAAGAGAAACACTTA
AATATGCTAGAATCTAGAG
>NODE_5_length_29_cov_1.000000
ATACTGTGCAACCGCTTCAATATCTTTAAGAAATTGACGGTCATATGTAATGGATGG
>NODE_6_length_40_cov_1.000000
This is your assembly in fasta format. Each contig is labelled with a node number (just so it has an ID number), the length and the coverage. We can also look at the stats table. I’m using a cut command to only print out the first 7 columns, because the columns after that are not useful.
head stats.txt | cut -f 1-7
ID lgth out in long_cov short1_cov short1_Ocov short2_cov
1 29 1 1 0.000000 1.000000 1.000000 0.000000
2 171 2 2 0.000000 26.801170 24.830409 0.000000
3 9 2 1 0.000000 28.000000 27.000000 0.000000
4 4 2 1 0.000000 26.250000 25.250000 0.000000
5 29 1 1 0.000000 1.000000 1.000000 0.000000
6 40 1 1 0.000000 1.000000 1.000000 0.000000
7 29 1 1 0.000000 1.000000 1.000000 0.000000
8 29 1 1 0.000000 9.103448 9.103448 0.000000
9 29 1 1 0.000000 1.000000 1.000000 0.000000
This tells you the length of your contigs and the amount of coverage or the number of times was sequenced.
Questions
- What is the total length of the contigs? How does that compare to the expected length of 197,394 bp?
We can also use another program to get more stats on the assembly. We’re going to use QUAST (Quality Assessment Tool). QUAST generates reports about the size of your contigs, but additionally you can give it a reference genome and it will compare your assembly to the reference. This can show you if your genome assembly is the full size you expect it to be. We’re going to give QUAST the wild type reference genome and see how our assembly compares.
cd ..
module load StdEnv/2020 gcc/9.3.0 quast/5.2.0
quast -o mutant_quast -r wildtype.fna mutant/contigs.fa
The output of QUAST is text files, but it also has HTML reports. Download the entire mutant_quast directory to your desktop using scp, and then open icarus.html. In the QUAST report section, it includes some important graphs.
This graph shows the cumulative length of contigs starting with the longest and moving shorter, in the red line. The grey line is the total length of the reference genome. In an
ideal world, the red line would reach the grey line quickly, showing that the de novo assembly
is the same length as the reference genome and that most of that length is in only a few contigs.
This is Nx plot. Here we are again ordering our contigs from largest to smallest. The X axis is the contig percentile (i.e. longest to shortest), and the y axis is the length of the contig
in that position. We often report the N50 number, is defined by the length of the shortest contig for which longer and equal length contigs cover at least 50 % of the assembly.
Question
- What is N90 number for this assembly?
- Look at the contig alignment viewer. What is the most obvious difference between this assembly and the reference?
- What is an L50, and what is the L50 for your assembly?
- How many mismatches are there between your assembly and the reference?
For our next section, we’re going to use hifi data to assemble an E. coli genome. We’re not going to do filtering for this set because the quality is already great and it doesn’t actually remove anything, but you should run it for your own projects.
cp /project/ctb-grego/sharing/lab_10/ecoli_hifi.fastq.gz ./
module load StdEnv/2020 hifiasm
hifiasm ecoli_hifi.fastq.gz -l0 -t 10 -o ecoli_hifi
awk '/^S/{print ">"$2;print $3}' ecoli_hifi.bp.p_ctg.gfa > ecoli_hifi.bp.p_ctg.fa
Since the hifi reads are so long and accurate, it allows for the independent assembly of both parental haplotypes. Since this is a bacteria, it is haploid so we’ve given, it the -l0 option to prevent it from producing two assemblies. The -t 10 option tells it to use 10 cores.
You’ll notice as hifiasm runs it produces a histogram of sequence depths. In most cases, the mode of that distribution is the coverage of the sample. From that, we can see that we’ve got roughly 15X coverage, similar to the illumina data we used earlier. The last awk command converts the gfa (which contains extra information) into a normal fasta file.
Lets use quast to look at genome quality, but first we should download a genome to compare it to.
module load StdEnv/2023 gcc/12.3 sra-toolkit/3.0.9
curl -o ecoli_ref.zip 'https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_000005845.2/download?include_annotation_type=GENOME_FASTA&include_annotation_type=GENOME_GFF&include_annotation_type=RNA_FASTA&include_annotation_type=CDS_FASTA&include_annotation_type=PROT_FASTA&include_annotation_type=SEQUENCE_REPORT&hydrated=FULLY_HYDRATED'
unzip ecoli_ref.zip
quast -o hifi_quast -r ncbi_dataset/data/GCF_000005845.2/GCF_000005845.2_ASM584v2_genomic.fna ecoli_hifi.bp.p_ctg.fa
Download the hifi_quast directory to your desktop, open icarus.html.
Questions
- How many contigs are in your assembly?
- What is the N50 of the assembly?
- Would you say this is a better or worse genome assembly than the velvet assembly?
Now that we have a genome the next step is to see where genes are. To do this, we need to annotate the genome. This is a challenging problem. There are several different approaches: 1) Sequencing mRNA and using that to guide what genomic regions are transcribed, including splicing patterns. 2) Using a model to look for gene-like sequence motifs. 3) Comparing to other known gene models. 4) Comparing between closely related species genomes and looking for homologous sequences.
There are multiple different programs that take different combinations of the previous approaches. For eukaryotes, MAKER is a common choice. For prokaryotes, the situation is simpler because the genomes are smaller, more gene dense and don’t contain introns. We’re going to use PROKKA to annotate our new genome.
module load StdEnv/2020 gcc/9.3.0 prokka
prokka ecoli_hifi.bp.p_ctg.fa
Prokka compares the sequences to a set of known prokaryotic genes in UniProt and it also looks for more distant similarity to any gene in an annotated prokaryotic genome. Based on this similarity, it labels genes by their likely function.
Questions
- How many genes did prokka annotate?
- Find a gene annotated in both the reference genome and your new genome. Find a gene that is in your new genome, but not the reference genome.
In the last section, we’re going to align two whole genome sequences. For short reads, we used BWA-mem, but for longer sequences we need to use a different program, for example minimap2. For this section, we’re going to download two different genomes from the Arabidopsis genus.
mkdir genome_comparison
cd genome_comparison/
curl --output a_thaliana.zip 'https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_000001735.4/download?include_annotation_type=GENOME_FASTA&include_annotation_type=GENOME_GFF&include_annotation_type=RNA_FASTA&include_annotation_type=CDS_FASTA&include_annotation_type=PROT_FASTA&include_annotation_type=SEQUENCE_REPORT&hydrated=FULLY_HYDRATED'
curl --output a_arenosa.zip 'https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCA_905216605.1/download?include_annotation_type=GENOME_FASTA&include_annotation_type=GENOME_GFF&include_annotation_type=RNA_FASTA&include_annotation_type=CDS_FASTA&include_annotation_type=PROT_FASTA&include_annotation_type=SEQUENCE_REPORT&hydrated=FULLY_HYDRATED'
unzip a_arenosa.zip
unzip a_thaliana.zip
#Note, when asked if you want to overwrite, type "All"
module load minimap2
#Now we're using minimap to align one genome to the other. We use minimap2 instead of bwa2 because minimap is better designed for aligning long sequences (like chromosomes),
#We use the -cx asm20 flag because we are aligning whole genomes of two different species. There are other presets if you're aligning reads instead of contigs, or if you're aligning to the same species.
minimap2 -cx asm20 \
ncbi_dataset/data/GCA_905216605.1/GCA_905216605.1_AARE701a_genomic.fna \
ncbi_dataset/data/GCF_000001735.4/GCF_000001735.4_TAIR10.1_genomic.fna \
> arabidopsis_aligned.paf
head arabidopsis_aligned.paf | cut -f 1-12
NC_003070.9 30427671 302984 342050 - LR999451.1 25224288 123630 170102 35580 47816 60
NC_003070.9 30427671 25069147 25095539 + LR999452.1 14316399 7694740 7721381 24575 27066 60
NC_003070.9 30427671 5562728 5595170 + LR999451.1 25224288 6335141 6371781 29511 37557 60
NC_003070.9 30427671 701442 728409 + LR999451.1 25224288 820561 849097 25132 29082 60
NC_003070.9 30427671 27917258 27951438 + LR999452.1 14316399 11374022 11403129 26332 35248 60
NC_003070.9 30427671 5224185 5256648 + LR999451.1 25224288 5964211 5997757 29202 34577 60
NC_003070.9 30427671 11246938 11276499 - LR999451.1 25224288 15608823 15639766 26174 32415 60
NC_003070.9 30427671 17732524 17756114 + LR999451.1 25224288 18733693 18757375 21903 24129 60
NC_003070.9 30427671 5733060 5756911 + LR999451.1 25224288 6524999 6548507 21628 24502 60
NC_003070.9 30427671 29038801 29064917 + LR999452.1 14316399 12577980 12605368 23429 28506 60
This produces a .paf file, or paired mapping format file. The alignment is in many small pieces, representing
small chunks of the genomes where they are aligned.
By default, it doesn’t actually output the aligned sequences (unlike SAM format), it just tells you were the two sequences are aligned. This is useful for telling where orthologous sequences are between the two genomes. In almost any case, you will also get extra alignment matches between other random chunks of the genome. To understand the broad synteny patterns (i.e. which chromosome in A. arenosa is orthologous to which chromosome in A. thaliana) you need to visualize the matches.
We’re going to the pafr package to show how the two genomes align. Download the paf file and open R.
install.packages("pafr")
library(pafr)
ali <- read_paf("/project/ctb-grego/owens/test/lab_10/genome_comparison/arabidopsis_aligned.paf")
ali
pafr object with 17766 alignments (73.9Mb)
7 query seqs
8 target seqs
12 tags: NM, ms, AS, nn, tp, cm, s1, s2, de, zd, rl .
The pafr package has some tools for visualizing the alignments. First is a dotplot, which shows where alignments are relative to each genome.
dotplot(ali, label_seqs=TRUE, order_by="qstart") + theme_bw()
Another way of representing the alignment is a coverage plot. For this, every spot of the target genome is colored by which part of the query genome aligns to it.
plot_coverage(ali, fill='qname') +
scale_fill_brewer(palette="Set1")
We can see that contig LR999451.1 has blue sequences aligning to it, which correspond to contig NC_003070.9 in the query genome. There is a fair amount of white spots in this plot, suggesting there areas of the reference genome where nothing aligns. This is likely because these two species are fairly divergent and it is challenging to align. An alternative program for more divergent genomes is anchorwave.
Lastly, since the paf data is being stored as a dataframe, we can use our normal filtering and plotting skills. Here we filter down to one chromosome in the query genome.
library(dplyr)
ali %>%
filter(qname == "NC_003070.9") %>%
plot_coverage(., fill='qname') +
scale_fill_brewer(palette="Set1")