Alignment · Introduction to Computational Genomics

Objectives

To learn how to align sequence data to a reference.
To learn how to interpret sam and bam files.
To learn how to manipulate a bam file.

In this lab we’re going to learn how to align illumina data to a reference genome. This is the first step in many genomic analyses. We start off with a large number of anonymous reads from somewhere in the genome (or perhaps a different genome all together) and we need to know where they come from before we can say anything about genetic variation or gene expression. Log into the server to get started, and source.

After logging in, let’s make a new directory for this lab.

#First we source our bash profile to make sure we have access to the modules.
source /cvmfs/soft.computecanada.ca/config/profile/bash.sh

cd /project/biol470-grego/2026/YOUR_NAME
#Next make your new directory
mkdir lab_6
#And enter it
cd lab_6

I’ve put the required files in a shared directory on the system. We’ll copy it over to our current directory.

cp /project/ctb-grego/sharing/lab6_data.tar.gz ./

This is a tar.gz file. A tar file (originally named for tape archive) is a way of putting multiple files into a single file. In our case, we are also gzipping it, which compresses the size. Most programs you download will be stored in something like a tar.gz file. Before we can access the files in the tar.gz file, we have to open and unzip it.

tar -xzf lab6_data.tar.gz

It’s hard to remember the specific letter flags to use when unzipping a tar.gz file, so my trick is to picture someone with a thick german accent saying “Extract Zee Files!”, for X, Z, F

ls

lab6_data.tar.gz  data

The files we need are in the data directory. Let’s take a look.

cd data
ls

B123.C10.1.fq.gz  B123.C10.2.fq.gz  B127.C10.1.fq.gz  B127.C10.2.fq.gz l_densiflorus.chr1.fa.gz

Here we have sequence data for an endangered species, Lupinus densiflorus, in Victoria. There are two samples labelled based on what barcodes were used. In this case *.1.fq.gz is read one, and *.2.fq.gz is read two from paired-end sequencing. In general, it’s better to have a long and descriptive name (e.g. “SpeciesX.Pop102.Sample53.Tissue39”), than a short non-descriptive one ( e.g. “Sample_A”). When plotting the data, you can switch to easier to read sample names if necessary.

We also have a reference genome, l_densiflorus.chr1.fa.gz. This contains just the first chromosome to speed up our processing. Normally, your reference genome should include the entire genome sequence.

Before we can start aligning, we need to index our reference genome. Last week we indexed our reference genomes using samtools, which we’re doing again. The alignment program we’re using also needs to index the reference genome.

module load StdEnv/2023
module load samtools/1.18
module load bwa-mem2/2.2.1

samtools faidx l_densiflorus.chr1.fa.gz
bwa-mem2 index l_densiflorus.chr1.fa.gz

We now have index files for our reference genome. Take a look at them. In contrast to the .fai file we worked with last lab, these files are not actually useful to us on their own. They’re necessary for the alignment program, but they’re only machine readable. Regardless, lets align our first sample!

bwa-mem2 mem -t 2 l_densiflorus.chr1.fa.gz B123.C10.1.fq.gz B123.C10.2.fq.gz > B123.C10.sam

This will output a lot of text as it progresses. For most projects, this step will take many minutes or hours, but here we’re working with a small dataset, it should be finished within a minute.

We now have our data in “sam” format. This stands for Sequence Alignment/Map format. In the sam file, we have our base calls for each read, along with the quality score. This means a sam file has all the information in our original fastq format, plus additional info! Specifically, the sam file contains information about whether and where each read has mapped in the genome.

head B123.C10.sam | less -S

@SQ     SN:ptg000008l   LN:11744959
@PG     ID:bwa-mem2     PN:bwa-mem2     VN:2.2.1        CL:bwa-mem2 mem -t 2 l_densiflorus.chr1.fa.gz B123.C10.1.fq.gz B123.C10.2.fq.gz
290_1_1101_4179_1042    77      *       0       0       *       *       0       0       TGCAGAGGTG
290_1_1101_4179_1042    141     *       0       0       *       *       0       0       CGGGTGGCAT
290_1_1101_32834_1210   77      *       0       0       *       *       0       0       TGCAGGCTGT
290_1_1101_32834_1210   141     *       0       0       *       *       0       0       CGGATTGGAT
290_1_1101_36920_1280   77      *       0       0       *       *       0       0       TGCAGCTCAA
290_1_1101_36920_1280   141     *       0       0       *       *       0       0       CGGGATATGG
290_1_1101_26969_1336   77      *       0       0       *       *       0       0       TGCAGGAACA
290_1_1101_26969_1336   141     *       0       0       *       *       0       0       CGGTTTTAGC

The first line is a header, which list all the contigs in the reference genome (only one in this case), along with their size. After that, we have more header lines that specify all the programs and commands used to create this file. In this case, it has the bwa-mem2 call (and specific information about the flags used for that command call). As we modify the sam file, more lines will get added to the header recording the steps we did. This is very helpful if you forget what you did exactly.

Next, we have one line per read with a bunch of columns.

The query template name. This is a unique name for each read or read pair.
The bitwise flag. This number tells us about the read and its mapping. There are multiple different possible things the read could have, and the total value of the number can the translated into that information. For example, if the read is paired, add + 1. If the read is unmapped, add + 8. Therefore, if the final number is 9, that means it is paired and unmapped. The decoding of all possible numbers can be found here.
The contig that the read is mapped to
The starting location where the read maps to, on the contig.
The mapping quality. 60 is the highest (generally) and means high confidence in where the read is mapped. The lowest value is 0, which means it is equally likely to map elsewhere.
The CIGAR string describing how the read is aligned.
The ID of the read pair mate (in this case, the same as the original)
The mapping location of the read pair mate.
The distance between read map pairs (sort of)
The sequence of the read
The base quality of the read
Any extra flags.

Questions

For one read, imagine you find a CIGAR string “12M1I113M”. Refer to the sam format manual and figure out what this means for how the read is aligned.
Find a read with the highest mapping quality in your dataset using command line tools.
What proportion of your reads are aligned? You can do this using a samtools function, or calculate it directly from the sam file.
What are three possible reasons why a read could have very low mapping quality?

As you can tell, reads are not ordered based on where they are aligned in the genome. Most programs require that reads be ordered by their position in the genome because it allows them to be rapidly accessed. Imagine if the reads were randomly ordered but we wanted to get all the reads from the start of chromosome 1. The program would need to search through the entire file to find those reads before it could use them (which can take a while). When reads are ordered, programs can know exactly where to find all the reads for a particular region.

Therefore, we are going to sort the sam file by read position. At the same time, we’ll convert it to a “bam” file. This is a binary version of the sam file, so the file size is smaller, but we will need to use samtools to convert it to a human readable version if we later want to look at it.

samtools sort B123.C10.sam > B123.C10.sort.bam

We’ve now changed the name to end with .sort.bam to remind us that this version of the file is sorted and it is a bam file. To access the file now, using standard text viewing programs won’t work. We need to use samtools view. Lets try that.

#This will print weird characters, because its in binary
head B123.C10.sort.bam

#This will convert the file to standard text and then print it to screen
samtools view -h B123.C10.sort.bam | head

Another important step in processing alignment files is to mark duplicates. These are cases where we have sequenced the same molecule more than once. Since they don’t represent independent observations of the genome, we need to mark them so that we only use one copy in our calculations. We’ll use picardtools for this.

module load picard/3.1.0
java -jar $EBROOTPICARD/picard.jar MarkDuplicates I=B123.C10.sort.bam O=B123.C10.sort.markdup.bam M=B123.C10.sort.dupmetrics.txt

In cases where there are many options to a single command, it can be easier to put each option on its own line. As long as we end each line with \ the system will treat it as if it was a single line. Be careful not to have a space or anything after the \, because then it will think that the line really has ended (and your command will mess up).

#Alternate formatting for command
java -jar $EBROOTPICARD/picard.jar MarkDuplicates \
  I=B123.C10.sort.bam \
  O=B123.C10.sort.markdup.bam \
  M=B123.C10.sort.dupmetrics.txt

Although we are running mark duplicates to learn how to do it, the data we’re working on is Genotype By Sequencing (GBS). This approach uses restriction enzymes to cut DNA before sequencing, so all reads start at a cut site. This results in many more reads being classified as duplicates, than a library preparation approach that uses random DNA shearing like a typical whole genome sequencing library.

Lastly, we need to index the bam file. Many programs require the file to be indexed, before it can be used. The index files are prefixed with .bai for bam index.

samtools index B123.C10.sort.markdup.bam

Unlike when working with .fai files, the .bai is not human readable and isn’t helpful to look at.

We now have a nicely sorted and labelled bam file. Lets get some stats on it, using samtools commands.

samtools coverage B123.C10.sort.markdup.bam

This shows you a summary of the depth and coverage for your sample. Depth indicates how many reads are on each base (on average) and coverage tells you how much of the reference sequence is covered by at least one base.

Questions

Use samtools flagstats to find out what percent of reads are mapped to the genome.

We can also look at the alignment itself by using samtools tview.

samtools tview B123.C10.sort.markdup.bam l_densiflorus.chr1.fa.gz

Once you’re inside, use the “?” key to bring up the help menu. Then use keys to move and look at your alignment. Most of the genome does not have data, so use the “g” key to move to a specific location and go to base “260610”. Here you can see some aligned reads.

We often sequence many samples, and each of them needs to be processed individually. We could make one long script that does each step for each sample individually, but it’s much more manageable to use loops to repeat the same steps for many different samples. Lets go through a toy example to learn some tricks.

First lets start with a list of fastq files. Make this file in nano by copying and pasting and name it “fastq_files.txt”. In the future you could make a file this using the ls command.

cat fastq_files.txt

sample01.1.fq.gz
sample01.2.fq.gz
sample02.1.fq.gz
sample02.2.fq.gz
sample29.1.fq.gz
sample29.2.fq.gz

First thing is that we have both R1 and R2 files, but we’re going to be treating those together as a single sample. So lets filter out the R2 lines using “grep”. The “grep” command will keep any line that has the matching string. We can also reverse this pattern search with the -v flag, causing it to exclude any line that matches the pattern.

cat fastq_files.txt | grep -v 2.fq.gz > fastq_files.R1.txt

Now we have a list of files that we want to work on. Here are two ways we can loop through each line in that file.

Option 1:

for x in `cat fastq_files.R1.txt`
do
  echo $x
done

In this option, we’re doing a creating our list of $x variables by using a mini command call in the `` marks. We could modify it on the fly with additional filters. For example Option 1a:

for x in `cat fastq_files.R1.txt | grep -v sample29`
do
  echo $x
done

Here we’ve excluded sample29.

In the second option, we use a “while” loop and the “read” command. The file that it is reading is specified at the end of the while loop. Option 2:

while read x; do
  echo $x
done < fastq_files.R1.txt

A second challenge when making shell scripts is that we often want to make a series of files with similar names but different suffixes. In this case we start off with “sample29.1.fq.gz”. The bam file we create from this should be sample29.bam, not sample29.1.fq.gz.bam.

Here are two options for removing suffixes. Option 1:

x=sample01.1.fq.gz
y=$(basename $x .1.fq.gz)
echo $y

The “basename” removes a suffix that you specify for a string. It can also function to remove directories before a filename.

x=fastq/data/sample01.1.fq.gz
y=$(basename $x)
echo $y

A more general approach is to use sed (a.k.a. stream editor) to modify a variable. Sed can be used to find and substitute any character pattern in a string. It works like this: s/character to find/character to replace it with/g The “s” at the start is for subtitution. The “g” at the end is for global (which means it finds all instances of the replacement, not just the first).

Take a look at these examples and see if you understand how they work:

echo "cats" | sed s/a/o/g

echo "cats" | sed s/a//g

echo "cats" | sed s/a/aaaaaa/g

echo "cats" | sed s/ats/ats_are_funky/g

We can use this to rename variables.

x=sample01.1.fq.gz
y=$(echo $x | sed s/.1.fq.gz//g)
echo $y

Lastly, don’t forget you can build prefixes onto variable based names by wrapping the variable in ‘{}’.

x=sample01.1.fq.gz
y=$(echo $x | sed s/.1.fq.gz//g)
echo $y
echo ${y}.fq.gz

By default, a variable name doesn’t expect to have a “.” in it. So $y.txt will take the $y variable and add “.txt”. Using {} around a variable is a safe way of adding to a variable string, but isn’t always

Questions

Use multiple sed commands to change the first sentence into the second sentence. “My favorite city is Vancouver because Vancouver is much better than Victoria” into “My favorite city is Victoria because Victoria is much better than Vancouver”.
Extract the sample name using a command from the string “fastq/pool_1/Pool_1.sample902_R1.fastq.gz”

For the lab questions, you’ll be working on a small set of real data. Lets copy that into your directory

cd /project/biol470-grego/2026/YOUR_NAME
mkdir lab_6_questions
cd lab_6_questions
cp /project/ctb-grego/sharing/lab_6/* ./