Filtering a SAM file generated by TopHat to find uniquely mapped, concordant read pairs: AWK vs SAMtools

April 15, 2015 by Keith Bradnam

Software suites like SAMtools (or should that be [SAMsam]tools?) offer a powerful way to slice and dice files in the SAM, BAM, and CRAM file formats. But sometimes other approaches work just as well.

If you have aligned paired RNA-Seq read data to a genome or transcriptome using TopHat you may be interested in filtering the resulting SAM/BAM file to keep reads that are:

a) uniquely aligned (only match one place in target sequence)
b) mapped as concordant read pairs (both pairs map to same sequence in correct orientation with suitable distance between them)

TopHat has a --no-discordant command-line option which only report read alignments if both reads in a pair can be mapped, but the name of option is somewhat misleading, as you still end up with discordantly mapped read pairs in the final output (always good to check what difference a command-line option actually makes to your output!.

So if you have a TopHat SAM file, and you wanted to filter it to only keep uniqueley mapped, concordant read pairs, you could use two of the options that the samtools view command provides:

-q 50 — This filters on the MAPQ field (5th column of SAM file). TopHat uses a value of 50 in this field to denote unique mappings (this important piece of information is not in the TopHat manual).
-f 0x2 — This option filters on the bitwise FLAG score (column 2 of SAM file), and will extract only those mappings where the 2nd bit is set. From the SAM documentation, this is described as each segment properly aligned according to the aligner. In practice this means looking for values that are either 83, 89, 147, or 163 (see this helpful Biobeat blog post for more information about this).

So if you have a SAM file, in essence you just need to filter that file based on matching certain numbers in two different columns. This is something that the Unix AWK tool excels at, and unlike SAMtools, AWK is installed on just about every Unix/Linux system by default. So do both tools give ou the same result? Only one way to find out:

Using SAMtools

The 'unfiltered.sam' file is the result of a TopHat run that used the --no-discordant and --no-mixed options. The SAM file contained 34,340,754 lines of mapping information:

time samtools  view -q 50 -f 0x2 unfiltered.sam > filtered_by_samtools.sam

real    1m57.068s
user    1m18.705s
sys     0m13.712s

Using AWK

time awk '$5 == 50 && ($2 == 163 || $2 == 147 || $2 == 83 || $2 == 99) {print}' unfiltered.sam  > filtered_by_awk.sam

real    1m31.734s
user    0m46.855s
sys     0m15.775s

Does it make a difference?

wc -l filtered_by_*.sam
31710476 filtered_by_awk.sam
31710476 filtered_by_samtools.sam

diff filtered_by_samtools.sam filtered_by_awk.sam

No difference in the final output files, with AWK running quite a bit quicker than SAMtools. In this situation, filtering throws away about 8% of the mapped reads.

Inconsistent bioinformatics branding: SAMtools vs Samtools vs samtools

April 13, 2015 by Keith Bradnam

The popular Sequence Alignment Map format, SAM, has given rise to an equally popular toolkit for working with SAM files (and BAM, CRAM too). But what is the name of this tool?

SAMtools?

If we read the official publication, then we see this software described as 'SAMtools' (also described by Wikipedia in this manner).

Samtools?

Head to the official website and we see consistent references to 'Samtools'.

samtools?

Head to the official GitHub repository and we see consistent references to 'samtools'.

This is not exactly a problem that is halting the important work of bioinformaticians around the world, but I find it surprising that all of these names are in use by the people that developed the software. Unix-based software is typically — but not always — implemented as a set of lower-case commands and this can add one level of confusion when comparing a tool's name to the actual commands that are run ('samtools' is what you type at the terminal). However, you can still be consistent in your documentation!

More madness with MAPQ scores (a.k.a. why bioinformaticians hate poor and incomplete software documentation)

March 17, 2015 by Keith Bradnam

I have previously written about the range of mapping quality scores (MAPQ) that you might see in BAM/SAM files, as produced by popular read mapping programs. A very quick recap:

Bowtie 2 generates MAPQ scores between 0–42
BWA generates MAPQ scores between 0–37
Neither piece of software describes the range of possible scores in their documentation
The SAM specification defines the possible ranges of the MAPQ score as 0–255 (though 255 should indicate that mapping quality was not available)
I advocated that you should always take a look at your mapped sequence data to see what ranges of scores are present before doing anything else with your BAM/SAM files

So what is my latest gripe? Well, I've recently been running TopHat (version 2.0.13) to map some RNA-Seq reads to a genome sequence. TopHat uses Bowtie (or Bowtie 2) as the tool to do the intial mapping of reads to the genome, so you might expect it to generate the same range of MAPQ scores as the standalone version of Bowtie.

But it doesn't.

From my initial testing, it seems that the BAM/SAM output file from TopHat only contains MAPQ scores of 0, 1, 3, or 50. I find this puzzling and incongruous. Why produce only four MAPQ scores (compared to >30 different values that Bowtie 2 can produce), and why change the maximum possible value to 50? I turned to the TopHat manual, but found no explanation regarding MAPQ scores.

Turning to Google, I found this useful Biostars post which suggests that five MAPQ values are possible with TopHat (you can also have a value of 2 which I didn't see in my data), and that these values correspond to the following:

0 = maps to 10 or more locations
1 = maps to 4-9 locations
2 = maps to 3 locations
3 = maps to 2 locations
50 = unique mapping

The post also reveals that, confusingly, TopHat previously used a value of 255 to indicate uniquely mapped reads. However, I then found another Biostars post which says that a MAPQ score of 2 isn't possible with TopHat, and that the meaning of the scores are as follows:

0 = maps to 5 or more locations
1 = maps to 3-4 locations
3 = maps to 2 locations
255 = unique mapping

This post was in reference to an older version of TopHat (1.4.1) which probably explains the use of the 255 score rather than 50. The comments on this post reflect some of the confusion over this topic. Going back to the original Biostars post, I then noticed a recent comment suggesting that MAPQ scores of 24, 28, 41, 42, and 44 are also possible with TopHat (version 2.0.13).

As this situation shows, when there is no official explanation that fully describes how a piece of software should work, it can lead to mass speculation by others. Such speculation can sometimes be inconsistant which can end up making things even more confusing. This is what drives bioinformaticians crazy.

I find it deeply frustrating when so much of this confusion could be removed with better documentation by the people that developed the original software. In this case the documentation needs just one paragraph added; something along the lines of…

Mapping Quality scores (MAPQ)
TopHat outputs MAPQ scores in the BAM/SAM files with possible values 0, 1, 2, or 50. The first three values indicate mappings to 5, 3–4, or 2 locations, whereas a value of 50 represents a unique match. Please note that older versions of TopHat used a value of 255 for unique matches. Further note that standalone versions of Bowtie and Bowie 2 (used by TopHat) produce a different range of MAPQ scores (0–42).

Would that be so hard?

Understanding MAPQ scores in SAM files: does 37 = 42?

December 16, 2014 by Keith Bradnam

The official specification for the Sequence Alignment Map (SAM) format outlines what is stored in each column of this tab-separated value file format. The fifth column of a SAM file stores MAPping Quality (MAPQ) values. From the SAM specification:

MAPQ: MAPping Quality. It equals −10 log10 Pr{mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.

So if you happened to know that the probability of correctly mapping some random read was 0.99, then the MAPQ score should be 20 (i.e. log10 of 0.01 * -10). If the probability of a correct match increased to 0.999, the MAPQ score would increase to 30. So the upper bounds of a MAPQ score depends on the level of precision of your probability (though elswhere in the SAM spec, it defines an upper limit of 255 for this value). Conversely, as the probability of a correct match tends towards zero, so does the MAPQ score.

So I'm sure that the first thing that everyone does after generating a SAM file is to assess the spread of MAPQ scores in your dataset. Right? Anyone?

< sound of crickets >

Okay, so maybe you don't do this. Maybe you don't really care, and you are happy to trust the default output of whatever short read alignment program that you used to generate your SAM file. Why should it matter? Will these scores really vary all that much?

Here is a frequency distribution of MAPQ scores from two mapping experiments. The bottom panel zooms in to more clearly show the distribution of low frequency MAPQ scores:

Distribution of MAPQ scores from two experiments: bottom panel shows zoomed in view of MAPQ scores with frequencies < 1%. Click to enlarge.

What might we conclude from this? There seems to be some clear differences between both experiments. The most frequent MAPQ scores in the first experiment are 42 followed by 1. In the second experiment, scores only reach a maximum value of 37, and scores of 0 are the second most frequent value.

These two experiments reflect some real world data. Experiment 1 is based on data from mouse, and experiment 2 uses data from Arabidopsis thaliana. However, that is probably not why the distributions are different. The mouse data is based on unpaired Illumina reads from a DNase-Seq experiment, wheras the A. thaliana data is from paired Illumina reads from whole genome sequencing. However, that still probably isn't the reason for the differences.

The reason for the different distributions is that experiment 1 used Bowtie 2 to map the reads whereas experiment 2 used BWA. It turns out that different mapping programs calculate MAPQ scores in different ways and you shouldn't really compare these values unless they came from the same program.

The maximum MAPQ value that Bowtie 2 generates is 42 (though it doesn't say this anywhere in the documentation). In contrast, the maximum MAPQ value that BWA will generate is 37 (though once again, you — frustratingly — won't find this information in the manual).

The data for Experiment 1 is based on a sample of over 25 million mapped reads. However, you never see MAPQ scores of 9, 10, or 20, something that presumably reflects some aspect of how Bowtie 2 calculates these scores.

In the absence of any helpful information in the manuals of these two popular aligners, others have tried doing their own experimentation to work out what the values correspond to. Dave Tang has a useful post on Mappinq Qualities on his Musings from a PhD Candidate blog. There are also lots of posts about mapping quality on the SEQanswers site (e.g. see here, here or here). However, the prize for the most detailed investigation of MAPQ scores — from Bowtie 2 at least — goes to John Urban who has written a fantastic post on his Biofinysics blog:

How does Bowtie 2 assign MAPQ scores?

So in conclusion, there are 3 important take home messages:

MAPQ scores vary between different programs and you should not directly compare results from, say, Bowtie 2 and BWA.
You should look at your MAPQ scores and potentially filter out the really bad alignments.
Bioinformatics software documentation can often omit some really important details (see also my last blog post on this subject).

How user-friendly should bioinformatics documentation be?

November 25, 2014 by Keith Bradnam

Imagine that you have never seen a SAM output file before. Now imagine that you are relatively new to bioinformatics, perhaps you are PhD student doing a rotation in a bioinformatics lab. If you are asked to work with some SAM files, you might reasonably want to look at the SAM documentation to understand the structure of this 11-column plain text file format.

Let's consider just the second column of a SAM output file. You've been looking at the SAM file that your boss provided to you and you notice that column 2 is full of integer values, mostly 0, 4, and 16. You want to know what these mean and so you turn to the relevant section of the SAM documentation to find out more about column 2:

Column 2 — FLAG: bitwise FLAG

Each bit is explained in the following table:

Bit — Description
0x1 — template having multiple segments in sequencing
0x2 — each segment properly aligned according to the aligner
0x4 — segment unmapped
0x8 — next segment in the template unmapped
0x10 — SEQ being reverse complemented
0x20 — SEQ of the next segment in the template being reversed
0x40 — the first segment in the template
0x80 — the last segment in the template
0x100 — secondary alignment
0x200 — not passing quality controls
0x400 — PCR or optical duplicate
0x800 — supplementary alignment

For each read/contig in a SAM file, it is required that one and only one line associated with the read satisfies ‘FLAG & 0x900 == 0’. This line is called the primary line of the read.

Bit 0x100 marks the alignment not to be used in certain analyses when the tools in use are aware of this bit. It is typically used to flag alternative mappings when multiple mappings are presented in a SAM.

Bit 0x800 indicates that the corresponding alignment line is part of a chimeric alignment. A line flagged with 0x800 is called as a supplementary line.

Bit 0x4 is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, bits 0x2, 0x10, 0x100 and 0x800, and the bit 0x20 of the previous read in the template.

If 0x40 and 0x80 are both set, the read is part of a linear template, but it is neither the first nor the last read. If both 0x40 and 0x80 are unset, the index of the read in the template is unknown. This may happen for a non-linear template or the index is lost in data processing.

If 0x1 is unset, no assumptions can be made about 0x2, 0x8, 0x20, 0x40 and 0x80.

So having read all of this, my question to you is: what does a value of zero in your SAM file correspond to?

To me this is far from clear from the documentation. You first have to understand what bitwise actually means. You then need to understand that these bitwise flag values will be represented as an integer value in the SAM file (this is mentioned in passing elsewhere in the documentation).

Finally, you must deduce that a value of zero in your SAM output file means that no bitwise flags have been set. So, if the 3rd 'segment unmapped' bit isn't set, then that means that your segment (i.e. sequence) was mapped. Likewise, the lack of a 5th bit (reverse complemented) means that your sequence match must be on the forward strand.

Phew. I find this to be be frustratingly opaque and in desperate need of some examples. Particularly because zero values in a SAM output file are among the most common things that a user will see. The above table could also benefit from including equivalent integer values, to make it clearer than 0x10 = 16, 0x20 = 32 etc.

I've raised a GitHub issue regarding these points. The larger issue here is that I think software developers sometimes assume too much about the skill set of their end users and fail to write their documentation in terms that mere mortals will understand.