The dangers of default parameters in bioinformatics: lessons from Bowtie and TopHat

Updated 2015-04-28 to include a reference to a relevant blog post by Nick Loman.

Most bioinformatics tools are equipped with a vast array of command-line options which let the user configure the inputs, outputs, and performance of the software. You may not wish to explore every possible option when using a particular piece of software, but you should always try to have a look at the manual. In fact I would go so far as to say:

Never use a piece of bioinformatics software for the first time without looking to see what command-line options are available and what default parameters are being used

If a bioinformatics tool provides command-line options, it's a good bet that some of those options will have default parameters associated with them. The most common default parameter that you should always check for might be called something like -p, -t, or --threads. I.e. you should always check to see if a program can make use of multiple CPUs/cores to speed up its operation. The default value for such parameter is usually '1'.

Failure to check (and change) this option will merely result in your program taking longer than it otherwise could, so no harm done. More serious issues arise when you stick with default parameters which might actually make you miss interesting results, or produce misleading results.

E.g. the TopHat RNA-Seq aligner has an command-line option -i which lets you set a minimum intron length. The manual reveals that TopHat will ignore donor/acceptor pairs closer than this many bases apart. The default value for this parameter is 70 bp which is probably fine for all vertebrate genomes, but is probably not suitable for genomes such as Arabidopsis thaliana or Caenorhabditis elegans where there are many introns shorter than this length.

Now let's look at some specific examples which shed some light on what can happen if you stick with the default parameters.

Testing Bowtie

I used Bowtie 2 to align two FASTQ files of bovine RNA-Seq data to the reference cow transcriptome. These two files represented paired read data, about 25 million reads in each file. Following alignment, I filtered the data to only keep uniquely mapped, concordant read pairs. Furthermore, I filtered the data to only keep reads pairs that aligned with the highest possible mapping quality (for Bowtie, this means MAPQ values of 42).

From the resulting set of aligned reads, I calculated the inner read pair distance (see this great post about the difference between this and insert sizes), which I then plotted:

Thicker red vertical line indicates mean.

You will notice that most read pairs have a negative inner read pair distance, i.e. the reads overlap. But do you notice anything odd about these results? Look at the far right of the distribution…the maximum inner read pair distance cuts off abruptly at 300 bp. To me this is very suspicious. After looking in the Bowtie manual I discovered this:

-X <int> The maximum fragment length for valid paired-end alignments. Default: 500 bp.

Aha! My reads are 100 bp, so the cutoff at 300 bp in the above graph represents fragment lengths of 500 bp. If there are reads that correctly map further than 300 bp apart, I am not going to see them because of the default value of -X. So what happens if I increase this default parameter to 2,000 bp?

Thicker red line indicates mean of dataset with -X 500, thicker blue line indicates mean of dataset at -X 2000.

I have colored in the new data in blue; we now see many more paired reads that map further than 300 bp apart. There are actually still some reads that correctly map even further than 1,800 bp apart (the limit imposed by -X 2000). So not only does this default parameter of Bowtie mean that I lose some data, it also means that I lose the type of data that might be particularly useful in some applications where we especially want the reads that map far apart (e.g. transcriptome assembly).

It turns out that Nick Loman was writing about the dangers of the -X parameter of Bowtie back in 2013, and I encourage everyone to read his blog post: Use X with bowtie2 to set minimum and maximum insert sizes for Nextera libraries.

Testing TopHat

The TopHat program actually uses Bowtie for the alignment phase, and there are a couple of default parameters used by TopHat that might cause issues for you. From the TopHat manual:

-r <int>  This is the expected (mean) inner distance between mate pairs. The default is 50 bp.

--mate-std-dev <int> The standard deviation for the distribution on inner distances between mate pairs. The default is 20 bp.

The default value for --mate-std-dev seems worryingly low; how many biological datasets ever have standard deviations that are less than half the mean? So I did some testing…

I used TopHat to map a small sample of my RNA-Seq data (200,000 paired reads) to the cow transcriptome (using the -T option of TopHat). In addition to using the defaults, I also tried various runs that used different values for the -r and --mate-std-dev options. My run with default parameters ended up with 150,235 reads mapped to the transcriptome, whereas increasing those parameters (-r 100 and --mate-std-dev 125) produced fewer reads in my output (138,970). So you might conclude that the default parameters are performing better. However, of the 150,235 reads mapped with default parameters, only 84,996 represent high quality alignments (uniquely mapped, concordant read pairs). My custom parameters produced 105,748 high quality alignments. So the custom parameters are giving me better results overall.

Conclusions

It takes time and effort to be prepared to test command-line options, especially for programs like Bowtie and TopHat which have a lot of options. For TopHat, I ended up doing 40 different test runs in order to investigate the effects of various parameters to work out what was most suitable (and effective) for my input data.

Bioinformatics programs should never be treated as 'black box' devices. You may get results out of them, but you will never know whether you could have more and/or better results from changing some of those default parameters.

Filtering a SAM file generated by TopHat to find uniquely mapped, concordant read pairs: AWK vs SAMtools

Software suites like SAMtools (or should that be [SAMsam]tools?) offer a powerful way to slice and dice files in the SAM, BAM, and CRAM file formats. But sometimes other approaches work just as well.

If you have aligned paired RNA-Seq read data to a genome or transcriptome using TopHat you may be interested in filtering the resulting SAM/BAM file to keep reads that are:

  • a) uniquely aligned (only match one place in target sequence)
  • b) mapped as concordant read pairs (both pairs map to same sequence in correct orientation with suitable distance between them)

TopHat has a --no-discordant command-line option which only report read alignments if both reads in a pair can be mapped, but the name of option is somewhat misleading, as you still end up with discordantly mapped read pairs in the final output (always good to check what difference a command-line option actually makes to your output!.

So if you have a TopHat SAM file, and you wanted to filter it to only keep uniqueley mapped, concordant read pairs, you could use two of the options that the samtools view command provides:

  • -q 50 — This filters on the MAPQ field (5th column of SAM file). TopHat uses a value of 50 in this field to denote unique mappings (this important piece of information is not in the TopHat manual).
  • -f 0x2 — This option filters on the bitwise FLAG score (column 2 of SAM file), and will extract only those mappings where the 2nd bit is set. From the SAM documentation, this is described as each segment properly aligned according to the aligner. In practice this means looking for values that are either 83, 89, 147, or 163 (see this helpful Biobeat blog post for more information about this).

So if you have a SAM file, in essence you just need to filter that file based on matching certain numbers in two different columns. This is something that the Unix AWK tool excels at, and unlike SAMtools, AWK is installed on just about every Unix/Linux system by default. So do both tools give ou the same result? Only one way to find out:

Using SAMtools

The 'unfiltered.sam' file is the result of a TopHat run that used the --no-discordant and --no-mixed options. The SAM file contained 34,340,754 lines of mapping information:

time samtools  view -q 50 -f 0x2 unfiltered.sam > filtered_by_samtools.sam

real    1m57.068s
user    1m18.705s
sys     0m13.712s

Using AWK

time awk '$5 == 50 && ($2 == 163 || $2 == 147 || $2 == 83 || $2 == 99) {print}' unfiltered.sam  > filtered_by_awk.sam

real    1m31.734s
user    0m46.855s
sys     0m15.775s

Does it make a difference?

wc -l filtered_by_*.sam
31710476 filtered_by_awk.sam
31710476 filtered_by_samtools.sam

diff filtered_by_samtools.sam filtered_by_awk.sam

No difference in the final output files, with AWK running quite a bit quicker than SAMtools. In this situation, filtering throws away about 8% of the mapped reads.

More madness with MAPQ scores (a.k.a. why bioinformaticians hate poor and incomplete software documentation)

I have previously written about the range of mapping quality scores (MAPQ) that you might see in BAM/SAM files, as produced by popular read mapping programs. A very quick recap:

  1. Bowtie 2 generates MAPQ scores between 0–42
  2. BWA generates MAPQ scores between 0–37
  3. Neither piece of software describes the range of possible scores in their documentation
  4. The SAM specification defines the possible ranges of the MAPQ score as 0–255 (though 255 should indicate that mapping quality was not available)
  5. I advocated that you should always take a look at your mapped sequence data to see what ranges of scores are present before doing anything else with your BAM/SAM files

So what is my latest gripe? Well, I've recently been running TopHat (version 2.0.13) to map some RNA-Seq reads to a genome sequence. TopHat uses Bowtie (or Bowtie 2) as the tool to do the intial mapping of reads to the genome, so you might expect it to generate the same range of MAPQ scores as the standalone version of Bowtie.

But it doesn't.

From my initial testing, it seems that the BAM/SAM output file from TopHat only contains MAPQ scores of 0, 1, 3, or 50. I find this puzzling and incongruous. Why produce only four MAPQ scores (compared to >30 different values that Bowtie 2 can produce), and why change the maximum possible value to 50? I turned to the TopHat manual, but found no explanation regarding MAPQ scores.

Turning to Google, I found this useful Biostars post which suggests that five MAPQ values are possible with TopHat (you can also have a value of 2 which I didn't see in my data), and that these values correspond to the following:

  • 0 = maps to 10 or more locations
  • 1 = maps to 4-9 locations
  • 2 = maps to 3 locations
  • 3 = maps to 2 locations
  • 50 = unique mapping

The post also reveals that, confusingly, TopHat previously used a value of 255 to indicate uniquely mapped reads. However, I then found another Biostars post which says that a MAPQ score of 2 isn't possible with TopHat, and that the meaning of the scores are as follows:

  • 0 = maps to 5 or more locations
  • 1 = maps to 3-4 locations
  • 3 = maps to 2 locations
  • 255 = unique mapping

This post was in reference to an older version of TopHat (1.4.1) which probably explains the use of the 255 score rather than 50. The comments on this post reflect some of the confusion over this topic. Going back to the original Biostars post, I then noticed a recent comment suggesting that MAPQ scores of 24, 28, 41, 42, and 44 are also possible with TopHat (version 2.0.13).

As this situation shows, when there is no official explanation that fully describes how a piece of software should work, it can lead to mass speculation by others. Such speculation can sometimes be inconsistant which can end up making things even more confusing. This is what drives bioinformaticians crazy.

I find it deeply frustrating when so much of this confusion could be removed with better documentation by the people that developed the original software. In this case the documentation needs just one paragraph added; something along the lines of…

Mapping Quality scores (MAPQ)
TopHat outputs MAPQ scores in the BAM/SAM files with possible values 0, 1, 2, or 50. The first three values indicate mappings to 5, 3–4, or 2 locations, whereas a value of 50 represents a unique match. Please note that older versions of TopHat used a value of 255 for unique matches. Further note that standalone versions of Bowtie and Bowie 2 (used by TopHat) produce a different range of MAPQ scores (0–42).

Would that be so hard?