The dangers of default parameters in bioinformatics: lessons from Bowtie and TopHat

Updated 2015-04-28 to include a reference to a relevant blog post by Nick Loman.

Most bioinformatics tools are equipped with a vast array of command-line options which let the user configure the inputs, outputs, and performance of the software. You may not wish to explore every possible option when using a particular piece of software, but you should always try to have a look at the manual. In fact I would go so far as to say:

Never use a piece of bioinformatics software for the first time without looking to see what command-line options are available and what default parameters are being used

If a bioinformatics tool provides command-line options, it's a good bet that some of those options will have default parameters associated with them. The most common default parameter that you should always check for might be called something like -p, -t, or --threads. I.e. you should always check to see if a program can make use of multiple CPUs/cores to speed up its operation. The default value for such parameter is usually '1'.

Failure to check (and change) this option will merely result in your program taking longer than it otherwise could, so no harm done. More serious issues arise when you stick with default parameters which might actually make you miss interesting results, or produce misleading results.

E.g. the TopHat RNA-Seq aligner has an command-line option -i which lets you set a minimum intron length. The manual reveals that TopHat will ignore donor/acceptor pairs closer than this many bases apart. The default value for this parameter is 70 bp which is probably fine for all vertebrate genomes, but is probably not suitable for genomes such as Arabidopsis thaliana or Caenorhabditis elegans where there are many introns shorter than this length.

Now let's look at some specific examples which shed some light on what can happen if you stick with the default parameters.

Testing Bowtie

I used Bowtie 2 to align two FASTQ files of bovine RNA-Seq data to the reference cow transcriptome. These two files represented paired read data, about 25 million reads in each file. Following alignment, I filtered the data to only keep uniquely mapped, concordant read pairs. Furthermore, I filtered the data to only keep reads pairs that aligned with the highest possible mapping quality (for Bowtie, this means MAPQ values of 42).

From the resulting set of aligned reads, I calculated the inner read pair distance (see this great post about the difference between this and insert sizes), which I then plotted:

Thicker red vertical line indicates mean.

You will notice that most read pairs have a negative inner read pair distance, i.e. the reads overlap. But do you notice anything odd about these results? Look at the far right of the distribution…the maximum inner read pair distance cuts off abruptly at 300 bp. To me this is very suspicious. After looking in the Bowtie manual I discovered this:

-X <int> The maximum fragment length for valid paired-end alignments. Default: 500 bp.

Aha! My reads are 100 bp, so the cutoff at 300 bp in the above graph represents fragment lengths of 500 bp. If there are reads that correctly map further than 300 bp apart, I am not going to see them because of the default value of -X. So what happens if I increase this default parameter to 2,000 bp?

Thicker red line indicates mean of dataset with -X 500, thicker blue line indicates mean of dataset at -X 2000.

I have colored in the new data in blue; we now see many more paired reads that map further than 300 bp apart. There are actually still some reads that correctly map even further than 1,800 bp apart (the limit imposed by -X 2000). So not only does this default parameter of Bowtie mean that I lose some data, it also means that I lose the type of data that might be particularly useful in some applications where we especially want the reads that map far apart (e.g. transcriptome assembly).

It turns out that Nick Loman was writing about the dangers of the -X parameter of Bowtie back in 2013, and I encourage everyone to read his blog post: Use X with bowtie2 to set minimum and maximum insert sizes for Nextera libraries.

Testing TopHat

The TopHat program actually uses Bowtie for the alignment phase, and there are a couple of default parameters used by TopHat that might cause issues for you. From the TopHat manual:

-r <int>  This is the expected (mean) inner distance between mate pairs. The default is 50 bp.

--mate-std-dev <int> The standard deviation for the distribution on inner distances between mate pairs. The default is 20 bp.

The default value for --mate-std-dev seems worryingly low; how many biological datasets ever have standard deviations that are less than half the mean? So I did some testing…

I used TopHat to map a small sample of my RNA-Seq data (200,000 paired reads) to the cow transcriptome (using the -T option of TopHat). In addition to using the defaults, I also tried various runs that used different values for the -r and --mate-std-dev options. My run with default parameters ended up with 150,235 reads mapped to the transcriptome, whereas increasing those parameters (-r 100 and --mate-std-dev 125) produced fewer reads in my output (138,970). So you might conclude that the default parameters are performing better. However, of the 150,235 reads mapped with default parameters, only 84,996 represent high quality alignments (uniquely mapped, concordant read pairs). My custom parameters produced 105,748 high quality alignments. So the custom parameters are giving me better results overall.

Conclusions

It takes time and effort to be prepared to test command-line options, especially for programs like Bowtie and TopHat which have a lot of options. For TopHat, I ended up doing 40 different test runs in order to investigate the effects of various parameters to work out what was most suitable (and effective) for my input data.

Bioinformatics programs should never be treated as 'black box' devices. You may get results out of them, but you will never know whether you could have more and/or better results from changing some of those default parameters.

7 things that brilliant scientists do: the last item might surprise you!

I was presenting in our weekly lab meeting and I used the occasion to talk on a few different subjects (each of which will probably end up as a separate blog post). I started by trying to get people to think about the different things that scientists do and I offered my own list of suggestions:

The last item on the list is maybe not something that everyone thinks about. I believe that scientists should have some curiousity about their field (and about science in general) and one way that this can manifest itself is by asking questions.

I then sprang a surprise on the lab by revealing that for the last year, I've secretly been tracking which people ask questions in our lab meetings. The data is incomplete: I have probably missed some questions and not everyone has been to every lab meeting. Also, we have some new people in the lab who weren't here a year ago. But in any case, I showed the (anonymized) data which is showing how many times (out of 41 lab meetings) someone asked at least 1 question:

I stressed that it isn't automatically a good thing if you are on the left-hand side of the graph, and it isn't automatically a bad thing if you are the right-hand side of the graph. But clearly, we have some people in our lab meetings who ask more questions than others. I then shared some reasons for why I think many people do not ask questions:

The 2nd point is very common for younger scientists, and I fully understand that people can be nervous about asking what they feel might be a stupid question (especially to people that they perceive as being 'above them'). But if you don't understand something…then it is not a stupid question. It is only stupid if you never ask and instead sit through meetings being confused.

The 3rd point sometimes happens when it is the speaker who is the junior person and the audience doesn't want to risk giving them a hard time. It is laudable to be kind to others, but this doesn't mean that you can't ask questions. If nobody ever asked questions then scientists will never get the useful feedback as to which parts of their talks are confusing or unclear.

Learning to ask (and answer) questions is a useful, and necessary skill, for scientists. If you can't practice this skill in the relative safety of a lab meeting, then when will you practice it? I ended this section of my lab talk by pointing out that your scientific curiousity (or lack of) is something that may be used when people assess you (for job interviews, promotions etc.):

Viewing online figures in an Nucleic Acids Research article…an exercise in frustration

I found a Nucleic Acids Research article that I wanted to read online. The article in question had — like most articles — some figures. Here's how my web browser displays one of the figures:

I've blurred out the surrounding text so as not to risk any copyright issues (and also to let you focus on just the figure). If your eyesight is like mine, you may feel that the three subpanels of this figure are too small to be of much use.

So I clicked the image…

The white rectangle enclosing the figure increases in size from about 250 x 130 pixels to about 460 x 210. If I squint, I can just about make out some of the figure text, but it remains far from clear.

So I clicked the image…

Now I see exactly the same image as before, but it's in a new webpage which has a wider figure legend. The new legend is acually a little bit harder to read than previously as there are more words per line. I'm really not sure what the point of this page is.

So I clicked the image…

Hooray! Now I can finally see the figure at a decent size where all of the figure text is legible. It only took me three clicks to get there. To make sense of the figure I turn to the legend, only to find that there is no legend!

So you can have your figure at a reasonable size without the legend, or you can have the legend but only with a small version of the figure. It is obviously beyond the journals ability to give you both.

Great post by Lex Nederbragt about why we need graph-based representations of genome sequences

In a recent blog post, Lex Nederbragt explains why we all need to be moving to graph-based representations of genome sequences (and getting away from linear representations).

In this post I will provide some background, explain the reasons for moving towards graph-based representations, and indicate some challenges associated with this development.

After listing the many challenges involved in moving towards a graph-based future, he refers to the fact that current efforts have not been widely adopted:

Two file formats to represent the graph been developed: FASTG and GFA. FASTG has limited uptake, only two assembly programs (ALLPATHS_LG and SPAdes) will output in that format. GFA parsing is currently only experimentally in the ABYSS assembler, and [the VG program] is able to output it.

The lack of a widely recognized and supported standard for representing variation inherent in a genome sequence is, in my opinion, a major barrier to moving forward. Almost all bioinformatics software that works with genome sequence data expects a single sequence with no variation. It will require a whole new generation of tools to work with a variant-based format, but tool developers will be reluctant to write new code if there is no clear agreement on what is the new de facto file format.

101 questions with a bioinformatician #25: Alex Bateman

Alex Bateman is in charge of Protein Sequence Resources at the European Bioinformatics Institute (EBI). You might know him from his role in developing such popular bioinformatics tools as PfamRfam, and TreeFam (rumors abound that their planned database of Ox genome resources had to be cancelled because they couldn’t secure the desired name).

A recipient of the Benjamin Franklin Award for Open Access in the Life Sciences, Alex has also been an enthusiastic advocate of Wikipedia as a resource that scientists should utilize more. To borrow a couple of British-isms, Alex is a ‘jolly nice chap’ and a ‘thoroughly decent bloke’, and I say that as someone who once had to stand on a milk crate with him as part of a management training exercise (don’t ask!). I sometimes think that he has found the elixir of life as he never seems to age. Oh, and you should also be aware that he has a black belt in origami.

You can find out more about Alex by visiting his group’s website at the EBI, or by following him on twitter (@alexbateman1). And now, on to the 101 questions…

Read More

Filtering a SAM file generated by TopHat to find uniquely mapped, concordant read pairs: AWK vs SAMtools

Software suites like SAMtools (or should that be [SAMsam]tools?) offer a powerful way to slice and dice files in the SAM, BAM, and CRAM file formats. But sometimes other approaches work just as well.

If you have aligned paired RNA-Seq read data to a genome or transcriptome using TopHat you may be interested in filtering the resulting SAM/BAM file to keep reads that are:

  • a) uniquely aligned (only match one place in target sequence)
  • b) mapped as concordant read pairs (both pairs map to same sequence in correct orientation with suitable distance between them)

TopHat has a --no-discordant command-line option which only report read alignments if both reads in a pair can be mapped, but the name of option is somewhat misleading, as you still end up with discordantly mapped read pairs in the final output (always good to check what difference a command-line option actually makes to your output!.

So if you have a TopHat SAM file, and you wanted to filter it to only keep uniqueley mapped, concordant read pairs, you could use two of the options that the samtools view command provides:

  • -q 50 — This filters on the MAPQ field (5th column of SAM file). TopHat uses a value of 50 in this field to denote unique mappings (this important piece of information is not in the TopHat manual).
  • -f 0x2 — This option filters on the bitwise FLAG score (column 2 of SAM file), and will extract only those mappings where the 2nd bit is set. From the SAM documentation, this is described as each segment properly aligned according to the aligner. In practice this means looking for values that are either 83, 89, 147, or 163 (see this helpful Biobeat blog post for more information about this).

So if you have a SAM file, in essence you just need to filter that file based on matching certain numbers in two different columns. This is something that the Unix AWK tool excels at, and unlike SAMtools, AWK is installed on just about every Unix/Linux system by default. So do both tools give ou the same result? Only one way to find out:

Using SAMtools

The 'unfiltered.sam' file is the result of a TopHat run that used the --no-discordant and --no-mixed options. The SAM file contained 34,340,754 lines of mapping information:

time samtools  view -q 50 -f 0x2 unfiltered.sam > filtered_by_samtools.sam

real    1m57.068s
user    1m18.705s
sys     0m13.712s

Using AWK

time awk '$5 == 50 && ($2 == 163 || $2 == 147 || $2 == 83 || $2 == 99) {print}' unfiltered.sam  > filtered_by_awk.sam

real    1m31.734s
user    0m46.855s
sys     0m15.775s

Does it make a difference?

wc -l filtered_by_*.sam
31710476 filtered_by_awk.sam
31710476 filtered_by_samtools.sam

diff filtered_by_samtools.sam filtered_by_awk.sam

No difference in the final output files, with AWK running quite a bit quicker than SAMtools. In this situation, filtering throws away about 8% of the mapped reads.

More duplicate names for bioinformatics software: a tale of two HIPPIES

Thanks to Sara Gosline (@sargoshoe) for bringing this to my attention. Compare and contrast the following:

The former tool, published in 2012 in PLOS ONE, takes its name from 'Human Integrated Protein-Protein Interaction rEference' (it was doing so well until it reached the last letter). The latter tool ('High-throughput Identification Pipeline for Promoter Interacting Enhancer elements') was published in 2014 in the journal Bioinformatics.

Leaving aside the issue of whether these names are worthy of a JABBA award, the issue here is that we have yet another duplicate set of software names for two different bioinformatics tools. The authors of the 2nd paper could, and should, have checked for 'prior art'.

If you are planning to develop a new bioinformatics tool and have thought of a possible name, please take the time to do the following:

  1. Visit http://google.com (or your preferred web search engine of choice)
  2. In the search box type the proposed name of your tool followed by a space
  3. Then add the word 'bioinformatics'
  4. Click search
  5. That's it

Inconsistent bioinformatics branding: SAMtools vs Samtools vs samtools

The popular Sequence Alignment Map format, SAM, has given rise to an equally popular toolkit for working with SAM files (and BAM, CRAM too). But what is the name of this tool?


SAMtools?

If we read the official publication, then we see this software described as 'SAMtools' (also described by Wikipedia in this manner).

Samtools?

Head to the official website and we see consistent references to 'Samtools'.

samtools?

Head to the official GitHub repository and we see consistent references to 'samtools'.


This is not exactly a problem that is halting the important work of bioinformaticians around the world, but I find it surprising that all of these names are in use by the people that developed the software. Unix-based software is typically — but not always — implemented as a set of lower-case commands and this can add one level of confusion when comparing a tool's name to the actual commands that are run ('samtools' is what you type at the terminal). However, you can still be consistent in your documentation!