How to ask for bioinformatics help online

Part two of a two-part series.

In part one I covered where to ask for bioinformatics help. Now it is time to turn to the issue of how you should go about asking for help. Hat tip to reader Venu Thatikonda (@nerd_yie) for pointing me out to this 2011 PLOS Computational Biology article that tackles similar ground to this blog post. Here are my five main suggestions, with the last one being further broken down into 9 different tips:

  1. Be polite. Posting a question to an online forum does not mean that you deserve to be answered. If people do answer, they are usually doing so by giving up their own free time to try to help you. Don't berate people for their answers, or insult them in any way.
  2. Be relevant. Choose the right forum in which to ask your question. Sites like SEQanswers have different forums that discuss particular topics, so don't post your PacBio question in the Ion Torrent forum.
  3. Be aware of the rules. Most online forums will have some rules, guidelines, and/or an FAQ which covers general posting etiquette and other things that you should know. It is a good idea to check this before posting on a site for the first time.
  4. Be clever. Search the forum before asking your question, there is often a good chance that your question has already been asked (and answered) by others.
  5. Be helpful. The biggest thing you can probably do in order to get a useful answer to a question is to provide as many useful details as possible, these include:
    1. Type of operating system and version number, e.g. Mac OS X 10.10.5.
    2. Version number/name of software tool(s) you are using, e.g. NCBI BLAST+ v2.2.26, Perl v5.18.2 etc. A good bioinformatics or Unix tool will have a -v, -V, or --version command-line option that will give you this information.
    3. Any error message that you saw. Report the full error message exactly as it appeared.
    4. Where possible, provide steps that would let someone else reproduce the problem (assuming it is reproducible).
    5. Outline the steps that you have tried, if any, to fix the problem. Don't wait for someone to suggest 'quit and restart your terminal' before you reply 'Already tried that'.
    6. A description of what you were expecting to happen. Some perceived errors are not actually errors at all (the software was doing exactly what was asked of it, though this may not be what the user was expecting).
    7. Any other information that could help someone troubleshoot your problem, e.g. a listing of your Unix terminal before and/or after you ran a command which caused a problem.
    8. A snippet of your data that would allow others to reproduce the problem. You may not be able to upload data to the website in question, but small data snippets could be shared via a Dropbox or Google Drive link, or on sites like Github gist.
    9. Attach a screenshot that illustrates the problem. Many forum sites allow you to add image files to a post.

Any other suggestions?



2015-11-08 09.44: Added link to PLOS Computational Biology article

Where to ask for bioinformatics help online

Part one of a two-part series. In part two I tackle the issue of how to ask for help online.

You have many options when seeking bioinformatics help online. Here are ten possible places to ask for help, loosely arranged by their usefulness (as perceived by me):

  1. SEQanswers — the most popular online forum devoted to bioinformatics?
  2. Biostars — another very popular forum.
  3. Mailing lists — many useful bioinformatics tools have their own mailing lists where you can ask questions and get help from the developers or from other users, e.g. SAMtools and Bioconductor. Also note that resources such as Ensembl have their own mailing lists for developers.
  4. Google Discussion Groups — as well as having very general discussion groups, e.g. Bioinformatics, there are also groups like Tuxedo Tool Users…the perfect place to ask your TopHat or Cufflinks question.
  5. Stack Overflow — more suited for questions related to programming languages or Unix/Linux.
  6. Google — I'm including this here because I have solved countless bioinformatics problems just by searching Google with an error message.
  7. Reddit — try asking in r/bioinformatics or r/genome.
  8. Twitter — this may be more useful if you have enough followers who know something about bioinformatics, but it is potentially a good place to ask a question, though not a great forum for long questions (or replies). Try using the hashtag #askabioinformatician (this was @sjcockell's idea).
  9. Voat — Voat is like reddit's younger, hipster nephew. However, the bioinformatics 'subverse' is not very active.
  10. Research Gate — you may know it better as 'that site that sends me email every day', but some people use this site to ask questions about science. Surprisingly, they have 15 different categories relating to bioinformatics.
  11. LinkedIn — Another generator of too many emails, but they do have discussion groups for bioinformatics geeks and NGS.

Other suggestions welcome.



2015-11-02 09.53: Added twitter at the suggestion of Stephen Turner (@nextgenseek).

Welcome to the JABBA menagerie: a collection of animal-themed, bogus bioinformatics names…that have nothing to do with animals!

Bioinformaticians make the worst zookeepers:




























Other suggestions welcome! Only requirements are that:

  1. The name is bogus, i.e. not a straightforward acronym and worthy of a JABBA award
  2. The acronym is named after an animal (or animal grouping)
  3. The software/tool has nothing to do with the animal in question

Understanding MAPQ scores in SAM files: does 37 = 42?

The official specification for the Sequence Alignment Map (SAM) format outlines what is stored in each column of this tab-separated value file format. The fifth column of a SAM file stores MAPping Quality (MAPQ) values. From the SAM specification:

MAPQ: MAPping Quality. It equals −10 log10 Pr{mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.

So if you happened to know that the probability of correctly mapping some random read was 0.99, then the MAPQ score should be 20 (i.e. log10 of 0.01 * -10). If the probability of a correct match increased to 0.999, the MAPQ score would increase to 30. So the upper bounds of a MAPQ score depends on the level of precision of your probability (though elswhere in the SAM spec, it defines an upper limit of 255 for this value). Conversely, as the probability of a correct match tends towards zero, so does the MAPQ score.

So I'm sure that the first thing that everyone does after generating a SAM file is to assess the spread of MAPQ scores in your dataset. Right? Anyone?

< sound of crickets >

Okay, so maybe you don't do this. Maybe you don't really care, and you are happy to trust the default output of whatever short read alignment program that you used to generate your SAM file. Why should it matter? Will these scores really vary all that much?

Here is a frequency distribution of MAPQ scores from two mapping experiments. The bottom panel zooms in to more clearly show the distribution of low frequency MAPQ scores:

Distribution of MAPQ scores from two experiments: bottom panel shows zoomed in view of MAPQ scores with frequencies < 1%. Click to enlarge.

What might we conclude from this? There seems to be some clear differences between both experiments. The most frequent MAPQ scores in the first experiment are 42 followed by 1. In the second experiment, scores only reach a maximum value of 37, and scores of 0 are the second most frequent value.

These two experiments reflect some real world data. Experiment 1 is based on data from mouse, and experiment 2 uses data from Arabidopsis thaliana. However, that is probably not why the distributions are different. The mouse data is based on unpaired Illumina reads from a DNase-Seq experiment, wheras the A. thaliana data is from paired Illumina reads from whole genome sequencing. However, that still probably isn't the reason for the differences.

The reason for the different distributions is that experiment 1 used Bowtie 2 to map the reads whereas experiment 2 used BWA. It turns out that different mapping programs calculate MAPQ scores in different ways and you shouldn't really compare these values unless they came from the same program.

The maximum MAPQ value that Bowtie 2 generates is 42 (though it doesn't say this anywhere in the documentation). In contrast, the maximum MAPQ value that BWA will generate is 37 (though once again, you — frustratingly — won't find this information in the manual).

The data for Experiment 1 is based on a sample of over 25 million mapped reads. However, you never see MAPQ scores of 9, 10, or 20, something that presumably reflects some aspect of how Bowtie 2 calculates these scores.

In the absence of any helpful information in the manuals of these two popular aligners, others have tried doing their own experimentation to work out what the values correspond to. Dave Tang has a useful post on Mappinq Qualities on his Musings from a PhD Candidate blog. There are also lots of posts about mapping quality on the SEQanswers site (e.g. see here, here or here). However, the prize for the most detailed investigation of MAPQ scores — from Bowtie 2 at least — goes to John Urban who has written a fantastic post on his Biofinysics blog:

So in conclusion, there are 3 important take home messages:

  1. MAPQ scores vary between different programs and you should not directly compare results from, say, Bowtie 2 and BWA.
  2. You should look at your MAPQ scores and potentially filter out the really bad alignments.
  3. Bioinformatics software documentation can often omit some really important details (see also my last blog post on this subject).