Teaser: a solution for our read mapping dilemma?

October 27, 2015 by Keith Bradnam

A paper recently published in Genome Biology by Smolka et al. may offer some help to the problem of choosing which read mapping program to use in order to align a set of sequencing reads to a genome:

Teaser: Individualized benchmarking and optimization of read mapping results for NGS data

The paper starts by neatly summarising the problem:

Recent and ongoing advances in sequencing technologies and applicationslead to a rapid growth of methods that align next generation sequencing reads to a reference genome (read mapping). By mid 2015, nearly 100 different mappers are available, although not all are equally suited for a given application or dataset.

The program Teaser attempts to automate the benchmarking of not just different mappers, but also (some of) the different parameters that are available to these programs. The latter problem should not be underestimated. The Bowtie 2 program describes almost 100 different command-line options in its documentation and many of these options control how Bowtie runs and/or what output it generates.

Teaser uses small sets of simulated read data, leading to very quick run times (< 30 minutes for many comparisons), but you can also supply real data to it. By default, Teaser will test the performance of five read mapping programs: BWA, BWA-MEM, BWA-SW, Bowtie2, and NextGenMap.

Impressively, you can run Teaser on the web as well as a standalone program. The web output includes results displayed graphically for many different test datasets (x-axis):

The paper concludes by asking the community to submit optimal parameter combinations to the Teaser GitHub repository

Teaser is easy to use and at the same time extendable to other methods and parameters combinations. Future work will include the incorporation of benchmarking RNA-Seq mappers and variant calling methods. We furthermore encourage the scientific community to contribute the optimal parameter combinations they detected to our github repository (available at github.com/Cibiv/Teaser) for their particular organism of interest. This will help others to quickly select the optimal combination of mapper and parameter values using Teaser.

I can't wait for the companion program Firecat!

2015-10-26 11.05: Updated to remove specific references to software versions of mapping tools.

Help us do science! I’ve teamed up with researcher Paige Brown Jarreau to create a survey of ACGT readers. By participating, you’ll be helping me improve ACGT and contributing to the SCIENCE on blog readership. You will also get FREE science art from Paige's Photography for participating, as well as a chance to win a t-shirt and other perks! It should only take 10–15 minutes to complete.

You can find the survey here: http://bit.ly/mysciblogreaders

10 years of Open Access at the Wellcome Trust in 10 numbers [Link] →

October 27, 2015 by Keith Bradnam

A great summary of how the Wellcome Trust has helped drive big changes in open access publishing. Of the ten numbers that the post uses to summarise the last decade, this one surprised me the most:

20% – the volume of UK-funded research which is freely available at the time of publication
A recent study commissioned by Universities UK found that 20% of articles authored by UK researchers and published in the last two years were freely accessible upon publication. This figure increases to 24% within six months of publication, and 32% within 12 months.

If you had asked me to guess what this number would be, I think I would have been far too optimistic. Even the figure of 32% of articles being free within 12 months seems lower than I would imagine. Lots of progress still to be made!

ORCID: binding the (academic) galaxy together

October 26, 2015 by Keith Bradnam

Adapted from picture by flickr user Jim & Rachel McArthur

I am a supporter of ORCID's goals to help establish unique identifiers for researchers. Such identifiers can then be used to help connect a researcher with all of their inputs and outputs that surround their career. Most fundamentally, these inputs and outputs are grants and papers, but there is the potential for ORCID identifiers to link a person to much more, e.g. the organisations that they work for, manuscript reviews, code repositories, published slides, even blog posts.

For ORCID to succeed it has to be global and connect all parts of the academic network, a network that spans national boundaries. On this point, I am very impressed by the effort that ORCID makes in ensuring that their excellent outreach materials are not only available in English. As shown below, ORCID's 'Distinguish yourself' flyer is available in 9 different languages. Other material is also available in Russian, Greek, Turkish, and Danish. If your desired language is not available, they welcome volunteers to help translate their message into more languages. Email community@orcid.org if you want to help.

Welcome to the JABBA menagerie: a collection of animal-themed, bogus bioinformatics names…that have nothing to do with animals!

October 23, 2015 by Keith Bradnam

Bioinformaticians make the worst zookeepers:

A

2011 — ANTELOPE: Analysis of Networks through TEmporal-LOgic sPEcifications …has nothing to do with antelopes

B

2014 — BISON: BISulfite alignment On Nodes of a cluster …has nothing to do with bisons

C

2011 — CORAL: CORrection with ALignments …has nothing to do with corals

D

2010 — DODO: DOmain based Detection of Orthologs …has nothing to do with dodos

E

2011 — EMU: Extractor of MUtations …has nothing to do with emus
2014 — EAGLE: Enhanced Artificial Genome Engine (no 'L'?!?) …has nothing to do with eagles

F

2014 — FALCON: FAst Localization algorithm based on a CONtinuous-space formulation …has nothing to do with falcons
2015 — FROG: FingeRprinting Ontology of Genomic variations …has nothing to do with frogs
2017 — FROGS: Find, Rapidly, OTUs with Galaxy Solution …also has nothing to do with frogs

G

2015 — GECKO: GEnome Comparison with K-mers Out-of-core …has nothing to do with geckos
2009 — GORILLA: Gene Ontology enRIchment anaLysis and visuaLizAtion …has nothing to do with gorillas
2018 — GRASShopPER: GPU overlap GRaph ASSembler using Paired End Reads …has nothing to do with grasshoppers

H

2010 — HAMSTeRS: Haemophilia A Mutation, Structure, Test, and Resource Site…has nothing to do with hamsters

I

2013 — INSECT: IN-silico SEarch for Co-occurring Transcription factors …has nothing to do with insects

J

2014 — JAGuaR: Junction Alignments to Genome for RNA-seq reads … has nothing to do with jaguars

K

L

M

2002 — MOUSE: Mitochondrial and Other Useful SEquences …has nothing to do with mice
2014 — MONGOOSE: MetabOlic Network GrOwth Optimization Solved Exactly …has nothing to do with mongooses

N

O

2013 — ORCA: mOdel-dRiven disCovery and Analysis …has nothing to do with orcas

P

2015 — PANDA: Pathway AND Annotation explorer …has nothing to do with pandas
2005 — PANTHER: Protein ANalysis THrough Evolutionary Relationships …has nothing to do with panthers
2014 — PIGEONS: Photographically InteGrated En-suite for the OligoNucleotide Screening …has nothing to do with pigeons
2014 — PuFFIN: Positioning for Fuzzy and FIxed Nucleosomes …has nothing to do with puffins

Q

R

2013 — RAVEN: Reconstruction, Analysis and Visualization of mEtabolic Networks …has nothing to do with ravens

S

2006 — SPIDer: Saccharomyces Protein-protein Interaction Database …has nothing to do with spiders
2009 — SHRiMP: SHort Read Mapping Package …has nothing to do with shrimps

T

2008 — TiGER: Tissue-specific Gene Expression and Regulation …has nothing to do with tigers

U

V

W

X

Y

Z

2004 — ZEBRA: Zebra finch Expression BRain Atlas…has nothing to do with zebras

Other suggestions welcome! Only requirements are that:

The name is bogus, i.e. not a straightforward acronym and worthy of a JABBA award
The acronym is named after an animal (or animal grouping)
The software/tool has nothing to do with the animal in question

Great Scott! Five fun facts about DNA sequencing from 1985

October 21, 2015 by Keith Bradnam

As everyone is celebrating a certain 2015–themed calendar event today, I thought we could instead go back to the ~~future~~ past of DNA sequencing.

1.

Thirty years ago there were no automated sequencing machines. However, Sanger sequencing technology could still provide longer reads than most of Illumina's machines today, e.g. from this paper (A rapid procedure for DNA sequencing using transposon-promoted deletions in Escherichia coli):

The length of the sequence that could be read from each gel in a single run varied from 175 to 200 nt.

2.

The idea of sequencing nuclear genomes was still largely a pipe dream, but smaller genomes were tractable. 1985 saw the addition of the Xenopus laevis mitochondrial genome to the tiny collection of organelle genome sequences. Figure 3 of this paper displayed the full sequence, spread over six pages that looked like this:

Including long DNA sequences in journal articles was a surprisingly common practice at this time.

3.

There were two releases of GenBank in 1985. The second release saw the database grow to an astounding set of 5,700 sequences, totalling 5,204,420 bp. For comparison, this year also saw the release of the Commodore 128 home computer which came with 128 KB of RAM. The first 3.5" hard drives were only a couple of years old, and could store 10 MB (so capable of storing the DNA sequences in GenBank, but possibly not the associated annotation).

4.

The SEQ-ED program was published, allowing the handling of 'long DNA sequences' that were 'up to 200 Kbp'.

5.

Somewhat amazingly, people were writing bioinformatics software for Apple computers. The journal CABIOS included this paper:

PEGASE: a machine language program for DNA sequence analysis on Apple II microcomputer using a binary coding of nucleotides

But how did people distribute software in the days when there was no GitHub, SourceForge, or indeed…no world wide web?

For both code and source of PEGASE, please send two blank 5" diskettes and indicate precisely your system configuration (there is a slight difference between the Apple II+ and the Apple lIe version which depends on the availability of lower case characters).

BINGO, DINGO, PINGO, RINGO, and SPINGO

October 21, 2015 by Keith Bradnam

Sounds like these should be characters in a children's TV show.

Dovetail takes flight [Link] →

October 21, 2015 by Keith Bradnam

If you ever want to know about the latest developments in sequencing, you owe it to yourself to follow Keith Robison's blog. In his latest post he talks about the launch of the new de novo assembly service from Dovetail Genomics. Keith concludes:

Personally, a pure service offering is very attractive, since that means not having to find internal resources to learn the new technology and then execute on it. I checked with Dovetail, and while I don't have $40K burning a hole in my pocket, if I did I could grab something out of the garden or from the local seafood market, I really could have a complex genome scaffold of my very own in about two months. That's an exciting vision, and perhaps will be a major force in the sunsetting of science's tolerance for highly fragmented draft genomes.

Readers may also enjoy Bio-IT World's report on this new Dovetail service.

Another survey on bioinformatics practices

October 21, 2015 by Keith Bradnam

I recently wrote about the bioinformatics survey that Nick Loman and Tom Connor published. Well if people are interested, there is another bioinformatics survey happening, organised by Elia Brodsky (@EliaBrodsky).

Elia works at Pine Biotech and he says that the results of the survey will be publicized on their website.

You can take the survey here and you can read more details about it on Elia's LinkedIn post: Bioinformatics - useful or just frustrating?

Another hard-to-pronounce bioinformatics software name

October 21, 2015 by Keith Bradnam

This was from a few months ago, published in the journal Nucleic Acids Research:

CATH FunFHMMer web server: protein functional annotations using functional family assignments

So how do you pronounce 'FunFHMMer'? I can imagine several possibilities:

Fun-eff-aitch-em-em-er
Fun-eff-aitch-em-mer
Fun-eff-hammer
Fünf-hammer

Reading the manuscript suggests that 'FunF' stems from 'FunFam(s)' which in turn is derived from 'functional families'. This would suggest that options 1 or 3 above might be the correct way to pronounce this software's name.

The fully expanded description of this web server's name becomes a bit of a mouthful:

Class Architecture Topology Homologous Superfamily Functional Families Hidden Markov Model (maker?)

We asked 272 bioinformaticians…name something that makes you angry: more reflections on the poor state of software documentation.

October 19, 2015 by Keith Bradnam

I'd like to share the details of a recent survey conducted by Nick Loman and Thomas Connor that tried to understand current issues with bioinformatics practice and training.

The survey was announced on twitter and attracted almost 300 responses. Nick and Tom have kindly placed the results of the survey on Figshare so that others can play with the data (it seems fitting to talk about this today as it is International Open Access Week):

Bioinformatics infrastructure and training survey http://dx.doi.org/10.6084/m9.figshare.1572287

When you ask a bunch of bioinformaticians the question What things most frustrate you or limit your ability to carry out bioinformatics analysis? you can be sure that you will attract some passionate, and often amusing, answers (I particularly liked someone's response to this question "Not enough Heng Li").

I was struck by how many people raised the issue of poor, incomplete, or otherwise terrible software documentation as a problem (there were at least 42 responses that mentioned this). The availability of 'good documentation' was also listed as the 2nd most important factor when choosing software to use.

I recently wrote about whether this problem is something that really needs to be dealt with by journals and by the review process. It shouldn't be enough that software is available and that it works, we should have some minimal expectation for what documentation should accompany bioinformatics software.

Keith's 10 point checklist for reviewing software

If you are ever in a position to review a software-based manuscript, please check for the following:

Is there a plain text README file that accompanies the software and which explains what the program does and who created it?
Is there a comprehensive manual available somewhere that describes what every option of the program does?
Is there a clear version number or release date for the software?
Does the software provide clear installation instructions (where relevant) that actually work?
Is the software accompanied by an appropriate license?
For command-line programs, does the program give some sensible output when no arguments are provided?
For command-line programs, does the program give some sensible output when -h and/or --help is specified (see this old post of mine for more on this topic)?
For command-line programs, does the built-in help/documentation agree with the external documentation (text/PDF), i.e. do they both list the same features/options?
For script based software (Perl, Python etc.), does the code contain a reasonable level of comments that allow someone with relevant coding experience to understand what the major sections of the program are trying to do?
Is there a contact email address (or link to support web page) provided so that a user can ask questions and get more help?

I'm not expecting every piece of bioinformatics software to tick all 10 of these boxes, but most of these are relatively low-hanging fruit. If you are not prepared to provide useful documentation for your software, then you should also be prepared for people to choose not to use your software, and for reviewers to reject your manuscript!