Teaser: a solution for our read mapping dilemma?

A paper recently published in Genome Biology by Smolka et al. may offer some help to the problem of choosing which read mapping program to use in order to align a set of sequencing reads to a genome:

The paper starts by neatly summarising the problem:

Recent and ongoing advances in sequencing technologies and applicationslead to a rapid growth of methods that align next generation sequencing reads to a reference genome (read mapping). By mid 2015, nearly 100 different mappers are available, although not all are equally suited for a given application or dataset.

The program Teaser attempts to automate the benchmarking of not just different mappers, but also (some of) the different parameters that are available to these programs. The latter problem should not be underestimated. The Bowtie 2 program describes almost 100 different command-line options in its documentation and many of these options control how Bowtie runs and/or what output it generates.

Teaser uses small sets of simulated read data, leading to very quick run times (< 30 minutes for many comparisons), but you can also supply real data to it. By default, Teaser will test the performance of five read mapping programs: BWA, BWA-MEM, BWA-SW, Bowtie2, and NextGenMap.

Impressively, you can run Teaser on the web as well as a standalone program. The web output includes results displayed graphically for many different test datasets (x-axis):

The paper concludes by asking the community to submit optimal parameter combinations to the Teaser GitHub repository

Teaser is easy to use and at the same time extendable to other methods and parameters combinations. Future work will include the incorporation of benchmarking RNA-Seq mappers and variant calling methods. We furthermore encourage the scientific community to contribute the optimal parameter combinations they detected to our github repository (available at github.com/Cibiv/Teaser) for their particular organism of interest. This will help others to quickly select the optimal combination of mapper and parameter values using Teaser.

I can't wait for the companion program Firecat!

 

2015-10-26 11.05: Updated to remove specific references to software versions of mapping tools.


Help us do science! I’ve teamed up with researcher Paige Brown Jarreau to create a survey of ACGT readers. By participating, you’ll be helping me improve ACGT and contributing to the SCIENCE on blog readership. You will also get FREE science art from Paige's Photography for participating, as well as a chance to win a t-shirt and other perks! It should only take 10–15 minutes to complete.

You can find the survey here: http://bit.ly/mysciblogreaders

The Bioboxes paper is now out of the box! [Link]

The Bioboxes project now has their first formal publication, with the software being described today in the journal GigaScience:

I love the concise abstract:

Software is now both central and essential to modern biology, yet lack of availability, difficult installations, and complex user interfaces make software hard to obtain and use. Containerisation, as exemplified by the Docker platform, has the potential to solve the problems associated with sharing software. We propose bioboxes: containers with standardised interfaces to make bioinformatics software interchangeable.

Congratulations to Michael Barton, Peter Belmann, and the rest of the Bioboxes team!

 

Updated 2015-10-15 18.18: Added specific acknowledgement of Peter Belmann.

New paper provides a great overview of the current state of genome assembly

The following paper by Stephen Richards and Shwetha Murali has just appeared in the journal Current Opinion in Insect Science:

Best practices in insect genome sequencing: what works and what doesn’t

In some ways I wish they had chosen a different title as the focus of this paper is much more about genome assembly than genome sequencing. Furthermore, it provides a great overview of all of the current strategies in genome assembly. This should be of interest to any non-insect researchers interested in the best way of putting a genome together. Here is part of the legend from a very informative table in the paper:

Table 1 — De novo genome assembly strategies:
Assembly software is designed for a specific sequencing and assembly strategy. Thus sequence must be generated with the assembly software and algorithm in mind, choosing a sequence strategy designed for a different assembly algorithm, or sequencing without thinking about assembly is usually a recipe for poor un-publishable assemblies. Here we survey different assembly strategies, with different sequence and library construction requirements.

Are there too many biological databases?

The annual 'Database' issue of Nucleic Acids Research (N.A.R.) was recently published. It contains a mammoth set of 172 papers that describe 56 new biological databases as well as updates to 115 others. I've already briefly commented on one of these papers, and expect that I'll be nominating several others for JABBA awards.

In this post I just wanted to comment on the the seemingly inexorable growth of these computational resources. There are databases for just about everything these days. Different species, different diseases, different types of sequence, different biological mechanisms…every possible biological topic has a relevant database, and sometimes they have several.

It is increasinly hard to even stay on top of just how many databases are out there. Wikipedia has a listing of biological databases as well as a category for biological databases, but both of these barely scratch the surface of what is out there.

So maybe one might turn to 'DBD': a Database of Biological Datsbases or even MetaBase which also describes itself as a 'Database of Biological Databases' (please don't start thinking about creating 'DBDBBDB': A Database of Databases of Biological Databases!).

However, the home pages of these two sites were last updated in 2008 and 2011 respectively, perfectly reflecting one of the problems in the world of biological databases…they often don't get removed when they go out of date. In a past life, I was a developer of several databases at something called UK CropNet. Curation of these databases, particularly the Arabidopis Genome Resource, effectively stopped when I left the job in 2001 but the databases were only taken offline in 2013!!!

So old, out-of-date, databases are part of the problem, but the other issue is that there seems to be some independent databases that — in an ideal world — should really be merged with similar databases. E.g. there is a database called BeetleBase that describes its remit as follows:

BeetleBase is a comprehensive sequence database and important community resource for Tribolium genetics, genomics and developmental biology.

This database has been around since at least 2007 though I'm not entirely sure if it is still being actively developed. However, I was still surprised to see this paper as part of the N.A.R. Database issue:

iBeetle-Base has been seemingly developed from a separate group of people from BeetleBase. Is it helpful to the wider community to have two databases like this, with confusingly similar names? It's possible that iBeetle-Base people tried reaching out to the BeetleBase folks to include their data in the pre-existing database, but were rebuffed or found out that BeetleBase is no longer a going concern. Who knows, but it just seems a shame to have so much genomics information for a species split across multiple databases.

I'm not sure what could, or should, be done to tackle these issues. Should we discourage new databases if there are already existing resources that cover much of the subject matter? Should we require the people who run databases to 'wind up' the resources in a better way when funding runs out (i.e. retire databases or make it abundantly clear that a resource is no longer being updated)? Is it even possible to set some minimum standards for database usage that must be met in order for subsequent 'update papers' to get published (i.e. 'X' DB accesses per month)?

diArk – the database for eukaryotic genome and transcriptome assemblies in 2014

A new paper in Nucleic Acids Research describes a database that I was not aware of. The abstract features an eye-catching, not to mention ambitious, claim (the emphasis is mine):

The database…has been developed with the aim to provide access to all available assembled genomes and transcriptomes.

The diArk database currently features data on 2,771 species. There are many options to filter your search queries including filtering by 'sequencing type' and by the status of completion. So when I search for 'completed' genome sequencing projects, it reports that there 3,626 projects corresponding to 1,848 species. The FAQ has this to say regarding 'completeness':

The term completeness is intended to describe the coverage of the genome and the chance to find all homologs of the gene of interest.

I was a bit put off by the interface to this database. As far as I can tell, diArk is mostly containing links to other resources (rather than hosting any sequence information). There are lots of very small icons everywhere which are hard to understand (unless you mouse over each icon). When I went to the page for Caenorhabditis elegans, I was struck by the confusing nature of just posting links to every C. elegans resource on the web. There are 12 'Project' links listed. Which one gives you access to the latest version of the genome sequence?

diArk summary of Caenorhabditis elegans&nbsp;data

diArk summary of Caenorhabditis elegans data

As a final comment, I noticed that the latest entry on the diArk news page is from September 2011 which is a bit worrying (nothing newsworthy has happened in the last 3 years?).

Comparisons of computational methods for differential alternative splicing detection using RNA-seq in plant systems

Marc Robinson-Rechavi (@marc_rr) tweeted about this great new paper in BMC Bioinformatics by Ruolin Liu, Ann Loraine, and Julie Dickerson. From the abstract:

The goal of this paper is to benchmark existing computational differential splicing (or transcription) detection methods so that biologists can choose the most suitable tools to accomplish their goals.

Like so many other areas of bioinformatics, there are many methods available for detecting alternative splicing, and it is far from clear which — if any —  is the best. This paper attempts to compare eight of them, and the abstract contains a sobering conclusion:

No single method performs the best in all situations

Figure 5 from the paper is especially depressing. It looks at the overlap of differentially spliced genes as detected by five different methods. There are zero differentially spliced genes that all methods agreed on.

Liu et al. BMC Bioinformatics 2014 15:364   doi:10.1186/s12859-014-0364-4

Ten recommendations for software engineering in research

The list of recommendations in this new GigaScience paper by Janna Hastings et al. is not aimed at bioinformatics in particular, but many bioinformaticians would benefit from reading it. I can particularly relate to this suggestion:

Document everything
Comprehensive documentation helps other developers who may take over your code, and will also help you in the future. Use code comments for in-line documentation, especially for any technically chal- lenging blocks, and public interface methods. However, there is no need for comments that mirror the exact detail of code line-by-line.

3 word summary of new PLOS Computational Biology paper: genome assemblies suck

Okay, I guess a more accurate four word summary would be 'genome assemblies sometimes suck'.

The paper that I'm referring to, by Denton et al, looks at the problem of fragmentation (of genes) in many draft genome assemblies:

As they note in their introduction:

Our results suggest that low-quality assemblies can result in huge numbers of both added and missing genes, and that most of the additional genes are due to genome fragmentation (“cleaved” gene models).

One section of this paper looks at the quality of different versions of the chicken genome and CEGMA is one of the tools they use in this analysis. I am a co-author of CEGMA, and reading this paper brought back some memories of when we were also looking at similar issues.

In our 2nd CEGMA paper we tried to find out why 36 core genes were not present in the v2.1 version of the chicken genome (6.6x coverage). Turns out that there were ESTs available for 29 of those genes, indicating that they are not absent from the genome, just from the genome assembly. This led us to find pieces of these missing genes in the unanchored set of sequences that were included as part of the set of genome sequences (these often appear as a 'ChrUn' FASTA file in genome sequence releases).

Something else that we reported on in our CEGMA paper is that, sometimes, a newer version of a genome assembly can actually be worse than what it replaces (at least in terms of genic content). Version 1.95 of the Ciona intestinalis genome contained several core genes that subsequently disappeared in the v2.0 release.

In conclusion — and echoing some of the findings of this new paper by Denton et al.:

  1. Many genomes are poorly assembled
  2. Many genomes are poorly annotated (often a result of the poor assembly)
  3. Newer versions of genome assemblies are not always better