Which 'omics' assembly tools are currently the most popular?

I recently organized an online poll to find out which tools for genome, transcriptome, and metagenome assembly are currently the most popular with researchers. After a week or so of collecting results, I ended up with 116 responses that describe over 30 different assembly tools.

Thanks to everyone who took part. I've posted the results to Figshare as a PDF report, and have also embedded this below (I suggest downloading the PDF so that you can use all of the embedded hyperlinks in the report).

Impactstory: Publications are an important part of research…but they’re not the only part

I'm a great fan of the Impactstory service that makes it easy to aggregate all of your research output in one place, and then see how people are engaging with your research. I like it so much, that I signed up to be an Impactstory Advisor.

Today I'm giving a talk at UC Davis about Impactstory, and so that everyone can see why I like this service so much, I've made a video version of my presentation. 

Visit the Impactstory website to find out more or follow them on twitter (@Impactstory). For an example of the types of things that Impactstory can track, have a look at my own Impactstory page (impactstory.org/KeithBradnam) .

Some slides from a recent talk about genome assembly (and thoughts on evolving slide decks)

Last week I presented a talk about genome assembly at a UC Davis Bioinformatics Core workshop. The first time that I gave this talk, it concentrated almost exclusively on the the results from our Assemblathon 2 paper. In the handful of times that I've subsequently given this talk, it has always evolved:

  1. More background to better explain some of the more common terminology in this field
  2. Less detail about the specifics of the Assemblathon 2 results
  3. New information relating to the latest developments in sequencing and assembly
  4. Added an 'intermission' so I can explain why I think Next-generation sequencing must die

Even if I didn't add any new content to my talk, and even if I was giving the same talk twice in the same week, I would still almost certainly change some aspect of my presentation. Here are some reasons for why I often end up changing things:

  1. Things which seemed like a good idea when planning and making slides, don't always work as well in front of an actual audience. Sometimes this might be unnecessary detail which slows things down, or it might be something which is no longer as relevant (or exciting) as when you first gave a talk on this topic.
  2. Inevitably there will be some parts of my talk which don't flow as well as others. Sometimes I will switch the order of sections, or drop sections altogether.
  3. If people ask me questions during the talk, then this is often because something is unclear. I try to make mental reminders about this, as it might mean that there is something I can better explain.
  4. Some visual elements will look great on my screen, and even on certain projectors, but then I will give a talk somewhere where a different projector makes a slide look horrible. Most common it will be when two colors end up looking far too similar. Always a good idea to change things so that they look clear on any projector.
  5. If I know that the audience for a talk may contain many people that don't speak English as their primary language, I might add more text content on key slides.
  6. A final reason for changing content is just to keep your talk fresh. It's possible that you become stale when you give the exact same talk over and over again. Changing the order of sections, or adding/removing content, means that you have to re-engage with your own material.

But hey, enough of my yakking…here are the slides. Note that I include two versions; the first version doesn't have any notes (harder to follow as I often prefer to talk around what's on my slides). The second version has notes added below each slide (these notes try to capture the gist of what I talk about on each slide). Also, don't be alarmed by the high slide count, each animation step appears as a separate slide (so that you can almost capture all of the animated fun of a real Keith Bradnam presentation).



Designing a musical motif for the UC Davis Genome Center

Over the last month, I have spent much of my time helping to develop a new website for the UC Davis Genome Center (a site which will hopefully be launched very soon). In trying to bring the website into the modern era, I've been trying to set things up so that we can better promote any news that arises from the work of the talented faculty, staff, and students that we have.  

In particular, I'm keen to feature some video clips on the new site, and that made me think that we should have our own Genome Center 'ident' to use in any videos. Idents are a bit like stingers on radio stations, something that gives an audio signature that people might come to recognize (and maybe even like).

I have a smattering of music knowledge so I thought it might be fun to create something based on DNA. As there are four canonical DNA bases (A, C, G, and T), I thought that the musical motif should have four principle notes. I then decided to arrange the notes with musical intervals based on the intervals between the alphabet positions of A, C, G, and T. If you start this sequence on a C note, you end up with C, D, F# and G (one octave up). This progression feels like it needs to be resolved, and a basic G major chord seems to work.

So this is what I have come up with so far. This may end up being vetoed by the powers-that-be, but I'm still pretty happy with it:

Update: just to add that this piece was made entirely using GarageBand on my Mac. There are: three tracks that use Classic Electric Piano (I was using the onscreen keyboard which is why I ended up doing these as three separate tracks); one Tonewheel Organ track; one Upright Studio Bass track; one Classic Analog Pad track; and one String Ensemble track. The latter three tracks combine to form the final chord.

What's in a name? Some thoughts on the 'exSPAnder' assembly tool

This week a new tool was published in the Bioinformatics journal:

ExSPAnder: a universal repeat resolver for DNA fragment assembly

The tool's name really refers to the name of an algorithm that is implemented as part of the SPAdes genome assembler. I don't think that this is particularly obvious from the title of the paper. The results section in the paper further complicates this somewhat. E.g. this is how the comparative assembler results are reported in Table 2 of the paper:

The entry called 'SPAdes 2.4' refers to a version of the SPAdes assembler that doesn't use the exSPAnder algorithm, whereas the entry marked 'EXPANDER" refers to a newer version of the SPAdes assembler that does include the algorithm. I find this confusing and it is one of three issues that I have concerning the use of the exSPAnder name:

1. Do we really need to start giving names to algorithms that are part of another tool? This has the potential to create a lot more confusion for people. Particularly when there is no tool called 'exSPAnder' that you can download from anywhere. If somebody implemented the algorithm as part of another piece of software would they be expected to retain the exSPAnder name somewhere (MegaAssembler featuring exSPAnder)?

2. You would hope that the website that the paper links to gives you more information about exSPAnder. But that's not the case:

  • Number of mentions of exSPANder in the publication: 35
  • Number of mentions of exSPAnder in the linked software web page: 0
  • Number of mentions of exSPAnder in the latest SPAdes v3.1.0 manual: 0

Again, I think this can only lead to confusion. The mention of exSPAnder as if it was its own separate tool suggests that this is software that you can download. E.g. this is from the Conclusion section of the paper:

Benchmarks across eight popular assemblers demonstrate that exSPAnder produces high-quality assemblies for datasets of different types.

But exSPAnder is not an assembler that anyone can download and use at the moment. Rather you can download the SPAdes assembler which may or may not feature the exSPAnder algorithm (I don't know because the website and the manual doesn't say).

3. My final issue is perhaps the most minor one and it relates to this horrible trend of using mixed capitalization for bioinformatics tool names. If you are going to do this, please be consistent and please realize that journal formatting conventions may mess up your planned use of capitalization. Here are the different ways you can see 'exSPAnder' referred to in this paper:

  • ExSPAnder: 1
  • exSPAnder: 1
  • EXPANDER: 1
  • EXSPANDER: 28

So I'm assuming that the latter format is the one that the authors are really using and the other variations are due to problems of the journal formatting the article. Using small caps like this is a great way to guarantee that no-one else will bother to format the name like this. Okay, time to finish this post as I need to go and work on my new assembly tool:

MaSSEMbLerXL— an assembler that assigns different font sizes to each DNA base

 

101 questions with a bioinformatician #11: Ewan Birney

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

This is a special 'If we need that extra push over the cliff' edition of '101 questions with a bioinformatician'.


This week's interviewee is the Associate Director of the European Bioinformatics Institute and also the head of CTTV (the Centre for Therapeutic Target Validation). Furthermore, he is a winner of the prestigious Benjamin Franklin Award, a winner of the Overton Prize, and he was recently elected as a Fellow of the Royal Society. He helped found Ensembl, was an early supporter/developer of BioPerl, and has contributed to an improbably large number of genomics and bioinformatics projects.

And yet, these are not the accomplishments of Ewan Birney that I am most impressed by. Rather, I am in awe that he helped define life to Douglas Adams (a very special kind of DNA). And most impressively, his entry on Wikipedia lists one of his roles/accomplishments as follows: 

[Ewan] acted as a bookmaker to the genomics community, taking bets on different estimates of the total number of genes in the human genome.

Ewan Birney, a man who will help satisfy your genomic gambling needs! You can find out more about Ewan by following him on twitter (@ewanbirney), or reading his blog (Bioinformatician at large), or attending any bioinformatics conference (he asks all the questions at all the conferences…it's in his contract).

And now, on to the 101 questions...

 

 

001. What's something that you enjoy about current bioinformatics research?

Learning new stuff. My research is heading towards leveraging outbred genetics to understand basic biological processes in a variety of metazoans (fly, medaka, human). I’m learning about statistical genetics (population structure, other stuff…), fly development, medaka anatomy and human physiology along with my students. It’s great.

 

010. What's something that you *don't* enjoy about current  bioinformatics research?

Broadening this out to life sciences, not just bioinformatics, it’s about a chronic set-point problem in our collective investment into information infrastructure. Fundamentally, biology is an information science; we need to understand how things work and we need to pass data, information and knowledge both on in the future and between people now. Writing papers is part of this, but other parts — just as important (perhaps more?) — are both raw and derived sets of information. To do this we need robust information infrastructuresCollectively we are happy that multiple billions are spent on data generation/experiments/analysis and yet we often agonise on the millions spent on aggregating (over space) and propagating (over time) this information. This is a fundamental mindset that we need to change.

 

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

  1. Get your head around statistics — frequentists, bayesian, etc. You will really need this.
  2. Be open (the few times I’ve not been, I’ve regretted it).

 

100. What's your all-time favorite piece of bioinformatics software, and why?

 Hmmm. Tough one.

 Perl? (though the cool kids do Python now)

 R? (though the insanity of the function conventions drive me mad…)

 

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

I’ve always had a soft spot for Y. Perhaps thats because I first got into bioinformatics through splicing and the 3’ splice sites have the polypyrimindine tract beforehand. Or that T and C seem like the underdogs in the nucleotide world…

This JABBA-award-winning bioinformatics tool should be 'detonated'

Another week, another JABBA award for Just Another Bogus Bioinformatics Acronym. The title of the paper — published on bioRxiv.org — that describes this tool, does not reveal the name:

Neither does the abstract, but when you get to the end of the Introduction, it is finally revealed:

We improve upon the state-of-the-art in transcriptome assembly evaluation by presenting DETONATE: DE novo TranscriptOme rNa-seq Assembly with or without the Truth Evaluation 

Wow. Three of the eight letters in this name do not come from the initial letters of words, and five out of eleven words in the full name of the tool do not contribute to the acronym at all. I particularly like the 'with or without' part.

While I can understand why they didn't want to use the full acronym (DNTRAWOWTTE), I'm sure that they could have come up with something else instead  — how about TETRA: Truth Evaluation Transcriptome RNAseq Assembler? But really, this is yet another example where you don't need to make an acronym! Just call the tool 'Detonate' and be done with it.

MIRA, MIRA on the wall: the problem of duplicated names in bioinformatics

So in addition to lots of bioinformatics tools that use bogus acronyms for their names, or which have very unpronounceable names, we now have a new problem…duplicate names. Rachel Glover (@rach_glover) tweeted this today:

The new MIRA tool (Mutual Information-based Reporter Algorithm for metabolic networks) is entirely unrelated to the existing MIRA tool which is a genome assembler that's been around for over 15 years.

It is not uncommon to need to search online for a bioinformatics tool. This can be complicated by the fact that many tools have names that are more commonly associated with other things (e.g. SHRiMP, ICEberg, HAMSTeRSPigeons, MOUSE, INSECT etc.). The first three examples also highlight that using mixed capitalization to help distinguish your bioinformatics tool from other things doesn't really help when you use a web search engine. 

One solution to this problem has always been to add the word 'bioinformatics' to your web search. However, if we start seeing more tools that share the same name, then this might not be that useful either.

Following Rachel's tweet, Torsten Seemann (@torstenseemann) had a suggestion:

I can't imagine that this would be an easy undertaking, but Alastair Kerr (@alastair_kerr) made a good follow-up point:

I think this is a great suggestion. Bioinformatics journals should perhaps state in their author guidelines that people should not duplicate the name of an existing (published) bioinformatics tool. Reviewer guidelines could also prompt the reviewer to check if this has happened (a simple web search of '<tool name> bioinformatics|genomics' would probably suffice).