What's in a name? Some thoughts on the 'exSPAnder' assembly tool

June 19, 2014 by Keith Bradnam

This week a new tool was published in the Bioinformatics journal:

ExSPAnder: a universal repeat resolver for DNA fragment assembly

The tool's name really refers to the name of an algorithm that is implemented as part of the SPAdes genome assembler. I don't think that this is particularly obvious from the title of the paper. The results section in the paper further complicates this somewhat. E.g. this is how the comparative assembler results are reported in Table 2 of the paper:

The entry called 'SPAdes 2.4' refers to a version of the SPAdes assembler that doesn't use the exSPAnder algorithm, whereas the entry marked 'EXPANDER" refers to a newer version of the SPAdes assembler that does include the algorithm. I find this confusing and it is one of three issues that I have concerning the use of the exSPAnder name:

1. Do we really need to start giving names to algorithms that are part of another tool? This has the potential to create a lot more confusion for people. Particularly when there is no tool called 'exSPAnder' that you can download from anywhere. If somebody implemented the algorithm as part of another piece of software would they be expected to retain the exSPAnder name somewhere (MegaAssembler featuring exSPAnder)?

2. You would hope that the website that the paper links to gives you more information about exSPAnder. But that's not the case:

Number of mentions of exSPANder in the publication: 35
Number of mentions of exSPAnder in the linked software web page: 0
Number of mentions of exSPAnder in the latest SPAdes v3.1.0 manual: 0

Again, I think this can only lead to confusion. The mention of exSPAnder as if it was its own separate tool suggests that this is software that you can download. E.g. this is from the Conclusion section of the paper:

Benchmarks across eight popular assemblers demonstrate that exSPAnder produces high-quality assemblies for datasets of different types.

But exSPAnder is not an assembler that anyone can download and use at the moment. Rather you can download the SPAdes assembler which may or may not feature the exSPAnder algorithm (I don't know because the website and the manual doesn't say).

3. My final issue is perhaps the most minor one and it relates to this horrible trend of using mixed capitalization for bioinformatics tool names. If you are going to do this, please be consistent and please realize that journal formatting conventions may mess up your planned use of capitalization. Here are the different ways you can see 'exSPAnder' referred to in this paper:

  
  ExSPAnder: 1
exSPAnder: 1
EXPANDER: 1
EXSPANDER: 28

So I'm assuming that the latter format is the one that the authors are really using and the other variations are due to problems of the journal formatting the article. Using small caps like this is a great way to guarantee that no-one else will bother to format the name like this. Okay, time to finish this post as I need to go and work on my new assembly tool:

MaSSEMbLerXL— an assembler that assigns different font sizes to each DNA base

101 questions with a bioinformatician #11: Ewan Birney

June 18, 2014 by Keith Bradnam

This post is part of a series that interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

This is a special 'If we need that extra push over the cliff' edition of '101 questions with a bioinformatician'.

This week's interviewee is the Associate Director of the European Bioinformatics Institute and also the head of CTTV (the Centre for Therapeutic Target Validation). Furthermore, he is a winner of the prestigious Benjamin Franklin Award, a winner of the Overton Prize, and he was recently elected as a Fellow of the Royal Society. He helped found Ensembl, was an early supporter/developer of BioPerl, and has contributed to an improbably large number of genomics and bioinformatics projects.

And yet, these are not the accomplishments of Ewan Birney that I am most impressed by. Rather, I am in awe that he helped define life to Douglas Adams (a very special kind of DNA). And most impressively, his entry on Wikipedia lists one of his roles/accomplishments as follows:

[Ewan] acted as a bookmaker to the genomics community, taking bets on different estimates of the total number of genes in the human genome.

Ewan Birney, a man who will help satisfy your genomic gambling needs! You can find out more about Ewan by following him on twitter (@ewanbirney), or reading his blog (Bioinformatician at large), or attending any bioinformatics conference (he asks all the questions at all the conferences…it's in his contract).

And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

Learning new stuff. My research is heading towards leveraging outbred genetics to understand basic biological processes in a variety of metazoans (fly, medaka, human). I’m learning about statistical genetics (population structure, other stuff…), fly development, medaka anatomy and human physiology along with my students. It’s great.

010. What's something that you *don't* enjoy about current bioinformatics research?

Broadening this out to life sciences, not just bioinformatics, it’s about a chronic set-point problem in our collective investment into information infrastructure. Fundamentally, biology is an information science; we need to understand how things work and we need to pass data, information and knowledge both on in the future and between people now. Writing papers is part of this, but other parts — just as important (perhaps more?) — are both raw and derived sets of information. To do this we need robust information infrastructures. Collectively we are happy that multiple billions are spent on data generation/experiments/analysis and yet we often agonise on the millions spent on aggregating (over space) and propagating (over time) this information. This is a fundamental mindset that we need to change.

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Get your head around statistics — frequentists, bayesian, etc. You will really need this.
Be open (the few times I’ve not been, I’ve regretted it).

100. What's your all-time favorite piece of bioinformatics software, and why?

Hmmm. Tough one.

Perl? (though the cool kids do Python now)

R? (though the insanity of the function conventions drive me mad…)

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

I’ve always had a soft spot for Y. Perhaps thats because I first got into bioinformatics through splicing and the 3’ splice sites have the polypyrimindine tract beforehand. Or that T and C seem like the underdogs in the nucleotide world…

This JABBA-award-winning bioinformatics tool should be 'detonated'

June 17, 2014 by Keith Bradnam

Another week, another JABBA award for Just Another Bogus Bioinformatics Acronym. The title of the paper — published on bioRxiv.org — that describes this tool, does not reveal the name:

Evaluation of de novo transcriptome assemblies from RNA-Seq data

Neither does the abstract, but when you get to the end of the Introduction, it is finally revealed:

We improve upon the state-of-the-art in transcriptome assembly evaluation by presenting DETONATE: DE novo TranscriptOme rNa-seq Assembly with or without the Truth Evaluation

Wow. Three of the eight letters in this name do not come from the initial letters of words, and five out of eleven words in the full name of the tool do not contribute to the acronym at all. I particularly like the 'with or without' part.

While I can understand why they didn't want to use the full acronym (DNTRAWOWTTE), I'm sure that they could have come up with something else instead — how about TETRA: Truth Evaluation Transcriptome RNAseq Assembler? But really, this is yet another example where you don't need to make an acronym! Just call the tool 'Detonate' and be done with it.

MIRA, MIRA on the wall: the problem of duplicated names in bioinformatics

June 16, 2014 by Keith Bradnam

So in addition to lots of bioinformatics tools that use bogus acronyms for their names, or which have very unpronounceable names, we now have a new problem…duplicate names. Rachel Glover (@rach_glover) tweeted this today:

So now there are two MIRA's in bioinformatics... http://t.co/h8e9GCDBOK
— Rachel Glover (@rach_glover) June 16, 2014

The new MIRA tool (Mutual Information-based Reporter Algorithm for metabolic networks) is entirely unrelated to the existing MIRA tool which is a genome assembler that's been around for over 15 years.

It is not uncommon to need to search online for a bioinformatics tool. This can be complicated by the fact that many tools have names that are more commonly associated with other things (e.g. SHRiMP, ICEberg, HAMSTeRS, Pigeons, MOUSE, INSECT etc.). The first three examples also highlight that using mixed capitalization to help distinguish your bioinformatics tool from other things doesn't really help when you use a web search engine.

One solution to this problem has always been to add the word 'bioinformatics' to your web search. However, if we start seeing more tools that share the same name, then this might not be that useful either.

Following Rachel's tweet, Torsten Seemann (@torstenseemann) had a suggestion:

@pathogenomenick @rach_glover @kbradnam I think we need a central tool name registry to stop both stupid names and duplicate names!
— Torsten Seemann (@torstenseemann) June 16, 2014

I can't imagine that this would be an easy undertaking, but Alastair Kerr (@alastair_kerr) made a good follow-up point:

@kbradnam @torstenseemann Shouldn't reviewers already do this job? Perhaps better editorial guidelines are needed?
— Alastair Kerr (@alastair_kerr) June 16, 2014

I think this is a great suggestion. Bioinformatics journals should perhaps state in their author guidelines that people should not duplicate the name of an existing (published) bioinformatics tool. Reviewer guidelines could also prompt the reviewer to check if this has happened (a simple web search of '<tool name> bioinformatics|genomics' would probably suffice).

Unpronounceable — why can't people give bioinformatics tools sensible names?

June 13, 2014 by Keith Bradnam

Okay, so many of you know that I have a bit of an issue with bioinformatics tools with names that are formed from very tenuous acronyms or initialisms. I've handed out many JABBA awards for cases of 'Just Another Bogus Bioinformatics Acronym'. But now there is another blight on the landscape of bioinformatics nomenclature…that of unpronounceable names.

If you develop bioinformatics tools, you would hopefully want to promote those tools to others. This could be in a formal publication, or at a conference presentation, or even over a cup of coffee with a colleague. In all of these situations, you would hope that the name of your bioinformatics tool should be memorable. One way of making it memorable is to make it pronounceable. Surely, that's not asking that much? And yet…

GO2MSIG, an automated GO based multi-species gene set generator for gene set enrichment analysis – This is not so hard to pronounce (go-to-em-sig), but it is a little awkward and not very memorable.
AbsCN-seq: a statistical method to estimate tumor purity, ploidy and absolute copy numbers from next-generation sequencing data — I guess this only has one obvious pronunciation (abs-see-en-seq), but again not particularly memorable.
QCGWAS: A flexible R package for automated quality control of genome-wide association results — This sort of works if you separate out the two commonly used initialisms (QC + GWAS), but maybe not everyone will spot this straight away (especially if you are not familiar with GWAS). I still find this a bit of mouthful to say (cue-see-gee-was).
CMGRN: a web server for constructing multilevel gene regulatory networks using ChIP-seq and gene expression data — The lack of vowels means that can only ever be pronounced by uttering every consonant separately (see-em-gee-ar-en).
iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition — I don't know where to start with this one! Imagine that you had to spell this out to a journalist over the phone (something that can happen in science!): "The software name? Yes, it's aye (lower-case), en (upper-case), you-see (lower-case), hyphen, pee (upper-case), ess-ee (lower-case), and kay-en-see (upper-case)…hello, are you still there?".
MFSPSSMpred: identifying short disorder-to-order binding regions in disordered proteins based on contextual local evolutionary conservation — Couldn't be simpler really. I look forward to telling my colleagues about em-eff-ess-pee-ess-ess-em-pred.
mRMRe: an R package for parallelized mRMR ensemble feature selection — This is not as long as some of the others, but trying saying this five times fast (em-ar-em-ar-ee).
LoQAtE—Localization and Quantitation ATlas of the yeast proteomE. A new tool for multiparametric dissection of single-protein behavior in response to biological perturbations in yeast — I get the feeling that this is meant to be pronounced 'LOCATE', but that's only a guess. Maybe it's really pronounced low-queue-at-ee? It's clumsy, ugly, and also an incredibly tenuous initialism.
HoPaCI-DB: host-Pseudomonas and Coxiella interaction database — This, like many of the above entries, also featured as a JABBA award recipient. This is not as bad an acronym/initialism as others, but it ranks highly for its lack of obvious pronunciation. Is it ho-pa-cee-aye-dee-bee, hop-pah-cee-aye-dee-bee, ho-pa-sigh-dee-bee, or even ho-pack-ee-dee-bee???

There is a lot of bioinformatics software in this world. If you choose to add to this ever growing software catalog, then it will be in your interest to make your software easy to discover and easy to promote. For your own sake, and for the sake of any potential users of your software, I strongly urge you to ask yourself the following five questions:

Is the name memorable?
Does the name have one obvious pronunciation?
Could I easily spell the name out to a journalist over the phone?
Is the name of my database tool free from any needless mixed capitalization?
Have I considered whether my software name is based on such a tenuous acronym or intialism that it will probably end up receiving a JABBA award?

101 questions with a bioinformatician #10: Lex Nederbragt

June 11, 2014 by Keith Bradnam

This is the third 'binary' post in this series — where the interviewee number consists of just ones and/or zeros. If this fact makes you excited, then you probably need to get out more.

Lex Nederbragt works as a Bioinformatician at the Norwegian Sequencing Centre (where they probably do more than just sequence Norwegians). He is also an Associate Professor at the Centre for Ecological and Evolutionary Synthesis (CEES), University of Oslo.

As a Dutchman living in the least populous of the three Scandinavian Kingdoms, Lex can take comfort in knowing that the Netherlands retain the upper hand in their battles with Norway on the football field.

Away from football — and this is the last chance you'll have to get away from football for the next few weeks — Lex is someone who posts fantastic amounts of useful information on his blog. If you have any interest in high-throughput sequencing and assembly, then you owe it to yourself to follow his blog updates.

You can find out more about Lex by following him on twitter (@lexnederbragt), or reading his aforementioned blog (In between lines of code) or his other blog…presumably the world's only blog devoted to the Newbler assembler.

And so on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

The increasing focus on reproducibility and reusability. Making sure others can reproduce your work is such a fundamental aspect of science, and computational work should be easy to reproduce in principle. It is fascinating to see how difficult this turns out to be in practice — even in cases where the description of the work is very complete.

010. What's something that you *don't* enjoy about current bioinformatics research?

I'm not the first one to complain about the seemingly unlimited growth in tools meant for the same job, e.g., short read mappers. My field of interest is de novo genome assembly, and there too new tools appear regularly. I think it is about time we settle on a set of tools that appear to be best suited for the job, and move on to finding ways to determine which tools works best for each individual dataset and research question. In the case of assembly, we basically already know the set of programs that generally perform well. Now we need to develop and implement evaluation tools that tell a researcher which assembly of the data is the best one for their purposes.

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

I am a bit ambivalent here. It took me a long time to realize that I wanted to become a bioinformatician, I missed a lot of signals how much I enjoyed programming, for example. So, I would like to tell myself to explore computational science much more than I did. On the other hand, waiting this long to make the switch to bioinformatics meant I have acquired a very firm background in biology. I find this essential for my work, as it allows me to make connections between the technological aspects of high-throughput sequencing experiments and data analysis, and the biological questions that inspired the experiments in the first place. So, I would also like to tell myself to keep on studying biology.

100. What's your all-time favorite piece of bioinformatics software, and why?

The Newbler assembly and mapping program from Roche/454 Life Sciences. It is not the program per se (it's good, but not necessarily the best; nor is it open source, for that matter). However, it is through the use of this program I was propelled into bioinformatics. I became very familiar with it and started scripting to massage its output. I even wrote a user-oriented manual for Newbler. These days, I use many more assembly programs besides Newbler, but my bioinformatics 'roots' will always be Newbler.

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

B, as it stands for 'C or G or T', so it is flexible, allowing several alternatives and keeping options open. But it also means knowing your limits, not everything goes. I also like to have a 'plan B' in the back of my head.

How to make your genomics website more suitable for an English-speaking audience

June 05, 2014 by Keith Bradnam

Today I visited the website of the Beijing Institute of Genomics (BIG) for the first time. BIG is not to be confused with BGI (which was formerly known as the Beijing Genomics Institute). If you look at just about any web page on this site other than the home page (which contains an unusual visual element), you'll see the following image:

My sharp, British-born, eyes quickly recognized this as the UK's Houses of Parliament in London (well technically it's the Palace of Westminster). See this image for a comparison. I then noticed that this image doesn't feature on the Chinese language version of the website (which has a completely different design).

I can only assume that some web designer thought that an image like this would be fitting because it is the English-language version of the website, and that they therefore chose an image of something (incorrectly) deemed to be English. At this point, I feel obliged to share the following video which offers a definitive explanation as to the differences between England, Great Britain, and the United Kingdom:

Reflections on my '101 questions with a bioinformatician' series

June 04, 2014 by Keith Bradnam

This is in lieu of a regular '101 questions with a bioinformatician' post which has been delayed (hopefully by only a day). This series of interviews has now been running for over 2 months and — judging by my web stats — it seems to be popular. In fact, these posts now account for the majority of traffic to this site.

Thanks to everyone who has contributed so far, and for everyone who has been reading these interviews. It's been fun doing this and I've enjoyed seeing the variety of answers that people have provided.

I should confess that I'm solely responsible for adding hyperlinks to the answers that people provide, and in addition to adding links for obvious items like pieces of bioinformatics software, I sometimes like to have a bit of fun with what I choose to link to. E.g. see the links I added to question 101 in my interview with Holly Bik.

To finish off, here are some relevant numbers about this series:

10 — number of interviews posted
2 — number of interviews finished and (almost) ready to be posted
6 — number of people who have agreed to be interviewed but haven't yet sent me their answers (cough, cough).
81 — my current list of 'potential interviewees'

The last point means that hopefully I can keep this series going for a while longer. I guess that I now have to aim for an interviewee #101, (which would be the 102nd interview…obviously).

Still collecting results for my survey about gender bias in bioinformatics

May 30, 2014 by Keith Bradnam

A quick post just to say that although I published some preliminary results from my survey about gender bias in bioinformatics, I left the survey live so that others could still add their responses. So far, I've had 28 more responses on top of the original 370.

I also tweaked the survey form to allow ex-bioinformaticians to respond (and I asked whether they left bioinformatics as a career because of gender bias). If you haven't done so, please complete the form (embedded below) or available here. I'll try to update the main results on Figshare in a few weeks. Hopefully, with some more results it will be possible to see if there are other notable patterns in the results.

101 questions with a bioinformatician #9: Tuuli Lappalainen

May 28, 2014 by Keith Bradnam

Tuuli Lappalainen is a Group leader at the New York Genome Center, an institution that's so new, that their Illumina HiSeq X Ten is counted as one of their older sequencing machines. In addition to having possibly the coolest logo for a genomics/bioinformatics institute, they also have an impressive set of green credentials. And did I mention that it's in New York, New York? Start spreading the newwwss…

Sorry, I got distracted.

Tuuli is also an assistant professor at the Department of Systems Biology at Columbia University. Her work focuses on using high-throughput sequencing data to study functional genetic variation in human populations. Her website — paraphrasing Dobzhansky — puts it like this:

Nothing in the genome makes sense except in the light of the transcriptome

You can find out more about Tuuli by following her on twitter (@tuuliel) or by checking out her lab's website. Oh, and Tuuli is looking for a talented post-doc to join her lab (she didn't ask me to say that, it's all part of the service). And now, on to the 101 questions...

001. What's something that you enjoy about current bioinformatics research?

I have very little interest in methods for the sake of methods; for me it's all about understanding biology, and bioinformatics provides fantastic opportunities for that.

010. What's something that you *don't* enjoy about current bioinformatics research?

The working environment that is local when data and analyses are increasingly global is driving me insane. I've done (and still do) a lot of consortium work, where all of us still end up copying large data files to our local servers, and having locally optimized pipelines and scripts that are impossible to transfer to colleagues. I know that many people are trying to solve the problem, and I hope we'll be able to make it happen soon. And then there are the complications of applying and getting access to various datasets. Privacy concerns are important, but does dbGap really need to be so difficult to use? Our open access data set from GEUVADIS (Genetic European Variation in Health and Disease) is a great exception to this.

011. If you could go back in time and visit yourself as an 18 year old, what single piece of advice would you give yourself to help your future bioinformatics career?

Learn more stats, math, proper programming. It's great to see how the younger generations have formal training in so many of the skills that I've had to just pick up the along the way — I'm a biologist by training and proud of it, but in the early 2000's computational biology was still very marginal.

100. What's your all-time favorite piece of bioinformatics software, and why?

My two current favorites are pysam for handling BAM/SAM files — fast, great syntax, and much more versatile than alternatives — and Matrix eQTL for very fast eQTL analysis.

101. IUPAC describes a set of 18 single-character nucleotide codes that can represent a DNA base: which one best reflects your personality?

T for Tuuli!