Does your bioinformatics software pass the 'elevator test'?

July 28, 2015 by Keith Bradnam

The name of your bioinformatics software is important. A good name should be clear, unambiguous, pronouncable, memorable, and meaningful. Sadly many (most?) names of existing tools do not satisfy all of these criteria. Here is a simple thought experiment that you can use when trying to decide on a new name for your software; this is something which might help you avoid many common naming problems that can arise.

Imagine that you are in an elevator going from the 6th floor of a building to the ground floor. The elevator stops at the 5th floor and a visiting bioinformatics/genomics scholar steps in. He/she is someone that you admire and someone who you would really like to know about the latest software tool that you've been working on.

They press the button for the 2nd floor. You have maybe 30 seconds to introduce the tool and hopefully make them curious enough to check it out when they next get back to their computer. You say something like:

Hi. I'm a big fan of your work. I wanted to let you know that I've been working on a tool that you might be interested in…it's called 'X'

In this example, we will assume that you may never see this person again and that you don't know when they will have time to look up your software tool. It might be days, so the name has to be something that they will remember. The more meaningful and pronouncable the name, the more chance that it will be memorable.

Now, let's consider the names of some recently published bioinformatics tools…do these pass the elevator test? You should always consider how you might have to spell out the name of your software:

tmle.npvi — tee-em-el-ee-dot-en-pee-vee-aye
EW_dmGWAS — Ee-double-you-underscore-dee-em-gee-was
do_x3dna — dee-oh- (or do?) -underscore-ex-three-DNA
R3D-2-MSA — ar-three-dee-dash-two-dash-em-ess-ay
Pse-in-One — pee-ess-ee- (or see?) -dash-in-dash-one
(PS)² — open-parentheses-pee-ess-close-parentheses-superscript-two

In these examples you would probably choose to omit details of the dots, dashes, underscores, parentheses, and superscript characters that are part of the name. So you should ask yourself whether you really need to include them in the first place.

The bottom line is that it is not enough for the name of your sofware to be comprehensible when read from a screen or page…it should also sound good!

And the award for needless use of subscript in the name of a bioinformatics tool goes to…

June 01, 2015 by Keith Bradnam

The following paper can be found in the latest issue of Bioinformatics:

Computational identification of MoRFs in protein sequences

MoRFs are molecular recognition features, and the tool that the authors developed to identify them is called:

MoRF_CHiBi

So the tool's name includes a subscripted version of 'CHiBi', a name which is taken from the shorthand name for the Center for High-Throughput Biology at the University of British Columbia (this is where the software was presumably developed). The website for MoRF_CHiBi goes one step further by describing something called the MoRF_ChiBi,mc predictor. I'm glad that they felt that some italicized text was just the thing to complement the subscripted, mixed case name.

The subscript seems to serve no useful purpose and just makes the software name harder to read, particularly because it combines a lot of mixed capitalization. It also doesn't help that 'ChiBi' can be read as 'kai-bye' or 'chee-bee'. I'm curious whether the CHiBi be adding their name as a subscripted suffix to all of their software, or just this one?

More duplicate names for bioinformatics software: a tale of two HIPPIES

April 14, 2015 by Keith Bradnam

Thanks to Sara Gosline (@sargoshoe) for bringing this to my attention. Compare and contrast the following:

The former tool, published in 2012 in PLOS ONE, takes its name from 'Human Integrated Protein-Protein Interaction rEference' (it was doing so well until it reached the last letter). The latter tool ('High-throughput Identification Pipeline for Promoter Interacting Enhancer elements') was published in 2014 in the journal Bioinformatics.

Leaving aside the issue of whether these names are worthy of a JABBA award, the issue here is that we have yet another duplicate set of software names for two different bioinformatics tools. The authors of the 2nd paper could, and should, have checked for 'prior art'.

If you are planning to develop a new bioinformatics tool and have thought of a possible name, please take the time to do the following:

Visit http://google.com (or your preferred web search engine of choice)
In the search box type the proposed name of your tool followed by a space
Then add the word 'bioinformatics'
Click search
That's it

Thinking of naming your bioinformatics software? Be afraid (of me), be very afraid (of me)

March 20, 2015 by Keith Bradnam

Saw this tweet today by Jessica Chong (@jxchong):

Hm… our group just spent 1.5 hrs brainstorming a name/initialism/acronym for a database/tool. Paranoid thanks to @kbradnam
— Jessica Chong (@jxchong) March 19, 2015

I like the idea that people might be fearful of choosing a name that could provoke me into writing a JABBA post. If my never-ending tirade about bogus bioinformatics acronyms causes some people to at least think twice about their intended name, then I will take that as a minor victory for this website!

Bioinformatics software names: the good, the bad, and the ugly

March 12, 2015 by Keith Bradnam

The Good

Given that I spend so much time criticising bad bioinformatics names, I probably should make more of an effort to those occasional flag names that I actually like! Here are a few:

RNAcentral: an international database of ncRNA sequences

A good reminder that a bioinformatics tool doesn't have to use acronyms or intialisms! The name is easy to remember and makes it fairly obvious as to what you might expect to find in this database.

KnotProt: a database of proteins with knots and slipknots

A simple, clever, and memorable, name. And once again, no acronym!

WormBase and FlyBase

Some personal bias here — I spent four years working at WormBase — but you have to admire the simplicity and elegance of the names. 'WormBase' sort of replaced it's predecessor ACeDB (A Caenorhabidtis elegans DataBase). I say 'sort of' because ACeDB was the name for both the underlying software (which continued to be used by WormBase) and the specific instance of the database that contained C. elegans data. This led to the somewhat confusing situation (circa 2000) of there being many public ACeDB databases for many different species, only one of which was the actual ACeDB resource with worm data.

The Bad

These are all worthy of a JABBA award:

The human DEPhOsphorylation database DEPOD: a 2015 update

I find it amusing that they couldn't even get the acronym correctly captitalized in the title of the paper. As the abstract confirms, the second 'D' in 'DEPOD' comes from the word 'database' which should be capitalized. So it is another tenuous selection of letters to form the name of the database, but I guess at least the name is unique and Google searches for depod database don't have any trouble finding this resource.

IMGT®, the international ImMunoGeneTics information system® 25 years on

It's a registered trademark and that little R appears at every mention of the name in the paper. This initialism is the first I've seen where all letters of the short name come from one word in the full name.

DoGSD: the dog and wolf genome SNP database

I have several issues with this:

It's a poor acronym (not explicitly stated in the paper): Dog and wolf Genome Snp Database
The word 'dog' contributes a 'D' to the name, but then you end up with 'DoG' in the final name. It looks odd.
What did the poor wolf do to not get featured in the database name?
The lower-case 'O' means that you potentially can read this as dog-ess-dee or do-gee-ess-dee.
Why focus the name on just two types of canine species? What if they wanted to add SNPs from Jackals or Coyotes, are they going to change the name of the database? They could have just called this something like 'The Canine SNP Database' and avoided all of these problems.

The Ugly

Maybe not JABBA-award winners, but they come with their own problems:

MulRF: a software package for phylogenetic analysis using multi-copy gene trees

Sometimes using the lower-case letter 'L' in any name is just asking for trouble. Depending on the font, it can look like the number 1 or even a pipe character '|'. The second issue is concerns the pronouncability of this name. Mull-urf? Mull-ar-eff? It doesn't trip off the tongue.

DupliPHY-Web: a web server for DupliPHY and DupliPHY-ML

This tool is all about looking for gene duplications from a phylogenetic perspective, hence 'Dupli' + 'PHY'. I actually think this is quite a good choice of name, except for the inconsistent, and visually jarring, use of mixed case. Why not just 'Dupliphy'?

ChiTaRS 2.1—an improved database of the chimeric transcripts and RNA-seq data with novel sense–antisense chimeric RNA transcripts

It's not spelt out in detail, but one can assume that 'ChiTaRS' derives from the following letters: CHImeric Transcripts And Rna-Seq data. So it is not being a bogus bioinformatics acronym in that respect. But I find it visually unappealing. Mixed capitalization like this never scans well.

DoRiNA 2.0—upgrading the doRiNA database of RNA interactions in post-transcriptional regulation

The paper doesn't explicitly state how the word 'DoRiNA' is formed other than saying:

we have built the database of RNA interactions (doRiNA)

So one can assume that those letters are derived from 'Database Of Rna INterActions'. On the plus side, it is a unique name easily searchable with Google. On the negative side, it seems strange to have 'RNA' as part of your database name, only with an additional letter inserted inbetween.

Choosing names for bioinformatics software: it's a snap

March 02, 2015 by Keith Bradnam

Compare the following published bioinformatics resources:

SNAP: Semi-HMM-based Nucleic Acid Parser (published 2004)
SNAP: Suite of Nucleotide Analysis Programs (published 2005)
SNAP: SNP Annotation And Proxy search (published 2008)
SNAP: Screening for NonAcceptable Polymorphisms (published 2008)
SNAP: Scalable Nucleotide Alignment Program (published 2011)

Every new bioinformatics tool that decides to reuse an existing name — either wilfully or by ignorance — makes it that little bit harder for people to find one of the other similarly-named-tools that they might be searching for.

h/t to @byuhobbes for bringing some of these duplicates to my attention.

Sowing the seeds of bad bioinformatics names

February 17, 2015 by Keith Bradnam

Here are two simple pieces of advice for people who are looking for a name for their latest bioinformatics tool/database/resource:

Avoid common words which might cause people searching for your tool to find something else instead.
Choose a name that hasn't been used before by the bioinformatics community.

Having said that, let's look at a new paper in the journal Bioinformatics:

Seed: a user-friendly tool for exploring and visualizing microbial community data

This name 'Seed', is a not-too-offensive acronym for Simple Exploration of Ecological Data. So what's my beef with it?

The problem is that words like seed are going to appear all over the Internet. My standard test for the 'searchability' of a bioinformatics tool is to search for the tool name followed by the word 'bioinformatics'. Your resource's website or publication should hopefully be the number one result (or somewhere on the first page). However, that is not what happens here.

And searching for 'seed bioinformatics' raises more problems by clashing with my first piece of advice. E.g. here are a couple of papers that were in my first page of Google results:

2010: Accessing the SEED Genome Databases via Web Services API: Tools for Programmers

2011: SEED: efficient clustering of next-generation sequences

So what happens if you include 'microbial' into your search terms? Won't that help?

Nope. Turns out that the SEED — not an ancronym as far as I can tell — is an annotation environment for microbial genomes that has been around for a decade, and which has spawned many papers, e.g.:

2014: The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST)

All of which means that people looking to find the newly published Seed tool, are not going to have much luck when using search engines.

Can you say the name of this new bioinformatics method three times fast?

February 12, 2015 by Keith Bradnam

New in the journal Bioinformatics:

jNMFMA: a joint non-negative matrix factorization meta-analysis of transcriptomics data

JNMF stand for Joint Non-negative Matrix Factorization. Throw in some meta-analysis and randomly decide to make the 'J' lower-case as well as itacilized and you end up with the trips-off-the-tongue name of jNMFMA. Try saying it three-times fast! Actually, I had trouble pronouncing this just once.

From CASP to Poreathon: what makes for a good bioinformatics 'brand' name?

January 23, 2015 by Keith Bradnam

One of my more significant contributions to the world of bioinformatics is that I came up with the name for The Assemblathon.

Towards the end of 2010, our group at the UC Davis Genome Center was tasked with helping organize a new competition to assess software in the field of genome assembly. I remember a midweek meeting with my boss (Ian Korf) where he informed me that by the end of the week we had to come up with a name for the project, set up a website, and have a mailing list up and running…and by 'we' he meant 'me'.

I was aware that there had been several other comparative software assessments in the field of bioinformatics, and that a certain theme had arisen in the naming of such exercises:

CASP - Critical Assessment of protein Structure Prediction: running since 1994 and organized by a team that are also in the Genome Center
GASP - Genome Annotation aSsessment Project (later renamed GASP1): a 1999 attempt to assess annotation in a region of the Drosophila melanogaster genome
EGASP - the human ENCODE Genome Annotation aSsessment Project: 2005–2006
nGASP - nematode Genome Annotation aSsessment Project: 2006–2008
RGASP - RNA-seq Genome Annotation aSsessment Project: 2005–2013 (RGASP1 and RGASP2 were designed to evaluate computational methods for RNA-seq data analysis whereas the latest RGASP3 is focusing on comparing RNA-seq read alignment software)
dnGASP - de novo Genome Assembly aSsessment Project: 2010–2011 (something that ran in parallel with Assemblathon 1)

It seems amazing to me that after GASP decided to make a bogus acronym by including the 'S' from 'aSsessment', all subsequent evaluation exercises followed suit (although you could also argue that CASP could have worked equally well as 'CAPS').

I felt quite strongly that the world did not need another '…ASP' style of name and so I came up with 'The Assemblathon'. Although many might shudder at this, I was really thinking of it as a 'brand' name, rather than just another forgettable scientific project name. The Assemblathon name ticked several boxes:

Memorable
Different
Pronounceable
Website name was available
Twitter account name was available

The last two items are kind of obvious when you realize that this is a completely new word. You may disagree, but I think that these are important — but not essential — aspects of naming a scientific project.

So what has happened since I bequeathed the Assemblathon brand to the world? Well, we've now had:

Alignathon - A collaborative competition to assess the state of the art in whole genome sequence alignment (published in 2014)
Variathon - A challenge to analyze existing or new pipelines for variant calling in terms of accuracy and efficiency (completed in 2013, but not published yet as far as I can tell)
Poreathon - Assessment of bioinformatics pipelines relating to Oxford Nanopore sequencing data (announced by Nick Loman this week)

I don't have any issues with 'Alignathon', as the name is based on a verb and the goal of the project is probably guessble by any bioinformatician. Like Assemblathon, it is a portmanteau that just seems to work.

In contrast, I find 'Variathon' a horrible name. The name doesn't scan well and may not make as much sense to others. If you search Google for this name you will see the following:

Not a good sign if your project name is regarded as a spelling mistake!

So what about 'Poreathon'? While I find this less offensive than Variathon, I still don't think it is a particularly snappy name…a bit of a snoreathon perhaps? ;-) Pore is both a noun and a verb, so the dual meaning of the word somewhat dilutes its impact as a project name.

5 suggestions for naming scientific projects

You should not feel committed to naming something in order to continue a previous naming trend
Acronyms are not the only option for the name of a scientific project!
If there is any confusion as to how your project name is spelt or pronounced, this will not help you promote the name among your peers.
Consider treating the intended name as a brand, and explore the issues that arise (how discoverable is the name, how similar to other 'brands', can you trademark it, is your name offensive in other languages, can you buy a suitable domain name? etc.)
At the very least, perform a Google search for your intended name to see if others in your field have already used it (see my post on Identical Classifications In Science)

Unpronounceable bioinformatics database names

January 21, 2015 by Keith Bradnam

First a quick reminder that an acronym is something that is meant to be pronounced as an entire word (e.g. NATO, AIDS etc.). Sometimes these end up becoming regular, non-capitalized, words (e.g. radar, laser).

In contrast, an initialism is something where the component letters are read out individually (e.g. BBC, CPU). In bioinformatics, there are also names which are part acronym and part initialism (e.g.GWAS…which I have only every heard pronounced as gee-was).

Most initialisms that we use in everday life tend to be short (2–4 letters) because this makes them easier to read and to pronounce. As you move past 4 letters, you run the risk of making your initialism unprouncible and unmemorable.

So here are some recently published bioinformatics tools with names that are a bit cumbersome to repeat. For each one I include how someone might try to pronounce them. Try repeating these names quickly and for an added test, see how many of these names you can remember 5 minutes after you read this:

5 characters

CeCaFDB: a curated database for the documentation, visualization and comparative analysis of central carbon metabolic flux distributions explored by 13C-fluxomics: cee-car-eff-dee-bee? — this assumes that 'Ce' and 'Ca' are not treated separately as two letters…one could argue that if it is not clear how your bioinformatics tool name should be pronounced, then it does not have a good name.
EHFPI: a database and analysis resource of essential host factors for pathogenic infection: ee-aitch-eff-pee-aye
PAIDB v2.0: exploration and analysis of pathogenicity and resistance islands: pee-ay-aye-dee-bee — this is a particularly bad choice of name as it will read to many as 'paid-bee'
rrnDB: improved tools for interpreting rRNA gene abundance in bacteria and archaea and a new foundation for future development: ar-ar-en-dee-bee (the first 3 characters are not easy to say quickly!)
The TTSMI database: a catalog of triplex target DNA sites associated with genes and regulatory elements in the human genome: tee-tee-ess-em-aye

6 characters

DBTMEE: a database of transcriptome in mouse early embryos: dee-bee-tee-em-ee-ee — I accept that maybe this one is just pronounced dee-bee-tee-me, but once again do you really want there to be uncertaintly as to how the name of your bioinformatics tool is read by others?
euL1db: the European database of L1HS retrotransposon insertions in humans: ee-you-ell-one-dee-bee
SASBDB, a repository for biological small-angle scattering data: ess-ay-ess-bee-dee-bee
WDSPdb: a database for WD40-repeat proteins: dub-ball-you-dee-ess-pee-dee-bee

7 characters

BCCTBbp: the Breast Cancer Campaign Tissue Bank bioinformatics portal: bee-cee-cee-tee-bee-bee-pee
PFP/ESG: automated protein function prediction servers enhanced with Gene Ontology visualization tool: pee-eff-pee-slash-ee-ess-gee (only 6 characters if you omit the slash I guess)
PHI-DAC: protein homology database through dihedral angle conservation: pee-aitch-aye-dash-dee-ay-cee (shorter if you omit dash and/or pronounce 'DAC' as a word)

And the winner goes to…

BioVLAB-MMIA-NGS: microRNA–mRNA integrated analysis using high-throughput sequencing data: this is a 7-letter initialism that comes after a three syllable (non-standard) word, so to pronounce this you have to say bio-vee-lab-em-em-aye-ay-en-gee-ess!!!

Conclusions

If you want people to actually use your bioinformatics tools, then you should aim to give them names that are memorable and pronounceable.

MoRFCHiBi