Bioinformatics software names: the good, the bad, and the ugly

The Good

Given that I spend so much time criticising bad bioinformatics names, I probably should make more of an effort to those occasional flag names that I actually like! Here are a few:

RNAcentral: an international database of ncRNA sequences

A good reminder that a bioinformatics tool doesn't have to use acronyms or intialisms! The name is easy to remember and makes it fairly obvious as to what you might expect to find in this database.


KnotProt: a database of proteins with knots and slipknots

A simple, clever, and memorable, name. And once again, no acronym!


WormBase and FlyBase

Some personal bias here — I spent four years working at WormBase — but you have to admire the simplicity and elegance of the names. 'WormBase' sort of replaced it's predecessor ACeDB (A Caenorhabidtis elegans DataBase). I say 'sort of' because ACeDB was the name for both the underlying software (which continued to be used by WormBase) and the specific instance of the database that contained C. elegans data. This led to the somewhat confusing situation (circa 2000) of there being many public ACeDB databases for many different species, only one of which was the actual ACeDB resource with worm data.


The Bad

These are all worthy of a JABBA award:

The human DEPhOsphorylation database DEPOD: a 2015 update

I find it amusing that they couldn't even get the acronym correctly captitalized in the title of the paper. As the abstract confirms, the second 'D' in 'DEPOD' comes from the word 'database' which should be capitalized. So it is another tenuous selection of letters to form the name of the database, but I guess at least the name is unique and Google searches for depod database don't have any trouble finding this resource.


IMGT®, the international ImMunoGeneTics information system® 25 years on

It's a registered trademark and that little R appears at every mention of the name in the paper. This initialism is the first I've seen where all letters of the short name come from one word in the full name.


DoGSD: the dog and wolf genome SNP database

I have several issues with this:

  1. It's a poor acronym (not explicitly stated in the paper): Dog and wolf Genome Snp Database
  2. The word 'dog' contributes a 'D' to the name, but then you end up with 'DoG' in the final name. It looks odd.
  3. What did the poor wolf do to not get featured in the database name?
  4. The lower-case 'O' means that you potentially can read this as dog-ess-dee or do-gee-ess-dee.
  5. Why focus the name on just two types of canine species? What if they wanted to add SNPs from Jackals or Coyotes, are they going to change the name of the database? They could have just called this something like 'The Canine SNP Database' and avoided all of these problems.

The Ugly

Maybe not JABBA-award winners, but they come with their own problems:

MulRF: a software package for phylogenetic analysis using multi-copy gene trees

Sometimes using the lower-case letter 'L' in any name is just asking for trouble. Depending on the font, it can look like the number 1 or even a pipe character '|'. The second issue is concerns the pronouncability of this name. Mull-urf? Mull-ar-eff? It doesn't trip off the tongue.


DupliPHY-Web: a web server for DupliPHY and DupliPHY-ML

This tool is all about looking for gene duplications from a phylogenetic perspective, hence 'Dupli' + 'PHY'. I actually think this is quite a good choice of name, except for the inconsistent, and visually jarring, use of mixed case. Why not just 'Dupliphy'?


ChiTaRS 2.1—an improved database of the chimeric transcripts and RNA-seq data with novel sense–antisense chimeric RNA transcripts

It's not spelt out in detail, but one can assume that 'ChiTaRS' derives from the following letters: CHImeric Transcripts And Rna-Seq data. So it is not being a bogus bioinformatics acronym in that respect. But I find it visually unappealing. Mixed capitalization like this never scans well.


DoRiNA 2.0—upgrading the doRiNA database of RNA interactions in post-transcriptional regulation

The paper doesn't explicitly state how the word 'DoRiNA' is formed other than saying:

we have built the database of RNA interactions (doRiNA)

So one can assume that those letters are derived from 'Database Of Rna INterActions'. On the plus side, it is a unique name easily searchable with Google. On the negative side, it seems strange to have 'RNA' as part of your database name, only with an additional letter inserted inbetween.

Metassembler: Merging and optimizing de novo genome assemblies

There's a great new paper in bioRxiv by Alejandro Hernandez Wences and Michael Schatz. They directly address something I wondered about as we were running the Assemblathon contests. Namely, can you combine some of the submitted assemblies to make an even better assembly? Well the answer seems to be a resounding 'yes'.

For each of three species in the Assemblathon 2 project we applied our algorithm to the top 6 assemblies as ranked by the cumulative Z-score reported in the paper…

We evaluated the correctness and contiguity of the metassembly at each merging step using the metrics used by the Assemblathon 2 evaluation…

In all three species, the contiguity statistics are significantly improved by our metassembly algorithm

Hopefully their Metassembler tool will be useful in improving many other poor quality assemblies that are out there!

My favorite bioinformatics blogs of 2014

Yeah, so I’m a little late in getting around to writing this! The following are not presented in any particular order…


Blog: In between lines of code
Creator: Lex Nederbragt (@lexnederbragt)
Frequency of updates: maybe 1 post a month on average (but it can vary)
Recommended blog post: Developments in next generation sequencing – June 2014 edition

This blog primarily focuses on developments in modern sequencing technology and genome assembly. Required reading if you have an interest in the state of current sequencing technologies, and more importantly, where they are heading.


Blog: Bioinformatician at large
Creator: Ewan Birney (@ewanbirney)
Frequency of updates: less than 1 post a month on average
Recommended blog post: A cheat’s guide to histone modifications

Ewan doesn’t update the blog very often, but when he does post, he usually takes the time to provide us with a very detailed look at some aspect of genomics. Many of his posts explore the underlying science that people are addressing through various genomics/bioinformatics approaches.


Blog: Loman Labs Blog
Creator: Nick Loman (@pathogenomenick)
Frequency of updates: 1–2 posts a month on average
Recommended blog post: The infinite lie of easy bioinformatics

Nick covers lots of material relating to metagenomics and sequencing in general. As a self-confessed Oxford Nanopore fan boy he has some interesting thoughts and observations to share about this nascent sequencing technology, but he writes about most modern sequencing technologies. He also likes the occasional rant from time to time (don’t we all?).


Blog: Living in an Ivory Basement Stochastic thoughts on science, testing, and programming
Creator: C. Titus Brown (@ctitusbrown), a.k.a. Chuck Norris
Frequency of updates: several posts a week
Recommended blog post: Some myths of reproducible computational research

Titus covers a lot of different material on his blog. Many posts see him ‘thinking out loud’ on an issue, keeping people updated with developments with his training courses, or frequently asking people for their thoughts or suggestions on a topic. There are also detailed scientific posts relating to his interests in kmer-based approaches relating to genome assembly. Being a keen advocate (and practitioner) of open and reproducible science, Titus also uses his blog to write on these topics.


Blog: Omics! Omics! A computational biologist’s personal views on new technologies & publications on genomics & proteomics and their impact on drug discovery
Creator: Keith Robison (@omicsomicsblog)
Frequency of updates: 1–2 posts per month (but many more during AGBT!)
Recommended blog post: A Sunset for Draft Genomes?

This blog is predominantly focused on the latest developments in sequencing technology. Keith provides great insight into future developments in the world of sequencing, and also tries to make sense of the claims and marketing hype that sometimes surrounds the announcements of new technologies. During the annual Advancements in Genome Biology and Technology (AGBT) meeting, you can rely on Keith to provide great commentary on what is happening (and not happening) at the meeting.


Blog: Opiniomics: bioinformatics, genomes, biology etc. “I don’t mean to sound angry and cynical, but I am, so that’s how it comes across”
Creator: Mick Watson (@biomickwatson)
Frequency of updates: 1–2 posts per week
Recommended blog post: Why anonymous peer review is bad for science

Of all the blogs that I’m including here, this is probably my favorite. I greatly enjoy Mick’s writing; not so much for the detailed technical posts about sequencing technology — good though these are — but for the fantastic pieces he writes about the wider field of bioinformatics. Mick has insightful views on such topics as peer review, training of bioinformaticians, and reproducible science. I particularly like Mick’s frequently humorous — and sometimes slightly ranty — style of writing. Oh, I should also point out that Mick’s site is best viewed on an iPad.

Slides: Thoughts on the feasibility of Assemblathon 3

The slides below represent the draft assembly version of the talk that Ian Korf will be giving today at the Genome 10K meeting. I.e. these are slides that I made for him to use as the basis of his talk. I expect his final version will differ somewhat.

After I made these slides I discovered that two of the species that I listed as potential candidates for Assemblathon 3 already have genome projects. The tuatara genome project is actually the subject of another talk at the Genome 10K meeting, and a colleague tells me that there is also a California condor genome project too.

Thoughts on a possible Assemblathon 3

Lex Nederbragt has written a post outlining his thoughts on what any Assemblathon 3 contest should look like. This is something that Ian Korf will be talking about today at the Genome 10K meeting which is happening at the moment (though it seems that there has been a lot of discussion about this in other sessions). From his post:

I believe it is here the Assemblathon 3 could make a contribution. By switching the focus from the assembly developers to the assembly users, Assemblathon 3 could help to answer the question:

How to choose the ‘right’ assembly from a set of generated assemblies

Trying to download the cow genome (again): where's the beef (again)?

Almost a year ago, I blogged about my frustrations regarding the extremely confusing nature of the cow genome and the many genome assemblies that are out there. Much of that frustration was due to websites and FTP sites that had broken links, misleading information, and woefully incomplete documentation.

One year on and I hear a rumor that a new version of the cow genome is available. So I went off in search of 'UMD 3.1.1'. My first stop was bovinegenome.org which is one place where you can find the previous 'UMD 3.1' assembly. But alas, they do not list UMD 3.1.1.

After some Google searching I managed to find this information at the UCSC Genome Bioinformatics news archive:

We are pleased to announce the release of a Genome Browser for the June 2014 assembly of cow, Bos taurus (BostaurusUMD 3.1.1, UCSC version bosTau8). This updated cow assembly was provided by the UMD Center for Bioinformatics and Computational Biology (CBCB). This assembly is an update to the previous UMD 3.1 (bosTau6) assembly. UMD 3.1 contained 138 unlocalized contigs that were found to be contaminants. These have been suppressed in UMD 3.1.1.

This reveals that the update is pretty minor (removal of contaminant contigs which were never part of any chromosome sequence anyway). In any case, the USCC FTP site contains the UMD 3.1.1 assembly so that's great.

But out of curiosity I followed UCSC's link to the UMD Center for Bioinformatics and Computational Biology (CBCB) website. The home page doesn't make it easy to find the cow genome data. Searching the site for 'UMD 3.1.1' didn't help but searching for 'cow genome' did take me to their Assembly data page which lists the cow genome. Unfortunately the link for the Bos taurus genome takes you to 'page not found'. In contrast, the 'data download' link does work and takes you to their FTP site which fails to include the new assembly (but it does list all of the older cow genome assemblies).

Plus ça change, plus c'est la même chose.

Community annotation — by any name — still isn’t a part of the research process. It should be

In order for community annotation efforts to succeed, they need to become part of the established research process: mine annotations, generate hypotheses, do experiments, write manuscripts, submit annotations. Rinse and repeat.

A thoughtful post by Todd Harris on his blog which lists some suggestions for how to fix the failure of community annotation projects.

I particularly like Todd's 3rd suggestion:

We need to recognize the efforts of people who do [community annotation]. This system must have professional currency to it, akin to writing a review paper, and should be citable…

Tales of drafty genomes: part 3 – all genomes are complete…except for those that aren't

This is the third post in an infrequent series that looks at the world of unfinished genomes.

One of the many, many resources at the NCBI is their Genome database. Here's how they describe themselves:

The Genome database contains sequence and map data from the whole genomes of over 1000 species or strains. The genomes represent both completely sequenced genomes and those with sequencing in-progress. All three main domains of life (bacteria, archaea, and eukaryota) are represented, as well as many viruses, phages, viroids, plasmids, and organelles.

This text could probably be updated because the size of the database is now wrong by an order of magnitude…there are currently 11,322 genomes represented in this database. But how many of them are 'completely sequenced' and how many are at the 'sequencing in-progress' stage?

Luckily, the NCBI classifies all genomes into one of four 'levels':

  • Complete
  • Chromosome
  • Scaffold
  • Contig

I couldn't find any definitions for these categories within the NCBI Genome database, but elsewhere on the NCBI website I found the following definitions for the latter three categories:

Chromosome - there is sequence for one or more chromosomes. This could be a completely sequenced chromosome (gapless) or a chromosome containing scaffolds with unlinked gaps between them.

Scaffold - some sequence contigs have been connected across gaps to create scaffolds, but the scaffolds are all unplaced or unlocalized.

Contig - nothing is assembled beyond the level of sequence contigs

So considering just the 2,032 Eukaryotic species in the NCBI Genome Database, we can ask…how many of them are complete?

Completion status of 2,032 eukaryotic genomes, as classified by NCBI

Completion status of 2,032 eukaryotic genomes, as classified by NCBI

The somewhat depressing answer is that only a meagre 24 eukaryotic genomes are listed as complete, about 1% of the total. Even if we include genomes with chromosome sequences, we are still only talking about 13% of all genomes. You might imagine that the state of completion would be markedly better when looking at prokaryotes. However, only 11.5% of the 31,696 prokaryotic genomes are classified as complete.

In the last post in this series, I included a dictionary definition of the word 'draft'. This time, let's look to see how Merriam-Webster defines 'complete':

having all necessary parts : not lacking anything

not limited in any way

not requiring more work : entirely done or completed

By this definition, I think we could all agree that very few genomes are actually complete.

Choosing names for bioinformatics software: it's a snap

Image from flickr user plashingvole

Image from flickr user plashingvole

Compare the following published bioinformatics resources:

  1. SNAP: Semi-HMM-based Nucleic Acid Parser (published 2004)
  2. SNAP: Suite of Nucleotide Analysis Programs (published 2005)
  3. SNAP: SNP Annotation And Proxy search (published 2008)
  4. SNAP: Screening for NonAcceptable Polymorphisms (published 2008)
  5. SNAP: Scalable Nucleotide Alignment Program (published 2011)

Every new bioinformatics tool that decides to reuse an existing name — either wilfully or by ignorance — makes it that little bit harder for people to find one of the other similarly-named-tools that they might be searching for.

h/t to @byuhobbes for bringing some of these duplicates to my attention.