Celebrating an unsung hero of genomics: how Albrecht Kossel saved bioinformatics from a world of hurt

The following image is from one of the first publications to ever depict a DNA sequence in a textual manner:

From Wu & Kaiser, Journal of Molecular Biology, 1968. 

From Wu & Kaiser, Journal of Molecular Biology, 1968. 

This is taken from the 1968 paper by Wu and Kaiser: Structure and base sequence in the cohesive ends of bacteriophage lambda DNA. For almost half a century since this publication, it has become the norm to simply represent the sequence of any DNA molecule as a string of characters, one character per base. But have you ever stopped to consider why these bases have the names that they do?

Most of the work to isolate and describe the purines and pyrimidines that comprise nucleic acids came from the work of the German biochemist Albrecht Kossel. Between 1885 and 1901 he characterized the principle nucleobases that comprise the nucleic acids: adenine, cytosine, guanine, and thymine (though guanine had first been isolated in 1844 by Heinrich Gustav Magnus). The fifth nucleobase (uracil) was discovered by Alberto Ascoli, a student of Kossel.

Kossel's work would later be recognized with the 1910 Nobel Prize for Medicine. It should be noted that Kossel didn't just help isolate and describe these bases, he was also chiefly responsible for most of their names.

Albrecht Kossel, Image from wikimedia

Albrecht Kossel, Image from wikimedia

Guanine had already been named based on where it had first been discovered (the excrement of seabirds known as guano). Adenine was so named by Kossel because it was isolated from the pancreas gland ('adenas' in Greek). Thymine was named because it was isolated in nucleic acids from the thymus of a calf. Cytosine — the last of the four DNA bases to be characterized — was also discovered from hydrolysis of the calf thymus. Its name comes from the original name in German ('cytosin') and simply refers to to the Greek prefix for cell ('cyto').

While this last naming choice might seem a little dull, all bioinformaticians owe a huge debt of thanks to Albrecht Kossel. Thankfully, he ensured that all DNA bases have names that start with different letters. This greatly facilitates their representation in silico. Imagine if he had — in a fit of vanity — instead chosen to name these last two bases that he characterized after himself and his daughter Gertrude. If that had been the case then maybe we would today be talking about the bases adenine, albrechtine, guanine, and gertrudine. Not an insurmountable problem to represent with single characters — we already deal with the minor headache of representing purines and pyrimidines differently (using R and Y respectively) — but frankly, it would be a royal pain in the ass.

Thank you Albrecht. The world of bioinformatics is in your debt.

The growth of bioinformatics papers that mention 'big data'

I very much enjoyed Stephen Turner's recent blog post There is no Such Thing as Biomedical "Big Data" and I agree with his central point that a lot of the talk about 'big data' is not really what others would consider 'big'. Out of curiosity, I had a quick dive into Google Scholar to see just how popular this particular cliche is becoming. My search term was "big data" biology|genomics|bioinformatics.

Growth of bioinformatics papers on Google Scholar that mention "big data".

Growth of bioinformatics papers on Google Scholar that mention "big data".

Clearly, this term is on the rise and might become as much of an annoyance as another phrase I loathe: next generation sequencing. A phrase that has been used to describe everything from 25 bp reads from early Solexa technology (circa 2005) to PacBio subreads that can exceed 25,000 bp.

As more people use N50 as a metric, fewer genomes seem to be 'completed'

If you search Google Scholar for the term genome contig|scaffold|sequence +"N50 size|length" and then filter by year, you can see that papers which mention N50 length have increased dramatically in recent years:

Google Scholar results for papers that mention N50 length. 2000–2013.

Google Scholar results for papers that mention N50 length. 2000–2013.

I'm sure that my search term doesn't capture all mentions of N50, and it probably includes a few false positives as well. It doesn't appear to be mentioned before 2001 at all, and I think that the 2001 Nature human genome paper may have been the first publication to use this metric.

Obviously, part of this growth simply reflects the fact that more people are sequencing genomes (or at least writing about sequenced genomes), and therefore feel the need to include some form of genome assembly metric. A Google Scholar search term for "genome sequence|assembly" shows another pattern of growth, but this time with a notable spurt in 2013:

Google Scholar results for papers that mention genome sequences or assemblies. 2000–2013.

Google Scholar results for papers that mention genome sequences or assemblies. 2000–2013.

Okay, so more and more people are sequencing genomes. This is good news, but only if those genomes are actually usable. This led me to my next query. How many people refer to their published genome sequence as complete? I.e. I searched Google Scholar for "complete|completed genome sequence|assembly". Again, this is not a perfect search term, and I'm sure it will miss some descriptions of what people consider to be complete genomes. But at the same time it probably filters out all of the 'draft genomes' that have been published. The results are a little depressing:

Google Scholar results for papers that mention genome sequences or assemblies vs those that make mention of 'completed' genome sequences or assemblies. 2000–2013.

Google Scholar results for papers that mention genome sequences or assemblies vs those that make mention of 'completed' genome sequences or assemblies. 2000–2013.

So although there were nearly 90,000 publications last year that mentioned a genome sequence (or assembly), approximately just 7,500 papers mentioned the C-word. This is a little easier to visualize if you show the number of 'completed' genome publications as a percentage of the number of publications that mention 'genome sequence' (irrespective of completion status):

Numbers of publications that mention 'completed' genomes as percentage of those that mention genomes. 2000–2013.

Numbers of publications that mention 'completed' genomes as percentage of those that mention genomes. 2000–2013.

Maybe  journal reviewers are more stringent about not allowing people to use the 'completed' word if the genome isn't really complete (which depending on your definition of 'complete' may include most genomes)? Or maybe people are just happier these days to sequence something, throw it through an assembler and then publish it, regardless of how incomplete it is?

2013 ended with a bumper crop of new JABBA awards for bogus bioinformatics acronyms

jabba logo.png

I have a huge backlog of JABBA awards to hand out. These are all collated from the annual publication of the (voluminous) 2013 Nucleic Acids Research Database Issue. So without further ado, here are the recipients of the latest batch of JABBA awards:

Honorable mention:

The three ages of CEGMA: thoughts on the slow burning popularity of some bioinformatics tools

The past

CEGMA is a bioinformatics tool that was originally developed to annotate a small set of genes in novel genome sequences that lack any annotation. The logic being that if you can at least annotate a small number of genes and have some confidence about their gene structure, you can then use them as a training set for an ab initio gene finder to go on and annotate the rest of the genome.

This tool was developed in 2005 and it took rejections from two different journals before the paper was finally published in 2007. We soon realized that the set of highly conserved eukaryotic genes that CEGMA used could also be adapted to assess the completeness of genome assemblies. Strictly speaking, we can use CEGMA to assess the completeness of the 'gene space' of genomes. Another publication followed in 2009, but CEGMA still didn't gain much attention.

CEGMA was then used as one of the assessment tools in the 2011 Assemblathon competition, and then again for the Assemblathon 2 contest. It's possible that these publications led to an explosion in the popularity of CEGMA. Alternatively, it may have become more popular just because more and more people have started to sequence genomes, and there is a growing need for tools to assess whether genome assemblies are any good.

The following graph shows the increase in citations to our two CEGMA papers since they were first published. I think it is unusual to see this sort of delayed growth in citations to a paper. The current citations from 2014 suggest that this year will see the citation count double compared to 2013.

cegma citations.png

The present

All of this is both good and bad. It is always good to see a bioinformatics tool actively being used, and it is always nice to have your papers cited. However, it's bad because the principal developer left our group many years ago and I have been left to support CEGMA without sufficient time or resources to do so. I will be honest and say that it can be a real pain to even get CEGMA installed (especially on some flavors of Linux). You need to separately install NCBI BLAST+, HMMER, geneid, and genewise, and you can't just use any version of these tools either. These installation problems have meant that I recently tried making it easier for people to submit jobs to us, which I run locally on their behalf.

These submission requests made me realize that many people are using CEGMA to assess the quality of transcriptomes as well as genomes. This is not something we ever advocated, but it seems to work. These submissions have also let me take a look at whether CEGMA is performing as expected with respect to the N50 lengths of the genomes/transcriptomes being assessed (I can't use NG50 which I would prefer, because I don't know the expected size for these genomes).

cegma.png

Generally speaking, if your genome contains longer sequences, then you have more chance of those sequences containing some of the 248 core genes that CEGMA searches for. This is not exactly rocket science, but I still find it surprising — not to mention worrying — that there are a lot of extremely low quality genome assemblies out there, which might not be useful for anything.

The future

We are currently preparing a grant that would hopefully allow us to give CEGMA a much needed overhaul. Currently we have insufficient resources to really support the current version of CEGMA, but we have many good ideas for how we could improve it. Most notably, we would want to make new sets of core genes based on modern resources such as the eggNOG database. The core genes that CEGMA uses were determined from an analysis of the KOGs database which is now over a decade old! A lot has changed in the world of genomics since then.

The problem that arises when Google Scholar indexes papers published to pre-print servers

The Assemblathon 2 paper, on which I was lead author, was ultimately published with the online journal Gigascience. However, like an increasing number of papers, it was first released to the arXiv.org pre-print server.

If you are a user of the very useful Google Scholar service and you have also published a paper such that it appears in two places, then you may have run into the same problems that I have. Namely, Google Scholar appears to only track citations to the first place where the paper was published.

It should be said that it is great that Google tracks citations to these pre-print articles at all, though see another post of mine that illustrates just how powerful (and somewhat creepy), Google Scholar's indexing power is. However, most people would expect that when a paper is formally published, that Google Scholar should track citations to that as well. Preferably separately from the pre-print version of the article.

For a long time with the Assemblathon 2 paper, Google Scholar only seemed to show citations to the pre-print version of the paper, even when I knew that others were citing the Gigascience version. So I contacted Google about this, and after a bit of a wait, I heard back from them:

Hi Keith,

It still get indexed though the information is not yet shown:

http://scholar.google.com/scholar?q=http%3A%2F%2Fwww.gigasciencejournal.com%2Fcontent%2F2%2F1%2F10+&btnG=

If one version (the arXiv one in this case) was discovered before the last major index update, the information for the other versions found after the major update would not appear before the next major update.

Their answer still raises some issues, and I'm waiting to hear back from my follow up question...how often does the index get updated? Checking Google Scholar today, it initially appears as if they are still only tracking the pre-print version of our paper:

2014-01-27 at 9.36 AM.png

However, after checking I see that 9 out of 10 of the most recent citations are all citing the Gigascience version of the paper. So in conclusion:

  1. Google Scholar will start to track formal versions of a publication even after the paper was first published on a pre-print server.
  2. Somewhat annoyingly, they do not separate out the citations and so one Google Scholar entry ends up tracking two versions of a paper.
  3. The Google Scholar entry that is tracking the combined citations only lists the pre-print server in the 'Journal' name field; you have to check individual citations to see if they are citing the formal version of the publication.
  4. Google Scholar has a 'major' indexing cycle and you may have to wait for the latest version of the index to be updated before you see any changes.

JABBA, ORCA, and more bad bioinformatics acronyms

JABBA awards — Just Another Bogus Bioinformatics Acronym — are my attempt to poke a little bit of fun at the crazy (and often nonsensical) acronyms and initialisms that are sometimes used in bioinformatics and genomics. When I first started handing out these awards in June 2013, I didn't realize that I was not alone in drawing attention to these unusual epithets.

http://orcacronyms.blogspot.com

http://orcacronyms.blogspot.com

ORCA is the Organization for Really Contrived Acronyms a fun blog set up by an old colleague of mine, Richard Edwards. ORCA sets out to highlight strange acronyms across many different disciplines, whereas my JABBA awards focus on bioinformatics. Occasionally, there is some overlap, and so I will point you to the latest ORCA post which details a particularly strange initialism for a bioinformatics database:

ADAN - prediction of protein-protein interAction of moDular domAiNs

Be sure to read Richard's thoughts on this name, as well as checking out some of the other great ORCA posts, including one of my favorites (GARFIELD).

ACGT: a new home for my science-related blog posts

Over the last year I've increasingly found myself blogging about science — and about genomics and bioinformatics in particular — on my main website (keithbradnam.com). Increasingly this has led to a very disjointed blog portfolio: posts about my disdain for contrived bioinformatics acronyms would sit aside pictures of my bacon extravaganza.

No longer will this be the case. ACGT will the new home for all of my scientific contemplations. So what is ACGT all about? Maybe you are wondering Are Completed Genomes True? or maybe you are just on the lookout to see someone Assessing Computational Genomics Tools. If so, then ACGT may be a home for such things (as well as Arbitrary, Contrived, Genome Tittle-Tattle perhaps).

I've imported all of the relevant posts from my main blog (I'll leave the originals in place for now), and hopefully all of the links work. Please let me know if this is not the case. Now that I have a new home for my scientific musings —  particularly those relating to bioinformatics — I hope this will encourage me to write more. See you around!

Keith Bradnam

Paper review: anybody who works in bioinformatics and/or genomics should read this paper!

I rarely blog about specific papers but felt moved to write about a new paper by Jonathan Mudge, Adam Frankish, and Jennifer Harrow who work in the Vertebrate Annotation group at the Wellcome Trust Sanger Institute.

Their paper, now out in Genome Research, is titled: Functional transcriptomics in the post-ENCODE era.

They brilliantly, and comprehensively, list the various ways in which gene architecture — and by extension gene annotation — is incredibly complex and far from a solved problem. However, they also provide an exhaustive description of all the various experimental technologies that are starting to shine a lot more light on this, at times, dimly lit field of genomics.

In their summary, they state:

Modern genomics (and indeed medicine) demands to understand the entirety of the genome and transcriptome right now

I'd go so far as to say that many people in genomics assume that genomes and transcriptomes are already understood. I often feel that too many people enter this field with false beliefs that many genomes are complete and that we know about all of the genes in this genomes. Jonathan Mudge et al. start this paper by firmly pointing out that even the simple question of 'what is a gene?' is something that we are far from certain about.

Reading this paper, I was impressed by how comprehensively they have reviewed the relevant literature, pulling in numerous examples that indicate just how complex genes are, and which show that we need to move away from the very protein-centric world view that has dominated much of the history of this field.

LncRNAs, microRNAs, and piwi-interacting RNAs are three categories of RNA that you probably wouldn't find mentioned anywhere in text books from a decade ago, but which now — along with 'traditional' non-coding RNAs such as rRNAs, tRNAs, snoRNAs etc. — probably outnumber the number of protein-coding genes in the human genome. Many parts of this paper tackle the issue of transcriptional complexity, particularly trying to address the all-important question how much of this is functional?

I found that so many parts of this paper touched on previous, current, and possible future projects in our lab. Producing an accurate catalog of genes, understanding alternative splicing, examining the relationship between mRNA and protein abundances, looking for conservation of signals between species...these are all topics that are near and dear to people in our lab.

Even if you have no interest in the importance of gene annotation — and shame on you if that is how you feel — this paper also serves as a fantastic catalog of the latest experimental techniques that can be used to capture and study genes (e.g. CAGE, ribosome profiling, polyA-seq etc).

If you have ever worked with a set of genes from a well curated organism, spare a thought for the huge amount of work that goes into trying to provide those annotations and keep them up to date. I'll leave you with the last couple of sentences from the paper...please repeat this every morning as your new mantra:

Finally, no one knows what proportion of the transcriptome is functional at the present time; therefore, the appropriate scientific position to take is to be open-minded. We thus do not claim that the annotation of the human genome is close to completion. If anything, it seems as if the hard work is just beginning.

More JABBA awards for inventive bioinformatics acronyms

A quick set of new JABBA award recipients. Once again these are drawn from the journal Bioinformatics.

  1. NetWeAvers: an R package for integrative biological network analysis with mass spectrometry data - the mixed capitalization of this software tool is a little uneasy on the eye. But more importantly, a Google search for 'netweavers' returns lots of links about something entirely different. I.e. NetWeavers (and NetWeaving) is already a recognized term in another field.
  2. GIM3E: condition-specific models of cellular metabolism developed from metabolomics and expression data. - the 3 part of this algorithm's name is deliberately written in superscript by the authors. This implies 'cubed', but I think it is really referring to 3 lots of 'M' related words because the full name of the algorithm is 'Gene Inactivation Moderated by Metabolism, Metabolomics and Expression'. GIM3E is not something that is particularly easy to say quickly, though it is much more Google friendly than NetWeavers.
  3. INSECT: IN-silico SEarch for Co-occurring Transcription factors - making an acronym into the name of a plant or animal name is quite common in bioinformatics. A couple of examples are worth mentioning. There is the MOUSE resource (Mitochondria and Other Useful SEquences) and also something called HAMSTeRS (the Haemophilus A Mutation, Structure, Test and Resource Site). The main problem with acronyms like these is that they can be to hard to find using online search tools (e.g. Google for hamster resources). A secondary issue is that the name just doesn't really connect to what the resource/database/algorithm is about. The INSECT database contains information about 14 different species, only one of which is an insect.
2013-11-26 at 2.38 PM.png

I'll no doubt be posting again the next time I come across some more dubious acroynms.