Next-generation sequencing must die!

Screen Shot 2014-03-07 at 4.05.14 PM.png

I hate the phrase next-generation sequencing (NGS) with a passion. Here's why...

The first published use of this phrase (that I can find) is from an article in Drug Discovery Today: Technologies by Thomas Jarvie in 2005. This paper had the succinct title Next generation sequencing technologies, and while this may represent the first time this phrase made it into print, it certainly wouldn't be the last.

Illumina sequencing may be the most obvious technology that springs to mind when people think of NGS, but there is also Pyrosequencing (developed circa 1996), Massively Parallel Signature Sequencing (circa 2000), ABI SOLiD sequencing (circa 2008), and Ion semiconductor sequencing (circa 2010).

Of course we also have single molecule real time sequencing by Pacific Biosciences. They were founded in 2004 but didn't launch their PacBio RS machines until 2010. The current darling of the sequencing world is nanopore technology, something which has been in development since the mid 1990s.

So do we refer to this entire period (from development to finished technology) as the NGS era? If so, then NGS technologies have already been around for almost 20 years. It doesn't strike me as particularly helpful to keep on labeling all of these different technologies with the same name.

Some people have tried to make things clearer by introducing yet more levels of obfuscation. This has led to some of these technologies being referred to as either second generation, third generation, fourth generation, next-next-generation, and even next-next-next generation. And of course these definitions are all subjective and one person's third-generation technology is another person's fourth-generation technology.

Other alternatives to NGS such as high-throughput sequencing or long-read sequencing are equally useless because 'high' and 'long' are both relative terms. The output from a high-throughput sequencing platform of 2008 might seem like 'low-throughput' today. The weakness of length-based descriptions is the reason why the 'Short Read Archive' was (thankfully) reborn as the Sequence Read Archive.

So here is my proposed three-step solution to rid the world of this madness:

  1. Don't use any of these terms, ever again.
  2. Just refer to a technology by a name that describes the methodology (e.g. sequencing-by-synthesis) or by the name of a company that has developed a specific product (e.g. Oxford Nanopore).
  3. You could even just use the term 'current sequencing technologies'...as long as your paper/talk/blog/book has a date associated with it, then I'm confident people will know what you mean.

Update 12th March: I have added a follow-up post to this one.

Some more JABBA awards to highlight bad bioinformatics acronyms

jabba logo.png

More JABBA awards for bogus bioinformatics acronyms. These all come from a recent issue of Bioinformatics:

Honorable mention:

Celebrating an unsung hero of genomics: how Albrecht Kossel saved bioinformatics from a world of hurt

The following image is from one of the first publications to ever depict a DNA sequence in a textual manner:

From Wu & Kaiser, Journal of Molecular Biology, 1968. 

From Wu & Kaiser, Journal of Molecular Biology, 1968. 

This is taken from the 1968 paper by Wu and Kaiser: Structure and base sequence in the cohesive ends of bacteriophage lambda DNA. For almost half a century since this publication, it has become the norm to simply represent the sequence of any DNA molecule as a string of characters, one character per base. But have you ever stopped to consider why these bases have the names that they do?

Most of the work to isolate and describe the purines and pyrimidines that comprise nucleic acids came from the work of the German biochemist Albrecht Kossel. Between 1885 and 1901 he characterized the principle nucleobases that comprise the nucleic acids: adenine, cytosine, guanine, and thymine (though guanine had first been isolated in 1844 by Heinrich Gustav Magnus). The fifth nucleobase (uracil) was discovered by Alberto Ascoli, a student of Kossel.

Kossel's work would later be recognized with the 1910 Nobel Prize for Medicine. It should be noted that Kossel didn't just help isolate and describe these bases, he was also chiefly responsible for most of their names.

Albrecht Kossel, Image from wikimedia

Albrecht Kossel, Image from wikimedia

Guanine had already been named based on where it had first been discovered (the excrement of seabirds known as guano). Adenine was so named by Kossel because it was isolated from the pancreas gland ('adenas' in Greek). Thymine was named because it was isolated in nucleic acids from the thymus of a calf. Cytosine — the last of the four DNA bases to be characterized — was also discovered from hydrolysis of the calf thymus. Its name comes from the original name in German ('cytosin') and simply refers to to the Greek prefix for cell ('cyto').

While this last naming choice might seem a little dull, all bioinformaticians owe a huge debt of thanks to Albrecht Kossel. Thankfully, he ensured that all DNA bases have names that start with different letters. This greatly facilitates their representation in silico. Imagine if he had — in a fit of vanity — instead chosen to name these last two bases that he characterized after himself and his daughter Gertrude. If that had been the case then maybe we would today be talking about the bases adenine, albrechtine, guanine, and gertrudine. Not an insurmountable problem to represent with single characters — we already deal with the minor headache of representing purines and pyrimidines differently (using R and Y respectively) — but frankly, it would be a royal pain in the ass.

Thank you Albrecht. The world of bioinformatics is in your debt.

The growth of bioinformatics papers that mention 'big data'

I very much enjoyed Stephen Turner's recent blog post There is no Such Thing as Biomedical "Big Data" and I agree with his central point that a lot of the talk about 'big data' is not really what others would consider 'big'. Out of curiosity, I had a quick dive into Google Scholar to see just how popular this particular cliche is becoming. My search term was "big data" biology|genomics|bioinformatics.

Growth of bioinformatics papers on Google Scholar that mention "big data".

Growth of bioinformatics papers on Google Scholar that mention "big data".

Clearly, this term is on the rise and might become as much of an annoyance as another phrase I loathe: next generation sequencing. A phrase that has been used to describe everything from 25 bp reads from early Solexa technology (circa 2005) to PacBio subreads that can exceed 25,000 bp.

As more people use N50 as a metric, fewer genomes seem to be 'completed'

If you search Google Scholar for the term genome contig|scaffold|sequence +"N50 size|length" and then filter by year, you can see that papers which mention N50 length have increased dramatically in recent years:

Google Scholar results for papers that mention N50 length. 2000–2013.

Google Scholar results for papers that mention N50 length. 2000–2013.

I'm sure that my search term doesn't capture all mentions of N50, and it probably includes a few false positives as well. It doesn't appear to be mentioned before 2001 at all, and I think that the 2001 Nature human genome paper may have been the first publication to use this metric.

Obviously, part of this growth simply reflects the fact that more people are sequencing genomes (or at least writing about sequenced genomes), and therefore feel the need to include some form of genome assembly metric. A Google Scholar search term for "genome sequence|assembly" shows another pattern of growth, but this time with a notable spurt in 2013:

Google Scholar results for papers that mention genome sequences or assemblies. 2000–2013.

Google Scholar results for papers that mention genome sequences or assemblies. 2000–2013.

Okay, so more and more people are sequencing genomes. This is good news, but only if those genomes are actually usable. This led me to my next query. How many people refer to their published genome sequence as complete? I.e. I searched Google Scholar for "complete|completed genome sequence|assembly". Again, this is not a perfect search term, and I'm sure it will miss some descriptions of what people consider to be complete genomes. But at the same time it probably filters out all of the 'draft genomes' that have been published. The results are a little depressing:

Google Scholar results for papers that mention genome sequences or assemblies vs those that make mention of 'completed' genome sequences or assemblies. 2000–2013.

Google Scholar results for papers that mention genome sequences or assemblies vs those that make mention of 'completed' genome sequences or assemblies. 2000–2013.

So although there were nearly 90,000 publications last year that mentioned a genome sequence (or assembly), approximately just 7,500 papers mentioned the C-word. This is a little easier to visualize if you show the number of 'completed' genome publications as a percentage of the number of publications that mention 'genome sequence' (irrespective of completion status):

Numbers of publications that mention 'completed' genomes as percentage of those that mention genomes. 2000–2013.

Numbers of publications that mention 'completed' genomes as percentage of those that mention genomes. 2000–2013.

Maybe  journal reviewers are more stringent about not allowing people to use the 'completed' word if the genome isn't really complete (which depending on your definition of 'complete' may include most genomes)? Or maybe people are just happier these days to sequence something, throw it through an assembler and then publish it, regardless of how incomplete it is?

2013 ended with a bumper crop of new JABBA awards for bogus bioinformatics acronyms

jabba logo.png

I have a huge backlog of JABBA awards to hand out. These are all collated from the annual publication of the (voluminous) 2013 Nucleic Acids Research Database Issue. So without further ado, here are the recipients of the latest batch of JABBA awards:

Honorable mention:

The three ages of CEGMA: thoughts on the slow burning popularity of some bioinformatics tools

The past

CEGMA is a bioinformatics tool that was originally developed to annotate a small set of genes in novel genome sequences that lack any annotation. The logic being that if you can at least annotate a small number of genes and have some confidence about their gene structure, you can then use them as a training set for an ab initio gene finder to go on and annotate the rest of the genome.

This tool was developed in 2005 and it took rejections from two different journals before the paper was finally published in 2007. We soon realized that the set of highly conserved eukaryotic genes that CEGMA used could also be adapted to assess the completeness of genome assemblies. Strictly speaking, we can use CEGMA to assess the completeness of the 'gene space' of genomes. Another publication followed in 2009, but CEGMA still didn't gain much attention.

CEGMA was then used as one of the assessment tools in the 2011 Assemblathon competition, and then again for the Assemblathon 2 contest. It's possible that these publications led to an explosion in the popularity of CEGMA. Alternatively, it may have become more popular just because more and more people have started to sequence genomes, and there is a growing need for tools to assess whether genome assemblies are any good.

The following graph shows the increase in citations to our two CEGMA papers since they were first published. I think it is unusual to see this sort of delayed growth in citations to a paper. The current citations from 2014 suggest that this year will see the citation count double compared to 2013.

cegma citations.png

The present

All of this is both good and bad. It is always good to see a bioinformatics tool actively being used, and it is always nice to have your papers cited. However, it's bad because the principal developer left our group many years ago and I have been left to support CEGMA without sufficient time or resources to do so. I will be honest and say that it can be a real pain to even get CEGMA installed (especially on some flavors of Linux). You need to separately install NCBI BLAST+, HMMER, geneid, and genewise, and you can't just use any version of these tools either. These installation problems have meant that I recently tried making it easier for people to submit jobs to us, which I run locally on their behalf.

These submission requests made me realize that many people are using CEGMA to assess the quality of transcriptomes as well as genomes. This is not something we ever advocated, but it seems to work. These submissions have also let me take a look at whether CEGMA is performing as expected with respect to the N50 lengths of the genomes/transcriptomes being assessed (I can't use NG50 which I would prefer, because I don't know the expected size for these genomes).

cegma.png

Generally speaking, if your genome contains longer sequences, then you have more chance of those sequences containing some of the 248 core genes that CEGMA searches for. This is not exactly rocket science, but I still find it surprising — not to mention worrying — that there are a lot of extremely low quality genome assemblies out there, which might not be useful for anything.

The future

We are currently preparing a grant that would hopefully allow us to give CEGMA a much needed overhaul. Currently we have insufficient resources to really support the current version of CEGMA, but we have many good ideas for how we could improve it. Most notably, we would want to make new sets of core genes based on modern resources such as the eggNOG database. The core genes that CEGMA uses were determined from an analysis of the KOGs database which is now over a decade old! A lot has changed in the world of genomics since then.

The problem that arises when Google Scholar indexes papers published to pre-print servers

The Assemblathon 2 paper, on which I was lead author, was ultimately published with the online journal Gigascience. However, like an increasing number of papers, it was first released to the arXiv.org pre-print server.

If you are a user of the very useful Google Scholar service and you have also published a paper such that it appears in two places, then you may have run into the same problems that I have. Namely, Google Scholar appears to only track citations to the first place where the paper was published.

It should be said that it is great that Google tracks citations to these pre-print articles at all, though see another post of mine that illustrates just how powerful (and somewhat creepy), Google Scholar's indexing power is. However, most people would expect that when a paper is formally published, that Google Scholar should track citations to that as well. Preferably separately from the pre-print version of the article.

For a long time with the Assemblathon 2 paper, Google Scholar only seemed to show citations to the pre-print version of the paper, even when I knew that others were citing the Gigascience version. So I contacted Google about this, and after a bit of a wait, I heard back from them:

Hi Keith,

It still get indexed though the information is not yet shown:

http://scholar.google.com/scholar?q=http%3A%2F%2Fwww.gigasciencejournal.com%2Fcontent%2F2%2F1%2F10+&btnG=

If one version (the arXiv one in this case) was discovered before the last major index update, the information for the other versions found after the major update would not appear before the next major update.

Their answer still raises some issues, and I'm waiting to hear back from my follow up question...how often does the index get updated? Checking Google Scholar today, it initially appears as if they are still only tracking the pre-print version of our paper:

2014-01-27 at 9.36 AM.png

However, after checking I see that 9 out of 10 of the most recent citations are all citing the Gigascience version of the paper. So in conclusion:

  1. Google Scholar will start to track formal versions of a publication even after the paper was first published on a pre-print server.
  2. Somewhat annoyingly, they do not separate out the citations and so one Google Scholar entry ends up tracking two versions of a paper.
  3. The Google Scholar entry that is tracking the combined citations only lists the pre-print server in the 'Journal' name field; you have to check individual citations to see if they are citing the formal version of the publication.
  4. Google Scholar has a 'major' indexing cycle and you may have to wait for the latest version of the index to be updated before you see any changes.

JABBA, ORCA, and more bad bioinformatics acronyms

JABBA awards — Just Another Bogus Bioinformatics Acronym — are my attempt to poke a little bit of fun at the crazy (and often nonsensical) acronyms and initialisms that are sometimes used in bioinformatics and genomics. When I first started handing out these awards in June 2013, I didn't realize that I was not alone in drawing attention to these unusual epithets.

http://orcacronyms.blogspot.com

http://orcacronyms.blogspot.com

ORCA is the Organization for Really Contrived Acronyms a fun blog set up by an old colleague of mine, Richard Edwards. ORCA sets out to highlight strange acronyms across many different disciplines, whereas my JABBA awards focus on bioinformatics. Occasionally, there is some overlap, and so I will point you to the latest ORCA post which details a particularly strange initialism for a bioinformatics database:

ADAN - prediction of protein-protein interAction of moDular domAiNs

Be sure to read Richard's thoughts on this name, as well as checking out some of the other great ORCA posts, including one of my favorites (GARFIELD).

ACGT: a new home for my science-related blog posts

Over the last year I've increasingly found myself blogging about science — and about genomics and bioinformatics in particular — on my main website (keithbradnam.com). Increasingly this has led to a very disjointed blog portfolio: posts about my disdain for contrived bioinformatics acronyms would sit aside pictures of my bacon extravaganza.

No longer will this be the case. ACGT will the new home for all of my scientific contemplations. So what is ACGT all about? Maybe you are wondering Are Completed Genomes True? or maybe you are just on the lookout to see someone Assessing Computational Genomics Tools. If so, then ACGT may be a home for such things (as well as Arbitrary, Contrived, Genome Tittle-Tattle perhaps).

I've imported all of the relevant posts from my main blog (I'll leave the originals in place for now), and hopefully all of the links work. Please let me know if this is not the case. Now that I have a new home for my scientific musings —  particularly those relating to bioinformatics — I hope this will encourage me to write more. See you around!

Keith Bradnam