Tales of drafty genomes: part 3 – all genomes are complete…except for those that aren't

This is the third post in an infrequent series that looks at the world of unfinished genomes.

One of the many, many resources at the NCBI is their Genome database. Here's how they describe themselves:

The Genome database contains sequence and map data from the whole genomes of over 1000 species or strains. The genomes represent both completely sequenced genomes and those with sequencing in-progress. All three main domains of life (bacteria, archaea, and eukaryota) are represented, as well as many viruses, phages, viroids, plasmids, and organelles.

This text could probably be updated because the size of the database is now wrong by an order of magnitude…there are currently 11,322 genomes represented in this database. But how many of them are 'completely sequenced' and how many are at the 'sequencing in-progress' stage?

Luckily, the NCBI classifies all genomes into one of four 'levels':

  • Complete
  • Chromosome
  • Scaffold
  • Contig

I couldn't find any definitions for these categories within the NCBI Genome database, but elsewhere on the NCBI website I found the following definitions for the latter three categories:

Chromosome - there is sequence for one or more chromosomes. This could be a completely sequenced chromosome (gapless) or a chromosome containing scaffolds with unlinked gaps between them.

Scaffold - some sequence contigs have been connected across gaps to create scaffolds, but the scaffolds are all unplaced or unlocalized.

Contig - nothing is assembled beyond the level of sequence contigs

So considering just the 2,032 Eukaryotic species in the NCBI Genome Database, we can ask…how many of them are complete?

Completion status of 2,032 eukaryotic genomes, as classified by NCBI

Completion status of 2,032 eukaryotic genomes, as classified by NCBI

The somewhat depressing answer is that only a meagre 24 eukaryotic genomes are listed as complete, about 1% of the total. Even if we include genomes with chromosome sequences, we are still only talking about 13% of all genomes. You might imagine that the state of completion would be markedly better when looking at prokaryotes. However, only 11.5% of the 31,696 prokaryotic genomes are classified as complete.

In the last post in this series, I included a dictionary definition of the word 'draft'. This time, let's look to see how Merriam-Webster defines 'complete':

having all necessary parts : not lacking anything

not limited in any way

not requiring more work : entirely done or completed

By this definition, I think we could all agree that very few genomes are actually complete.

Tales of drafty genomes: part 2 — when draft genomes took over the world

This is the second post in an infrequent series that looks at draft genomes.

At the time of writing, Google has indexed almost 400,000 pages that include a mention of the phrase draft genome. Prior to the year 2000, there are zero mentions of this phrase in the tech giant’s search index.

The phrase ‘draft genome’ came to prominence with the publication of the ‘working draft’ version of the human genome[1]. But referring to published genomes as anything other than ‘complete’ was still atypical at this time. This can be seen if you search Google Scholar for papers that include in their titles either the phrase draft genome sequence or complete genome sequence. When you look at how these results change over time, an interesting pattern emerges:

Number of papers indexed by Google Scholar that include the phrases 'Complete genome sequence' or 'Draft genome sequence' in their titles.

Around 2000–2003, there were a small number of papers mentioning draft genome sequences. These are nearly all related to the draft sequences of the human or rice genomes. Usage of the phrase (in journal titles) didn’t break double digits until 2011. Draft genomes then became a much more widely used phrase in 2012 and by 2013 they overtook usage of ‘complete genome sequence’

I find this reveals something about the nature of sequencing and genome assembly. It almost feels like we are giving up our ambition to finish genomes (whatever ‘finished’ actually means) and are more willing to settle for something that is clearly incomplete.

A definition of ‘draft’ provided by Merriam-Webster is as follows:

A version of something (such as a document) that you make before you make the final version

In an ideal world, I would hope that all of these draft genomes would also end up being replaced by ‘final versions’. But I’m doubtful that many of these published sequences will be completed any time soon.

  1. See part 1 in this series for more details about the drafty nature of the human genome.  ↩

Tales of drafty genomes: part 1 — The Human Genome

One of my recent blog posts discussed this new paper in PLOS Computational Biology:

There has also been a lot of chatter on twitter about this paper. Here is just part of an exchange that I was involved in yesterday:

The issue of what is or isn’t a draft genome — and whether this even matters — is something on which I have much to say. It’s worth mentioning that there are a lot of draft genomes out there: Google Scholar reports that there are 1,440 artices that mention the phrase ‘draft genome’ in their title [1]. In the first part of this series, I’ll take a look at one of the most well-studied genome sequences in existence…the human genome.

The most famous example of a draft genome is probably the ‘working draft’ of the human genome that was announced — with much fanfare — in July 2000 [2]. At this time, the assembly was reported as consisting of “overlapping fragments covering 97 percent of the human genome”. By the time the working draft was formally published in Nature in January 2001, the assembly was reported as covering “about 94% of the human genome” (incidentally, this Nature paper seems to be first published use of the N50 statistic).

On April 14, 2003 the National Human Genome Research Institute and the Department of Energy announced the “successful completion of the Human Genome Project” (emphasis mine). This was followed by the October 2004 Nature paper that discussed the ongoing work in finishing the euchromatic portion of the human genome[3]. Now, the genome was being referred to as ‘near-complete’ and if you focus on the euchromatic portion, it was indeed about 99% complete. However, if you look at the genome as a whole, it was still only 93.5% complete [4].

Of course the work to correctly sequence, assemble, and annotate the human genome has never stopped, and probably will never stop for some time yet. As of October 14, 2014, the latest version of the human genome reference sequence is GRCh38.p1[5] lovingly maintained by the Genome Reference Consortium (GRC). The size of the human genome has increased just a little bit compared to the earlier publications from a decade ago[6], but there is still several things that we don’t know about this ‘complete/near-complete/finished’ genome. Unknown bases still account for 5% of the total size (that’s over 150 million bp). Furtheremore, there are almost 11 million bp of unplaced scaffolds that are still waiting to be given a (chromosomal) home. Finally, there remains 875 gaps in the genome (526 are spanned gaps and 349 unspanned gaps[7]).

If we leave aside other problematic issues in deciding what a reference genome actually is, and what it should contain[8], we can ask the simple question is the current human genome a draft genome? Clearly I think everyone would say ‘no’. But what if I asked is the current human genome complete? I’m curious how many people would say ‘yes’ and how many people would ask me to first define ‘complete’.

Here are some results for how many hits you get when Googling for the following phrases:

Scientists and journalists don’t help the situation by maybe being too eager to overhype the state of completion of the human genome[9]. In conclusion, the human genome is no longer a draft genome, but it is still just a little bit drafty. More on this topic of drafty genomes in part 2!

  1. There are 1,570 if you don’t require the words ‘draft’ and ‘genome’ to be together in the article title.  ↩

  2. The use of the ‘working draft’ as a phrase had been in use since at least late 1998.  ↩

  3. There is also the entire batch of chromosome-specific papers published between 2001 and 2006.  ↩

  4. This percentage is based on the following line in the paper: “The euchromatic genome is thus ~2.88 Gb and the overall human genome is ~3.08 Gb”  ↩

  5. This is the 1st patched updated to version 38 of the reference sequence  ↩

  6. There are 3,212,670,709 bp in the latest assembly  ↩

  7. The GRC defines the two categories as follows:

    Spanned gaps are found within scaffolds and there is some evidence suggesting linkage between the two sequences flanking the gap. Unspanned gaps are found between scaffolds and there is no evidence of linkage.  ↩

  8. Remember, human genomes are diploid and not only vary between individuals but can also vary from cell-to-cell. The idea of a ‘reference’ sequence is therefore a nebulous one. How much known variation do you try to represent (the GRC represents many alternative loci)? How should a reference sequence represent things like ribosomal DNA arrays or other tandem repeats?  ↩

  9. Jonathan Eisen wrote a great blog post on this: Some history of hype regarding the human genome project and genomics  ↩

When is a genome complete...and does it even matter? Part 1: the 1% rule vs Sydney Brenner's CAP criteria

This will be the first in a new series of blog posts that discuss my thoughts on the utility of genomes at various stages of completion (both in terms of genome assembly and annotation). These posts will mostly be addressing issues that pertain to eukaryotic genomes...are there any other kind? ;-)

I often find myself torn between two conflicting viewpoints about the utility of unfinished genomes. First, let's look at the any-amount-of-sequence-is-better-than-no-sequence-at-all argument. This is clearly true in many cases. If you sequence only 1% of a genome, and if that 1% contains something you're interested in (gene, repeat, binding site, sequence variant etc), then you may well think that the sequencing effort was tremendously useful.

Indeed, one of my all-time favorite papers in science is an early bioinformatics analysis of gene sequences in GenBank. Published way back in 1980, this paper (Codon catalog usage and the genome hypothesis) studied "all published mRNA sequences of more than about 50 codons". Today, that would be a daunting exercise. Back then, the dataset comprised just 90 genes! Most of these were viral sequences, with just six vertebrate species represented (and only four sequences from human).

The abstract of this paper concluded:

Each gene in a genome tends to conform to its species' usage of the codon catalog; this is our genome hypothesis.

This mostly remains true today and the original work on this tiny dataset established a pattern that spawned an entire sub-discipline of genomics, that of codon-usage bias (now with over 7,000 publications). So clearly, you can do lots of great and useful science with only a tiny amount of genome sequence information. So what's the problem?


Well, 1% of a genome may be better than 0%, and 2% is better than 1%, and so on. But I want 100% of a genome (yes, I'm greedy like that). However, I begrudgingly accept that generating a complete and accurate genome assembly (not to mention a complete set of gene annotations) currently falls into the nice-idea-kid-but-we-can't-all-be-dreamers category.

The danger in not getting to 100% completion is that there is a perception — by scientists as well as the general public — that these genomes are indeed all finished. This disconnect between the actual state of completion, versus the perceived state of completion can lead to reactions of the wait-a-minute-I-thought-this-was-meant-to-be-finished!?! variety. Indeed, it can be highly confusing when people go to download the genome of their species of interest, under the impression that the genome was 'finished' many years ago, only to find that they can't find what they're looking for.

Someone might be looking for their favorite gene annotation, but maybe this 'finished' genome hasn't actually been annotated. Or maybe it's been annotated by four different gene finders and left in a state where the user has to decide which ones to trust. Maybe the researcher is interested in chromosome evolution and is surprised to find that the genome doesn't consist of chromosome sequences, just scaffolds. Maybe they find that there are two completely different versions of the same genome, that were assembled by different groups. Or maybe they find that the download link provided by the paper no longer works and they can't even find the genome in question.

The great biologist Sydney Brenner has often spoke of the need to achieve CAP criteria in efforts such as genome sequencing. What are these criteria?

  • C - Complete I.e. if you're going to do it, do a thorough job so that someone doesn't have to come along later to redo it.
  • A - Accurate This is kind of obvious but there are so many published genomes out there that are far from accurate.
  • P - Permanent Do it once, and forever.

The last point is probably not something that is thought about as much as the first two criteria. It relates to where these genomes end up being stored and the file formats that people use. But it also applies to other subtle issues. I.e. let's assume that research group 'X' has sequenced a genome to an impressive depth but that they made a terrible assembly. As long as their raw reads remain available, someone else can (in theory) attempt a better assembly, or attempt to remake the exact same assembly (science should be reproducible, right?).

However, reproducibility is not always easy in bioinformatics. Even if all of the methodologies are carefully documented, the software involved may no longer be available, or it may only run on an architecture that no longer exists. If you are attempting to make a better genome assembly, you could face issues if some critical piece of information was missing from the SRA Experiment metadata. A potentially more problematic situation would be if the metadata was incorrect in some way (e.g. a wrong insert size was listed).

In subsequent posts, I'll explore how different genomes hold up to these criteria. I will also suggest my own 'five levels of genome completeness' criteria (for genome sequences and annotations).