Improving genomes is an increasing trend

To check whether there really are more ‘improved’ sequences being described, I looked in Google Scholar to see how many papers feature the terms ‘complete genome|assembly’ vs ‘draft genome|assembly’ vs ‘improved genome|assembly’ (these Google Scholar links reveal the slightly more complex query that I used). In gathering data I went back to 1995 (the date of the first published genome sequence).

As always with Google Scholar, these are not perfect search terms and they all pull in matches which are not strictly what I’m after, but it does reveal an interesting picture:

Why improve a genome?

I wonder how much of this growth reflects the sad truth that many genomes that were published in the post-Sanger, pre-nanopore era (approximately 2005–2015) were just not very good. Many people rushed to adopt the powerful new sequencing technologies provided by Illumina and others, and many genomes have been published using those technologies that are now being given makeovers by applying newer sequencing, scaffolding, and mapping technologies

The updated pine genome (the last publication on the list above) says as much in its abstract (emphasis mine):

The 22-gigabase genome of loblolly pine (Pinus taeda) is one of the largest ever sequenced. The draft assembly published in 2014 was built entirely from short Illumina reads, with lengths ranging from 100 to 250 base pairs (bp). The assembly was quite fragmented, containing over 11 million contigs whose weighted average (N50) size was 8206 bp. To improve this result, we generated approximately 12-fold coverage in long reads using the Single Molecule Real Time sequencing technology developed at Pacific Biosciences. We assembled the long and short reads together using the MaSuRCA mega-reads assembly algorithm, which produced a substantially better assembly, P. taeda version 2.0. The new assembly has an N50 contig size of 25 361, more than three times as large as achieved in the original assembly, and an N50 scaffold size of 107 821, 61% larger than the previous assembly.

Perhaps I’m being a bit harsh in saying that the first versions of many of these genomes that have been subsequently improved were not very good. The more important lesson to bear in mind is that, in reality, a genome is never finished and that all published sequences represent ‘works in progress’.

Note: this blog post was edited on 2026-05-30 to remove an embedded code block that no longer renders due to storify.com no longer existing.

One of my recent blog posts discussed this new paper in PLOS Computational Biology:

Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies

There has also been a lot of chatter on Twitter about this paper.

The issue of what is or isn’t a draft genome — and whether this even matters — is something on which I have much to say. It’s worth mentioning that there are a lot of draft genomes out there: Google Scholar reports that there are 1,440 artices that mention the phrase ‘draft genome’ in their title [1]. In the first part of this series, I’ll take a look at one of the most well-studied genome sequences in existence…the human genome.

The most famous example of a draft genome is probably the ‘working draft’ of the human genome that was announced — with much fanfare — in July 2000 [2]. At this time, the assembly was reported as consisting of “overlapping fragments covering 97 percent of the human genome”. By the time the working draft was formally published in Nature in January 2001, the assembly was reported as covering “about 94% of the human genome” (incidentally, this Nature paper seems to be first published use of the N50 statistic).

On April 14, 2003 the National Human Genome Research Institute and the Department of Energy announced the “successful completion of the Human Genome Project” (emphasis mine). This was followed by the October 2004 Nature paper that discussed the ongoing work in finishing the euchromatic portion of the human genome[3]. Now, the genome was being referred to as ‘near-complete’ and if you focus on the euchromatic portion, it was indeed about 99% complete. However, if you look at the genome as a whole, it was still only 93.5% complete [4].

Of course the work to correctly sequence, assemble, and annotate the human genome has never stopped, and probably will never stop for some time yet. As of October 14, 2014, the latest version of the human genome reference sequence is GRCh38.p1[5] lovingly maintained by the Genome Reference Consortium (GRC). The size of the human genome has increased just a little bit compared to the earlier publications from a decade ago[6], but there is still several things that we don’t know about this ‘complete/near-complete/finished’ genome. Unknown bases still account for 5% of the total size (that’s over 150 million bp). Furtheremore, there are almost 11 million bp of unplaced scaffolds that are still waiting to be given a (chromosomal) home. Finally, there remains 875 gaps in the genome (526 are spanned gaps and 349 unspanned gaps[7]).

If we leave aside other problematic issues in deciding what a reference genome actually is, and what it should contain[8], we can ask the simple question is the current human genome a draft genome? Clearly I think everyone would say ‘no’. But what if I asked is the current human genome complete? I’m curious how many people would say ‘yes’ and how many people would ask me to first define ‘complete’.

Here are some results for how many hits you get when Googling for the following phrases:

251,000 — finished human genome sequence
171,000 — “almost complete” human genome sequence
69,400 — “near complete” human genome sequence
26,200 — “essentially complete” human genome sequence

Scientists and journalists don’t help the situation by maybe being too eager to overhype the state of completion of the human genome[9]. In conclusion, the human genome is no longer a draft genome, but it is still just a little bit drafty. More on this topic of drafty genomes in part 2!

There are 1,570 if you don’t require the words ‘draft’ and ‘genome’ to be together in the article title. ↩
The use of the ‘working draft’ as a phrase had been in use since at least late 1998. ↩
There is also the entire batch of chromosome-specific papers published between 2001 and 2006. ↩
This percentage is based on the following line in the paper: “The euchromatic genome is thus ~2.88 Gb and the overall human genome is ~3.08 Gb” ↩
This is the 1st patched updated to version 38 of the reference sequence ↩
There are 3,212,670,709 bp in the latest assembly ↩
The GRC defines the two categories as follows:

Spanned gaps are found within scaffolds and there is some evidence suggesting linkage between the two sequences flanking the gap. Unspanned gaps are found between scaffolds and there is no evidence of linkage. ↩
Remember, human genomes are diploid and not only vary between individuals but can also vary from cell-to-cell. The idea of a ‘reference’ sequence is therefore a nebulous one. How much known variation do you try to represent (the GRC represents many alternative loci)? How should a reference sequence represent things like ribosomal DNA arrays or other tandem repeats? ↩
Jonathan Eisen wrote a great blog post on this: Some history of hype regarding the human genome project and genomics ↩

Genomic makeovers: the number of ‘improved’ genome sequences is increasing

Complete vs Draft vs Improved

Improving genomes is an increasing trend

Why improve a genome?

Tales of drafty genomes: part 1 — The Human Genome